Clinical NLP PyTorch · ONNX · XGBoost Journal Submission Edge Deployment

TinySurgicalBERT — Distilled Clinical LM for OR Scheduling

A 0.75 MB INT8 BERT student model distilled from Bio-ClinicalBERT on 180,370 surgical cases, achieving the same predictive accuracy as 436 MB clinical encoders at 43.7× faster inference — designed for real-time operating-room scheduling on standard CPU hardware.

2026 Mohammad Noorchenarboo 180,370 surgical cases 4 encoders · 8 regression models
180K
Surgical cases, single tertiary institution
0.75 MB
TinySurgicalBERT on-disk size (INT8 ONNX)
26.38
MAE (min) — XGBoost + TinySurgicalBERT
580×
Compression vs. Bio-ClinicalBERT (436 MB)
0.484 ms
Per-case inference — CPU, no GPU needed
Methodology Overview

End-to-End Pipeline

The pipeline begins with a 194,661-case retrospective EHR dataset filtered to 180,370 valid records. Four text-encoding strategies compete — from a structured-only baseline that discards all free text to TinySurgicalBERT, a student distilled specifically from surgical procedure language. After feature assembly (text embedding ‖ structured vector), eight regression models are evaluated under 5-fold cross-validation with Optuna hyperparameter optimisation.

🏥
EHR Data
194,661 cases → 180,370 after QC filters
🔤
Text Encoding
4 strategies incl. TinySurgicalBERT distillation
🔗
Feature Assembly
Text embed ‖ 38-dim structured vector
⚙️
HPO + CV
Optuna TPE, 20 trials, 5-fold CV
📊
Evaluation
MAE, RMSE, sMAPE, R² + Wilcoxon FDR-BH
💡

Why Procedure Text Matters

Structured EHR fields (surgeon ID, specialty, ASA grade) plateau at MAE ≈ 34.8 min. Adding any text encoding drops MAE to ≈ 26.4 min — a robust 24 % improvement that is consistent across all eight non-linear models and all five folds. The procedure description encodes the specific surgical plan at a granularity no categorical code can match.

TinySurgicalBERT Components

Distillation & Compression Pipeline

📝 BPE Tokeniser
Domain-Specific Vocabulary
Trained on all 180,370 surgical procedure strings. Eliminates OOV tokens entirely, compared with ~8% OOV rate for the standard Bio-ClinicalBERT WordPiece vocabulary on the same data.
Vocab size2,500 tokens
OOV rate0% (vs. ~8% Bio-ClinicalBERT)
🧠 Student Architecture
Compact Transformer
Two transformer layers, four attention heads, 128-dimensional hidden representation with a learned 128 → 256 output projection. Only 0.63M parameters vs. 110M in Bio-ClinicalBERT.
Parameters0.63M (vs. 110M teacher)
Layers / Heads / Hidden2 / 4 / 128
🎓 Distillation Objective
MSE + Cosine Loss
Combined loss: α·‖eₛ − eₜ‖² + (1−α)·(1 − cos(eₛ,eₜ)). MSE enforces magnitude agreement while the cosine term preserves the directional embedding geometry of Bio-ClinicalBERT.
TeacherBio-ClinicalBERT (MIMIC-III)
Embedding dim256-d (PCA-projected teacher)
⚡ INT8 ONNX Export
Static Quantisation
After distillation, the student is statically quantised to 8-bit integers and exported to ONNX Runtime format. Inference runs entirely on CPU — no GPU or network dependency at prediction time.
Final model size0.752 MB
Per-case inference0.484 ± 0.072 ms
🔍 Optuna HPO
Tree-Structured Parzen Estimator
Each of the 32 encoding × model combinations receives 20 Optuna TPE trials (first 5 random), tuned independently per fold to prevent information leakage from validation data.
Trials per combination20 (5 random warm-up)
Total combinations32 (4 encoders × 8 models)
📐 Statistical Testing
Wilcoxon + FDR-BH
Pairwise Wilcoxon signed-rank tests on 5-fold metric vectors. Multiple comparisons corrected via Benjamini–Hochberg FDR across 12 test pairs (3 baselines × 4 metrics).
vs. Structured OnlyW = 0, p = 0.0625 (strongest)
vs. large encodersW ≥ 4, p ≥ 0.44 (no diff.)
Encoding Strategies Compared

Four Text Encoders

All four strategies share the same downstream regression suite and 5-fold CV protocol, ensuring a fair paired comparison. TinySurgicalBERT is the only encoder designed for edge deployment, trading 436× less storage for zero accuracy loss.

TinySurgicalBERT (ours)
0.63M params · 0.75 MB · 0.484 ms/case · 256-d output · CPU ONNX Runtime
Proposed
Bio-ClinicalBERT
110M params · 436 MB · 21.19 ms/case · 768-d → PCA 384 · Teacher model
Teacher
SentenceBERT
22.7M params · 91 MB · 2.57 ms/case · 384-d embeddings · General domain
Baseline
Structured Only
38-dim EHR features only — no text encoding. φ(p) = 0. Strong structured baseline.
Baseline
⚙️

Deployment Gap Closed

Prior work forced a binary trade-off: high accuracy with a 436 MB encoder impractical at the point of care, or fast scheduling with a structured-only model that ignores the richest pre-operative signal. TinySurgicalBERT occupies the intersection — statistically equivalent accuracy at 0.75 MB and 0.484 ms/case.

Interactive Explorer

Encoding Strategy Explorer

Select an encoding strategy to see its XGBoost performance across all four metrics (mean ± SD, 5-fold CV). The R² bar shows the fraction of surgical duration variance explained.

Results from 5-fold cross-validation on 180,370 retrospective surgical cases. Statistical comparisons via two-sided Wilcoxon signed-rank with Benjamini–Hochberg FDR correction.

Performance Snapshot

Results at a Glance

MAE by Encoder (XGBoost)
Model Size (log scale)
Inference Time (log scale)

Mean Absolute Error (minutes ↓) for XGBoost paired with each encoding strategy. All three text-augmented encoders reduce MAE by ~24 % over the structured-only baseline. TinySurgicalBERT achieves 26.38 min — statistically indistinguishable from Bio-ClinicalBERT (26.40) and SentenceBERT (26.35).

On-disk model file size in MB (log scale ↓). TinySurgicalBERT (0.75 MB INT8 ONNX) is 121× smaller than SentenceBERT (91 MB) and 580× smaller than Bio-ClinicalBERT (436 MB), enabling deployment on edge and mobile hardware common in OR scheduling systems.

End-to-end per-case inference time in ms (log scale ↓), measured over N=30 repeated passes on 200 procedure descriptions under single-item CPU inference. TinySurgicalBERT (0.484 ms) is 5.3× faster than SentenceBERT and 43.7× faster than Bio-ClinicalBERT. Both differences are significant (p < 0.001, Wilcoxon FDR-BH).

Deployment Profile

Efficiency Comparison

TinySurgicalBERT is evaluated against three baselines on two deployment-critical dimensions: on-disk model size and end-to-end per-case CPU inference time. Downstream XGBoost prediction contributes <0.005 ms/case for all encoders (negligible).

⚡ TinySurgicalBERT OURS
0.75 MB
0.484 ± 0.072 ms · 2L · 4H · 128d · 0.63M params
🔤 SentenceBERT
91 MB
2.570 ± 0.136 ms · 6L · 12H · 384d · 22.7M params
🏥 Bio-ClinicalBERT
436 MB
21.190 ± 0.111 ms · 12L · 12H · 768d · 110M params
Method MAE ↓ (min) RMSE ↓ (min) sMAPE ↓ (%) R² ↑
TinySurgicalBERT + XGBoost OURS 26.38 ± 0.09 41.87 ± 0.77 21.61 ± 0.15 0.8543 ± 0.0051
SentenceBERT + XGBoost 26.35 ± 0.07 41.89 ± 0.75 21.56 ± 0.06 0.8542 ± 0.0049
Bio-ClinicalBERT + XGBoost 26.40 ± 0.11 41.93 ± 0.81 21.62 ± 0.09 0.8539 ± 0.0054
Structured Only + XGBoost 34.80 ± 0.12 51.77 ± 0.67 28.84 ± 0.12 0.7773 ± 0.0050

Mean ± SD over 5-fold CV. Statistical comparisons: TinySurgicalBERT vs. Structured Only — W = 0, p = 0.0625 (all folds consistently signed, strongest evidence achievable at n = 5). vs. SentenceBERT / Bio-ClinicalBERT — W ≥ 4, p ≥ 0.44 (no significant difference). FDR-BH corrected across 12 pairs.

Design Decisions

Why These Choices Work

📝
Domain BPE
A vocabulary of 2,500 tokens trained exclusively on surgical procedure text yields 0% OOV rate — eliminating the tokenisation noise that degrades general-purpose encoders on surgical terminology.
🎓
580× Smaller
Knowledge distillation transfers Bio-ClinicalBERT's embedding geometry into a 2-layer student via combined MSE + cosine loss. INT8 quantisation then halves per-weight storage with negligible accuracy loss.
43.7× Faster
ONNX Runtime CPU inference at 0.484 ms/case enables real-time point-of-care scheduling without GPU or cloud dependency — a hard requirement on the commodity hardware used in most OR systems.
🔬
24% MAE Drop
Including any text encoding reduces MAE from 34.80 to ~26.4 min — a robust effect consistent across all 8 non-linear models and all 5 folds, confirming that procedure free-text provides independent predictive signal.
🛡️
Rigorous CV
Optuna TPE tuning is conducted per-fold on the inner training split, preventing validation leakage. Wilcoxon signed-rank tests with FDR-BH correction guard against false positives across 12 pairwise comparisons.
🏥
Clinical Ready
The complete pipeline produces a single .onnx artefact requiring no HuggingFace runtime, no GPU, and no internet at inference time — designed from the ground up for deployment in the resource-constrained OR scheduling environment.