TinySurgicalBERT — Surgical Duration Prediction

Methodology Overview

End-to-End Pipeline

The pipeline begins with a 194,661-case retrospective EHR dataset filtered to 180,370 valid records. Four text-encoding strategies compete — from a structured-only baseline that discards all free text to TinySurgicalBERT, a student distilled specifically from surgical procedure language. After feature assembly (text embedding ‖ structured vector), eight regression models are evaluated under 5-fold cross-validation with Optuna hyperparameter optimisation.

🏥

EHR Data

194,661 cases → 180,370 after QC filters

🔤

Text Encoding

4 strategies incl. TinySurgicalBERT distillation

🔗

Feature Assembly

Text embed ‖ 38-dim structured vector

⚙️

HPO + CV

Optuna TPE, 20 trials, 5-fold CV

📊

Evaluation

MAE, RMSE, sMAPE, R² + Wilcoxon FDR-BH

TinySurgicalBERT Components

Distillation & Compression Pipeline

📝 BPE Tokeniser

Domain-Specific Vocabulary

Trained on all 180,370 surgical procedure strings. Eliminates OOV tokens entirely, compared with ~8% OOV rate for the standard Bio-ClinicalBERT WordPiece vocabulary on the same data.

Vocab size2,500 tokens

OOV rate0% (vs. ~8% Bio-ClinicalBERT)

🧠 Student Architecture

Compact Transformer

Two transformer layers, four attention heads, 128-dimensional hidden representation with a learned 128 → 256 output projection. Only 0.63M parameters vs. 110M in Bio-ClinicalBERT.

Parameters0.63M (vs. 110M teacher)

Layers / Heads / Hidden2 / 4 / 128

🎓 Distillation Objective

MSE + Cosine Loss

Combined loss: α·‖eₛ − eₜ‖² + (1−α)·(1 − cos(eₛ,eₜ)). MSE enforces magnitude agreement while the cosine term preserves the directional embedding geometry of Bio-ClinicalBERT.

TeacherBio-ClinicalBERT (MIMIC-III)

Embedding dim256-d (PCA-projected teacher)

⚡ INT8 ONNX Export

Static Quantisation

After distillation, the student is statically quantised to 8-bit integers and exported to ONNX Runtime format. Inference runs entirely on CPU — no GPU or network dependency at prediction time.

Final model size0.752 MB

Per-case inference0.484 ± 0.072 ms

🔍 Optuna HPO

Tree-Structured Parzen Estimator

Each of the 32 encoding × model combinations receives 20 Optuna TPE trials (first 5 random), tuned independently per fold to prevent information leakage from validation data.

Trials per combination20 (5 random warm-up)

Total combinations32 (4 encoders × 8 models)

📐 Statistical Testing

Wilcoxon + FDR-BH

Pairwise Wilcoxon signed-rank tests on 5-fold metric vectors. Multiple comparisons corrected via Benjamini–Hochberg FDR across 12 test pairs (3 baselines × 4 metrics).

vs. Structured OnlyW = 0, p = 0.0625 (strongest)

vs. large encodersW ≥ 4, p ≥ 0.44 (no diff.)

Encoding Strategies Compared

Four Text Encoders

All four strategies share the same downstream regression suite and 5-fold CV protocol, ensuring a fair paired comparison. TinySurgicalBERT is the only encoder designed for edge deployment, trading 436× less storage for zero accuracy loss.

TinySurgicalBERT (ours)

0.63M params · 0.75 MB · 0.484 ms/case · 256-d output · CPU ONNX Runtime

Proposed

Bio-ClinicalBERT

110M params · 436 MB · 21.19 ms/case · 768-d → PCA 384 · Teacher model

Teacher

SentenceBERT

22.7M params · 91 MB · 2.57 ms/case · 384-d embeddings · General domain

Baseline

Structured Only

38-dim EHR features only — no text encoding. φ(p) = 0. Strong structured baseline.

Baseline

Interactive Explorer

Encoding Strategy Explorer

Select an encoding strategy to see its XGBoost performance across all four metrics (mean ± SD, 5-fold CV). The R² bar shows the fraction of surgical duration variance explained.

Results from 5-fold cross-validation on 180,370 retrospective surgical cases. Statistical comparisons via two-sided Wilcoxon signed-rank with Benjamini–Hochberg FDR correction.

Performance Snapshot

Results at a Glance

MAE by Encoder (XGBoost)

Model Size (log scale)

Inference Time (log scale)

Mean Absolute Error (minutes ↓) for XGBoost paired with each encoding strategy. All three text-augmented encoders reduce MAE by ~24 % over the structured-only baseline. TinySurgicalBERT achieves 26.38 min — statistically indistinguishable from Bio-ClinicalBERT (26.40) and SentenceBERT (26.35).

On-disk model file size in MB (log scale ↓). TinySurgicalBERT (0.75 MB INT8 ONNX) is 121× smaller than SentenceBERT (91 MB) and 580× smaller than Bio-ClinicalBERT (436 MB), enabling deployment on edge and mobile hardware common in OR scheduling systems.

End-to-end per-case inference time in ms (log scale ↓), measured over N=30 repeated passes on 200 procedure descriptions under single-item CPU inference. TinySurgicalBERT (0.484 ms) is 5.3× faster than SentenceBERT and 43.7× faster than Bio-ClinicalBERT. Both differences are significant (p < 0.001, Wilcoxon FDR-BH).

Deployment Profile

Efficiency Comparison

TinySurgicalBERT is evaluated against three baselines on two deployment-critical dimensions: on-disk model size and end-to-end per-case CPU inference time. Downstream XGBoost prediction contributes <0.005 ms/case for all encoders (negligible).

⚡ TinySurgicalBERT OURS

0.75 MB

0.484 ± 0.072 ms · 2L · 4H · 128d · 0.63M params

🔤 SentenceBERT

91 MB

2.570 ± 0.136 ms · 6L · 12H · 384d · 22.7M params

🏥 Bio-ClinicalBERT

436 MB

21.190 ± 0.111 ms · 12L · 12H · 768d · 110M params

Method	MAE ↓ (min)	RMSE ↓ (min)	sMAPE ↓ (%)	R² ↑
TinySurgicalBERT + XGBoost OURS	26.38 ± 0.09	41.87 ± 0.77	21.61 ± 0.15	0.8543 ± 0.0051
SentenceBERT + XGBoost	26.35 ± 0.07	41.89 ± 0.75	21.56 ± 0.06	0.8542 ± 0.0049
Bio-ClinicalBERT + XGBoost	26.40 ± 0.11	41.93 ± 0.81	21.62 ± 0.09	0.8539 ± 0.0054
Structured Only + XGBoost	34.80 ± 0.12	51.77 ± 0.67	28.84 ± 0.12	0.7773 ± 0.0050

Mean ± SD over 5-fold CV. Statistical comparisons: TinySurgicalBERT vs. Structured Only — W = 0, p = 0.0625 (all folds consistently signed, strongest evidence achievable at n = 5). vs. SentenceBERT / Bio-ClinicalBERT — W ≥ 4, p ≥ 0.44 (no significant difference). FDR-BH corrected across 12 pairs.

Design Decisions

Why These Choices Work

📝

Domain BPE

A vocabulary of 2,500 tokens trained exclusively on surgical procedure text yields 0% OOV rate — eliminating the tokenisation noise that degrades general-purpose encoders on surgical terminology.

🎓

580× Smaller

Knowledge distillation transfers Bio-ClinicalBERT's embedding geometry into a 2-layer student via combined MSE + cosine loss. INT8 quantisation then halves per-weight storage with negligible accuracy loss.

⚡

43.7× Faster

ONNX Runtime CPU inference at 0.484 ms/case enables real-time point-of-care scheduling without GPU or cloud dependency — a hard requirement on the commodity hardware used in most OR systems.

🔬

24% MAE Drop

Including any text encoding reduces MAE from 34.80 to ~26.4 min — a robust effect consistent across all 8 non-linear models and all 5 folds, confirming that procedure free-text provides independent predictive signal.

🛡️

Rigorous CV

Optuna TPE tuning is conducted per-fold on the inner training split, preventing validation leakage. Wilcoxon signed-rank tests with FDR-BH correction guard against false positives across 12 pairwise comparisons.

🏥

Clinical Ready

The complete pipeline produces a single .onnx artefact requiring no HuggingFace runtime, no GPU, and no internet at inference time — designed from the ground up for deployment in the resource-constrained OR scheduling environment.

TinySurgicalBERT — Distilled Clinical LM for OR Scheduling

End-to-End Pipeline

Why Procedure Text Matters

Distillation & Compression Pipeline

Four Text Encoders

Deployment Gap Closed

Encoding Strategy Explorer

Results at a Glance

Efficiency Comparison

Why These Choices Work

At a Glance

Project Info

Tech Stack

Key Metrics

Related Work