Benchmarking Text Encoding Strategies for Surgical Case Duration Prediction
Surgeons write rich narratives about every operation. This study benchmarks five text-encoding strategies in a multimodal pipeline that combines clinical text with structured perioperative variables for surgical duration prediction. Evaluated on 180,370 elective cases across 3 tertiary hospitals, contextual embeddings (ClinicalBERT and Sentence-BERT) achieved comparable top overall performance, with small metric differences depending on evaluation view.
Published in IJMI · 2026Noorchenarboo et al. 180,370 surgical cases 3 Tertiary Hospitals · London, Canada
Best-model MAE in minutes (XGBoost + contextual embeddings)
16%
Reduction in prediction error vs structured-only baseline
0.86
Best R² achieved (SMAPE 21.6%)
The Core Idea
From a Surgeon's Words to a Prediction
After every surgery, the surgeon writes a narrative note describing what happened. This project is about one question: what's the smartest way to turn that text into numbers a prediction model can learn from? We tested five strategies — from simple label counts to state-of-the-art clinical transformer embeddings.
📋
Operative Notes
Procedure descriptions & diagnoses (free text)
🔤
Text Encoding
5 strategies — from label encoding to Sentence-BERT
🧬
Feature Fusion
Merged with structured variables (age, BMI, ASA, sex…)
🤖
ML Models
Linear, tree-based ensembles & neural networks
⏱️
Surgery Duration
Predicted minutes from incision to close
💡
Why Do Embeddings Matter So Much?
A label encoder sees "laparoscopic cholecystectomy" as an arbitrary integer. A clinical transformer embedding understands it as a minimally invasive gallbladder removal with a predictable time profile. The encoding method determines how much medical meaning the model can extract — and the published results show contextual embeddings materially improve prediction accuracy over structured-only inputs.
Head-to-Head
Five Encoding Strategies Compared
🔢 Baseline
Label Encoding
Converts each category to a single integer. No semantic meaning — "appendectomy" becomes 3, "Whipple" becomes 9.
Relative performance
MAE improvementMinimal
Medical knowledgeNone
📊 Frequency
Count Vectorization
Counts raw word occurrences. Captures vocabulary but treats all words equally — "the" and "hemorrhage" weighted the same.
Relative performance
MAE improvementLimited
Medical knowledgeNone
⚖️ Weighted
TF-IDF
Weights words by rarity — rare clinical terms score higher. Better than count, but still no understanding of meaning.
Relative performance
MAE improvementLimited
Medical knowledgeNone
🏥 Clinical Transformer
ClinicalBERT
BERT fine-tuned on real clinical notes (MIMIC-III). In the paper's fold-averaged encoder table, ClinicalBERT is slightly better than Sentence-BERT: MAE 34.66 ± 6.71, SMAPE 32.28 ± 9.40, R² 0.777 ± 0.061. Pairwise tests still show no significant difference.
Statistically comparable to Sentence-BERT
MAE34.66 ± 6.71 min
SMAPE32.28 ± 9.40%
R²0.777 ± 0.061
Pairwise MAE testvs Sentence-BERT: p = 0.9679
🧠 Contextual Top Tier
Sentence-BERT
Sentence-level transformer embeddings optimized for semantic similarity — captures full narrative meaning. In the same fold-averaged encoder table, Sentence-BERT remains very close to ClinicalBERT: MAE 34.82 ± 7.01, SMAPE 32.41 ± 9.68, R² 0.775 ± 0.064.
Statistically comparable to ClinicalBERT
MAE34.82 ± 7.01 min
SMAPE32.41 ± 9.68%
R²0.775 ± 0.064
Pairwise MAE testvs ClinicalBERT: p = 0.9679
Interactive Explorer
See How Each Encoding Strategy Reads a Surgical Note
Pick a real-world operative note type and watch how each encoding strategy extracts meaning — and how prediction accuracy differs.
TF-IDF sees: Key words (ranked)
Sentence-BERT sees: Sentence-level embedding
Predicted Duration
--
minutes
MAE by encoding strategy (lower = better)
Label / Count
TF-IDF
ClinicalBERT
Sentence-BERT
Illustrative values based on relative cohort performance. Individual predictions require a deployed model.
Results Deep Dive
Performance by Encoding Strategy & Procedure
MAE by Encoder
R² by Procedure Type
Accuracy Bands
Contextual embeddings (ClinicalBERT and Sentence-BERT) are near-tied and both outperform traditional encodings and structured-only baselines.
Embedding gains are largest for complex, variable-length procedures (Whipple, Colectomy) where operative narratives are richest in semantic content.
Contextual embeddings achieve the highest share of predictions within ±30 minutes, supporting more reliable OR scheduling.
So What?
Three Things You Can Take Away
🏥
Embeddings add real value
Adding transformer embeddings from clinical text improved prediction accuracy across all models tested — not just some.
🧠
Contextual embeddings lead
ClinicalBERT and Sentence-BERT both achieve top-tier performance in the published benchmark, with no statistically significant difference between them.
🔬
Traditional encodings fall short
Label encoding, count vectorization, and TF-IDF provided only limited benefit. Hospitals using simple encodings are leaving most accuracy gains on the table.
At a Glance
Quick read
The question: Which text-encoding strategy best extracts predictive signal from surgical notes? The approach: Benchmark 5 strategies on 180,370 real cases with six ML models. The answer: Contextual embeddings (ClinicalBERT and Sentence-BERT) achieved comparable top performance and reduced error by up to 16% versus structured-only baselines; fold-averaged metrics are slightly different but not statistically significant between the two.
Noorchenarboo M et al. (2026). Benchmarking text encoding strategies in multimodal clinical data for surgical case duration prediction. International Journal of Medical Informatics, 214, 106416. doi: 10.1016/j.ijmedinf.2026.106416