Do Clinical Transformer Embeddings Actually Outpredict Traditional Encodings by 16%?
Surgeons write rich narratives about every operation. This study benchmarks five text-encoding strategies — from simple label encoding to clinical transformer embeddings — to find out which best converts those notes into accurate surgical duration predictions. Tested on 180,366 real cases across 3 hospitals, the gap between methods is larger than you'd expect.
Real elective surgical cases from 3 tertiary hospitals
27.6
Best MAE in minutes — Sentence-BERT embeddings
16%
Max reduction in prediction error over traditional encodings
0.77
Best R² achieved — Sentence-BERT (SMAPE 23.0%)
The Core Idea
From a Surgeon's Words to a Prediction
After every surgery, the surgeon writes a narrative note describing what happened. This project is about one question: what's the smartest way to turn that text into numbers a prediction model can learn from? We tested five strategies — from simple label counts to state-of-the-art clinical transformer embeddings.
📋
Operative Notes
Procedure descriptions & diagnoses (free text)
🔤
Text Encoding
5 strategies — from label encoding to Sentence-BERT
🧬
Feature Fusion
Merged with structured variables (age, BMI, ASA, sex…)
🤖
ML Models
Linear, tree-based ensembles & neural networks
⏱️
Surgery Duration
Predicted minutes from incision to close
💡
Why Do Embeddings Matter So Much?
A label encoder sees "laparoscopic cholecystectomy" as an arbitrary integer. A clinical transformer embedding understands it as a minimally invasive gallbladder removal with a predictable time profile. The encoding method determines how much medical meaning the model can extract — and our results show the gap is up to 16% in prediction accuracy.
Head-to-Head
Five Encoding Strategies Compared
🔢 Baseline
Label Encoding
Converts each category to a single integer. No semantic meaning — "appendectomy" becomes 3, "Whipple" becomes 9.
Relative performance
MAE improvementMinimal
Medical knowledgeNone
📊 Frequency
Count Vectorization
Counts raw word occurrences. Captures vocabulary but treats all words equally — "the" and "hemorrhage" weighted the same.
Relative performance
MAE improvementLimited
Medical knowledgeNone
⚖️ Weighted
TF-IDF
Weights words by rarity — rare clinical terms score higher. Better than count, but still no understanding of meaning.
Relative performance
MAE improvementLimited
Medical knowledgeNone
🏥 Clinical Transformer
ClinicalBERT
BERT fine-tuned on real clinical notes (MIMIC-III). Speaks the language of the OR — abbreviations, surgical slang, and clinical shorthand. Showed statistically significant gains over structured-only baseline (p < 0.01).
Relative performance
MAE improvementSignificant (p < 0.01)
Embedding typeToken-level (MIMIC-III)
🚀 Best Encoder
Sentence-BERT
Sentence-level transformer embeddings optimised for semantic similarity — captures the full meaning of a surgical narrative, not just individual tokens. Best MAE of 27.6 min, SMAPE 23.0%, R² 0.77. Up to 16% reduction over traditional encodings.
Best performing encoder overall
MAE27.6 min (best)
SMAPE23.0%
R²0.77
Error reductionUp to 16%
Interactive Explorer
See How Each Encoding Strategy Reads a Surgical Note
Pick a real-world operative note type and watch how each encoding strategy extracts meaning — and how prediction accuracy differs.
TF-IDF sees: Key words (ranked)
Sentence-BERT sees: Sentence-level embedding
Predicted Duration
--
minutes
MAE by encoding strategy (lower = better)
Label / Count
TF-IDF
ClinicalBERT
Sentence-BERT
Illustrative values based on relative cohort performance. Individual predictions require a deployed model.
Results Deep Dive
Performance by Encoding Strategy & Procedure
MAE by Encoder
R² by Procedure Type
Accuracy Bands
Sentence-BERT achieves the lowest MAE (27.6 min) — outperforming traditional encodings by up to 16%. ClinicalBERT also shows significant gains (p < 0.01).
Embedding gains are largest for complex, variable-length procedures (Whipple, Colectomy) where operative narratives are richest in semantic content.
Sentence-BERT achieves the highest share of predictions within ±30 minutes, directly enabling more reliable OR scheduling.
So What?
Three Things You Can Take Away
🏥
Embeddings add real value
Adding transformer embeddings from clinical text improved prediction accuracy across all models tested — not just some.
🧠
Sentence context wins
Sentence-BERT — which encodes full sentence meaning — outperforms all other strategies including token-level models like ClinicalBERT.
🔬
Traditional encodings fall short
Label encoding, count vectorization, and TF-IDF provided only limited benefit. Hospitals using simple encodings are leaving most accuracy gains on the table.
At a Glance
Quick read
The question: Which text-encoding strategy best extracts predictive signal from surgical notes? The approach: Benchmark 5 strategies on 180,366 real cases. The answer: Sentence-BERT transformer embeddings reduce MAE to 27.6 min and cut prediction error by up to 16% over traditional encodings.
Noorchenarboo M et al. (2025). Benchmarking Text Encoding Strategies in Multimodal Clinical Data for Surgical Case Duration Prediction. SSRN. doi: 10.2139/ssrn.5489578