NLP · Clinical AI Preprint · SSRN 5489578

Do Clinical Transformer Embeddings
Actually Outpredict Traditional Encodings by 16%?

Surgeons write rich narratives about every operation. This study benchmarks five text-encoding strategies — from simple label encoding to clinical transformer embeddings — to find out which best converts those notes into accurate surgical duration predictions. Tested on 180,366 real cases across 3 hospitals, the gap between methods is larger than you'd expect.

SSRN Preprint · 2025 Noorchenarboo et al. 180,366 surgical cases 3 Tertiary Hospitals · London, Canada
5
Encoding strategies benchmarked head-to-head
180K+
Real elective surgical cases from 3 tertiary hospitals
27.6
Best MAE in minutes — Sentence-BERT embeddings
16%
Max reduction in prediction error over traditional encodings
0.77
Best R² achieved — Sentence-BERT (SMAPE 23.0%)
The Core Idea

From a Surgeon's Words to a Prediction

After every surgery, the surgeon writes a narrative note describing what happened. This project is about one question: what's the smartest way to turn that text into numbers a prediction model can learn from? We tested five strategies — from simple label counts to state-of-the-art clinical transformer embeddings.

📋
Operative Notes
Procedure descriptions & diagnoses (free text)
🔤
Text Encoding
5 strategies — from label encoding to Sentence-BERT
🧬
Feature Fusion
Merged with structured variables (age, BMI, ASA, sex…)
🤖
ML Models
Linear, tree-based ensembles & neural networks
⏱️
Surgery Duration
Predicted minutes from incision to close
💡

Why Do Embeddings Matter So Much?

A label encoder sees "laparoscopic cholecystectomy" as an arbitrary integer. A clinical transformer embedding understands it as a minimally invasive gallbladder removal with a predictable time profile. The encoding method determines how much medical meaning the model can extract — and our results show the gap is up to 16% in prediction accuracy.

Head-to-Head

Five Encoding Strategies Compared

🔢 Baseline
Label Encoding
Converts each category to a single integer. No semantic meaning — "appendectomy" becomes 3, "Whipple" becomes 9.
Relative performance
MAE improvementMinimal
Medical knowledgeNone
📊 Frequency
Count Vectorization
Counts raw word occurrences. Captures vocabulary but treats all words equally — "the" and "hemorrhage" weighted the same.
Relative performance
MAE improvementLimited
Medical knowledgeNone
⚖️ Weighted
TF-IDF
Weights words by rarity — rare clinical terms score higher. Better than count, but still no understanding of meaning.
Relative performance
MAE improvementLimited
Medical knowledgeNone
🏥 Clinical Transformer
ClinicalBERT
BERT fine-tuned on real clinical notes (MIMIC-III). Speaks the language of the OR — abbreviations, surgical slang, and clinical shorthand. Showed statistically significant gains over structured-only baseline (p < 0.01).
Relative performance
MAE improvementSignificant (p < 0.01)
Embedding typeToken-level (MIMIC-III)
🚀 Best Encoder
Sentence-BERT
Sentence-level transformer embeddings optimised for semantic similarity — captures the full meaning of a surgical narrative, not just individual tokens. Best MAE of 27.6 min, SMAPE 23.0%, R² 0.77. Up to 16% reduction over traditional encodings.
Best performing encoder overall
MAE27.6 min (best)
SMAPE23.0%
0.77
Error reductionUp to 16%
Interactive Explorer

See How Each Encoding Strategy Reads a Surgical Note

Pick a real-world operative note type and watch how each encoding strategy extracts meaning — and how prediction accuracy differs.

TF-IDF sees: Key words (ranked)

Sentence-BERT sees: Sentence-level embedding

Predicted Duration
--
minutes
MAE by encoding strategy (lower = better)
Label / Count
TF-IDF
ClinicalBERT
Sentence-BERT

Illustrative values based on relative cohort performance. Individual predictions require a deployed model.

Results Deep Dive

Performance by Encoding Strategy & Procedure

MAE by Encoder
R² by Procedure Type
Accuracy Bands

Sentence-BERT achieves the lowest MAE (27.6 min) — outperforming traditional encodings by up to 16%. ClinicalBERT also shows significant gains (p < 0.01).

Embedding gains are largest for complex, variable-length procedures (Whipple, Colectomy) where operative narratives are richest in semantic content.

Sentence-BERT achieves the highest share of predictions within ±30 minutes, directly enabling more reliable OR scheduling.

So What?

Three Things You Can Take Away

🏥
Embeddings add real value
Adding transformer embeddings from clinical text improved prediction accuracy across all models tested — not just some.
🧠
Sentence context wins
Sentence-BERT — which encodes full sentence meaning — outperforms all other strategies including token-level models like ClinicalBERT.
🔬
Traditional encodings fall short
Label encoding, count vectorization, and TF-IDF provided only limited benefit. Hospitals using simple encodings are leaving most accuracy gains on the table.