NLP · Clinical AI Published · IJMI 2026

Benchmarking Text Encoding Strategies
for Surgical Case Duration Prediction

Surgeons write rich narratives about every operation. This study benchmarks five text-encoding strategies in a multimodal pipeline that combines clinical text with structured perioperative variables for surgical duration prediction. Evaluated on 180,370 elective cases across 3 tertiary hospitals, contextual embeddings (ClinicalBERT and Sentence-BERT) achieved comparable top overall performance, with small metric differences depending on evaluation view.

Published in IJMI · 2026 Noorchenarboo et al. 180,370 surgical cases 3 Tertiary Hospitals · London, Canada
5
Encoding strategies benchmarked head-to-head
180,370
Elective surgical cases from 3 tertiary hospitals
26.4
Best-model MAE in minutes (XGBoost + contextual embeddings)
16%
Reduction in prediction error vs structured-only baseline
0.86
Best R² achieved (SMAPE 21.6%)
The Core Idea

From a Surgeon's Words to a Prediction

After every surgery, the surgeon writes a narrative note describing what happened. This project is about one question: what's the smartest way to turn that text into numbers a prediction model can learn from? We tested five strategies — from simple label counts to state-of-the-art clinical transformer embeddings.

📋
Operative Notes
Procedure descriptions & diagnoses (free text)
🔤
Text Encoding
5 strategies — from label encoding to Sentence-BERT
🧬
Feature Fusion
Merged with structured variables (age, BMI, ASA, sex…)
🤖
ML Models
Linear, tree-based ensembles & neural networks
⏱️
Surgery Duration
Predicted minutes from incision to close
💡

Why Do Embeddings Matter So Much?

A label encoder sees "laparoscopic cholecystectomy" as an arbitrary integer. A clinical transformer embedding understands it as a minimally invasive gallbladder removal with a predictable time profile. The encoding method determines how much medical meaning the model can extract — and the published results show contextual embeddings materially improve prediction accuracy over structured-only inputs.

Head-to-Head

Five Encoding Strategies Compared

🔢 Baseline
Label Encoding
Converts each category to a single integer. No semantic meaning — "appendectomy" becomes 3, "Whipple" becomes 9.
Relative performance
MAE improvementMinimal
Medical knowledgeNone
📊 Frequency
Count Vectorization
Counts raw word occurrences. Captures vocabulary but treats all words equally — "the" and "hemorrhage" weighted the same.
Relative performance
MAE improvementLimited
Medical knowledgeNone
⚖️ Weighted
TF-IDF
Weights words by rarity — rare clinical terms score higher. Better than count, but still no understanding of meaning.
Relative performance
MAE improvementLimited
Medical knowledgeNone
🏥 Clinical Transformer
ClinicalBERT
BERT fine-tuned on real clinical notes (MIMIC-III). In the paper's fold-averaged encoder table, ClinicalBERT is slightly better than Sentence-BERT: MAE 34.66 ± 6.71, SMAPE 32.28 ± 9.40, R² 0.777 ± 0.061. Pairwise tests still show no significant difference.
Statistically comparable to Sentence-BERT
MAE34.66 ± 6.71 min
SMAPE32.28 ± 9.40%
0.777 ± 0.061
Pairwise MAE testvs Sentence-BERT: p = 0.9679
🧠 Contextual Top Tier
Sentence-BERT
Sentence-level transformer embeddings optimized for semantic similarity — captures full narrative meaning. In the same fold-averaged encoder table, Sentence-BERT remains very close to ClinicalBERT: MAE 34.82 ± 7.01, SMAPE 32.41 ± 9.68, R² 0.775 ± 0.064.
Statistically comparable to ClinicalBERT
MAE34.82 ± 7.01 min
SMAPE32.41 ± 9.68%
0.775 ± 0.064
Pairwise MAE testvs ClinicalBERT: p = 0.9679
Interactive Explorer

See How Each Encoding Strategy Reads a Surgical Note

Pick a real-world operative note type and watch how each encoding strategy extracts meaning — and how prediction accuracy differs.

TF-IDF sees: Key words (ranked)

Sentence-BERT sees: Sentence-level embedding

Predicted Duration
--
minutes
MAE by encoding strategy (lower = better)
Label / Count
TF-IDF
ClinicalBERT
Sentence-BERT

Illustrative values based on relative cohort performance. Individual predictions require a deployed model.

Results Deep Dive

Performance by Encoding Strategy & Procedure

MAE by Encoder
R² by Procedure Type
Accuracy Bands

Contextual embeddings (ClinicalBERT and Sentence-BERT) are near-tied and both outperform traditional encodings and structured-only baselines.

Embedding gains are largest for complex, variable-length procedures (Whipple, Colectomy) where operative narratives are richest in semantic content.

Contextual embeddings achieve the highest share of predictions within ±30 minutes, supporting more reliable OR scheduling.

So What?

Three Things You Can Take Away

🏥
Embeddings add real value
Adding transformer embeddings from clinical text improved prediction accuracy across all models tested — not just some.
🧠
Contextual embeddings lead
ClinicalBERT and Sentence-BERT both achieve top-tier performance in the published benchmark, with no statistically significant difference between them.
🔬
Traditional encodings fall short
Label encoding, count vectorization, and TF-IDF provided only limited benefit. Hospitals using simple encodings are leaving most accuracy gains on the table.