Benchmarking Text Encoding Strategies for Surgical Duration Prediction · International Journal of Medical Informatics

The Core Idea

From a Surgeon's Words to a Prediction

After every surgery, the surgeon writes a narrative note describing what happened. This project is about one question: what's the smartest way to turn that text into numbers a prediction model can learn from? We tested five strategies — from simple label counts to state-of-the-art clinical transformer embeddings.

📋

Operative Notes

Procedure descriptions & diagnoses (free text)

🔤

Text Encoding

5 strategies — from label encoding to Sentence-BERT

🧬

Feature Fusion

Merged with structured variables (age, BMI, ASA, sex…)

🤖

ML Models

Linear, tree-based ensembles & neural networks

⏱️

Surgery Duration

Predicted minutes from incision to close

Head-to-Head

Five Encoding Strategies Compared

🔢 Baseline

Label Encoding

Converts each category to a single integer. No semantic meaning — "appendectomy" becomes 3, "Whipple" becomes 9.

Relative performance

MAE improvementMinimal

Medical knowledgeNone

📊 Frequency

Count Vectorization

Counts raw word occurrences. Captures vocabulary but treats all words equally — "the" and "hemorrhage" weighted the same.

Relative performance

MAE improvementLimited

Medical knowledgeNone

⚖️ Weighted

TF-IDF

Weights words by rarity — rare clinical terms score higher. Better than count, but still no understanding of meaning.

Relative performance

MAE improvementLimited

Medical knowledgeNone

🏥 Clinical Transformer

ClinicalBERT

BERT fine-tuned on real clinical notes (MIMIC-III). In the paper's fold-averaged encoder table, ClinicalBERT is slightly better than Sentence-BERT: MAE 34.66 ± 6.71, SMAPE 32.28 ± 9.40, R² 0.777 ± 0.061. Pairwise tests still show no significant difference.

Statistically comparable to Sentence-BERT

MAE34.66 ± 6.71 min

SMAPE32.28 ± 9.40%

R²0.777 ± 0.061

Pairwise MAE testvs Sentence-BERT: p = 0.9679

🧠 Contextual Top Tier

Sentence-BERT

Sentence-level transformer embeddings optimized for semantic similarity — captures full narrative meaning. In the same fold-averaged encoder table, Sentence-BERT remains very close to ClinicalBERT: MAE 34.82 ± 7.01, SMAPE 32.41 ± 9.68, R² 0.775 ± 0.064.

Statistically comparable to ClinicalBERT

MAE34.82 ± 7.01 min

SMAPE32.41 ± 9.68%

R²0.775 ± 0.064

Pairwise MAE testvs ClinicalBERT: p = 0.9679

Interactive Explorer

See How Each Encoding Strategy Reads a Surgical Note

Pick a real-world operative note type and watch how each encoding strategy extracts meaning — and how prediction accuracy differs.

TF-IDF sees: Key words (ranked)

Sentence-BERT sees: Sentence-level embedding

Predicted Duration

minutes

MAE by encoding strategy (lower = better)

Label / Count

TF-IDF

ClinicalBERT

Sentence-BERT

Illustrative values based on relative cohort performance. Individual predictions require a deployed model.

Results Deep Dive

Performance by Encoding Strategy & Procedure

MAE by Encoder

R² by Procedure Type

Accuracy Bands

Contextual embeddings (ClinicalBERT and Sentence-BERT) are near-tied and both outperform traditional encodings and structured-only baselines.

Embedding gains are largest for complex, variable-length procedures (Whipple, Colectomy) where operative narratives are richest in semantic content.

Contextual embeddings achieve the highest share of predictions within ±30 minutes, supporting more reliable OR scheduling.

So What?

Three Things You Can Take Away

🏥

Embeddings add real value

Adding transformer embeddings from clinical text improved prediction accuracy across all models tested — not just some.

🧠

Contextual embeddings lead

ClinicalBERT and Sentence-BERT both achieve top-tier performance in the published benchmark, with no statistically significant difference between them.

🔬

Traditional encodings fall short

Label encoding, count vectorization, and TF-IDF provided only limited benefit. Hospitals using simple encodings are leaving most accuracy gains on the table.

Benchmarking Text Encoding Strategies
for Surgical Case Duration Prediction

From a Surgeon's Words to a Prediction

Why Do Embeddings Matter So Much?

Five Encoding Strategies Compared

See How Each Encoding Strategy Reads a Surgical Note

TF-IDF sees: Key words (ranked)

Sentence-BERT sees: Sentence-level embedding

Performance by Encoding Strategy & Procedure

Three Things You Can Take Away

At a Glance

Project Info

Tech Stack

Citation

Related Work

Benchmarking Text Encoding Strategiesfor Surgical Case Duration Prediction

From a Surgeon's Words to a Prediction

Why Do Embeddings Matter So Much?

Five Encoding Strategies Compared

See How Each Encoding Strategy Reads a Surgical Note

TF-IDF sees: Key words (ranked)

Sentence-BERT sees: Sentence-level embedding

Performance by Encoding Strategy & Procedure

Three Things You Can Take Away

At a Glance

Project Info

Tech Stack

Citation

Related Work

Benchmarking Text Encoding Strategies
for Surgical Case Duration Prediction