Clinical ML: Surgical Duration Prediction — BERT, SHAP, Flask, Healthcare AI

① The Problem

Operating rooms run late every day — and it's costing everyone.

18 min

Average scheduling error

Surgeons estimate case duration from memory and experience. Their guesses are systematically off by over 18 minutes — always in the same direction: too short.

$1,000s

Per hour of OR time wasted

Operating rooms are the most expensive resource in a hospital. One overrun case delays the next patient, causes staff overtime, and can cancel planned surgeries.

No two

Patients are the same

A hernia repair in a healthy 30-year-old takes far less time than in a 70-year-old with diabetes and heart disease. Traditional scheduling ignores these differences entirely.

② The Data

5 years of real surgery records from 3 hospitals.

17,246

Surgical procedures used to train & test our AI — every one a real patient

16,159

Unique patients

3

Academic hospitals in London, Canada

2015–20

5-year retrospective window

11

Procedure types studied

20

Input variables per case

56.9 yrs

Average patient age

Cases by procedure type

③ How It Works

From hospital records to accurate predictions — 4 steps.

🗄️

Collect & Clean

5 years of de-identified records. Outliers removed. Only elective adult cases included. 17,246 clean records.

🔍

Extract Patient Signals

20 pre-surgery variables: age, health complexity (ASA score), BMI, procedure type, surgeon, and day of week.

🤖

Train 8 AI Models

From simple statistics to deep neural networks. Each tested with 10-fold cross-validation to ensure fairness.

📊

Benchmark vs. Surgeons

Every AI model compared directly against real surgeon estimates on 5 accuracy and bias metrics.

Data Sources

Cerner EMR system
3 tertiary hospitals
Elective cases only
No emergency or cancelled
Adults (18+) only

Key Inputs Used

Procedure type & approach
ASA health complexity score
Patient age & BMI
Surgeon identity
First/last case of day
Day of week & ICD code

8 Models Tested

Linear / Ridge / Lasso
Support Vector Regression
Random Forest
Gradient Boosting (GBM)
XGBoost
Neural Network (ANN) ✓ Best

Validation Approach

10-fold cross-validation
Per-fold data preprocessing
Bayesian hyperparameter tuning
Statistical bias testing
TRIPOD-AI reporting standard

④ Results

The AI eliminates scheduling bias — outperforming surgeons on every metric.

🤖

AI Model — Neural Network

−0.37 min

Average prediction error — essentially zero bias. The model makes mistakes in both directions equally, with no systematic drift. Statistically unbiased (p = 0.34).

👨‍⚕️

Surgeon Estimate — Current Practice

−18.52 min

Average prediction error — surgeons consistently underestimate how long cases take, causing schedules to run over every single day. Statistically significant (p < 0.001).

31.8 min

Average Error (MAE)
vs 35.3 min for surgeons
= 10% more accurate

26%

% Error (MAPE)
vs 34% for surgeons
= 8 percentage points better

R² 0.78

Variance Explained
78% of what drives surgery length is captured by the model

p = 0.34

Bias Test
Only model with no significant bias
(all others p < 0.05)

How does each model compare?

Average prediction error in minutes — lower bars are better. The red dashed line shows how surgeons perform today.

Which models have a scheduling bias?

A bar extending left means the model consistently underestimates. Zero is perfect. Only the Neural Network achieves near-zero systematic error.

Model	Avg. Error ↓	% Error ↓	Fit (R²) ↑	Scheduling Bias	Systematically Biased?
🤖 Neural Network (ANN) BEST	31.8 min	26%	0.78	−0.37 min ✓	No (p=0.34)
XGBoost	32.5 min	27%	0.78	−2.63 min	Yes
Gradient Boosting (GBM)	32.4 min	27%	0.78	−2.67 min	Yes
Random Forest	36.8 min	26%	0.77	−2.69 min	Yes
Linear / Ridge / Lasso	36.9 min	31%	0.72	−1.7 to −2.1 min	Yes
👨‍⚕️ Surgeon Estimate (current practice)	35.3 min	34%	0.78	−18.52 min ✗	Yes (p<0.001)

What this means in practice

🏥

For Hospital Administrators

More accurate OR scheduling means fewer overruns, less overtime pay, and more procedures completed per day — directly improving throughput and revenue per OR.

👨‍⚕️

For Surgical Teams

A schedule that reflects reality — not wishful thinking. Less rushed finishes, fewer late-night cases, and better preparation time between procedures.

🧑‍🤝‍🧑

For Patients

Fewer cancellations of same-day surgeries due to time overruns. Less waiting. When your procedure is scheduled, it happens — on time.

Can AI Predict How Long a Surgery Will Take
Better Than the Surgeon?

Operating rooms run late every day — and it's costing everyone.

Average scheduling error

Per hour of OR time wasted

Patients are the same

5 years of real surgery records from 3 hospitals.

Cases by procedure type

From hospital records to accurate predictions — 4 steps.

Collect & Clean

Extract Patient Signals

Train 8 AI Models

Benchmark vs. Surgeons

The AI eliminates scheduling bias — outperforming surgeons on every metric.

How does each model compare?

Which models have a scheduling bias?

For Hospital Administrators

For Surgical Teams

For Patients

At a Glance

Project Info

Tech Stack

Citation

Related Work

Can AI Predict How Long a Surgery Will TakeBetter Than the Surgeon?

Operating rooms run late every day — and it's costing everyone.

Average scheduling error

Per hour of OR time wasted

Patients are the same

5 years of real surgery records from 3 hospitals.

Cases by procedure type

From hospital records to accurate predictions — 4 steps.

Collect & Clean

Extract Patient Signals

Train 8 AI Models

Benchmark vs. Surgeons

The AI eliminates scheduling bias — outperforming surgeons on every metric.

How does each model compare?

Which models have a scheduling bias?

For Hospital Administrators

For Surgical Teams

For Patients

At a Glance

Project Info

Tech Stack

Citation

Related Work

Can AI Predict How Long a Surgery Will Take
Better Than the Surgeon?