Healthcare AI · OR Scheduling Published Paper · Surgical Endoscopy 2025

Can AI Predict How Long a Surgery Will Take
Better Than the Surgeon?

Yes — and by a meaningful margin. Using 17,246 real surgical records, we trained an AI that predicts operating room duration with near-zero systematic bias, while surgeon estimates consistently run 18+ minutes short every single day.

Kwong · Noorchenarboo · Grolinger · Hawel · Schlachta · Elnahas 17,246 procedures · 3 hospitals · 5 years DOI: 10.1007/s00464-025-11885-0
17,246
Real surgical records across 3 tertiary hospitals
6
ML models + surgeon baseline compared head-to-head
31.8
Best MAE in minutes — Neural Network (ANN)
18+ min
Average surgeon underestimation eliminated by AI
−0.37
Bias achieved by ANN — statistically zero (p = 0.34)

① The Problem

Operating rooms run late every day — and it's costing everyone.

18 min

Average scheduling error

Surgeons estimate case duration from memory and experience. Their guesses are systematically off by over 18 minutes — always in the same direction: too short.

$1,000s

Per hour of OR time wasted

Operating rooms are the most expensive resource in a hospital. One overrun case delays the next patient, causes staff overtime, and can cancel planned surgeries.

No two

Patients are the same

A hernia repair in a healthy 30-year-old takes far less time than in a 70-year-old with diabetes and heart disease. Traditional scheduling ignores these differences entirely.


② The Data

5 years of real surgery records from 3 hospitals.

17,246
Surgical procedures used to train & test our AI — every one a real patient
16,159
Unique patients
3
Academic hospitals in London, Canada
2015–20
5-year retrospective window
11
Procedure types studied
20
Input variables per case
56.9 yrs
Average patient age

Cases by procedure type


③ How It Works

From hospital records to accurate predictions — 4 steps.

🗄️

Collect & Clean

5 years of de-identified records. Outliers removed. Only elective adult cases included. 17,246 clean records.

🔍

Extract Patient Signals

20 pre-surgery variables: age, health complexity (ASA score), BMI, procedure type, surgeon, and day of week.

🤖

Train 8 AI Models

From simple statistics to deep neural networks. Each tested with 10-fold cross-validation to ensure fairness.

📊

Benchmark vs. Surgeons

Every AI model compared directly against real surgeon estimates on 5 accuracy and bias metrics.

Data Sources
  • Cerner EMR system
  • 3 tertiary hospitals
  • Elective cases only
  • No emergency or cancelled
  • Adults (18+) only
Key Inputs Used
  • Procedure type & approach
  • ASA health complexity score
  • Patient age & BMI
  • Surgeon identity
  • First/last case of day
  • Day of week & ICD code
8 Models Tested
  • Linear / Ridge / Lasso
  • Support Vector Regression
  • Random Forest
  • Gradient Boosting (GBM)
  • XGBoost
  • Neural Network (ANN) ✓ Best
Validation Approach
  • 10-fold cross-validation
  • Per-fold data preprocessing
  • Bayesian hyperparameter tuning
  • Statistical bias testing
  • TRIPOD-AI reporting standard

④ Results

The AI eliminates scheduling bias — outperforming surgeons on every metric.

🤖
AI Model — Neural Network
−0.37 min
Average prediction error — essentially zero bias. The model makes mistakes in both directions equally, with no systematic drift. Statistically unbiased (p = 0.34).
👨‍⚕️
Surgeon Estimate — Current Practice
−18.52 min
Average prediction error — surgeons consistently underestimate how long cases take, causing schedules to run over every single day. Statistically significant (p < 0.001).
31.8 min
Average Error (MAE)
vs 35.3 min for surgeons
= 10% more accurate
26%
% Error (MAPE)
vs 34% for surgeons
= 8 percentage points better
R² 0.78
Variance Explained
78% of what drives surgery length is captured by the model
p = 0.34
Bias Test
Only model with no significant bias
(all others p < 0.05)

How does each model compare?

Average prediction error in minutes — lower bars are better. The red dashed line shows how surgeons perform today.

Which models have a scheduling bias?

A bar extending left means the model consistently underestimates. Zero is perfect. Only the Neural Network achieves near-zero systematic error.

ModelAvg. Error ↓% Error ↓Fit (R²) ↑Scheduling BiasSystematically Biased?
🤖 Neural Network (ANN) BEST31.8 min
26%0.78−0.37 min ✓No (p=0.34)
XGBoost32.5 min
27%0.78−2.63 minYes
Gradient Boosting (GBM)32.4 min
27%0.78−2.67 minYes
Random Forest36.8 min
26%0.77−2.69 minYes
Linear / Ridge / Lasso36.9 min
31%0.72−1.7 to −2.1 minYes
👨‍⚕️ Surgeon Estimate (current practice)35.3 min
34%0.78−18.52 min ✗Yes (p<0.001)
What this means in practice
🏥

For Hospital Administrators

More accurate OR scheduling means fewer overruns, less overtime pay, and more procedures completed per day — directly improving throughput and revenue per OR.

👨‍⚕️

For Surgical Teams

A schedule that reflects reality — not wishful thinking. Less rushed finishes, fewer late-night cases, and better preparation time between procedures.

🧑‍🤝‍🧑

For Patients

Fewer cancellations of same-day surgeries due to time overruns. Less waiting. When your procedure is scheduled, it happens — on time.