Benchmarked 8 machine learning models — Linear Regression, Ridge, Lasso, SVR, Random Forest, GBM, XGBoost, and a Neural Network (ANN) — against real surgeon estimates for predicting elective surgical case duration. The study covers 17,246 procedures from 3 academic hospitals in London, Canada (2015–2020), using 20 structured pre-operative features including procedure type, ASA health complexity score, patient age, BMI, surgeon identity, ICD code, and day of week. Published in Surgical Endoscopy (2025).
The best model — a Neural Network (ANN) — achieved MAE = 31.8 min, MAPE = 26%, R² = 0.78, and a near-zero systematic bias of −0.37 min (p = 0.34) — the only model statistically free of scheduling error. By contrast, surgeon estimates showed a consistent −18.52 min underestimation (p < 0.001). All models were validated with 10-fold cross-validation, Bayesian hyperparameter tuning, and TRIPOD-AI reporting standards using Python, scikit-learn, XGBoost, and Keras/TensorFlow.
Benchmarked five text-encoding strategies — Label Encoding, Count Vectorization, TF-IDF, ClinicalBERT, and Sentence-BERT — on 180,366 real elective surgical cases from 3 tertiary hospitals in London, Canada (2015–2020). Each strategy was fused with structured pre-operative features (age, BMI, ASA score, sex, procedure type) and evaluated against multiple ML models using MAE, SMAPE, and R² with 10-fold cross-validation. Submitted as an SSRN preprint (2025).
The best encoder — Sentence-BERT — achieved MAE = 27.6 min, SMAPE = 23.0%, and R² = 0.77, reducing prediction error by up to 16% over traditional encodings. ClinicalBERT also showed statistically significant gains over the structured-only baseline (p < 0.01). Results confirm that sentence-level transformer embeddings, which encode the full meaning of a surgical narrative, outperform all token-level and frequency-based strategies. Tech stack: Python, Hugging Face, PyTorch, scikit-learn, XGBoost, SHAP.
Developed a context-aware XAI framework for deep learning-based energy anomaly detection across five real building types — residential, manufacturing, medical clinic, retail, and office. The key innovation replaces SHAP's random background baseline with cosine-similarity-matched historical windows, weighted by Random Forest feature importances, so explanations are grounded in each anomaly's actual operational context. Published in Energy & Buildings (2025), the method was evaluated across 10 deep learning architectures (LSTM, GRU, BiLSTM, BiGRU, 1D-CNN, TCN, DCNN, WaveNet, TFT, TST) and 5 XAI techniques over 250+ model–dataset–XAI combinations.
Results show a 38% average reduction in explanation variability across all datasets and methods, with a maximum of 80.3% reduction on the manufacturing facility using Partition SHAP (p < 0.001). The anomaly detection pipeline uses IQR-based thresholding with 48-hour sliding window inputs and Bayesian hyperparameter tuning on all models. Funded by NSERC and London Hydro; co-authored with K. Grolinger, Western University. Tech stack: Python, SHAP, LIME, Random Forest, Cosine Similarity, Bayesian Optimization, Min-Max Scaling.
Applied a multivariate candidate-gene approach to identify genetic risk factors for sarcopenia in 2,772 elderly Iranians (BEH cohort, aged 60+) genotyped at 663,377 SNPs. Rather than testing muscle mass and grip strength separately, the study used MultiPhen — a joint ordinal regression method that regresses genotype on both outcomes simultaneously — providing up to 2× more statistical power than conventional single-trait GWAS. After LD filtering (r² ≤ 0.4, MAF > 0.01, HWE p > 0.05), 27 independent SNPs within ±50 kb of the IL10 gene were aggregated via the GATES Extended Simes procedure. Published in the Journal of Biostatistics and Epidemiology (2022).
Three intronic IL10 variants reached significance — rs11119603 (p = 0.00384), rs57461190 (p = 0.00411), and rs3950619 (p = 0.03641) — with effect sizes ranging 0.178–0.883. The GATES gene-level test confirmed IL10 at p = 0.046, the first genomic validation of IL10 as a sarcopenia risk gene in an Iranian population. Tech stack: R, MultiPhen, GATES, PLINK.
Profiled 50 plasma metabolites (30 acylcarnitines + 20 amino acids) via FIA-MS/MS in 1,102 fasting participants from the STEPs 2016 national survey (age 40–79, Tehran University of Medical Sciences). To handle the high dimensionality and multicollinearity of 50 correlated metabolites, the pipeline applied PCA with Varimax rotation (KMO = 0.874, Bartlett p < 0.001) to produce 10 orthogonal metabolite factors, followed by binary logistic regression stratified across four ACC/AHA ASCVD risk groups. Pathway enrichment was performed via MetaboAnalyst v5.0 / KEGG with Benjamini-Hochberg FDR correction. Published in Frontiers in Cardiovascular Medicine (2023).
Multiple linear regression on all 50 metabolites identified 14 significant ASCVD biomarkers — 3 acylcarnitines (C4DC, C8:1, C16OH) and 11 amino acids — with C16OH showing the strongest single-metabolite correlation (r = 0.279, p < 0.001). PCA logistic regression confirmed Factor 10 (ornithine + citrulline) as the highest-risk factor (OR = 1.570, p < 0.001) and Factor 9 (glycine, serine, threonine) as the only protective factor (OR = 0.741, p < 0.001). Tech stack: FIA-MS/MS, PCA (Varimax), SPSS v19.0, MetaboAnalyst v5.0, KEGG.
Conducted the statistical analysis for a 3-arm randomized clinical trial (n=123 CABG patients, block randomization size 4, n=41 per arm) comparing a custom gamified Android app ("Delban") against teach-back training and usual care over 30 days post-discharge. Delban covered three clinical modules — diet, medication, and movement — with star rewards and a shared social leaderboard. Outcomes were measured via the Sanaie dietary questionnaire (30 items), Sanaie movement scale (19 items), and the MMAS medication adherence scale (Cronbach α = 0.82). Published in JMIR (2021).
Statistical analysis used one-way ANOVA with Dunnett post-hoc tests in SPSS v20 and STATA v12. Gamification significantly outperformed both teach-back and control for diet (F = 71.8, Δ = +1.797, p < 0.001) and movement adherence (F = 124.5, Δ = +2.013, p < 0.001), with non-overlapping 95% CIs confirming a genuine advantage over teach-back. Medication adherence improved in both active arms vs. control (F = 9.66, p < 0.001) but did not differ significantly between the two interventions. Tech stack: SPSS v20, STATA v12, one-way ANOVA, Dunnett post-hoc, Fisher exact test, Cronbach alpha.
Built a Flask web application that converts a plain-language biomedical query into validated human protein drug targets with AlphaFold 3D structures rendered in-browser. The core engine is a 5-model LLM fallback chain — Llama-3.2-3B-Instruct → Mistral-7B-Instruct-v0.3 → Phi-3-mini-4k-instruct → Apriel-5B-Instruct → Llama-3.1-5B-Instruct — where each model is tried with three API methods (conversational → chat_completion → text_generation) before cascading to the next. LLMs return a structured JSON array of 3–5 human protein targets with UniProt IDs, function summaries, and known drugs per query.
Every LLM-suggested UniProt ID is cross-validated live against the UniProt REST API with a strict Homo sapiens organism filter (organism_id:9606) — hallucinated or non-human IDs are dropped automatically. AlphaFold EBI (v4/v3) structures are fetched with RCSB PDB fallback for 3D in-browser rendering. Additional resilience layers include a 3-level UniProt deep search fallback with progressive keyword extraction and a curated disease-protein JSON database as last resort. A per-protein LLM chat assistant handles follow-up Q&A. Deployed on HuggingFace Spaces (2025).
Built a full-stack risk intelligence dashboard unifying seven risk modules — credit scoring, fraud detection, market risk (VaR/CVaR), loan portfolio analysis, insurance claims analytics, underwriting risk, and loss-ratio monitoring — into a single Flask + Plotly application deployed on HuggingFace Spaces. Three production-style Scikit-learn ML models (Random Forest, Gradient Boosting, Logistic Regression) are trained at startup on 5,200+ synthetic records across credit, fraud, insurance, and market domains. Three REST API endpoints (/api/credit, /api/fraud, /api/underwriting) deliver real-time ML scoring from browser forms — mirroring the microservice pattern of production risk systems.
The market risk engine computes VaR at 95% & 99%, CVaR (Expected Shortfall), Sharpe ratio, and maximum drawdown via historical simulation on 252 days of synthetic returns with a rolling 21-day window. The credit risk module outputs Probability of Default, LGD proxy, and Expected Loss per applicant. All Plotly figures are serialised server-side to JSON and rendered client-side — responsive, interactive, and theme-consistent. The entire application — data generation, ML training, seven pages, REST endpoints, and CSS — is contained in a single app.py, containerised via Docker for reproducible deployment (2025).
Built a fully interactive deep learning playground where users configure and train a Variational Autoencoder (VAE) on MNIST from scratch, then explore the learned latent space — all inside a browser-based five-tab interface backed by a live PyTorch training loop. The VAE architecture follows an encoder–reparameterisation–decoder pipeline (784 → 400 → L → 400 → 784) optimised with the ELBO loss (Binary Cross-Entropy + KL divergence) via Adam. Hyperparameters — epochs, batch size, learning rate, hidden dimension, and latent dimension — are fully configurable at runtime, and a live loss curve polls the training thread every 600 ms without page reloads.
The app exposes seven Flask REST API endpoints returning base64-encoded PNG responses for all visualisations (loss curve, latent scatter, reconstruction grid, slider-driven generation). A 2-D latent manifold scatter plot encodes 10,000 MNIST test images and colours them by digit class, making representation learning directly observable. Two sliders (Z₁, Z₂ ranging −3 to +3) let users navigate the latent space in real time and decode any coordinate into a generated digit, alongside a full 15 × 15 manifold grid. Training runs in a Python daemon thread for a non-blocking UI. Deployed on HuggingFace Spaces (2025).
Built a 6-step interactive educational playground that trains a vanilla Transformer (encoder–decoder seq2seq) on a handcrafted English-to-French dataset entirely in the browser, powered by a real PyTorch training loop streamed live via Server-Sent Events (SSE). The full Transformer stack — multi-head self-attention, cross-attention, positional encoding, label-smoothed KL divergence loss, and warmup inverse-sqrt LR scheduling — is implemented from scratch and configurable at runtime (d_model 64–256, 1–4 layers, 2/4/8 heads, d_ff 128–512). A live parameter count and ASCII architecture diagram update dynamically on every hyperparameter change. Training runs in a Python daemon thread pushing per-epoch JSON events (loss, perplexity, LR, sample translations) to a Chart.js live loss curve with early stopping and best-checkpoint restoration.
Post-training evaluation computes SacreBLEU corpus scores for both greedy and beam-4 decoding across all 138 sentence pairs with a side-by-side comparison table. The inference tab renders a cross-attention heatmap on an HTML Canvas — weights extracted from the last decoder layer, averaged across heads — showing exactly which source token the model attended to at each decoding step. A live token colour-coder flags out-of-vocabulary (OOV) words in real time with UNK badges before translation. The 6-step wizard enforces sequential step-locking (Data → Vocab → Model → Train → Evaluate → Infer) to mirror the correct conceptual pipeline. Deployed on HuggingFace Spaces via Docker (2025).
Built a production-quality statistical experimentation platform covering the full controlled-experiment lifecycle across four interactive modules: Power Calculator (two-proportion z-test sample sizing with power curves across MDE and power targets), A/B Test Analyzer (two-proportion z-test and Welch's t-test with Cohen's d effect size, CSV upload support, and a 2×2 statistical + practical significance verdict matrix enforcing both p-value and business-relevance thresholds), Two-Factor Design of Experiments / DoE (full two-way factorial ANOVA with interaction via statsmodels OLS for 2×2 and 3×3 designs with replicates), and Sequential Testing Demo (Monte Carlo simulation of the peeking problem vs. O'Brien-Fleming alpha-spending boundary correction). All computation runs server-side in Python using SciPy, NumPy, and statsmodels; results are serialised as Plotly JSON and rendered client-side.
The application is architected as a set of self-contained Flask Blueprints — one per module — keeping statistical logic fully decoupled from presentation with zero cross-module dependencies. A debounced live computation pattern (400–800 ms per module) fires POST requests on input change, giving a live-calculator feel without server overload. The entire stack is containerised via Docker and deployed on HuggingFace Spaces (port 7860). The A/B analyzer's dual-gate verdict — requiring clearance of both statistical significance (p-value) and a user-defined practical significance threshold — directly reflects industry-standard experiment decision frameworks used at data-driven organisations (2024).
Built a full-stack Retrieval-Augmented Generation (RAG) benchmarking platform that fetches real arXiv papers on-demand via the official API, indexes them with both BM25 sparse retrieval (BM25Okapi) and dense vector retrieval (FAISS IndexFlatIP + all-MiniLM-L6-v2, 384-dim L2-normalised embeddings), and fuses ranked results using Reciprocal Rank Fusion (RRF, k=60). All three retrieval methods — BM25, dense, and hybrid — run simultaneously on every query, with live evaluation metrics (context relevance, faithfulness, diversity) visualised in radar and grouped bar charts via Plotly Dash. A Flan-T5-large generator (780M params) receives the 6 most relevant sentences — selected by sentence-level embedding similarity rather than naive truncation — as distilled context for answer generation. The entire stack runs fully open-source with no external API key.
Three configurable chunking strategies (fixed, sentence-window, semantic) and a full JSON REST API (/api/fetch-arxiv, /api/query, /api/compare) make all features accessible programmatically. Every component — RRF fusion, chunker, context distillation, evaluation metrics — is implemented from scratch without LangChain, demonstrating internals-level understanding of each retrieval stage. Evaluation results show hybrid RRF achieves 0.47 average context relevance vs 0.41 BM25 and 0.51 diversity vs 0.38 BM25. Architecture is containerised via Docker and deployed on HuggingFace Spaces (2025).
Built a multi-agent document research system using a LangGraph 0.2 StateGraph with five specialised agents — Planner, Retriever, Grader, Generator, and Critic — executing in a clean linear pipeline with only 3 LLM calls per query (Qwen 2.5-7B-Instruct via HuggingFace Router OpenAI-compatible API). The Grader agent is entirely score-based (hybrid search score × 0.7 + keyword overlap × 0.3), eliminating 5 sequential LLM calls and reducing average query latency by ~60% while remaining instant and deterministic. The Retriever agent runs parallel FAISS IndexFlatIP semantic search and BM25Okapi keyword search over ingested chunks, fusing results via Reciprocal Rank Fusion (RRF, k=60) — improving top-5 recall by ~18% over pure semantic search on technical documents. Embeddings use BAAI/bge-small-en-v1.5 running locally via sentence-transformers, with zero API dependency on retrieval.
The Generator agent produces structured answers with inline [Source: filename, p.N] citations strictly grounded in retrieved context (Qwen at temp 0.4, max 512 tokens). The Critic agent performs hallucination detection and completeness evaluation, outputting an APPROVED or NEEDS_REVIEW verdict with a warning badge in the UI (temp 0.1, near-deterministic). Document ingestion supports PDF uploads (pypdf, up to 10 MB) and public URL scraping (requests + BeautifulSoup), chunked at 500 tokens with 50-token overlap. The entire LLM layer is provider-agnostic via the OpenAI SDK — swappable to OpenAI, Groq, Anthropic, or Ollama with one URL change. All LangChain chains use LCEL pipe syntax (prompt | llm). The Flask backend executes the graph via Python threading with query-ID polling for a live agent trace UI. Deployed on HuggingFace Spaces via Docker (2025).
Built an intelligent multi-turn customer support chatbot powered by a LangGraph 0.2 ReAct StateGraph with four nodes — Router, Agent, Tool Executor, and Responder — and conditional edges looping up to 4 ReAct iterations per turn before forcing a Final Answer. The agent dispatches five live tools: FAQ search (keyword-scored knowledge base, 15 entries), order status lookup, support ticket creation, product catalogue search (8 products), and human escalation. Since free-tier HuggingFace models do not support native function calling, tool dispatch uses a layered regex-based ReAct parser extracting Action/Action Input blocks with JSON fallback — covering the full output variance of open models. Four free HuggingFace LLMs are selectable at runtime mid-session (Mistral-7B, Zephyr-7B, Phi-3-Mini, Llama-3-8B), all at temperature 0.25 via HuggingFace Hub InferenceClient with stream=True.
Token streaming to the browser is delivered via Server-Sent Events (SSE), with LangGraph running in a Python daemon thread bridged by a per-session thread-safe queue.Queue — keeping Flask's HTTP response non-blocking. Every graph node emits enter/exit SSE events with millisecond timing, driving a live animated graph trace visualiser in the UI. A session analytics dashboard (Chart.js) tracks tool usage frequency and per-turn response latency in real time. The entire stack is zero-persistence — all state in-memory, no database, no external services beyond the HF Inference API — enabling indefinite free-tier deployment on HuggingFace Spaces via Docker (2025).