Statistics & Experimentation Python ยท Flask ยท SciPy Live on HuggingFace Spaces
ExperimentLab โ The Complete A/B Testing Suite, from Power Analysis to Production
An interactive end-to-end experimentation platform covering the full controlled experiment lifecycle โ from pre-experiment power planning to post-experiment decision-making. Four production-quality modules, each backed by rigorous statistical methods and interactive Plotly visualizations.
2024Mohammad Noorchenarboo Synthetic & user-supplied data 5 statistical methods
ExperimentLab maps directly onto the five stages of a well-run controlled experiment. Each module is a self-contained Flask Blueprint that handles both full-page rendering (GET) and JSON computation (POST), keeping statistical logic cleanly separated from presentation. The browser receives Plotly JSON objects and renders interactive charts client-side โ no heavy SPA framework required.
๐ฏ
Define Hypothesis
Set baseline, MDE & ฮฑ
๐
Power Analysis
Compute required n & power curves
โ๏ธ
Run Experiment
Collect observations per plan
โ๏ธ
Statistical Test
z-test ยท t-test ยท ANOVA
๐
Decision
Ship / Iterate / Hold
๐ก
Statistical + Practical Significance โ Both Required
A common pitfall in experimentation is treating statistical significance as the only decision gate. ExperimentLab's A/B analyzer enforces a 2ร2 verdict matrix: an experiment must clear both the p-value threshold and a user-specified practical significance threshold before returning a "PASS โ Ship It" verdict. This prevents shipping changes that are statistically detectable but business-irrelevant.
Module Breakdown
Four Modules, One Workflow
๐ Pre-Experiment
Power Calculator
Computes the required per-group sample size for a two-proportion z-test. Returns power curves across MDE values and a sample-size trade-off chart across power targets.
Route/power
MethodTwo-proportion z-test power
InputsBaseline, MDE, ฮฑ, power, tails
โ๏ธ Post-Experiment
A/B Test Analyzer
Runs two-proportion z-test or Welch's t-test and issues a traffic-light verdict combining statistical and practical significance. Supports manual input or CSV upload.
Route/ab-test
Methodsz-test ยท Welch's t-test ยท Cohen's d
Input modesManual summary stats ยท CSV upload
๐ฌ Multi-factor
Two-Factor DoE
Full two-way factorial ANOVA for 2ร2 or 3ร3 designs. The user enters a response grid (with optional replicates per cell) and receives an ANOVA table, interaction plot, and main-effects chart.
Route/doe
MethodTwo-way ANOVA (statsmodels)
Levels2 or 3 per factor ยท replicate support
๐ Sequential
Sequential Testing Demo
Monte Carlo simulation of the peeking problem and the O'Brien-Fleming correction. Demonstrates how repeated interim checks inflate the false positive rate and how OBF boundaries fix it.
Route/sequential
MethodOBF alpha-spending (z_ฮฑ / โ(t/T))
Sims5โ50 Monte Carlo runs
๐จ UX Design
Design System
A bespoke CSS design system built from scratch โ split-panel layout, collapsible parameter sections, debounced live updates, Ctrl+Enter shortcut, toast notifications, and a responsive mobile layout.
SidebarSticky ยท 260px ยท Dark navy
ChartsPlotly JSON from server
๐ณ Deployment
Docker + HF Spaces
Docker-first architecture exposing port 7860 โ the default port expected by HuggingFace Spaces Docker SDK. A single YAML front-matter block in README.md triggers automatic cloud deployment on push.
Port7860 (HF standard)
SDKdocker ยท python:3.10-slim
Statistical Methods
Core Methods & Libraries
All computation runs server-side in Python using scipy, numpy, and statsmodels โ results are serialised as Plotly JSON and streamed to the browser. This keeps the client lightweight and the statistical logic easily testable.
Two-Proportion Z-Test Power
Fleiss (1981) formula โ used in Power Calculator to compute n and generate power / trade-off curves
scipy.stats.norm
Two-Proportion Z-Test & Welch's T-Test
A/B analyzer โ pooled SE for proportions; unequal-variance t-test for means; Cohen's d for effect size
scipy.stats.ttest_ind
Two-Way ANOVA with Interaction
OLS formula interface โ Factor A, Factor B, and A:B interaction term; supports unbalanced and replicated designs
statsmodels ols + anova_lm
O'Brien-Fleming Alpha Spending
Boundary z_ฮฑ / โ(t/T) โ corrects for type I error inflation from repeated interim analyses; compared to naive p-value peeking via Monte Carlo
numpy simulation
Plotly โ Server-Side Chart Generation
All charts are built as Plotly figure dicts on the server, serialised to JSON, and rendered by Plotly.js in the browser โ enabling theming and layout control server-side
plotly.graph_objects
โ๏ธ
Debounced Live Computation
Every numeric input fires a POST request to the computation endpoint โ but only after a 400โ800 ms debounce (configurable per module). This gives the app a "live calculator" feel without hammering the server on every keystroke. The Sequential Testing module uses a longer 800 ms debounce due to its Monte Carlo cost.
Interactive Explorer
Representative Experiment Scenarios
Select a scenario to see typical outputs from each module. These are illustrative values drawn from the default parameters of the live app.
Illustrative outputs matching the default parameters in the live app. Open the live demo to interact with real computations in real time.
Performance Snapshot
Statistical Visualizations
Power vs MDE
Sample Size Trade-off
Peeking Problem
Statistical power as a function of minimum detectable effect (MDE) at three power targets โ baseline 10%, ฮฑ = 0.05, two-tailed. Power collapses rapidly as MDE shrinks below 2 pp.
Required per-group sample size as a function of desired power โ for three MDE levels. Targeting 90% power costs roughly 35% more observations than 80% power.
Observed false positive rate under naive repeated testing vs the O'Brien-Fleming boundary across increasing numbers of interim looks โ null hypothesis is true in all simulations (ฮฑ = 0.05).
Design Decisions
Key Engineering Choices
๐๏ธ
Blueprint Architecture
Each module is a self-contained Flask Blueprint with its own route, template, and statistical logic. Adding a new module requires zero changes to any existing file โ just register the Blueprint in app.py.
๐ฌ
Server-side Plotly JSON
Charts are constructed in Python (full control over colours, annotations, and layout) and sent as serialised JSON. The client only calls Plotly.newPlot() โ no chart-building logic leaks into the frontend.
โ๏ธ
2ร2 Significance Verdict
The A/B analyzer enforces both statistical and practical significance as independent gates. This prevents the common mistake of shipping a statistically significant but business-irrelevant 0.1 pp conversion lift.
At a Glance
Quick read
What it is: A Flask-based statistical experimentation suite with four interactive modules. Tech: Python, SciPy, statsmodels, Plotly, Jinja2. Deploy: Docker on HuggingFace Spaces (port 7860). Scope: Power analysis, A/B testing, factorial DoE, sequential testing.