Statistics & Experimentation Python ยท Flask ยท SciPy Live on HuggingFace Spaces

ExperimentLab โ€” The Complete A/B Testing Suite, from Power Analysis to Production

An interactive end-to-end experimentation platform covering the full controlled experiment lifecycle โ€” from pre-experiment power planning to post-experiment decision-making. Four production-quality modules, each backed by rigorous statistical methods and interactive Plotly visualizations.

2024 Mohammad Noorchenarboo Synthetic & user-supplied data 5 statistical methods
4
Interactive
Modules
5
Statistical
Methods
100%
Server-side
Computation
2-way
Factorial ANOVA
Support
OBF
O'Brien-Fleming
Boundary
Architecture Overview

Experiment Lifecycle Pipeline

ExperimentLab maps directly onto the five stages of a well-run controlled experiment. Each module is a self-contained Flask Blueprint that handles both full-page rendering (GET) and JSON computation (POST), keeping statistical logic cleanly separated from presentation. The browser receives Plotly JSON objects and renders interactive charts client-side โ€” no heavy SPA framework required.

๐ŸŽฏ
Define Hypothesis
Set baseline, MDE & ฮฑ
๐Ÿ“
Power Analysis
Compute required n & power curves
โš—๏ธ
Run Experiment
Collect observations per plan
โš–๏ธ
Statistical Test
z-test ยท t-test ยท ANOVA
๐Ÿš€
Decision
Ship / Iterate / Hold
๐Ÿ’ก

Statistical + Practical Significance โ€” Both Required

A common pitfall in experimentation is treating statistical significance as the only decision gate. ExperimentLab's A/B analyzer enforces a 2ร—2 verdict matrix: an experiment must clear both the p-value threshold and a user-specified practical significance threshold before returning a "PASS โ€” Ship It" verdict. This prevents shipping changes that are statistically detectable but business-irrelevant.

Module Breakdown

Four Modules, One Workflow

๐Ÿ“ Pre-Experiment
Power Calculator
Computes the required per-group sample size for a two-proportion z-test. Returns power curves across MDE values and a sample-size trade-off chart across power targets.
Route/power
MethodTwo-proportion z-test power
InputsBaseline, MDE, ฮฑ, power, tails
โš–๏ธ Post-Experiment
A/B Test Analyzer
Runs two-proportion z-test or Welch's t-test and issues a traffic-light verdict combining statistical and practical significance. Supports manual input or CSV upload.
Route/ab-test
Methodsz-test ยท Welch's t-test ยท Cohen's d
Input modesManual summary stats ยท CSV upload
๐Ÿ”ฌ Multi-factor
Two-Factor DoE
Full two-way factorial ANOVA for 2ร—2 or 3ร—3 designs. The user enters a response grid (with optional replicates per cell) and receives an ANOVA table, interaction plot, and main-effects chart.
Route/doe
MethodTwo-way ANOVA (statsmodels)
Levels2 or 3 per factor ยท replicate support
๐Ÿ“ˆ Sequential
Sequential Testing Demo
Monte Carlo simulation of the peeking problem and the O'Brien-Fleming correction. Demonstrates how repeated interim checks inflate the false positive rate and how OBF boundaries fix it.
Route/sequential
MethodOBF alpha-spending (z_ฮฑ / โˆš(t/T))
Sims5โ€“50 Monte Carlo runs
๐ŸŽจ UX Design
Design System
A bespoke CSS design system built from scratch โ€” split-panel layout, collapsible parameter sections, debounced live updates, Ctrl+Enter shortcut, toast notifications, and a responsive mobile layout.
SidebarSticky ยท 260px ยท Dark navy
ChartsPlotly JSON from server
๐Ÿณ Deployment
Docker + HF Spaces
Docker-first architecture exposing port 7860 โ€” the default port expected by HuggingFace Spaces Docker SDK. A single YAML front-matter block in README.md triggers automatic cloud deployment on push.
Port7860 (HF standard)
SDKdocker ยท python:3.10-slim
Statistical Methods

Core Methods & Libraries

All computation runs server-side in Python using scipy, numpy, and statsmodels โ€” results are serialised as Plotly JSON and streamed to the browser. This keeps the client lightweight and the statistical logic easily testable.

Two-Proportion Z-Test Power
Fleiss (1981) formula โ€” used in Power Calculator to compute n and generate power / trade-off curves
scipy.stats.norm
Two-Proportion Z-Test & Welch's T-Test
A/B analyzer โ€” pooled SE for proportions; unequal-variance t-test for means; Cohen's d for effect size
scipy.stats.ttest_ind
Two-Way ANOVA with Interaction
OLS formula interface โ€” Factor A, Factor B, and A:B interaction term; supports unbalanced and replicated designs
statsmodels ols + anova_lm
O'Brien-Fleming Alpha Spending
Boundary z_ฮฑ / โˆš(t/T) โ€” corrects for type I error inflation from repeated interim analyses; compared to naive p-value peeking via Monte Carlo
numpy simulation
Plotly โ€” Server-Side Chart Generation
All charts are built as Plotly figure dicts on the server, serialised to JSON, and rendered by Plotly.js in the browser โ€” enabling theming and layout control server-side
plotly.graph_objects
โš™๏ธ

Debounced Live Computation

Every numeric input fires a POST request to the computation endpoint โ€” but only after a 400โ€“800 ms debounce (configurable per module). This gives the app a "live calculator" feel without hammering the server on every keystroke. The Sequential Testing module uses a longer 800 ms debounce due to its Monte Carlo cost.

Interactive Explorer

Representative Experiment Scenarios

Select a scenario to see typical outputs from each module. These are illustrative values drawn from the default parameters of the live app.

Illustrative outputs matching the default parameters in the live app. Open the live demo to interact with real computations in real time.

Performance Snapshot

Statistical Visualizations

Power vs MDE
Sample Size Trade-off
Peeking Problem

Statistical power as a function of minimum detectable effect (MDE) at three power targets โ€” baseline 10%, ฮฑ = 0.05, two-tailed. Power collapses rapidly as MDE shrinks below 2 pp.

Required per-group sample size as a function of desired power โ€” for three MDE levels. Targeting 90% power costs roughly 35% more observations than 80% power.

Observed false positive rate under naive repeated testing vs the O'Brien-Fleming boundary across increasing numbers of interim looks โ€” null hypothesis is true in all simulations (ฮฑ = 0.05).

Design Decisions

Key Engineering Choices

๐Ÿ—๏ธ
Blueprint Architecture
Each module is a self-contained Flask Blueprint with its own route, template, and statistical logic. Adding a new module requires zero changes to any existing file โ€” just register the Blueprint in app.py.
๐Ÿ”ฌ
Server-side Plotly JSON
Charts are constructed in Python (full control over colours, annotations, and layout) and sent as serialised JSON. The client only calls Plotly.newPlot() โ€” no chart-building logic leaks into the frontend.
โš–๏ธ
2ร—2 Significance Verdict
The A/B analyzer enforces both statistical and practical significance as independent gates. This prevents the common mistake of shipping a statistically significant but business-irrelevant 0.1 pp conversion lift.