A/B Testing & Statistical Experimentation: SciPy, Power Analysis, Hypothesis Testing

Architecture Overview

Experiment Lifecycle Pipeline

ExperimentLab maps directly onto the five stages of a well-run controlled experiment. Each module is a self-contained Flask Blueprint that handles both full-page rendering (GET) and JSON computation (POST), keeping statistical logic cleanly separated from presentation. The browser receives Plotly JSON objects and renders interactive charts client-side — no heavy SPA framework required.

🎯

Define Hypothesis

Set baseline, MDE & α

📐

Power Analysis

Compute required n & power curves

⚗️

Run Experiment

Collect observations per plan

⚖️

Statistical Test

z-test · t-test · ANOVA

🚀

Decision

Ship / Iterate / Hold

💡

Statistical + Practical Significance — Both Required

A common pitfall in experimentation is treating statistical significance as the only decision gate. ExperimentLab's A/B analyzer enforces a 2×2 verdict matrix: an experiment must clear both the p-value threshold and a user-specified practical significance threshold before returning a "PASS — Ship It" verdict. This prevents shipping changes that are statistically detectable but business-irrelevant.

Module Breakdown

Four Modules, One Workflow

📐 Pre-Experiment

Power Calculator

Computes the required per-group sample size for a two-proportion z-test. Returns power curves across MDE values and a sample-size trade-off chart across power targets.

Route/power

MethodTwo-proportion z-test power

InputsBaseline, MDE, α, power, tails

⚖️ Post-Experiment

A/B Test Analyzer

Runs two-proportion z-test or Welch's t-test and issues a traffic-light verdict combining statistical and practical significance. Supports manual input or CSV upload.

Route/ab-test

Methodsz-test · Welch's t-test · Cohen's d

Input modesManual summary stats · CSV upload

🔬 Multi-factor

Two-Factor DoE

Full two-way factorial ANOVA for 2×2 or 3×3 designs. The user enters a response grid (with optional replicates per cell) and receives an ANOVA table, interaction plot, and main-effects chart.

Route/doe

MethodTwo-way ANOVA (statsmodels)

Levels2 or 3 per factor · replicate support

📈 Sequential

Sequential Testing Demo

Monte Carlo simulation of the peeking problem and the O'Brien-Fleming correction. Demonstrates how repeated interim checks inflate the false positive rate and how OBF boundaries fix it.

Route/sequential

MethodOBF alpha-spending (z_α / √(t/T))

Sims5–50 Monte Carlo runs

🎨 UX Design

Design System

A bespoke CSS design system built from scratch — split-panel layout, collapsible parameter sections, debounced live updates, Ctrl+Enter shortcut, toast notifications, and a responsive mobile layout.

SidebarSticky · 260px · Dark navy

ChartsPlotly JSON from server

🐳 Deployment

Docker + HF Spaces

Docker-first architecture exposing port 7860 — the default port expected by HuggingFace Spaces Docker SDK. A single YAML front-matter block in README.md triggers automatic cloud deployment on push.

Port7860 (HF standard)

SDKdocker · python:3.10-slim

Statistical Methods

Core Methods & Libraries

All computation runs server-side in Python using scipy, numpy, and statsmodels — results are serialised as Plotly JSON and streamed to the browser. This keeps the client lightweight and the statistical logic easily testable.

Two-Proportion Z-Test Power

Fleiss (1981) formula — used in Power Calculator to compute n and generate power / trade-off curves

scipy.stats.norm

Two-Proportion Z-Test & Welch's T-Test

A/B analyzer — pooled SE for proportions; unequal-variance t-test for means; Cohen's d for effect size

scipy.stats.ttest_ind

Two-Way ANOVA with Interaction

OLS formula interface — Factor A, Factor B, and A:B interaction term; supports unbalanced and replicated designs

statsmodels ols + anova_lm

O'Brien-Fleming Alpha Spending

Boundary z_α / √(t/T) — corrects for type I error inflation from repeated interim analyses; compared to naive p-value peeking via Monte Carlo

numpy simulation

Plotly — Server-Side Chart Generation

All charts are built as Plotly figure dicts on the server, serialised to JSON, and rendered by Plotly.js in the browser — enabling theming and layout control server-side

plotly.graph_objects

Interactive Explorer

Representative Experiment Scenarios

Select a scenario to see typical outputs from each module. These are illustrative values drawn from the default parameters of the live app.

Illustrative outputs matching the default parameters in the live app. Open the live demo to interact with real computations in real time.

Performance Snapshot

Statistical Visualizations

Power vs MDE

Sample Size Trade-off

Peeking Problem

Statistical power as a function of minimum detectable effect (MDE) at three power targets — baseline 10%, α = 0.05, two-tailed. Power collapses rapidly as MDE shrinks below 2 pp.

Required per-group sample size as a function of desired power — for three MDE levels. Targeting 90% power costs roughly 35% more observations than 80% power.

Observed false positive rate under naive repeated testing vs the O'Brien-Fleming boundary across increasing numbers of interim looks — null hypothesis is true in all simulations (α = 0.05).

Design Decisions

Key Engineering Choices

🏗️

Blueprint Architecture

Each module is a self-contained Flask Blueprint with its own route, template, and statistical logic. Adding a new module requires zero changes to any existing file — just register the Blueprint in app.py.

🔬

Server-side Plotly JSON

Charts are constructed in Python (full control over colours, annotations, and layout) and sent as serialised JSON. The client only calls Plotly.newPlot() — no chart-building logic leaks into the frontend.

⚖️

2×2 Significance Verdict

The A/B analyzer enforces both statistical and practical significance as independent gates. This prevents the common mistake of shipping a statistically significant but business-irrelevant 0.1 pp conversion lift.

ExperimentLab — The Complete A/B Testing Suite, from Power Analysis to Production

Experiment Lifecycle Pipeline

Statistical + Practical Significance — Both Required

Four Modules, One Workflow

Core Methods & Libraries

Debounced Live Computation

Representative Experiment Scenarios

Statistical Visualizations

Key Engineering Choices

At a Glance

Try It Live

Project Info

Tech Stack

Modules

Related