NLP · Deep Learning · Seq2Seq Python · PyTorch · Flask Live on HuggingFace Spaces
Train a Transformer from Scratch Step by Step in Your Browser
A fully interactive, 6-step educational playground where you configure the dataset, build the vocabulary, design the model architecture, watch it train live, evaluate BLEU scores, and translate new sentences — all powered by a real PyTorch Transformer training loop streamed to your browser via Server-Sent Events.
Interactive steps: Data → Vocab → Model → Train → Evaluate → Infer
138
Handcrafted EN→FR pairs across 8 vocabulary categories
SSE
Server-Sent Events stream live loss, LR, and sample translations per epoch
Beam-4
Greedy vs beam search decoding with SacreBLEU evaluation
Live
Dockerised and deployed on HuggingFace Spaces — no local setup
How It Works
From Raw Sentence Pairs to Live Translation
The app guides you through every conceptual layer of building a Transformer from scratch. You control the dataset size and augmentation factor, watch the vocabulary form token-by-token, dial in architectural hyperparameters, then observe the model learn real English-to-French translation — all the way to cross-attention heatmaps that show exactly which source word the model attended to when producing each French token.
📊
Data
138 pairs · augmentation up to 16× · train/val split
📚
Vocab
EN + FR dictionaries · 4 special tokens · live tokeniser
🏗️
Model
d_model, heads, layers, d_ff configurable · live param count
🎯
Train
SSE loss stream · live translations · early stopping
📈
Evaluate
SacreBLEU · greedy vs beam · sample table
🔤
Infer
Translate any sentence · OOV warnings · attention heatmap
💡
Why Server-Sent Events Instead of WebSockets?
SSE is a unidirectional HTTP stream — the server pushes events as they happen with no client-side socket handshake. For a training dashboard where the browser only ever receives epoch results, batch loss, and LR values, SSE is simpler, firewall-friendly, and reconnects automatically on drop. Every epoch end emits a single JSON event containing losses, perplexity, patience status, and three live sample translations.
Feature Breakdown
Six Steps, One Unified Playground
📊 Step 1
Data Explorer
Choose how many of the 138 pairs to use (20 → 138) and set an augmentation factor (1× → 16×). A live summary shows the resulting total training sample count with the 87.5 / 12.5 train-val split before any computation starts.
Max pairsSlider: 20 → 138
Augmentation1× → 16× shuffle-repeat
📚 Step 2
Vocabulary Builder
Builds separate EN and FR token dictionaries from the selected pairs, displays sample word-to-index mappings, and provides a live tokeniser where any sentence is instantly colour-coded: known tokens green, unknown tokens red with an UNK badge.
EN vocab~95 tokens (full set)
FR vocab~115 tokens (full set)
🏗️ Step 3
Model Configurator
Sliders and dropdowns for d_model (64–256), num_heads (2/4/8), num_layers (1–4), d_ff (128/256/512), and dropout. An estimated parameter count updates live and an ASCII architecture diagram redraws on every change.
Default configd=128 · h=4 · L=2
Param range~50 K → ~2 M
🎯 Step 4
Live Training Dashboard
Configures epochs, batch size, patience, and warmup steps. A Chart.js loss curve updates in real time via SSE. Sample translations for three fixed sentences appear every epoch so you can watch the model learn. Early stopping restores the best checkpoint automatically.
LR scheduleWarmup inverse-sqrt
LossLabel-smoothed KL div
📈 Step 5
BLEU Evaluation
After training, computes SacreBLEU corpus scores for both greedy and beam-4 decoding across all 138 pairs. A 10-row side-by-side table shows English, reference French, greedy output, and beam output with colour coding.
MetricSacreBLEU corpus
Expected range70–95+ at convergence
🔤 Step 6
Inference + Attention Heatmap
Click any of 15 held-out test sentences (10 all-known, 5 intentional OOV) or type your own. As you type, tokens are colour-coded live. OOV words trigger a warning banner. The cross-attention matrix is rendered as a canvas heatmap.
With only ~95 source tokens, hard one-hot targets cause the model to become over-confident very quickly — the cross-entropy loss collapses but the model generalises poorly to held-out word order. Label smoothing spreads 5% of the probability mass uniformly across the vocabulary, preventing overfit on this small closed dataset while keeping the loss signal strong enough to train in 30 epochs. A larger ε of 0.1 (tried in early experiments) suppressed useful gradient signal on a vocabulary this small.
Dataset Design
138 Pairs Across 8 Vocabulary Categories
All sentence pairs are handcrafted — no external corpus, web scraping, or automatic translation. Each category introduces specific linguistic structures so the model must learn generalisation across subject pronouns, verb tenses, adjective agreement, pluralisation, spatial prepositions, and adverb placement. A separate 15-sentence held-out test set was designed to probe exactly those combinations — 10 using only known vocabulary, and 5 deliberately containing out-of-vocabulary words to demonstrate <unk> degradation.
🔬
Held-Out Test Set: Intentional OOV Probing
Five test sentences were written to include words like fox, student, always, and very — words absent from the training vocabulary. The app shows each unknown token with a red UNK badge as you type and displays an orange warning banner before translation. This teaches users concretely why vocabulary coverage matters and what a model does when it encounters unknown input.
⚠ fox⚠ student⚠ always⚠ very⚠ excited⚠ an
Interactive Explorer
Visualise Key Concepts Without Running the App
Select a view to see illustrative representations of what each section of the live app produces.
Cross-attention heatmap for the sentence "the dog is running in the park" after training. Each row is a French output token; each column is an English source token. Brighter purple = stronger attention weight. Notice how le attends strongly to the, and parc attends to park.
Illustrative heatmap — live app renders the actual PyTorch attention weights as a canvas element.
The tokeniser lowercases input, inserts spaces around punctuation, then maps each token to an integer index. Tokens outside the training vocabulary are flagged as <unk> (index 3).
Sentence: "the dog is running in the park"
Encoded (with SOS/EOS):
Unknown token example: "the fox is running in the park"
Greedy decoding selects the highest-probability token at each step — fast but locally suboptimal. Beam search maintains 4 candidate sequences simultaneously and selects the one with the best length-normalised score.
Training Dynamics
Expected Training Behaviour
Loss Curve
LR Schedule
BLEU vs Epochs
Typical train and validation loss over 30 epochs with d_model=128, 2 layers, 12× augmentation. Early stopping typically triggers around epoch 25–35 depending on configuration.
Warmup LR schedule with 100 warmup steps: LR ramps from 0, peaks around step 100, then decays as step−0.5. This stabilises early gradient steps when embeddings are randomly initialised.
SacreBLEU corpus score (beam-4) as a function of training epochs on the 138-pair vocabulary set. Beam search consistently outperforms greedy by 3–8 BLEU points throughout training.
Engineering Decisions
Three Architectural Choices That Define This App
📡
SSE Live Streaming
Training runs in a Python daemon thread and pushes JSON events per epoch — loss, PPL, LR, sample translations — to a persistent HTTP connection. The chart updates in real time with zero polling overhead.
🔒
Step-lock Navigation
Each of the 6 steps unlocks only after the previous one completes. This enforces the correct conceptual order and prevents the common beginner mistake of trying to translate before the vocabulary or model exists.
🧠
Attention Heatmap Canvas
Cross-attention weights are extracted from the last decoder layer, averaged across heads, then rendered directly on an HTML canvas — no image encoding required, enabling crisp dynamic scaling to any token-sequence length.
At a Glance
Quick read
What it does: Trains a Transformer seq2seq model in the browser with live loss streaming, configurable hyperparameters, and step-by-step pedagogical unlocking. How: PyTorch + Flask + SSE → Chart.js live chart + Canvas attention heatmap. Why it's educational: Every component of the Transformer — tokenisation, positional encoding, multi-head attention, label smoothing, beam search — is observable and interactive.