Transformer Architecture: Attention Visualization & Seq2Seq — PyTorch, NLP, Deep Learning

How It Works

From Raw Sentence Pairs to Live Translation

The app guides you through every conceptual layer of building a Transformer from scratch. You control the dataset size and augmentation factor, watch the vocabulary form token-by-token, dial in architectural hyperparameters, then observe the model learn real English-to-French translation — all the way to cross-attention heatmaps that show exactly which source word the model attended to when producing each French token.

📊

Data

138 pairs · augmentation up to 16× · train/val split

📚

Vocab

EN + FR dictionaries · 4 special tokens · live tokeniser

🏗️

Model

d_model, heads, layers, d_ff configurable · live param count

🎯

Train

SSE loss stream · live translations · early stopping

📈

Evaluate

SacreBLEU · greedy vs beam · sample table

🔤

Infer

Translate any sentence · OOV warnings · attention heatmap

💡

Why Server-Sent Events Instead of WebSockets?

SSE is a unidirectional HTTP stream — the server pushes events as they happen with no client-side socket handshake. For a training dashboard where the browser only ever receives epoch results, batch loss, and LR values, SSE is simpler, firewall-friendly, and reconnects automatically on drop. Every epoch end emits a single JSON event containing losses, perplexity, patience status, and three live sample translations.

Feature Breakdown

Six Steps, One Unified Playground

📊 Step 1

Data Explorer

Choose how many of the 138 pairs to use (20 → 138) and set an augmentation factor (1× → 16×). A live summary shows the resulting total training sample count with the 87.5 / 12.5 train-val split before any computation starts.

Max pairsSlider: 20 → 138

Augmentation1× → 16× shuffle-repeat

📚 Step 2

Vocabulary Builder

Builds separate EN and FR token dictionaries from the selected pairs, displays sample word-to-index mappings, and provides a live tokeniser where any sentence is instantly colour-coded: known tokens green, unknown tokens red with an UNK badge.

EN vocab~95 tokens (full set)

FR vocab~115 tokens (full set)

🏗️ Step 3

Model Configurator

Sliders and dropdowns for d_model (64–256), num_heads (2/4/8), num_layers (1–4), d_ff (128/256/512), and dropout. An estimated parameter count updates live and an ASCII architecture diagram redraws on every change.

Default configd=128 · h=4 · L=2

Param range~50 K → ~2 M

🎯 Step 4

Live Training Dashboard

Configures epochs, batch size, patience, and warmup steps. A Chart.js loss curve updates in real time via SSE. Sample translations for three fixed sentences appear every epoch so you can watch the model learn. Early stopping restores the best checkpoint automatically.

LR scheduleWarmup inverse-sqrt

LossLabel-smoothed KL div

📈 Step 5

BLEU Evaluation

After training, computes SacreBLEU corpus scores for both greedy and beam-4 decoding across all 138 pairs. A 10-row side-by-side table shows English, reference French, greedy output, and beam output with colour coding.

MetricSacreBLEU corpus

Expected range70–95+ at convergence

🔤 Step 6

Inference + Attention Heatmap

Click any of 15 held-out test sentences (10 all-known, 5 intentional OOV) or type your own. As you type, tokens are colour-coded live. OOV words trigger a warning banner. The cross-attention matrix is rendered as a canvas heatmap.

DecodingGreedy + Beam-4

HeatmapLast decoder layer avg

Architecture Detail

Transformer Layer Diagram & Key Equations

Source Tokens (EN)

Integer-encoded + Positional Encoding

↓

Encoder × N layers

MultiHeadAttn (self) → Add & Norm → FFN → Add & Norm

↓ encoder memory ↓

Context Memory

Contextualised source representations — keys & values for cross-attention

↑ fed into each decoder layer ↑

Decoder × N layers

Masked Self-Attn → Cross-Attn ← memory → FFN → Add & Norm

↓

Linear Projection + Softmax

→ French token distribution at each position

Scaled Dot-Product Attention

Attn(Q,K,V) = softmax(QKᵀ / √d_k) · V

Multi-Head Attention

MHA = Concat(head₁…headₕ) · Wₒ

Reparameterisation Trick (masked causal)

mask[i,j] = 1 if j ≤ i else −∞

Label Smoothing Loss

ℒ = KL( y_smooth ‖ log_softmax(logits) )

y_smooth = (1−ε)·y_one_hot + ε/(V−2)

Default Hyperparameters

Parameter	Default
d_model	128
num_heads	4
num_layers	2
d_ff	256
dropout	0.10
warmup_steps	100
label_smoothing ε	0.05
optimizer	Adam (β₁=0.9, β₂=0.98)

🎯

Why Label Smoothing at ε = 0.05?

With only ~95 source tokens, hard one-hot targets cause the model to become over-confident very quickly — the cross-entropy loss collapses but the model generalises poorly to held-out word order. Label smoothing spreads 5% of the probability mass uniformly across the vocabulary, preventing overfit on this small closed dataset while keeping the loss signal strong enough to train in 30 epochs. A larger ε of 0.1 (tried in early experiments) suppressed useful gradient signal on a vocabulary this small.

Dataset Design

138 Pairs Across 8 Vocabulary Categories

All sentence pairs are handcrafted — no external corpus, web scraping, or automatic translation. Each category introduces specific linguistic structures so the model must learn generalisation across subject pronouns, verb tenses, adjective agreement, pluralisation, spatial prepositions, and adverb placement. A separate 15-sentence held-out test set was designed to probe exactly those combinations — 10 using only known vocabulary, and 5 deliberately containing out-of-vocabulary words to demonstrate <unk> degradation.

🔬

Held-Out Test Set: Intentional OOV Probing

Five test sentences were written to include words like fox, student, always, and very — words absent from the training vocabulary. The app shows each unknown token with a red UNK badge as you type and displays an orange warning banner before translation. This teaches users concretely why vocabulary coverage matters and what a model does when it encounters unknown input.

⚠ fox ⚠ student ⚠ always ⚠ very ⚠ excited ⚠ an

Interactive Explorer

Visualise Key Concepts Without Running the App

Select a view to see illustrative representations of what each section of the live app produces.

Cross-attention heatmap for the sentence "the dog is running in the park" after training. Each row is a French output token; each column is an English source token. Brighter purple = stronger attention weight. Notice how le attends strongly to the, and parc attends to park.

Illustrative heatmap — live app renders the actual PyTorch attention weights as a canvas element.

The tokeniser lowercases input, inserts spaces around punctuation, then maps each token to an integer index. Tokens outside the training vocabulary are flagged as <unk> (index 3).

Sentence: "the dog is running in the park"

Encoded (with SOS/EOS):

Unknown token example: "the fox is running in the park"

Greedy decoding selects the highest-probability token at each step — fast but locally suboptimal. Beam search maintains 4 candidate sequences simultaneously and selects the one with the best length-normalised score.

Training Dynamics

Expected Training Behaviour

Loss Curve

LR Schedule

BLEU vs Epochs

Typical train and validation loss over 30 epochs with d_model=128, 2 layers, 12× augmentation. Early stopping typically triggers around epoch 25–35 depending on configuration.

Warmup LR schedule with 100 warmup steps: LR ramps from 0, peaks around step 100, then decays as step^−0.5. This stabilises early gradient steps when embeddings are randomly initialised.

SacreBLEU corpus score (beam-4) as a function of training epochs on the 138-pair vocabulary set. Beam search consistently outperforms greedy by 3–8 BLEU points throughout training.

Engineering Decisions

Three Architectural Choices That Define This App

📡

SSE Live Streaming

Training runs in a Python daemon thread and pushes JSON events per epoch — loss, PPL, LR, sample translations — to a persistent HTTP connection. The chart updates in real time with zero polling overhead.

🔒

Step-lock Navigation

Each of the 6 steps unlocks only after the previous one completes. This enforces the correct conceptual order and prevents the common beginner mistake of trying to translate before the vocabulary or model exists.

🧠

Attention Heatmap Canvas

Cross-attention weights are extracted from the last decoder layer, averaged across heads, then rendered directly on an HTML canvas — no image encoding required, enabling crisp dynamic scaling to any token-sequence length.

Train a Transformer from Scratch
Step by Step in Your Browser

From Raw Sentence Pairs to Live Translation

Why Server-Sent Events Instead of WebSockets?

Six Steps, One Unified Playground

Transformer Layer Diagram & Key Equations

Why Label Smoothing at ε = 0.05?

138 Pairs Across 8 Vocabulary Categories

Held-Out Test Set: Intentional OOV Probing

Visualise Key Concepts Without Running the App

Expected Training Behaviour

Three Architectural Choices That Define This App

At a Glance

Try It Live

Project Info

Tech Stack

Related Work

Train a Transformer from ScratchStep by Step in Your Browser

From Raw Sentence Pairs to Live Translation

Why Server-Sent Events Instead of WebSockets?

Six Steps, One Unified Playground

Transformer Layer Diagram & Key Equations

Why Label Smoothing at ε = 0.05?

138 Pairs Across 8 Vocabulary Categories

Held-Out Test Set: Intentional OOV Probing

Visualise Key Concepts Without Running the App

Expected Training Behaviour

Three Architectural Choices That Define This App

At a Glance

Try It Live

Project Info

Tech Stack

Related Work

Train a Transformer from Scratch
Step by Step in Your Browser