Variational Autoencoder (VAE): Generative AI & Deep Learning — PyTorch, Neural Networks

How It Works

From Pixel to Latent Point and Back

A VAE learns to compress each 28×28 MNIST digit into a two-number coordinate in "latent space", then reconstruct the original from that coordinate. Because the latent space is regularised to be smooth and Gaussian, you can pick any coordinate — even one never seen during training — and the decoder will generate a plausible-looking digit. This page lets you watch every step of that process unfold in real time.

🖼️

MNIST Input

28×28 image flattened to 784-D vector

🔒

Encoder FC

784 → 400 hidden units (ReLU)

🎲

Reparameterise

μ & log σ² → z = μ + σ·ε

🔓

Decoder FC

2 → 400 → 784 (Sigmoid output)

✨

Generated Digit

Reconstructed or freshly sampled image

Feature Breakdown

Five Interactive Tabs, One Unified Playground

⚡ Training

Live Training Dashboard

Set epochs, batch size, learning rate, hidden dim, and latent dim. A progress bar and loss curve update every 600 ms as the model trains in a background thread — no page reloads.

BackendPython threading

Poll interval600 ms

🏗️ Architecture

Network Topology Viewer

Visual layer-by-layer diagram of the encoder, reparameterisation step, and decoder. Updates dynamically when you change hyperparameters, alongside the ELBO loss formula breakdown.

LossBCE + KL divergence

Dims784 → H → L → H → 784

🌐 Latent Space

2-D Manifold Scatter Plot

Encodes 10,000 MNIST test images and plots their μ coordinates, coloured by digit class. Tight, separated clusters indicate a well-structured latent representation.

Points10,000 encoded digits

ColourClass 0–9 (tab10)

🔁 Reconstruction

Side-by-Side Comparison

Randomly samples 10 MNIST images, encodes and decodes them, then displays originals and reconstructions in a two-row grid. Slight blurriness reveals the smoothing effect of BCE loss.

Samples10 random per click

Grid layout2 rows × 10 columns

✨ Generation

Latent Space Navigation

Two sliders (Z₁ and Z₂, range −3 to +3) let you walk through the latent manifold and instantly decode any coordinate into a digit image, or generate a 15×15 grid of the full space.

Slider range−3.0 to +3.0

Grid size15 × 15 = 225 points

⚙️ Engineering

Thread-safe Flask API

Training runs in a daemon thread; all five endpoints (/start_training, /latent_space, /reconstruction, /generate, /generate_grid) are stateless REST calls with base64-encoded PNG responses.

Endpoints7 REST routes

OutputBase64 PNG images

Architecture Detail

VAE Layer Diagram & Loss Decomposition

Input · 784-D

28×28 flattened pixel values (0–1 normalised)

↓

Encoder FC · 400-D (configurable)

Linear(784, H) → ReLU

↓ split ↓

μ head & log σ² head · 2-D (configurable)

Two independent Linear(H, L) heads

↓ reparameterise ↓

Latent vector z · 2-D

z = μ + exp(½ log σ²) · ε · ε ∼ 𝒩(0, I)

↓

Decoder FC · 400-D (configurable)

Linear(L, H) → ReLU

↓

Output · 784-D

Linear(H, 784) → Sigmoid → pixel probabilities

ELBO Loss (Evidence Lower Bound)

ℒ = ℒ_recon + KL

Reconstruction (Binary Cross-Entropy, sum)

ℒ_recon = −Σ [ x·log x̂ + (1−x)·log(1−x̂) ]

KL Divergence (closed-form Gaussian)

KL = −½ Σ [ 1 + log σ² − μ² − σ² ]

Configurable Hyperparameters

Parameter	Default
Epochs	30
Batch size	128
Learning rate	1e-3 (Adam)
Hidden dim (H)	400
Latent dim (L)	2

Interactive Explorer

Visualise VAE Concepts Without Running the App

Select a view below to see illustrative examples of what each tab in the live app produces after training.

Each cell represents a region of the 2-D latent space. After training, digits of the same class cluster together — the VAE has learned to organise the latent manifold semantically.

Stylised representation of the 2-D latent space; colour encodes digit class (0–9).

Top row: original MNIST samples. Bottom row: VAE reconstructions decoded from the 2-D latent code. Slight blurriness is expected — BCE loss smooths pixel predictions toward the mean.

Illustrative reconstruction comparison. Live app uses actual PyTorch model outputs.

In the live app, two sliders control Z₁ and Z₂ (each from −3 to +3). The decoder maps that coordinate to a digit image in real time. Moving across the manifold smoothly interpolates between digit classes.

Five sample (Z₁, Z₂) coordinates and their decoded digit representation. The live app lets you explore any point with sliders.

Training Dynamics

Expected Training Behaviour

ELBO Loss Curve

Latent Cluster Quality

Latent Dim Trade-off

Typical ELBO loss over 30 epochs on the 10,000-sample MNIST subset with LR=1e-3 and hidden dim=400. Loss drops sharply in early epochs as the decoder learns basic digit structure, then flattens as fine detail is refined.

Approximate fraction of digit classes that form visually distinct latent clusters, measured by inter-cluster distance. Most classes separate by epoch 10; digits 4/9 and 3/5 remain closest due to visual similarity.

Trade-off between reconstruction quality (lower BCE = better) and latent interpretability as latent dimension increases from 2 to 20. The 2-D default is chosen to maximise visual interpretability at a small quality cost.

Design Decisions

Three Engineering Choices That Define This App

🔄

Non-blocking training

Training runs in a Python daemon thread so the UI stays fully interactive — you can switch tabs, inspect the architecture, and watch the live loss curve while training proceeds.

🗜️

Base64 image API

Every visualisation (loss curve, latent scatter, reconstruction grid) is rendered server-side by Matplotlib, encoded as base64 PNG, and injected into the DOM — zero external image hosting needed.

🎛️

Fully configurable dims

Latent dim and hidden dim are runtime hyperparameters. Changing them re-instantiates the VAE and updates the architecture diagram labels live — the UI always reflects the model you're actually training.

Train a Generative Model from Scratch — Then Explore Its Mind in Real Time

From Pixel to Latent Point and Back

Why the Reparameterisation Trick Matters

Five Interactive Tabs, One Unified Playground

VAE Layer Diagram & Loss Decomposition

Why 2-D as the Default Latent Dimension?

Visualise VAE Concepts Without Running the App

Expected Training Behaviour

Three Engineering Choices That Define This App

At a Glance

Try It Live

Project Info

Tech Stack

Related Work