But how do AI images and videos actually work?

TL;DR

Diffusion generation starts from random noise and iteratively refines it using a learned transformer, gradually transforming structure over many steps.

Briefing Cornell Notes

Briefing

Text-to-image and text-to-video systems work because diffusion models can be understood as reversing a physics-like random process—then steering that reversal using a learned “shared space” between language and vision. The payoff is practical: the same geometric ideas that connect diffusion to Brownian motion also yield algorithms that generate sharper images faster, and they explain why prompts can reliably shape what comes out.

The walkthrough starts with a hands-on look at how a diffusion video model turns noise into structure. Generation begins by sampling a random video—pixel intensities chosen at random—then repeatedly feeds that partially structured output through a transformer. Each iteration adds a new refinement step: the result stays mostly noise early on, but gradually accumulates coherent motion and scene details. Even when the prompt is reduced to nothing, the model still produces a coherent video, showing that the core mechanism is learning a powerful prior over plausible visual worlds; prompts then act as guidance rather than full instructions.

The physics connection comes next. Diffusion models are trained by corrupting real images with increasing noise until they become nearly destroyed, then learning how to reverse that corruption. A naive “one-step denoising” intuition doesn’t match how modern systems are trained. Instead, the DDPM approach trains the network to predict the total noise added to the original image, not just the noise removed in a single step. In a simplified 2D toy setting, adding noise corresponds to a random walk; reversing it means learning a direction field that points back toward the data distribution. Crucially, the learned direction depends on the diffusion time, so conditioning on time lets the model learn coarse structure when noise is high and fine structure as noise vanishes.

Two practical surprises follow from this view. First, adding random noise during generation (as in DDPM sampling) improves sharpness. Without those stochastic steps, trajectories collapse toward an average of the training data—often producing blurry results—because the model’s learned vector field behaves like a mean-seeking estimator early in the reverse process. Second, later work shows that the same target distribution can be reached without random steps by converting the stochastic process into an equivalent deterministic one. DDIM achieves this by using differential-equation machinery (stochastic differential equations and the Fokker–Planck connection) to match the final distribution while reducing the number of steps.

Prompting enters through CLIP, which learns a shared embedding space where matching image–caption pairs align and mismatched pairs repel using a contrastive objective and cosine similarity. CLIP alone maps text and images into vectors but can’t generate. Diffusion fills that gap: the text embedding can condition the denoising process so the reverse diffusion trajectory lands on images consistent with the prompt. OpenAI’s DALL·E 2 (via unCLIP) is described as training a diffusion model to invert CLIP’s image encoder, yielding strong prompt adherence.

Conditioning alone still isn’t enough for tight control. Classifier-free guidance fixes this by combining an unconditional vector field (no text/class) with a conditioned one, then amplifying the difference. The result is that increasing the guidance scale makes details “grow” toward what the prompt specifies. For video, WAN’s approach extends the idea with negative prompts—explicitly listing unwanted features—then subtracting the negative-conditioned direction to steer away from artifacts like extra fingers or backward walking. The overall message is that diffusion’s physics-inspired geometry, combined with language-vision embeddings and guidance tricks, turns language into a controllable steering signal for a learned generative process.

Cornell Notes

Diffusion models generate images and videos by starting from random noise and iteratively reversing a noise-corruption process. The key insight is that this reversal can be interpreted as running Brownian-motion-like random walks backward in a high-dimensional space, guided by a learned time-dependent vector field (a “score function”). DDPM training predicts total noise added to the original sample, and conditioning on diffusion time lets the model learn both coarse and fine structure. During sampling, adding noise prevents collapse toward the dataset mean (which otherwise produces blur), while later methods like DDIM remove stochastic steps to reach the same final distribution more efficiently. Prompt control comes from CLIP-style embeddings plus classifier-free guidance, which amplifies the difference between unconditional and conditioned denoising directions; negative prompts further steer away from specific artifacts.

Why does diffusion generation start from pure noise, and what changes across iterations?

Generation begins by sampling a random video (or image) where pixel intensities are chosen randomly. That noisy sample is passed through a transformer that outputs a refined version that still looks mostly like noise but begins to show structure. The refined output is then added back into the current noisy state and fed through the model again. Repeating this loop gradually shapes the noise into coherent frames and motion consistent with the learned visual prior; prompts then steer the trajectory rather than fully determining every detail.

What does CLIP learn, and why is it useful for prompting diffusion models?

CLIP trains two encoders—one for text and one for images—so that embeddings for matching image–caption pairs are close and embeddings for non-matching pairs are far. Training uses a contrastive objective over all pairs in a batch, with similarity measured by cosine similarity in a shared 512-dimensional embedding space. This creates a geometry where semantic concepts correspond to vector directions, enabling text embeddings to act as guidance signals for other models.

Why doesn’t the “denoise one step at a time” intuition match DDPM training?

The DDPM approach doesn’t train the network to reverse each intermediate noise step directly. Instead, it samples a clean image X0, adds scaled random noise ε to obtain a noisy version, and trains the model to predict the total noise ε that was added. In a toy random-walk picture, this corresponds to learning the vector that points back toward the original data distribution, which reduces variance and improves learning efficiency.

Why does adding random noise during DDPM sampling improve sharpness?

Without the stochastic noise injection during each reverse step, trajectories tend to collapse toward the mean (average) of the training data distribution. In image space, averaging plausible variations often looks like blur. DDPM’s per-step random noise keeps samples from collapsing to the center by effectively sampling from the appropriate distribution around the learned mean direction.

How does DDIM generate images deterministically without changing the final distribution?

DDIM replaces the stochastic reverse process with a deterministic ordinary differential equation that yields the same final distribution as the stochastic differential equation. The connection relies on results like the Fokker–Planck equation from statistical mechanics, which links stochastic dynamics to distribution evolution. Practically, DDIM uses different step-size scaling so trajectories follow the learned vector field more faithfully, reaching high-quality results in fewer steps without random noise at each step.

How does classifier-free guidance make prompts “stick” better than conditioning alone?

Classifier-free guidance uses two denoising predictions: one conditioned on the prompt (or class) and one unconditioned (no prompt/class). The method computes the difference between conditioned and unconditioned vector fields and amplifies it by a guidance scale α. This subtract-and-scale trick steers samples more strongly toward the prompt-specific region of the learned manifold, improving adherence; increasing α typically increases detail and size of prompt-relevant features.

Review Questions

In the DDPM framework, what exactly is the network trained to predict: the denoised image at each step or the total noise added to the original sample? Why does that matter?
Explain the role of diffusion time conditioning in learning a time-varying vector field. What changes in the learned directions as time approaches 0?
Why does removing the random noise injection during DDPM sampling tend to produce blurry outputs, and how does DDIM address efficiency without changing the final distribution?

Key Points

1
Diffusion generation starts from random noise and iteratively refines it using a learned transformer, gradually transforming structure over many steps.
2
Diffusion models can be interpreted as reversing Brownian-motion-like random walks in high-dimensional space using a time-dependent learned vector field.
3
DDPM training predicts total noise added to the original sample (not just one-step denoising), improving learning efficiency through reduced variance.
4
Adding random noise during DDPM sampling prevents collapse toward the dataset mean, which otherwise produces blur.
5
DDIM achieves faster, deterministic sampling by using an equivalent differential-equation formulation that preserves the final distribution.
6
Prompt control comes from combining CLIP-style text embeddings with diffusion conditioning, then strengthening steering via classifier-free guidance (and optionally negative prompts).

Highlights

A diffusion model’s iterative loop can be seen as repeatedly feeding a transformer outputs back into the current noisy state, turning chaos into coherent motion and scenes.

DDPM’s per-step random noise is not a cosmetic detail—it counteracts mean collapse that otherwise yields blurry images.

DDIM removes stochastic sampling while keeping the same final distribution by translating the stochastic process into a deterministic one via differential-equation theory.

Classifier-free guidance works by amplifying the difference between conditioned and unconditioned denoising directions, making prompt-specific details emerge more strongly.

Negative prompts steer generation away from unwanted artifacts by subtracting the direction associated with those negative features.

Topics

Diffusion Models
Brownian Motion
CLIP Embeddings
DDPM vs DDIM
Classifier-Free Guidance

Mentioned

OpenAI
DALL·E 2
stable diffusion 2
WAN 2.1
Welch Labs
Stephen Welsh
CLIP
DDPM
GPT-3
DDIM
Fokker-Planck
SDE
WAN

But how do AI images and videos actually work? | Guest video by Welch Labs