Get AI summaries of any video or article — Sign up free
Diffusion Policy Controlling Robots - Part 2 thumbnail

Diffusion Policy Controlling Robots - Part 2

5 min read

Based on West Coast Machine Learning's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Diffusion policy generates a full horizon of future actions by denoising a noisy action sequence conditioned on recent robot observations.

Briefing

Diffusion policy for robot control turns a noisy guess of future actions into a smooth, goal-reaching trajectory by repeatedly denoising an action sequence—using robot observations as conditioning. Instead of producing a single next move, the model predicts an entire horizon of actions (e.g., 16 steps of a 2D end-effector position), then the robot executes only part of that sequence in a receding-horizon loop: take new observations, re-run the diffusion denoising process, and continue until the task goal is met. This matters because it reframes robot control as conditional generative modeling over action trajectories, giving a mechanism to enforce temporal consistency and reduce “jitter” that can happen when predicting one action at a time.

The core setup uses a conditioned diffusion process that ingests observations—typically two recent state vectors in the push-T example—and refines a sequence of future actions over K diffusion iterations. The action space is position control: where the end-effector should be to move the block (the “T”) toward a target location. Visualizations in the discussion describe how early diffusion steps correspond to far-future, high-uncertainty action distributions (shown in warmer colors), while later steps concentrate into a coherent trajectory that achieves the goal.

Conditioning is implemented in two architectural styles. In the CNN-based variant, observations are encoded with a U-Net backbone and injected via FiLM (feature-wise linear modulation): intermediate feature maps are scaled and shifted using learned functions of the observation embedding. In the Transformer-based variant, observations and actions are embedded and combined through masked attention, again iterating through diffusion steps to denoise the action sequence.

A major conceptual thread during the session is what diffusion models are actually learning. Rather than directly outputting probabilities over actions, diffusion policy trains a network to predict the noise (equivalently tied to the score function, i.e., the gradient of the log probability of the action distribution). That distinction is tied to training stability: compared with an implicit energy-based policy (energy/unnormalized probability estimation), the score-based formulation avoids estimating the normalizing denominator, which can make optimization more stable. The tradeoff is inference cost: generating actions requires iterative denoising, though faster samplers and differential-equation solvers can reduce the number of steps.

The practical walkthrough then shifts to data and code. Training relies on a limited set of human demonstrations—about 200 manual episodes in the push-T setup—split into many training fragments. Each fragment uses two observations and asks the model to predict the next 16 actions. For real-world data, demonstrations are collected with a space mouse and recorded with cameras including an overview camera and a wrist camera for fine positioning. The discussion also clarifies how the simulation environment works (a 2D physics model with collision/overlap-based rewards) and how the training loop normalizes observations and actions to a -1 to +1 range, samples diffusion time steps and noise, and trains a conditional 1D U-Net with an MSE loss between predicted and true noise.

Finally, the session highlights how diffusion’s “mode decisions” appear to happen early when noise is high: exploratory, multi-modal behavior can emerge at early diffusion steps, but later steps collapse into a committed trajectory, enabling smooth, goal-directed motion rather than oscillation between alternatives.

Cornell Notes

Diffusion policy for robots conditions a denoising model on recent observations to generate an entire action sequence over a fixed horizon (e.g., 16 steps). The model starts from noisy action vectors and iteratively refines them through K diffusion steps, then the robot executes part of the predicted sequence and repeats after collecting new observations (receding-horizon control). Conditioning is injected either via FiLM-modulated CNN/U-Net features or via Transformer embeddings with masked attention. A key conceptual point is that training targets noise prediction, closely related to the score function (gradient of log probability), which can be more stable than energy-based implicit policies that require handling a normalizing denominator. The main cost is iterative inference, partially mitigated by faster diffusion samplers and ODE/SDE-based solvers.

How does diffusion policy turn observations into robot actions over time?

It conditions a diffusion model on recent observations (in the push-T example, two state vectors) and generates a sequence of future actions for a fixed horizon (e.g., 16 steps). During inference, the action sequence begins as noise and is refined over K diffusion iterations by repeatedly predicting noise and applying a noise scheduler. The robot then executes only an action subset (the “action horizon”) and re-collects observations, rerunning diffusion again until the goal is reached.

What are the two conditioning mechanisms discussed, and where do they enter the network?

In the CNN-based approach, observations are encoded and injected into a U-Net using FiLM (feature-wise linear modulation): intermediate feature maps are scaled and shifted using learned functions of the observation embedding. In the Transformer-based approach, observations and actions are embedded and combined with masked attention, with diffusion iteration steps controlling the denoising process. Both methods condition the denoising network on observation information throughout the K steps.

Why do score-based diffusion models often train more stably than implicit energy-based policies?

The discussion contrasts score-based diffusion (predicting noise/score information tied to gradients of log probability) with implicit energy-based models that estimate probabilities without directly modeling the normalizing constant. Avoiding explicit estimation of the denominator (normalizing constant) is presented as a reason diffusion training can be more stable. The session also notes that diffusion still requires iterative sampling at inference time.

What does the training loop optimize in the push-T code walkthrough?

Training normalizes observations and actions (to a -1 to +1 range), samples diffusion time steps and noise per batch entry, and forms a noisy action. A conditional 1D U-Net predicts the noise given the noisy action, the diffusion time step, and the observation conditioning. The loss is an MSE between predicted noise and the true sampled noise, followed by backpropagation and optimizer steps.

Why does predicting an action sequence help compared with predicting one action at a time?

The session argues that sequence prediction reduces oscillation/jitter. If a model predicted only one step at a time, it could alternate between left/right modes and “bounce” around. Predicting multiple actions in one denoising run encourages temporal consistency: once the early diffusion steps commit to a mode, later denoising refines a coherent trajectory rather than switching alternatives each step.

What does the session suggest about when diffusion makes its “mode decision”?

A grid-based visualization of the noise/score-like outputs suggests that early diffusion steps (high noise) can be multi-modal—allowing different directions/commitments—while later steps collapse into a single committed trajectory. The intuition is that large uncertainty early permits multiple plausible futures, but as noise decreases, the model performs fine adjustments around the chosen mode.

Review Questions

  1. In the receding-horizon loop, what fraction of the predicted action sequence is executed before re-running diffusion, and why?
  2. How does FiLM conditioning modify a U-Net’s intermediate features, and what information drives the scale/shift parameters?
  3. In the training objective, what are the roles of the sampled diffusion time step and the sampled noise, and what does the MSE loss compare?

Key Points

  1. 1

    Diffusion policy generates a full horizon of future actions by denoising a noisy action sequence conditioned on recent robot observations.

  2. 2

    Inference uses iterative refinement over K diffusion steps, then a receding-horizon loop repeatedly reconditions on fresh observations.

  3. 3

    CNN-based conditioning injects observation embeddings into a U-Net via FiLM (feature-wise linear modulation), while Transformer-based conditioning uses embedded tokens and masked attention.

  4. 4

    Training targets noise prediction with an MSE loss between predicted and sampled noise, which is closely related to score-based (log-probability gradient) learning.

  5. 5

    Score-based diffusion is framed as more stable than implicit energy-based policies because it avoids estimating the normalizing denominator.

  6. 6

    Iterative denoising makes inference slower, but faster samplers and ODE/SDE solvers can reduce the number of steps without changing the core score-based mechanism.

  7. 7

    The push-T example uses position control for a 2D end-effector to move a T-shaped block toward a fixed goal, with demonstrations split into many training fragments.

Highlights

Diffusion policy predicts an entire action sequence (e.g., 16 steps) in one inference call, then executes part of it and repeats—turning robot control into conditional trajectory generation.
FiLM conditioning lets observation embeddings scale and shift intermediate U-Net features, injecting state information throughout the denoising network.
A recurring theme: diffusion learns noise/score information (tied to gradients of log probability), which can be more stable to train than energy-based implicit probability estimation.
Mode selection appears to happen early when noise is high; later denoising collapses into a single committed trajectory for smooth control.
The training loop normalizes observations/actions, samples diffusion time steps and noise, and trains a conditional 1D U-Net with MSE on predicted noise.

Topics

  • Diffusion Policy
  • Robot Control
  • Score-Based Models
  • FiLM Conditioning
  • Conditional U-Net
  • Receding Horizon

Mentioned

  • MSE
  • U-Net
  • FiLM
  • SDE
  • ODE
  • DDIM