Generative Modelling | Sadhika Malladi | 2018 Summer Intern Open House

TL;DR

World modeling aims to learn environment dynamics so agents can generalize and transfer skills with less data, which matters because experience collection is costly.

Briefing Cornell Notes

Briefing

Generative modeling is positioned as a practical way to learn the underlying structure of data—so a model trained on one set of examples (like dogs) can produce plausible, previously unseen samples. That same idea becomes especially valuable in reinforcement learning, where agents need to understand how actions change the world. Instead of learning purely from trial-and-error, “world modeling” aims to build an internal model of environment dynamics so an agent can plan and generalize with far less data—an outcome that matters because collecting experience in games or real systems is expensive.

In the world-modeling setup, the goal is to learn basic movement skills first—such as stepping forward, turning without falling, or reversing—so later tasks don’t require new, highly specific instructions. A concrete example uses a simple car racing game: the agent learns how to accelerate, slow down, and turn, then can handle a track it has never seen before by relying on learned interactions between its actions and the environment. The promise is sample-efficient, transferable reinforcement learning. But current model-based reinforcement learning often underperforms model-free methods because learned world models can be brittle: small reconstruction errors compound during rollout, creating artifacts that the model can’t correct once they start propagating.

A key technical issue is how frames are represented. The car racing model uses a variational autoencoder (VAE): frames are encoded through convolutional layers into latent variables (Z), decoded back into reconstructions, and trained by comparing reconstructions to the original frames. Those latents are also fed into an RNN to track time-dependent state, such as where obstacles are and what the game situation looks like. The approach works well for car racing because the environment has relatively few elements—mostly track geometry and a simple highway structure—so the latent space can focus on meaningful, task-relevant factors.

The same approach struggles on Pong. After encoding and reconstructing, the model fails to preserve crucial objects like the ball and the other paddle, making it hard for the agent to learn where key elements are. The proposed fix is to pair the VAE-style encoder with a stronger decoder—specifically a PixelCNN conditioned on the time state—so latents are forced to represent obstacle-related information rather than generic background. With that conditioning, latents become more “time dependent,” and PixelCNN’s ability to generate high-resolution images is expected to improve generalization to richer games.

The discussion then shifts from world modeling to invertible flow models, which are highlighted for expressive latent spaces and efficient sampling. By learning a transformation between a simple prior distribution (random noise) and data (faces, cats, dogs), these models can generate realistic outputs by sampling noise and mapping it through an invertible network. Training and architecture rely on maximizing likelihood via the log-determinant of the transformation, but invertibility constraints reduce expressivity and can make training finicky. Architectures such as RealNVP and its refined variant (GLOW) use multi-scale designs, squeezing operations, normalization convolutions, and coupling layers to manage these constraints. A practical improvement described is parameter sharing across coupling layers, cutting memory use dramatically while maintaining sample quality (measured via bits-per-dimension) and improving speed. The work also notes that invertible models can fail when they enter regions where invertibility breaks down, motivating ongoing efforts to improve flow model training and expressivity.

Cornell Notes

Generative modeling learns a distribution so it can produce plausible new samples, and that capability can power reinforcement learning through “world modeling.” In world modeling, a VAE encodes frames into latent variables (Z) and an RNN tracks time-dependent state; a decoder reconstructs frames so the latent space captures useful structure. This works better in simple environments like car racing than in Pong because reconstruction can erase critical objects (ball and paddle), and rollout errors can compound. A proposed improvement pairs the encoder with a PixelCNN decoder conditioned on time state, forcing latents to represent obstacle-relevant information. The talk also covers invertible flow models (RealNVP/GLOW), which map noise to data with efficient sampling and likelihood training, but require strict invertibility that can limit expressivity and complicate training.

Why does world modeling promise sample-efficient reinforcement learning, and what goes wrong in practice?

World modeling aims to learn how actions interact with the environment so an agent can reuse learned dynamics for new tasks. The hope is that once basic movement skills are learned (e.g., stepping, turning, reversing), the agent can generalize without needing new instructions for every start-to-goal pair. In practice, learned world models can be brittle: small reconstruction artifacts during frame generation can accumulate during multi-step rollouts, and the model has no mechanism to correct those errors once they start compounding. That brittleness helps explain why model-based approaches can lag behind model-free methods.

How does the car racing world model use a VAE and an RNN, and why does it work better than on Pong?

Frames are encoded with a variational autoencoder: convolutional layers produce latent variables Z, then deconvolutions reconstruct the frame. Training compares reconstructions to the original frame to learn useful latent structure. The latents Z are also fed into an RNN to track time-dependent state such as the agent’s position and obstacle layout. Car racing works better because the scene has few elements (track and a simple highway structure), so the latent space can represent task-relevant factors. Pong fails because reconstruction after encoding can remove key objects—like the ball and the other paddle—so the agent can’t infer where the critical game elements are.

What change is proposed to make latents more task-relevant in complex games?

The proposed fix is to pair the encoder with a stronger, conditional decoder. By using a PixelCNN decoder conditioned on the time state of the game, the model forces the latents to capture obstacle-related information rather than generic background properties. The PixelCNN then generates the detailed obstacle appearance from those latents. The underlying idea is that latents should encode “what matters for not tripping,” not the irrelevant specifics of which background surface is present.

What makes invertible flow models attractive for generative tasks?

Invertible flow models are attractive because they support efficient sampling: draw random noise from a prior distribution and map it through an invertible network to generate data. They also learn a rich latent space where interpolations between latent codes can produce realistic intermediate images (illustrated with a face-morphing example). Training is tied to likelihood maximization, using the statistical change-of-variables formula, which expresses log-likelihood in terms of the transformation and its log-determinant.

Why do invertible flow models face expressivity and training challenges?

Invertibility constraints require specific architectural components (e.g., affine transformations, 1x1 convolutions, coupling layers) that are individually limited in expressivity. To achieve strong performance, models often become very large, increasing memory demands. Training can also be finicky: the model can end up in regions where invertibility is problematic, limiting what it can represent and making optimization unstable.

How do RealNVP and GLOW use multi-scale and coupling layers, and what improvement is described?

RealNVP and GLOW use multi-scale architectures because operations like 1x1 convolutions capture features at particular scales. They apply squeezing operations to reshape dimensions, then perform flow steps that include normalization convolutions and coupling layers. In coupling layers, inputs are split: one part is transformed by a deep network while the other part passes through unchanged, preserving invertibility. The improvement described is parameter sharing across coupling layers across flow steps, which reduces memory use substantially (to about 3% of the original) while maintaining sample quality (matching bits-per-dimension) and improving wall-clock speed (about 33% faster).

Review Questions

In what way can reconstruction artifacts in a learned world model harm multi-step planning, and why is that especially problematic for model-based reinforcement learning?
What specific failure mode appears when applying the VAE+RNN approach to Pong, and how does conditioning a PixelCNN decoder on time state address it?
How do invertible flow models compute likelihood during training, and what architectural constraints are required to keep the transformation invertible?

Key Points

1
World modeling aims to learn environment dynamics so agents can generalize and transfer skills with less data, which matters because experience collection is costly.
2
Small frame-generation errors in learned world models can compound during rollout, preventing correction and reducing performance versus model-free methods.
3
A VAE+RNN latent representation can work well in simple environments like car racing because the scene has few elements that the latent space can capture reliably.
4
The same reconstruction-based approach can fail in Pong when crucial objects (ball and paddle) disappear after encoding/decoding, making state estimation unreliable.
5
Conditioning a PixelCNN decoder on time state can force latents to encode obstacle-relevant, time-dependent information rather than generic background features.
6
Invertible flow models generate by sampling noise from a prior and mapping it through an invertible network, enabling efficient sampling and likelihood-based training via change-of-variables.
7
RealNVP/GLOW-style architectures manage invertibility using multi-scale squeezing and coupling layers, but strict invertibility constraints can limit expressivity and complicate training.

Highlights

Car racing succeeds because the environment’s structure is simple enough that latent reconstructions preserve task-relevant elements; Pong fails when encoding/decoding removes the ball and paddle.

Conditioning PixelCNN on the game’s time state is proposed as a way to make latents represent obstacles and their positions, not just background textures.

Invertible flow models can interpolate between latent codes while producing realistic intermediate images, reflecting a richly learned latent space.

Parameter sharing across coupling layers can drastically cut memory use (to ~3%) while preserving bits-per-dimension and improving speed (~33%).

Topics

Generative Modeling
World Modeling
Variational Autoencoders
PixelCNN
Invertible Flow Models

Mentioned

Satya
Sadhika Malladi
Prafulla
Neil Degrasse Tyson
VAE
RNN
R
Z
VAE
RNN
PixelCNN
GLOW
RealNVP