A pathway to more efficient generative models | Will Grathwohl

TL;DR

Invertible generative modeling becomes more flexible when continuous-time dynamics replace discrete flow steps.

Briefing Cornell Notes

Briefing

Invertible generative models can become more expressive and potentially more efficient by switching from discrete “flow steps” to a continuous-time formulation that replaces Jacobian log-determinants with an integral of divergence. In standard normalizing flows, each transformation must be invertible and the model must compute the log determinant of the Jacobian at every step—requirements that push architectures toward simpler, computationally friendly transformations. That simplicity forces models to stack many steps (as in Glow) to gain expressiveness, which can increase parameter count and compute.

A continuous normalizing flow reframes the model as parameterizing a dynamical system. Instead of applying a sequence of discrete transformations from time 0 to a final time, the model treats the data as evolving under a learned vector field. The likelihood then follows from the change-of-variables formula: in the continuous limit, the sum of log determinants becomes an integral over time of the divergence of the vector field. This shift matters because computing log determinants of arbitrary Jacobians is expensive—typically scaling like n cubed after forming the Jacobian—and there is no efficient unbiased estimator in general. Divergence, by contrast, admits an efficient unbiased estimator using automatic differentiation: sample a Gaussian “probe” vector, use it to probe the Jacobian via automatic differentiation (effectively estimating the trace of the Jacobian), and the expectation recovers the divergence. That estimator can be plugged into the continuous likelihood calculation to yield an unbiased estimate of log-likelihood, enabling training with standard backpropagation tools.

The tradeoff is computational. Discrete flows apply a known sequence of transformations; continuous flows require numerically integrating an ordinary differential equation (ODE) to transform samples and to evaluate likelihood terms. Training becomes harder because gradients must backpropagate through the ODE solution. Work from the University of Toronto provides a method to compute the needed gradients by solving an augmented ODE system, leveraging decades of numerical methods for ODEs to make the approach tractable.

Empirically, continuous normalizing flows have been compared against Glow and RealNVP. The results reported are competitive with RealNVP, and in some datasets continuous normalizing flows even outperform Glow, though they have not fully beaten Glow yet. The main bottleneck is training time: continuous models currently take too long to train, limiting how large they can be made. A visualization example shows a learned gradient field warping digit images through time toward a Gaussian-like distribution; after integrating forward, the model can integrate backward to generate samples. The architecture described uses a single neural network that, at each time step, takes in the current image and the time value to parameterize the vector field, with the ODE integration performing the actual invertible transformation.

Overall, the approach offers a clear path to more flexible invertible generative models: use divergence-based likelihood estimation to avoid Jacobian log-determinant bottlenecks, then rely on ODE solvers and augmented-system gradient methods to keep training feasible. Remaining issues center on making training fast enough to scale model capacity and close the gap with the strongest discrete-flow baselines.

Cornell Notes

Continuous normalizing flows recast invertible generative modeling as a continuous-time dynamical system. The key technical change replaces the discrete sum of Jacobian log-determinants with an integral over time of the vector field’s divergence. Divergence can be estimated efficiently and unbiasedly using Gaussian probe vectors with automatic differentiation, avoiding the expensive Jacobian log-determinant computation. The remaining challenge is training: it requires numerically integrating an ODE and backpropagating through the ODE solution. An augmented ODE method from the University of Toronto enables gradient computation, making standard backprop feasible. Reported experiments show competitive results with RealNVP and sometimes better-than-Glow performance, with training speed as the main limiter.

Why do standard normalizing flows often rely on simpler transformations?

Invertibility is mandatory, and the model must compute the log determinant of the Jacobian at each step. Those constraints are easiest to satisfy with architectures where the Jacobian determinant is tractable, so models like Glow use relatively simple transformations and then stack many of them to achieve expressiveness.

What changes when moving from discrete flows to continuous-time dynamics?

A discrete flow applies a sequence of transformations from time 0 to a final time. The continuous formulation treats the data as evolving under a learned vector field, integrating an ODE from the initial time to the final time. In the likelihood calculation, the discrete sum of log determinants becomes an integral of the divergence of the vector field over time.

Why is divergence estimation computationally easier than Jacobian log-determinants?

For an arbitrary function from R^n to R^n, computing the log determinant of the Jacobian typically requires forming or working with the Jacobian in a way that scales like n^3, and there’s no efficient unbiased estimator in general. Divergence corresponds to a trace term, which can be estimated unbiasedly by sampling a Gaussian probe vector and using automatic differentiation to compute the expected trace via Jacobian-vector products.

How does training work when the model requires solving an ODE?

Training needs gradients of the loss with respect to both inputs and parameters, but those outputs come from numerically integrating an ODE. Gradients therefore must backpropagate through the ODE solver. The approach builds on a University of Toronto method that computes required gradients by solving an augmented system of ODEs, enabling backpropagation through the continuous dynamics.

What empirical results were reported against Glow and RealNVP?

Continuous normalizing flows have not fully surpassed Glow yet, largely because they currently take too long to train, limiting model size. However, they achieve competitive results with RealNVP, and in some datasets they outperform Glow. The reported visualization shows digit images being warped by an integrated gradient field toward a Gaussian-like target distribution, then reversed to generate samples.

Review Questions

How does the continuous-time change-of-variables formula transform the likelihood computation compared with discrete normalizing flows?
What role do Gaussian probe vectors play in estimating divergence, and why does this avoid the Jacobian log-determinant bottleneck?
What makes gradient-based training harder for continuous normalizing flows, and how does the augmented ODE method address it?

Key Points

1
Invertible generative modeling becomes more flexible when continuous-time dynamics replace discrete flow steps.
2
Continuous normalizing flows replace the sum of Jacobian log-determinants with an integral of the vector field’s divergence.
3
Divergence admits an efficient unbiased estimator using Gaussian probe vectors and automatic differentiation, unlike Jacobian log-determinants.
4
Training requires numerical ODE integration and backpropagation through the ODE solution, which is computationally demanding.
5
An augmented ODE gradient method from the University of Toronto enables tractable gradient computation for continuous flows.
6
Reported experiments show competitive performance with RealNVP and sometimes better-than-Glow results, with training speed as the main remaining limitation.

Highlights

The likelihood term shifts from a discrete sum of Jacobian log-determinants to a continuous integral of divergence, changing both the math and the computational bottlenecks.

Divergence can be estimated unbiasedly via Gaussian probe vectors using automatic differentiation, avoiding expensive Jacobian log-determinant computation.

Backpropagation through an ODE solution is handled by solving an augmented ODE system, making continuous flows trainable with standard autodiff workflows.

Continuous normalizing flows can warp digit images through an integrated learned gradient field into a Gaussian-like distribution, then reverse the process to generate samples.

Topics

Continuous Normalizing Flows
Divergence Estimation
ODE-Based Invertible Models
Augmented ODE Gradients
Generative Model Likelihood

Mentioned

Will Grathwohl
ODE
Divergence

A pathway to more efficient generative models | Will Grathwohl | 2018 Summer Intern Open House