Continuous control with deep reinforcement learning

Q: What are representative results showing DDPG’s effectiveness?

In cartpole, DDPG achieves $R_{best,lowd}=1.115$ and $R_{best,pix}=1.138$. In hardCheetah, DDPG reaches $R_{av,lowd}=1.311$ and $R_{av,pix}=1.204$, with best pixel score $R_{best,pix}=1.431$.

Timothy Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra

arXiv (Cornell University)·2016·Computer Science·6,777 citations

6 min read

Read the full paper on arxiv

TL;DR

DDPG adapts deterministic policy gradients to continuous action spaces using an actor-critic architecture with a learned critic $Q (s, a)$ and deterministic actor $μ (s)$ .

Briefing Cornell Notes

Briefing

This paper asks whether deep reinforcement learning for continuous-action control can be made stable and effective without discretizing actions, and whether it can scale to high-dimensional physical control tasks using the same core learning algorithm and hyperparameters. This matters because many real control problems—robot locomotion, manipulation, and driving—require real-valued, multi-degree-of-freedom actions. Prior deep RL success (notably DQN) relied on discrete actions and on computing argmax over actions at every step, which becomes impractical in continuous domains.

The authors’ central contribution is an actor-critic, model-free algorithm they call Deep Deterministic Policy Gradient (DDPG). It adapts the deterministic policy gradient (DPG) idea to continuous action spaces, while importing two key stability mechanisms from DQN: (1) off-policy learning with a replay buffer to reduce sample correlations, and (2) target networks to stabilize temporal-difference targets. They further add batch normalization to handle differing feature scales across tasks and use temporally correlated exploration noise (Ornstein–Uhlenbeck) suited to physical systems with inertia.

Methodologically, the study is an empirical evaluation across a suite of simulated physics environments implemented in MuJoCo. The environments span classic control (cartpole variants, pendulum, cart) and more challenging multi-joint manipulation and contact-rich tasks (e.g., gripper, puck striking/canada, locomotion such as cheetah, hopper, hyq, walker2d), plus a non-physics baseline environment (TORCS). For each task, the authors train DDPG using both low-dimensional state inputs (e.g., joint angles/positions) and high-dimensional pixel observations from a fixed camera. For pixel inputs, they use action repeats: each chosen action is applied for 3 simulation steps while rendering each time, producing stacked RGB frames (downsampled to 64×64) so the agent can infer velocities from frame differences.

The algorithmic details are standard for DDPG: a deterministic actor (s| $θ^{μ}$ ) outputs continuous actions, and a critic approximates the action-value function $Q (s, a ∣ θ^{Q})$ . The critic is trained off-policy by minimizing a mean-squared Bellman error over minibatches sampled uniformly from a replay buffer of size $1 0^{6}$ . The target value uses slowly updated target networks $Q^{'}$ and $μ^{'}$ with soft updates $θ^{'} \leftarrow τ θ + (1 - τ) θ^{'}$ where $τ = 0.001$ . The actor is updated using the deterministic policy gradient via the chain rule, using $\nabla_{a} Q (s, a)$ evaluated at $a = μ (s)$ . Optimization uses Adam with learning rates $1 0^{- 4}$ for the actor and $1 0^{- 3}$ for the critic; the discount factor is $γ = 0.99$ ; the critic uses $L_{2}$ weight decay of $1 0^{- 2}$ . Network architectures differ for low-dimensional vs pixel inputs: low-dimensional models use two hidden layers (400 and 300 units), while pixel models use three convolutional layers followed by two fully connected layers (200 units each). Actions are bounded using a $tanh$ output layer. Training is run for at most 2.5 million environment steps per task; results are averaged over 5 replicas.

The paper’s key findings are that DDPG robustly solves more than 20 simulated physics tasks, often reaching performance competitive with (and sometimes exceeding) a strong model-based planning baseline (iLQG) that has full access to the simulator dynamics and derivatives. The authors report normalized returns where a naive random policy has mean score 0 and iLQG has mean score 1 (except TORCS, which uses raw reward). Across tasks, DDPG’s average and best observed scores are frequently positive and often near or above 1.

For example, in the cartpole swing-up task, DDPG achieves $R_{a v, l o w d} = 0.844$ and $R_{b es t, l o w d} = 1.115$ , while from pixels it achieves $R_{a v, p i x} = 0.482$ and $R_{b es t, p i x} = 1.138$ . In cartpoleBalance, DDPG reaches $R_{a v, l o w d} = 0.951$ and $R_{b es t, l o w d} = 1.000$ , and from pixels $R_{a v, p i x} = 0.335$ and $R_{b es t, p i x} = 0.996$ . In the harder hardCheetah task, DDPG attains $R_{a v, l o w d} = 1.311$ and $R_{b es t, l o w d} = 1.990$ , and from pixels $R_{a v, p i x} = 1.204$ and $R_{b es t, p i x} = 1.431$ , while the original DPG baseline with replay buffer and batch normalization (labeled $c n t r l$ ) has $R_{a v, c n t r l} = - 0.031$ and $R_{b es t, c n t r l} = 1.411$ . In locomotion walker2d, DDPG achieves $R_{a v, l o w d} = 0.705$ , $R_{b es t, l o w d} = 1.573$ , and from pixels $R_{a v, p i x} = 0.944$ , $R_{b es t, p i x} = 1.476$ . In manipulation-like tasks, performance is also strong: for gripper, DDPG yields $R_{a v, l o w d} = 0.655$ , $R_{b es t, l o w d} = 0.972$ , and from pixels $R_{a v, p i x} = 0.406$ , $R_{b es t, p i x} = 0.790$ . For fixedReacherSingle, DDPG reaches $R_{a v, l o w d} = 0.954$ , $R_{b es t, l o w d} = 1.000$ , and from pixels $R_{a v, p i x} = 0.827$ , $R_{b es t, p i x} = 0.995$ . The paper also includes ablation-style observations: removing the target network or batch normalization leads to very poor learning in many environments, and both additions are necessary for consistent performance.

The authors also examine critic value accuracy by comparing learned $Q$ estimates after training to true returns on test episodes. They report that in simpler tasks, DDPG’s value estimates are accurate without systematic bias; in harder tasks, estimates are worse, but the policy learning still succeeds.

A notable qualitative claim is that DDPG can learn end-to-end from raw pixels in many tasks, sometimes as fast as learning from low-dimensional state, plausibly due to action repeats simplifying dynamics and convolutional layers producing useful representations.

Limitations are acknowledged in the conclusion and are also implicit in the methodology. Like many model-free RL methods, DDPG requires a large number of training episodes/interaction steps to find solutions. The paper also notes that theoretical convergence guarantees are lost when using nonlinear function approximators. Additionally, performance variability across replicas is evident in the table (some tasks have low average but high best scores), suggesting sensitivity to initialization and exploration. Finally, while DDPG is compared to a strong planning baseline, the evaluation is in simulation; transfer to real robotics is not demonstrated here.

Practically, the results suggest that a relatively simple actor-critic framework—deterministic policy gradient plus replay buffer, target networks, and batch normalization—can serve as a general-purpose backbone for continuous control, including vision-based control. Researchers and engineers working on robotics, autonomous driving, and simulated manipulation should care because the paper provides a recipe that scales across many tasks with largely unchanged hyperparameters and architectures, and because it demonstrates that end-to-end pixel-to-action learning is feasible in a wide range of continuous control settings.

Overall, the paper’s core message is that stability techniques from deep Q-learning can be successfully transplanted into deterministic actor-critic learning for continuous actions, enabling robust performance across diverse high-dimensional control problems, with competitive results relative to model-based planners and with meaningful success from raw visual inputs.

Cornell Notes

The paper introduces DDPG, an off-policy actor-critic method for continuous control that combines deterministic policy gradients with DQN-style replay buffers and target networks, plus batch normalization and temporally correlated exploration. Experiments in many MuJoCo physics tasks show robust learning from both low-dimensional states and, in many cases, from raw pixels, often matching or exceeding an iLQG planning baseline.

What problem does the paper target in continuous control?

Learning policies for real-valued, high-dimensional action spaces without discretizing actions, which is impractical due to the curse of dimensionality and loss of action structure.

What is the core algorithmic approach of DDPG?

An actor-critic framework where a deterministic actor outputs continuous actions and a critic learns $Q (s, a)$ , with the actor updated using the deterministic policy gradient.

How does DDPG address instability from using neural networks in actor-critic learning?

It uses a replay buffer (off-policy minibatch learning) and target networks with soft updates $θ^{'} \leftarrow τ θ + (1 - τ) θ^{'}$ to stabilize temporal-difference targets.

What additional deep-learning technique is used to improve cross-task learning?

Batch normalization on the state input and layers of both the actor and critic, reducing sensitivity to differing feature scales.

How is exploration handled in continuous action domains?

By adding temporally correlated Ornstein–Uhlenbeck noise to the actor’s deterministic actions: $μ^{'} (s_{t}) = μ (s_{t} ∣ θ^{μ}) + N_{t}$ .

What environments and simulation platform are used for evaluation?

MuJoCo simulated physics tasks spanning cartpole variants, reaching, manipulation/gripper, contact-rich tasks (e.g., canada), and locomotion (cheetah, hopper, hyq, walker2d), plus TORCS.

What is the main performance comparison baseline?

A model-predictive planner using iLQG with full access to dynamics and derivatives; scores are normalized so random policy has mean 0 and iLQG has mean 1 (except TORCS).

What are representative results showing DDPG’s effectiveness?

In cartpole, DDPG achieves $R_{b es t, l o w d} = 1.115$ and $R_{b es t, p i x} = 1.138$ . In hardCheetah, DDPG reaches $R_{a v, l o w d} = 1.311$ and $R_{a v, p i x} = 1.204$ , with best pixel score $R_{b es t, p i x} = 1.431$ .

How does DDPG perform when learning from pixels versus low-dimensional state?

Pixel performance is often lower on average but can still be competitive; in some tasks, learning from pixels is nearly as fast as learning from low-dimensional state, aided by action repeats and convolutional representations.

Review Questions

Which two DQN-inspired mechanisms are essential in DDPG for stability, and how do they modify the learning targets and data distribution?
Explain how the deterministic policy gradient updates the actor without requiring action maximization over a continuous action space.
From the results table, pick one locomotion and one manipulation task and compare $R_{a v, l o w d}$ vs $R_{a v, p i x}$ . What does that tell you about the role of observation modality?
What role does batch normalization play in the paper’s claim of robustness across tasks with different state feature scales?
Why might the critic’s value estimates become less accurate on harder tasks, and why does the paper argue that policy learning can still succeed?

Key Points

1
DDPG adapts deterministic policy gradients to continuous action spaces using an actor-critic architecture with a learned critic $Q (s, a)$ and deterministic actor $μ (s)$ .
2
Stability comes from DQN-style replay buffers plus target networks with soft updates $τ = 0.001$ , which reduce divergence in temporal-difference learning.
3
Batch normalization is used to handle varying feature scales across environments, enabling more consistent hyperparameter transfer.
4
DDPG learns robustly across many MuJoCo continuous control tasks (more than 20), using the same core algorithm and largely fixed hyperparameters.
5
The method can learn end-to-end from raw pixels in many tasks; action repeats and convolutional encoders help infer dynamics from frame differences.
6
Performance is competitive with a strong model-based iLQG planner (normalized mean score 1), and in some tasks DDPG exceeds it (e.g., hardCheetah best pixel score $1.431$ ).
7
The paper acknowledges that nonlinear function approximation removes convergence guarantees and that model-free training still requires many interaction steps.

Highlights

“We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces.”

“Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks.”

In hardCheetah, DDPG achieves Rav,lowd​=1.311 and Rav,pix​=1.204 (normalized so iLQG has mean 1).

“In particular, learning without a target network… is very poor in many environments.”

“Nearly all of the problems we looked at were solved within 2.5 million steps of experience… a factor of 20 fewer steps than DQN requires for good Atari solutions.”

Topics

Reinforcement learning
Deep reinforcement learning
Continuous control
Actor-critic methods
Deterministic policy gradients
Off-policy learning
Stability in deep RL
Vision-based control
Robotics simulation (MuJoCo)
Model-based vs model-free control

Mentioned

MuJoCo
TORCS
iLQG
Adam
batch normalization
Deep Q-Network (DQN)
Ornstein–Uhlenbeck process
ReLU/rectified nonlinearity
Convolutional neural networks
Timothy P. Lillicrap
Jonathan J. Hunt
Alexander Pritzel
Nicolas Heess
Tom Erez
Yuval Tassa
David Silver
Daan Wierstra
Volodymyr Mnih
Koray Kavukcuoglu
Diederik Kingma
Jimmy Ba
Sergey Levine
Pieter Abbeel
Marc Peter Deisenroth
Carl E. Rasmussen
John Schulman
Hado V. Hasselt
DDPG - Deep Deterministic Policy Gradient
DPG - Deterministic Policy Gradient
DQN - Deep Q-Network
iLQG - iterative Linear Quadratic Gaussian
TRPO - Trust Region Policy Optimization
GPS - Guided Policy Search
PILCO - Probabilistic Inference for Learning Control
SVG(0) - Stochastic Value Gradients with parameter 0
NFQCA - Neural Fitted Q-iteration with Continuous Actions
RCT - Randomized Controlled Trial (not used in this paper)