Investigating Model Based RL for Continuous Control | Alex Botev

TL;DR

A dynamics model can look excellent for one-step prediction yet still produce unusable rollouts because small errors grow over time.

Briefing Cornell Notes

Briefing

Model-based reinforcement learning for continuous control can look nearly perfect at one-step prediction, yet still fail badly when its learned dynamics are rolled out over longer horizons. That gap—tiny single-step error compounding into large long-horizon drift—helps explain why many model-based methods lag behind model-free approaches in practice, even when the dynamics model appears accurate.

Alexander Botev frames the core tradeoff around what “learning a model” really buys. In model-free RL, agents learn a policy and/or value function from interaction data, relying on reward signals and bootstrapping. In model-based RL, agents also learn an internal dynamics model that predicts how the environment evolves from observations and actions. The ideal case is powerful: with a perfect model, tasks could be solved by simulation without further environment interaction. In reality, learning the model introduces costs and failure modes—especially when planning requires unrolling predictions far into the future.

Botev illustrates the compounding-error problem using a simple torque-controlled ball environment. A one-step dynamics model predicts the next state with extremely small error, appearing almost indistinguishable from the real system. But when predictions are unrolled independently over multiple steps, trajectories diverge substantially from reality. Training policies on these long-horizon imagined rollouts produces suboptimal behavior and can even fail outright in the real environment.

Several practical difficulties follow. Not all environment details matter for a given task, so model capacity can be wasted on irrelevant factors. More importantly, long-horizon rollouts amplify errors, and uncertainty estimation for flexible neural dynamics models remains difficult. Finally, sample efficiency is not guaranteed: learning an accurate dynamics model may require so much data that it cancels out any gains from planning, leaving model-free methods competitive or better in many settings.

To address these issues, Botev’s internship research focuses on “value expansion” for actor-critic architectures. After learning a dynamics model, the method uses it to unroll trajectories from offline data and generate multi-step on-policy value targets. Instead of relying on standard bootstrapped targets that can be unstable, value expansion produces training targets over longer horizons, reducing the need for corrections such as importance sampling.

Results come from standard Fetch robotics tasks in simulation: reaching, pick-and-place, pushing, and a harder sliding-table scenario. Against a carefully tuned baseline deterministic policy gradient setup (with replay and double Q-learning), value expansion with learned dynamics sometimes delivers up to five times better sample efficiency, though performance can be fragile to hyperparameters. Across experiments, several patterns stood out: a single dynamics model was insufficient; training dynamics models with multi-step losses (to improve consistency when feeding predictions back into the model) was necessary; and value expansion benefits required expanding beyond one or two steps.

A notable detail is the role of pessimism. When combining multi-horizon targets, taking a minimum across horizons—rather than averaging—improved performance in harder environments, echoing the idea behind reducing overestimation bias in double Q-learning. Overall, the work suggests model-based RL can help continuous control, but only when dynamics learning is robust to rollout errors and value targets are constructed in a way that stabilizes training over meaningful horizons.

Cornell Notes

Continuous-control model-based RL can be misleading: a dynamics model may predict the next state almost perfectly, yet long-horizon rollouts can drift far from reality due to compounding error. Botev’s research targets this by using “value expansion” in actor-critic methods. After learning a dynamics model, the approach unrolls it from offline data to create multi-step on-policy value targets, improving stability and sample efficiency compared with a tuned deterministic policy gradient baseline. Experiments on Fetch tasks show gains up to about five times better sample efficiency in some cases, but only when dynamics models are trained with multi-step losses and value expansion uses more than one or two steps. Pessimistically combining multi-horizon targets (taking a minimum) helps on harder tasks.

Why can a dynamics model that is nearly perfect at one-step prediction still fail in model-based RL?

Because errors compound when predictions are unrolled. In the torque-controlled ball example, one-step predictions are almost indistinguishable from the real environment, with tiny error. But when the model is rolled out independently over multiple steps, trajectories diverge significantly from reality. Training policies on these imagined long-horizon trajectories yields suboptimal behavior and can even fail when deployed in the real environment.

What is the main purpose of “value expansion” in actor-critic RL?

Value expansion uses a learned dynamics model to generate multi-step training targets that are closer to on-policy behavior. After collecting offline data, the method unrolls the dynamics model multiple steps to produce multi-horizon value targets, reducing reliance on standard bootstrapped targets that can be unstable. This also avoids some correction mechanisms such as importance sampling when constructing targets.

What training changes were necessary for dynamics models to make value expansion work?

Training dynamics models using multi-step losses was necessary. The approach improves consistency when the model is fed its own predictions during rollout. In Botev’s experiments, using a single-step training objective was not enough to beat the baseline; multi-step training helped the model avoid collapsing to the same value estimates without improving sample efficiency.

How did value expansion perform on the Fetch robotics tasks, and what limited its reliability?

On Fetch tasks (moving to a location, pick-and-place, pushing, and sliding-table pushing), value expansion with learned dynamics sometimes outperformed a deterministic policy gradient baseline and could reach up to about five times better sample efficiency in some cases. However, results were fragile to hyperparameters: some tasks showed larger gains than others, and in at least one environment asymptotic behavior could be worse than the baseline.

Why did pessimism—taking a minimum across horizons—help?

When combining multi-horizon targets, taking the minimum across horizons improved performance in harder environments. The motivation is similar to double Q-learning: it reduces overestimation bias. Instead of averaging horizons (which can keep optimistic errors), pessimistic aggregation makes the targets more conservative when the model’s multi-step predictions become less reliable.

Review Questions

What specific mechanism turns a low one-step dynamics error into poor long-horizon control performance?
How does value expansion modify the target construction compared with standard bootstrapped actor-critic training?
Which two experimental design choices were repeatedly necessary for gains: one related to dynamics-model training and one related to the number of expansion steps?

Key Points

1
A dynamics model can look excellent for one-step prediction yet still produce unusable rollouts because small errors grow over time.
2
Model-based RL’s promise depends on robust long-horizon behavior, not just accurate next-state prediction.
3
Value expansion improves actor-critic training by using a learned dynamics model to generate multi-step on-policy value targets from offline data.
4
Dynamics models must be trained with multi-step losses (feeding predictions back into the model) to prevent rollout inconsistency from erasing sample-efficiency gains.
5
Value expansion benefits require expanding beyond one or two steps; short expansions often fail to improve over baselines.
6
Pessimistically aggregating multi-horizon targets (taking a minimum) can reduce overestimation bias and help on harder tasks.
7
Even with improvements, performance can be hyperparameter-sensitive and may not always improve asymptotic performance across all environments.

Highlights

One-step dynamics error can be tiny while long-horizon unrolls diverge sharply, explaining why model-based training on imagined trajectories can fail.

Value expansion creates multi-step on-policy value targets by unrolling a learned dynamics model from offline data, stabilizing actor-critic learning.

On Fetch tasks, value expansion sometimes delivers up to ~5× better sample efficiency, but gains depend on dynamics training and expansion horizon length.

Training dynamics with multi-step losses and using pessimistic (minimum) aggregation were key practical ingredients for stronger results.

Topics

Model-Based RL
Continuous Control
Value Expansion
Dynamics Modeling
Actor-Critic Targets

Mentioned

Alexander Botev
RL
AL
Q
TG
Q-learning

Investigating Model Based RL for Continuous Control | Alex Botev | 2018 Summer Intern Open House