Large-Scale Study of Curiosity-Driven Learning

Yuri Burda

Edinburgh Research Explorer (University of Edinburgh)·2025·Social Sciences·89 citations

7 min read

Read the full paper at DOI or on arxiv

TL;DR

The paper tests whether RL can succeed with only intrinsic curiosity reward (prediction error) and no extrinsic reward or done signal across 54 benchmarks.

Briefing Cornell Notes

Briefing

This paper asks a simple but largely unanswered question in reinforcement learning (RL): can an agent learn useful behavior when it receives only intrinsic reward from curiosity—specifically, a prediction-error (surprisal) signal—without any extrinsic reward or end-of-episode (“done”) signal? The question matters because most RL success depends on carefully engineered external rewards that are dense and well-shaped, yet designing such rewards does not scale to new environments. If curiosity alone can drive competent exploration and skill acquisition, then reward engineering could be reduced to selecting environments rather than specifying detailed reward functions.

The authors conduct what they describe as the first large-scale empirical study of purely curiosity-driven learning across 54 standard benchmark environments, spanning Atari games, Super Mario Bros., Unity-based 3D navigation/maze tasks, Roboschool physics tasks, and two-player Pong. The broader significance is twofold. First, it provides evidence that intrinsic objectives can substitute for extrinsic reward in many human-designed environments, suggesting that extrinsic scoring systems may often correlate with “novelty-seeking” behavior. Second, it investigates how the curiosity mechanism depends on representation learning: curiosity requires learning a forward dynamics model in some feature space, and the choice of feature space affects both learning stability and generalization.

Methodologically, the paper uses a dynamics-based curiosity formulation. For an agent observing state (or observation) $x_{t}$ , taking action $a_{t}$ , and transitioning to $x_{t + 1}$ , they embed observations via $ϕ (x)$ and train a forward dynamics model to predict the next embedding $ϕ (x_{t + 1})$ conditioned on $x_{t}$ and $a_{t}$ . The intrinsic reward is defined as prediction error/surprisal, implemented as a mean-squared error corresponding to a fixed-variance Gaussian density: $r_{t} = - lo g p (ϕ (x_{t + 1}) ∣ x_{t}, a_{t})$ , operationalized as $∥ f (x_{t}, a_{t}) - ϕ (x_{t + 1}) ∥_{2}^{2}$ (up to constants). The policy and curiosity dynamics are trained jointly using PPO (Proximal Policy Optimization), with multiple practical stabilizers: reward normalization by a running estimate of the standard deviation of discounted returns, advantage normalization, observation normalization from a short random rollout, and increased parallelism (typically 128 parallel actors; 2048 for the large-scale Mario run). They also note that they remove the “done” signal in Atari to avoid reward leakage through episode termination.

A central experimental axis is the feature space $ϕ$ used for the forward model. They compare: (1) pixels ( $ϕ (x) = x$ ), (2) random features (a fixed, randomly initialized convolutional embedding network), (3) inverse dynamics features (IDF; features learned by training an embedding network to predict the action from consecutive states), and (4) VAE features (features from a variational autoencoder latent mean). Their qualitative and quantitative results show that pixels perform poorly, while random features and IDF are strong baselines. Specifically, on 8 representative Atari games (Figure 2), raw pixels “does not work well across any environment,” VAE is “either same or worse than random and inverse dynamics,” and IDF is better than random features in 55% of Atari games. Across the full Atari suite, they report that an IDF-curious agent collects more extrinsic reward than a random agent in 75% of games, while an RF-curious agent does so in 70%; IDF beats RF in 55% of games. Importantly, these extrinsic returns are used only for evaluation, not training, so the intrinsic objective is still the only learning signal.

The paper also studies generalization. In Super Mario Bros., they pre-train on Level 1-1 using curiosity only, then fine-tune on novel levels (Level 1-2 and Level 1-3) using curiosity only. They find strong transfer to Level 1-2 for both RF and IDF, but weaker transfer to Level 1-3, which differs more drastically (a day-to-night color scheme shift). Their interpretation is that learned IDF features generalize better when the visual distribution shifts substantially, while random features may suffice when global statistics match.

They further explore how training stability and compute affect outcomes. In a large-scale Mario experiment, they increase the number of parallel environment threads from 128 to 2048. The result is a substantial performance jump: with the larger batch/parallelism, the agent discovers 11 different Mario levels, finds secret rooms, and defeats bosses, indicating that curiosity-driven learning benefits from stronger optimization stability in the underlying PPO training.

Beyond pure exploration, the authors test whether curiosity helps when extrinsic reward is sparse or terminal. In a Unity 3D maze with 9 rooms and only a terminal reward of $+ 1$ at the goal, extrinsic-only PPO never finds the goal in all trials (so it receives no meaningful learning gradient), while PPO with intrinsic curiosity converges to reaching the goal consistently. They also report preliminary Atari experiments on sparse-reward games: in 4 out of 5 selected games, adding curiosity improves performance (details deferred to an appendix table). They emphasize that these reward-combination experiments are not the paper’s main focus and use a fixed intrinsic coefficient (0.01) without tuning.

Finally, they highlight a key limitation of prediction-error curiosity in stochastic settings. If the environment’s unpredictability is driven by stochasticity that the agent cannot control (or that the agent itself can induce), then the agent can be attracted to entropy rather than progress. They demonstrate this with a “noisy-TV” style manipulation: adding a local source of randomness (a TV that changes channels) to the Unity maze. They observe that curiosity learning slows dramatically when such stochasticity is present, and while agents may eventually recover and learn the extrinsic goal given enough time, the attraction to noise is a real failure mode.

Limitations are implied by the methodology and explicitly discussed in the discussion section. The approach relies on learning a forward dynamics model; if the model class is insufficient, the observation is partially observable, or the environment is stochastic in ways that create high entropy transitions, prediction-error rewards can misalign with task progress. Additionally, feature learning methods like VAE and IDF introduce non-stationarity because the embedding changes as auxiliary models train; the authors mitigate this by using random features (stable) and by carefully normalizing inputs and rewards. Another limitation is that the study is empirical and benchmark-focused: while 54 environments are diverse, the results may not generalize to environments with different reward structures, observation modalities, or stochasticity patterns.

Practically, the paper suggests that curiosity-driven exploration can be used as a default intrinsic objective when reward engineering is expensive, and that many standard RL benchmarks may already be “curriculum-like” in ways that align with novelty-seeking. Researchers and practitioners in RL should care because this work provides (i) a scalable recipe for reward-free training using PPO plus dynamics-based curiosity, (ii) evidence that random-feature curiosity is a surprisingly strong and stable baseline, and (iii) concrete warnings about stochasticity-driven exploitation. Teams building agents for new domains may use curiosity pretraining on unlabeled environments and then fine-tune with sparse extrinsic signals, potentially reducing the need for dense reward shaping.

Cornell Notes

The paper performs a large-scale study of reinforcement learning driven solely by dynamics-based curiosity (prediction error) with no extrinsic reward or end-of-episode signal across 54 benchmarks. It shows that curiosity alone often yields meaningful task performance, that random features are a strong and stable representation choice, and that learned inverse-dynamics features can generalize better to novel levels. It also demonstrates a key failure mode: prediction-error curiosity can be misled by stochasticity (e.g., a noisy TV).

What is the core research question of the paper?

Can an RL agent learn useful behaviors using only intrinsic curiosity reward (prediction error) without any extrinsic reward or done signal, and how do representation choices affect performance and generalization?

What intrinsic reward formulation is used?

Dynamics-based curiosity: train a forward model to predict the next embedding $ϕ (x_{t + 1})$ from $x_{t}$ and $a_{t}$ , and use prediction error/surprisal as reward, implemented as $∥ f (x_{t}, a_{t}) - ϕ (x_{t + 1}) ∥_{2}^{2}$ (fixed-variance Gaussian).

What study design and environments are used?

A large-scale empirical study across 54 environments: 48 Atari games, Super Mario Bros., Roboschool (Ant and Juggling), two-player Pong, and Unity mazes/navigation tasks.

How do the authors handle the done signal in Atari?

They remove the end-of-episode signal to prevent reward leakage (e.g., exploiting artificial reward tied to survival/death), ensuring exploration gains are not driven by episode termination.

Which feature spaces are compared for curiosity?

Pixels, random features (fixed random embedding network), inverse dynamics features (IDF), and VAE latent features.

What is the main result on Atari performance without extrinsic reward?

Curiosity-only agents often achieve higher extrinsic evaluation returns than a random agent: IDF beats random in 75% of Atari games and RF beats random in 70%; IDF beats RF in 55%.

What happens when using pixels or VAE features?

Pixels perform poorly across environments, and VAE features are either similar to random or worse than random/IDF; the paper reports that curiosity with pixels “does not work well” and VAE is not consistently better.

How does curiosity generalize in Super Mario Bros.?

Pretraining on Level 1-1 and fine-tuning on novel levels shows strong transfer to Level 1-2 for both RF and IDF, but weaker transfer to Level 1-3 (day-to-night shift); IDF transfers better in the harder shift scenario.

What does the paper find about sparse/terminal extrinsic rewards?

In a Unity 9-room maze with only terminal reward, extrinsic-only PPO never reaches the goal, while PPO with intrinsic curiosity converges to reaching the goal consistently; preliminary Atari sparse-reward tests show curiosity helps in 4/5 games.

What limitation do the authors demonstrate for prediction-error curiosity?

In stochastic setups, the agent can be attracted to entropy rather than progress; adding a noisy-TV source of randomness drastically slows learning and can misdirect exploration.

Review Questions

Why is removing the done signal important for interpreting curiosity-only learning in Atari?
What properties of the feature embedding $ϕ$ do the authors argue are desirable, and how do random features satisfy them?
How do the authors empirically compare representation learning methods (RF vs IDF vs VAE vs pixels) and what do the percentages on Atari imply?
What evidence supports the claim that learned features generalize better than random features in Mario, and what specific level shift causes weaker transfer?
Explain the noisy-TV failure mode: under what conditions does prediction-error curiosity become misaligned with task progress?

Key Points

1
The paper tests whether RL can succeed with only intrinsic curiosity reward (prediction error) and no extrinsic reward or done signal across 54 benchmarks.
2
Curiosity-only agents often achieve non-trivial task performance; on Atari, IDF beats random in 75% of games and RF beats random in 70%.
3
Pixels-based curiosity performs poorly, while random features and inverse-dynamics features are strong; IDF beats RF in 55% of Atari games.
4
Random features are surprisingly effective for training, but inverse-dynamics learned features show better transfer to novel Mario levels under larger visual distribution shifts (e.g., day-to-night).
5
Scaling PPO optimization stability (e.g., increasing parallel actors from 128 to 2048 in Mario) substantially improves exploration outcomes (e.g., discovering 11 levels, secret rooms, bosses).
6
Curiosity can help with sparse/terminal extrinsic rewards: in a Unity maze, extrinsic-only PPO never finds the goal, while curiosity-augmented PPO converges to goal-reaching.
7
Prediction-error curiosity has a key stochasticity limitation: agents may seek entropy (demonstrated with a noisy-TV setup), slowing or misdirecting learning.

Highlights

“We perform the first large-scale study of purely curiosity-driven learning, i.e. without any extrinsic rewards, across 54 standard benchmark environments.”

“IDF-curious agent collects more game reward than a random agent in 75% of the Atari games, an RF-curious agent does better in 70%.”

“We find that while random features work well at training, IDF-learned features appear to generalize better (e.g. to novel game levels in Super Mario Bros.).”

“In the infinite horizon setting … we removed ‘done’ to separate the gains of an agent’s exploration from merely that of the death signal.”

“If the agent itself is the source of stochasticity in the environment, it can reward itself without making any actual progress.”

Topics

Reinforcement learning
Intrinsic motivation and curiosity
Exploration strategies
Representation learning for RL
Model-based RL and forward dynamics prediction
Generalization and transfer learning
Sparse reward RL
Stochastic environments and reward misalignment

Mentioned

PPO (Proximal Policy Optimization)
Atari Learning Environment (ALE)
Unity ML-Agents
VAE (Variational Autoencoder)
Batch normalization
Pytorch-style convolutional architectures (Atari-like CNNs)
Roboschool (physics simulation environments)
Yuri Burda
Harri Edwards
Deepak Pathak
Amos Storkey
Trevor Darrell
Alexei A. Efros
RL - Reinforcement Learning
PPO - Proximal Policy Optimization
IDF - Inverse Dynamics Features
VAE - Variational Autoencoder
RF - Random Features
CNN - Convolutional Neural Network
MSE - Mean Squared Error
TV - Television (noisy-TV stochasticity manipulation)