Saving and Loading Models - Stable Baselines 3 Tutorial (P.2)

TL;DR

Create separate directories for saved models and TensorBoard logs so checkpoints and metrics don’t overwrite each other.

Briefing Cornell Notes

Briefing

Reinforcement learning runs can look “good” early and then collapse later, so the practical fix is to save checkpoints and track training metrics over time—then load the best-performing snapshot for evaluation. This tutorial segment shows a workflow for Stable Baselines 3 that trains PPO and A2C in chunks, saves models every fixed number of timesteps, logs to TensorBoard, and later reloads a chosen checkpoint to watch the agent behave in the environment.

Training starts by creating dedicated directories for saved models and TensorBoard logs (e.g., a models directory and a logs directory). The code then sets a checkpoint interval (10,000 timesteps in the example) and runs training in a loop: each iteration calls model.learn for a fixed number of timesteps, disables timestep resetting via reset_num_timesteps=False, and writes a checkpoint with model.save using a filename that encodes the total training progress (e.g., models_dir/<timestep>_i.zip). This chunked approach matters because leaving training to a single long run makes it harder to recover from volatility—especially when reward trends reverse after reaching a peak.

The tutorial trains PPO first, then repeats the same process for A2C. A key implementation detail is ensuring TensorBoard logging is configured correctly; a logging mistake initially leaves logs empty, then gets corrected by setting tb_log_name and pointing tensorboard to the right log directory. Once training runs long enough (the example continues until 300,000 timesteps for the first comparison), TensorBoard is used to compare reward mean and other signals. PPO is described as typically smoother, with less volatility across metrics like entropy loss, while A2C can be more erratic.

After training, the workflow shifts from charts to reality: a saved checkpoint is loaded with PPO.load(model_path, env=env), and the agent is run for multiple episodes using model.predict to generate actions. The goal is not just to maximize reward mean but to verify that behavior is actually competent—reward spikes can come from short-term tricks that fail to recover. In the environment used here (a simple landing task with flags), the loaded PPO checkpoint shows improved episode length and performance compared with earlier checkpoints and random behavior, though the tutorial notes that some models can still degrade.

To address surprising results—where A2C sometimes appears to overtake PPO early, and both can degrade later—the tutorial retrains multiple times with additional runs and longer training (eventually up to around a million timesteps in later experiments). The final takeaway is that PPO ends up as the best performer on this environment, but the bigger lesson is that random initialization and training duration strongly affect outcomes. Rather than forcing a single fixed random seed for every algorithm, the recommended practice is to train several runs and keep the best checkpoint, then continue training from that winner.

Cornell Notes

Stable Baselines 3 training can be volatile: reward may peak and then drop, so saving checkpoints and logging metrics is essential. The workflow trains PPO and A2C in timestep “chunks,” saves a model every 10,000 timesteps with filenames encoding progress, and logs to TensorBoard with tb_log_name so reward mean and other metrics can be compared. After training, a chosen checkpoint is reloaded via PPO.load(..., env=env) and evaluated by running episodes using model.predict for actions. Multiple retrains show that random initialization can change which algorithm looks best, so selecting the best saved checkpoint (often PPO here) is more reliable than trusting a single run.

Why split training into chunks and save models periodically instead of training once to the final timestep?

Because performance can reverse. In the example, reward trends can go positive early and then end up negative by a later timestep. Chunked training (e.g., model.learn for 10,000 timesteps inside a loop) plus checkpointing (model.save every interval) lets the workflow recover the best snapshot rather than being stuck with a degraded final model.

What does reset_num_timesteps=False accomplish when calling model.learn repeatedly?

It prevents the timestep counter from restarting each time model.learn is called. That keeps console output and TensorBoard curves aligned with the true total training progress, making checkpoint filenames and logged metrics consistent across iterations.

How does TensorBoard logging connect to saved checkpoints and comparisons between PPO and A2C?

Each training run writes logs into a logs directory, and tb_log_name labels the run so TensorBoard can plot curves for PPO (blue) and A2C (orange). Comparing reward mean and other metrics (like entropy loss) across these labeled runs helps identify which checkpoint is likely to perform well before loading it for visual evaluation.

Why evaluate a checkpoint by running episodes after loading it, instead of relying only on reward mean?

Reward spikes can come from short-term behavior that doesn’t generalize or can’t recover. The tutorial emphasizes visually checking behavior—e.g., an agent might exploit a temporary gain and then get stuck—so the only reliable test is running the loaded model in the environment and observing episode outcomes.

What role does randomness play in which algorithm “wins” on a given environment?

Random initialization and the specific training trajectory can change results dramatically. The tutorial notes that multiple runs can show different winners, so it recommends training a handful of models and selecting the best checkpoint rather than assuming one algorithm will always dominate under identical code.

What was the practical conclusion about PPO vs A2C in this landing/flag environment?

After additional retraining and longer runs, PPO checkpoints became the top performers, while A2C either matched briefly or degraded. The final guidance is that PPO was faster and better on this environment, though the tutorial still treats early comparisons as potentially misleading due to volatility and randomness.

Review Questions

When training in a loop, what two settings ensure checkpoint filenames and TensorBoard timelines reflect total progress rather than restarting?
What specific mismatch can occur between reward mean trends and real episode behavior, and how does the checkpoint-loading evaluation address it?
Why might two algorithms that look similar on TensorBoard still require multiple retrains before declaring a winner?

Key Points

1
Create separate directories for saved models and TensorBoard logs so checkpoints and metrics don’t overwrite each other.
2
Train in fixed timestep chunks (e.g., 10,000) and save a checkpoint after each chunk to guard against later performance collapse.
3
Use reset_num_timesteps=False when repeatedly calling model.learn so logged timesteps and console output remain consistent.
4
Configure tb_log_name and tensorboard log paths correctly; missing or misdirected logging can leave TensorBoard empty.
5
Select checkpoints using TensorBoard trends, then verify by loading the model and running episodes with model.predict to confirm behavior.
6
Expect random initialization to change outcomes; train multiple runs and keep the best checkpoint rather than relying on a single seed or single run.
7
In this environment, PPO ultimately outperformed A2C after additional training, but early results could be misleading without checkpoint-based evaluation.

Highlights

Checkpointing turns a volatile training process into something recoverable: the best model may occur well before the final timestep.

TensorBoard comparisons (reward mean, entropy loss) are useful, but episode rollouts after loading the checkpoint are the real test.

reset_num_timesteps=False keeps repeated model.learn calls from resetting the timestep timeline, preserving meaningful plots and checkpoint naming.

Random initialization can flip which algorithm looks best; multiple runs plus checkpoint selection is more reliable than a single training run.

PPO ended up as the most dependable performer on the landing/flag task, with smoother metrics and better final behavior after retraining.

Topics

Model Checkpointing
TensorBoard Logging
Stable Baselines 3
PPO vs A2C
Model Loading