Saving and Loading Models - Stable Baselines 3 Tutorial (P.2)
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Create separate directories for saved models and TensorBoard logs so checkpoints and metrics don’t overwrite each other.
Briefing
Reinforcement learning runs can look “good” early and then collapse later, so the practical fix is to save checkpoints and track training metrics over time—then load the best-performing snapshot for evaluation. This tutorial segment shows a workflow for Stable Baselines 3 that trains PPO and A2C in chunks, saves models every fixed number of timesteps, logs to TensorBoard, and later reloads a chosen checkpoint to watch the agent behave in the environment.
Training starts by creating dedicated directories for saved models and TensorBoard logs (e.g., a models directory and a logs directory). The code then sets a checkpoint interval (10,000 timesteps in the example) and runs training in a loop: each iteration calls model.learn for a fixed number of timesteps, disables timestep resetting via reset_num_timesteps=False, and writes a checkpoint with model.save using a filename that encodes the total training progress (e.g., models_dir/<timestep>_i.zip). This chunked approach matters because leaving training to a single long run makes it harder to recover from volatility—especially when reward trends reverse after reaching a peak.
The tutorial trains PPO first, then repeats the same process for A2C. A key implementation detail is ensuring TensorBoard logging is configured correctly; a logging mistake initially leaves logs empty, then gets corrected by setting tb_log_name and pointing tensorboard to the right log directory. Once training runs long enough (the example continues until 300,000 timesteps for the first comparison), TensorBoard is used to compare reward mean and other signals. PPO is described as typically smoother, with less volatility across metrics like entropy loss, while A2C can be more erratic.
After training, the workflow shifts from charts to reality: a saved checkpoint is loaded with PPO.load(model_path, env=env), and the agent is run for multiple episodes using model.predict to generate actions. The goal is not just to maximize reward mean but to verify that behavior is actually competent—reward spikes can come from short-term tricks that fail to recover. In the environment used here (a simple landing task with flags), the loaded PPO checkpoint shows improved episode length and performance compared with earlier checkpoints and random behavior, though the tutorial notes that some models can still degrade.
To address surprising results—where A2C sometimes appears to overtake PPO early, and both can degrade later—the tutorial retrains multiple times with additional runs and longer training (eventually up to around a million timesteps in later experiments). The final takeaway is that PPO ends up as the best performer on this environment, but the bigger lesson is that random initialization and training duration strongly affect outcomes. Rather than forcing a single fixed random seed for every algorithm, the recommended practice is to train several runs and keep the best checkpoint, then continue training from that winner.
Cornell Notes
Stable Baselines 3 training can be volatile: reward may peak and then drop, so saving checkpoints and logging metrics is essential. The workflow trains PPO and A2C in timestep “chunks,” saves a model every 10,000 timesteps with filenames encoding progress, and logs to TensorBoard with tb_log_name so reward mean and other metrics can be compared. After training, a chosen checkpoint is reloaded via PPO.load(..., env=env) and evaluated by running episodes using model.predict for actions. Multiple retrains show that random initialization can change which algorithm looks best, so selecting the best saved checkpoint (often PPO here) is more reliable than trusting a single run.
Why split training into chunks and save models periodically instead of training once to the final timestep?
What does reset_num_timesteps=False accomplish when calling model.learn repeatedly?
How does TensorBoard logging connect to saved checkpoints and comparisons between PPO and A2C?
Why evaluate a checkpoint by running episodes after loading it, instead of relying only on reward mean?
What role does randomness play in which algorithm “wins” on a given environment?
What was the practical conclusion about PPO vs A2C in this landing/flag environment?
Review Questions
- When training in a loop, what two settings ensure checkpoint filenames and TensorBoard timelines reflect total progress rather than restarting?
- What specific mismatch can occur between reward mean trends and real episode behavior, and how does the checkpoint-loading evaluation address it?
- Why might two algorithms that look similar on TensorBoard still require multiple retrains before declaring a winner?
Key Points
- 1
Create separate directories for saved models and TensorBoard logs so checkpoints and metrics don’t overwrite each other.
- 2
Train in fixed timestep chunks (e.g., 10,000) and save a checkpoint after each chunk to guard against later performance collapse.
- 3
Use reset_num_timesteps=False when repeatedly calling model.learn so logged timesteps and console output remain consistent.
- 4
Configure tb_log_name and tensorboard log paths correctly; missing or misdirected logging can leave TensorBoard empty.
- 5
Select checkpoints using TensorBoard trends, then verify by loading the model and running episodes with model.predict to confirm behavior.
- 6
Expect random initialization to change outcomes; train multiple runs and keep the best checkpoint rather than relying on a single seed or single run.
- 7
In this environment, PPO ultimately outperformed A2C after additional training, but early results could be misleading without checkpoint-based evaluation.