Reinforcement Learning with Stable Baselines 3 - Introduction (P.1)
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Stable Baselines 3 separates environment engineering from algorithm training, making it easier to swap RL methods without rewriting the training loop.
Briefing
Stable Baselines 3 is positioned as a shortcut for reinforcement learning: it standardizes the workflow so people can swap algorithms quickly while keeping the environment—and its reward/observations—separate. The core promise is time savings and reduced engineering pain: once an environment is defined, training different RL methods becomes mostly a matter of changing one line of code, rather than rewriting the full training loop each time.
The tutorial starts by mapping the moving parts of reinforcement learning. An environment is the task (for example, LunarLander-v2), a model is the trained algorithm instance, and an agent is the decision-maker that interacts with the environment. Each interaction happens in a loop: the agent observes the current state (observations/states), chooses an action, the environment advances one step, and then returns a new observation plus a reward and a done signal. The action space can be discrete or continuous. Discrete actions are categorical choices (like left vs. right in CartPole). Continuous actions are real-valued ranges (common in robotics, where control signals like torques or positions vary continuously), and the tutorial notes that robotics often gets treated as continuous even when the underlying actuator positions are effectively discretized into many possibilities.
On the tooling side, the setup centers on PyTorch as the backend, Stable Baselines 3 for the RL algorithms, and OpenAI Gym for environments. For the example environment, Box2D is installed to support LunarLander. With Gym and Stable Baselines 3 installed, the tutorial demonstrates how to create an environment, reset it, sample an action from the action space, inspect the observation space shape, and take steps using random actions. LunarLander-v2 is used to make these concepts concrete: the observation space is a flat vector of eight values, and stepping the environment yields rewards that eventually reflect failure (e.g., a large negative reward when the lander crashes).
The training section then turns these abstractions into code-level workflow. The key decision for algorithm selection is whether the environment’s action space is discrete or continuous (and sometimes multi-discrete/multi-binary). Using Stable Baselines 3’s supported-algorithm table as a guide, the tutorial picks A2C first for a discrete setting. It trains for 10,000 time steps, then evaluates performance via metrics like episode length and average episode reward. The results are described as underwhelming—better than random, but still weak.
To test whether more training helps, the tutorial increases training to 100,000 steps, but performance does not steadily improve; episode reward trends can degrade after a point. It then swaps algorithms to PPO (again by changing the import and the model class) and retrains for 100,000 steps. PPO shows some improvements—such as staying alive longer and training faster—but both A2C and PPO are still far from strong performance at these step counts. The tutorial attributes the poor outcomes to insufficient training duration and emphasizes that real RL runs often require millions to tens or hundreds of millions of steps.
Finally, the tutorial flags the next steps: saving/loading models and tracking progress over time (e.g., with TensorBoard). The takeaway is that Stable Baselines 3 reduces friction around algorithms, but the biggest engineering challenge—and the biggest performance lever—remains the environment design, reward shaping, and observation structure, which will be tackled in later parts of the series.
Cornell Notes
Stable Baselines 3 streamlines reinforcement learning by separating environment engineering from algorithm training. The tutorial breaks down RL fundamentals—environment, agent, observations (states), actions, and the step loop that returns new observations and rewards—then demonstrates these mechanics using LunarLander-v2. It highlights that action space type (discrete vs. continuous, plus multi-discrete/multi-binary) largely determines which algorithms are usable. Training A2C and PPO on LunarLander shows weak performance at 10k–100k steps, reinforcing that RL typically needs far more experience (often millions of steps). The practical next step is learning to save/load models and track learning curves over time.
What are the essential components in a reinforcement learning loop, and how do they connect each step?
How does action space type (discrete vs. continuous) affect algorithm choice?
Why does the tutorial emphasize environment and reward/observation design over algorithm tweaking?
What does the LunarLander-v2 example show about observations and actions in practice?
What happened when A2C and PPO were trained for 10,000 and 100,000 steps?
Why does the tutorial say saving/loading and tracking matter next?
Review Questions
- Which RL variables are returned by the environment after each action, and what does the done flag indicate?
- How would you decide whether A2C or PPO is appropriate before training, based on the environment’s action space?
- Why might increasing training steps from 10,000 to 100,000 fail to improve performance in LunarLander-v2?
Key Points
- 1
Stable Baselines 3 separates environment engineering from algorithm training, making it easier to swap RL methods without rewriting the training loop.
- 2
Reinforcement learning interaction is a step loop: observation → action → environment step → new observation, reward, and done.
- 3
Action space type (discrete vs. continuous, plus multi-discrete/multi-binary) is a primary filter for which algorithms can be used.
- 4
For LunarLander-v2, observations are an eight-value flat vector, and each step produces rewards that reflect progress or failure.
- 5
Default hyperparameter tweaks often yield small gains compared with changes to reward design and observation structure.
- 6
Training A2C and PPO for only 10k–100k steps produces weak results, underscoring that RL commonly needs millions to tens/hundreds of millions of steps.
- 7
Saving/loading and performance tracking are essential for diagnosing learning curves and continuing training efficiently.