Reinforcement Learning with Stable Baselines 3

TL;DR

Stable Baselines 3 separates environment engineering from algorithm training, making it easier to swap RL methods without rewriting the training loop.

Briefing Cornell Notes

Briefing

Stable Baselines 3 is positioned as a shortcut for reinforcement learning: it standardizes the workflow so people can swap algorithms quickly while keeping the environment—and its reward/observations—separate. The core promise is time savings and reduced engineering pain: once an environment is defined, training different RL methods becomes mostly a matter of changing one line of code, rather than rewriting the full training loop each time.

The tutorial starts by mapping the moving parts of reinforcement learning. An environment is the task (for example, LunarLander-v2), a model is the trained algorithm instance, and an agent is the decision-maker that interacts with the environment. Each interaction happens in a loop: the agent observes the current state (observations/states), chooses an action, the environment advances one step, and then returns a new observation plus a reward and a done signal. The action space can be discrete or continuous. Discrete actions are categorical choices (like left vs. right in CartPole). Continuous actions are real-valued ranges (common in robotics, where control signals like torques or positions vary continuously), and the tutorial notes that robotics often gets treated as continuous even when the underlying actuator positions are effectively discretized into many possibilities.

On the tooling side, the setup centers on PyTorch as the backend, Stable Baselines 3 for the RL algorithms, and OpenAI Gym for environments. For the example environment, Box2D is installed to support LunarLander. With Gym and Stable Baselines 3 installed, the tutorial demonstrates how to create an environment, reset it, sample an action from the action space, inspect the observation space shape, and take steps using random actions. LunarLander-v2 is used to make these concepts concrete: the observation space is a flat vector of eight values, and stepping the environment yields rewards that eventually reflect failure (e.g., a large negative reward when the lander crashes).

The training section then turns these abstractions into code-level workflow. The key decision for algorithm selection is whether the environment’s action space is discrete or continuous (and sometimes multi-discrete/multi-binary). Using Stable Baselines 3’s supported-algorithm table as a guide, the tutorial picks A2C first for a discrete setting. It trains for 10,000 time steps, then evaluates performance via metrics like episode length and average episode reward. The results are described as underwhelming—better than random, but still weak.

To test whether more training helps, the tutorial increases training to 100,000 steps, but performance does not steadily improve; episode reward trends can degrade after a point. It then swaps algorithms to PPO (again by changing the import and the model class) and retrains for 100,000 steps. PPO shows some improvements—such as staying alive longer and training faster—but both A2C and PPO are still far from strong performance at these step counts. The tutorial attributes the poor outcomes to insufficient training duration and emphasizes that real RL runs often require millions to tens or hundreds of millions of steps.

Finally, the tutorial flags the next steps: saving/loading models and tracking progress over time (e.g., with TensorBoard). The takeaway is that Stable Baselines 3 reduces friction around algorithms, but the biggest engineering challenge—and the biggest performance lever—remains the environment design, reward shaping, and observation structure, which will be tackled in later parts of the series.

Cornell Notes

Stable Baselines 3 streamlines reinforcement learning by separating environment engineering from algorithm training. The tutorial breaks down RL fundamentals—environment, agent, observations (states), actions, and the step loop that returns new observations and rewards—then demonstrates these mechanics using LunarLander-v2. It highlights that action space type (discrete vs. continuous, plus multi-discrete/multi-binary) largely determines which algorithms are usable. Training A2C and PPO on LunarLander shows weak performance at 10k–100k steps, reinforcing that RL typically needs far more experience (often millions of steps). The practical next step is learning to save/load models and track learning curves over time.

What are the essential components in a reinforcement learning loop, and how do they connect each step?

An environment defines the task (e.g., LunarLander-v2). An agent uses a model (a trained RL algorithm instance) to choose actions. At each step, the agent receives an observation/state describing the current situation, selects an action from the environment’s action space, and passes it to the environment. The environment advances one step and returns a new observation plus a reward and a done flag indicating whether the episode ended. This repeats until done.

How does action space type (discrete vs. continuous) affect algorithm choice?

Discrete action spaces require choosing one category per step (e.g., CartPole: left or right). Continuous action spaces output real-valued controls (common in robotics, such as torques or servo positions over a range like -1 to +1). Continuous problems are usually harder to learn and may require different algorithm support. The tutorial also notes multi-discrete/multi-binary cases, such as robots with multiple servos where several decisions happen at once.

Why does the tutorial emphasize environment and reward/observation design over algorithm tweaking?

Stable Baselines 3 provides robust default implementations, so small hyperparameter tweaks often yield only marginal gains (roughly 5–10% in the author’s experience). Bigger performance changes come from algorithm choice, but the largest practical impact comes from the environment’s reward mechanism and what the agent observes. If the environment or reward is changed, the algorithm code may not need rewriting—only the environment definition.

What does the LunarLander-v2 example show about observations and actions in practice?

After creating the Gym environment, the tutorial samples an action from the environment’s action space and inspects the observation space shape. For LunarLander-v2, the observation is a flat vector of eight values. The code then steps the environment with random actions for many steps and prints rewards, illustrating that each step yields immediate feedback and that episodes can end with strong negative outcomes when the lander crashes.

What happened when A2C and PPO were trained for 10,000 and 100,000 steps?

With A2C, performance after 10,000 steps is described as weak—episode reward and survival are not impressive compared with the earlier random behavior. Increasing to 100,000 steps does not produce a clear improvement; episode reward can even worsen after a period. PPO is then tried by swapping the algorithm class, and it shows some advantages (training faster and staying alive longer), but both methods still look poor at these step counts, suggesting RL needs far more experience.

Why does the tutorial say saving/loading and tracking matter next?

Because training curves can degrade or plateau, comparing runs requires more than printing final metrics. Saving/loading enables continuing training without restarting from scratch, and tracking (e.g., with TensorBoard) makes it possible to see how episode reward and episode length evolve over time. The tutorial points to the next installment as the place where these practices are introduced.

Review Questions

Which RL variables are returned by the environment after each action, and what does the done flag indicate?
How would you decide whether A2C or PPO is appropriate before training, based on the environment’s action space?
Why might increasing training steps from 10,000 to 100,000 fail to improve performance in LunarLander-v2?

Key Points

1
Stable Baselines 3 separates environment engineering from algorithm training, making it easier to swap RL methods without rewriting the training loop.
2
Reinforcement learning interaction is a step loop: observation → action → environment step → new observation, reward, and done.
3
Action space type (discrete vs. continuous, plus multi-discrete/multi-binary) is a primary filter for which algorithms can be used.
4
For LunarLander-v2, observations are an eight-value flat vector, and each step produces rewards that reflect progress or failure.
5
Default hyperparameter tweaks often yield small gains compared with changes to reward design and observation structure.
6
Training A2C and PPO for only 10k–100k steps produces weak results, underscoring that RL commonly needs millions to tens/hundreds of millions of steps.
7
Saving/loading and performance tracking are essential for diagnosing learning curves and continuing training efficiently.

Highlights

Stable Baselines 3’s main value is workflow: change the algorithm class, keep the environment definition intact, and avoid rebuilding RL plumbing each time.

LunarLander-v2 uses an eight-dimensional flat observation vector, and random stepping quickly demonstrates how rewards and episode termination behave.

A2C and PPO can both look poor at 10,000–100,000 steps, even when they outperform random—suggesting insufficient training experience rather than a broken setup.

The tutorial frames RL difficulty as less about algorithm code and more about environment, reward mechanism, and observation engineering.

Topics

Stable Baselines 3 Setup
Gym Environments
RL Terminology
Discrete vs Continuous Actions
Training A2C and PPO

Mentioned

Stable Baselines 3
PyTorch
OpenAI Gym
RL
A2C
PPO
MLP
CPU
GPU

Reinforcement Learning with Stable Baselines 3 - Introduction (P.1)