Custom Environments - Reinforcement Learning with Stable Baselines 3 (P.3)
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Custom RL environments require explicit definitions of both observation space and reward; these choices often determine whether learning progresses.
Briefing
Custom reinforcement learning hinges on two design choices that aren’t handed to you when you leave built-in benchmarks: what the agent observes and what it’s rewarded for. After converting a simple Snake game into a Gym-style environment compatible with Stable Baselines 3, the training run shows the pipeline works end-to-end—but the agent’s performance plateaus because the observation features and reward shaping still don’t give it enough guidance to reliably reach apples.
The tutorial starts by taking a playable Snake implementation (controlled with WASD) and speeding it up so it’s practical for an agent to interact with. The core conversion step is wrapping the game logic into a custom environment class that defines a discrete action space (four actions) and a Gym-compatible reset/step loop. In this setup, reset initializes episode state (including a done flag and score), while step applies an action, updates the game state (snake movement, apple consumption, collisions), and returns the next observation, reward, done status, and an info dictionary.
The hardest part is deciding the observation space. Instead of feeding the agent raw pixel/state images (which would include lots of irrelevant “black space”), the environment uses feature engineering: the observation vector includes the snake head coordinates (head x, head y), the apple’s relative position (apple delta x, apple delta y), the current snake length, and a fixed-length history of previous actions padded with -1. This keeps the observation size fixed—an essential requirement for standard RL algorithms.
Reward design follows a similarly pragmatic approach. The environment tracks score, incrementing when the snake eats an apple. Death (collision) triggers a large negative reward (set to -10), while non-terminal steps initially yield a reward tied to score (and later the code is adjusted to ensure reward resets correctly on each episode). The intent is to heavily punish failure and provide positive reinforcement for apple consumption, but the training results suggest the agent still needs more informative signals to learn faster.
To validate correctness before training, the tutorial uses Stable Baselines 3’s check_env (checkm) and then runs a separate “double checkm” script that executes random actions for many episodes, printing rewards and verifying that episode termination and reward behavior match expectations.
Training then runs with PPO in Stable Baselines 3 for about 41,000 steps. Episode length increases over time, which makes sense given there’s no ongoing penalty for surviving and only a -10 hit on death. Episode reward also trends upward, but the agent is not yet strong. The conclusion is straightforward: the environment conversion and training loop are functioning, yet the current feature engineering and reward shaping don’t provide enough guidance for the snake to consistently reach apples. The next iteration is framed around improving feature engineering and reward design to accelerate learning and produce a better policy.
Cornell Notes
A custom Snake environment for Stable Baselines 3 is built by defining a Gym-style reset/step interface, a discrete action space of four moves, and a fixed-size observation vector. The observation uses engineered features—snake head position, apple relative position (deltas), snake length, and padded previous-action history—instead of raw images to reduce noise. Rewards heavily penalize death with -10 and give positive reward when apples are eaten (via score), with careful handling so reward resets correctly each episode. Validation uses check_env plus a random-action “double checkm” to confirm observations, rewards, and done logic behave as intended. PPO training for ~41k steps improves episode length and reward trend, but performance remains limited, pointing to the need for better feature engineering and reward shaping.
Why does the observation design become the hardest part when building a custom RL environment?
What specific observation features are used, and why are they chosen over raw images?
How does the environment define actions and episode flow?
What reward scheme is implemented, and what problem does it create?
How are environment bugs caught before training starts?
What training outcome indicates the environment works but learning is still constrained?
Review Questions
- What constraints force the observation to be fixed-size, and how does the chosen Snake observation satisfy them?
- How do the reset and step methods interact to produce correct episode termination (done) and reward behavior?
- Which parts of the reward design likely influence the observed increase in episode length during training?
Key Points
- 1
Custom RL environments require explicit definitions of both observation space and reward; these choices often determine whether learning progresses.
- 2
Wrapping a game into a Gym-style environment means implementing reset and step so the agent can repeatedly interact with consistent state transitions.
- 3
Using engineered features (head position, apple deltas, snake length, previous actions) can reduce noise compared with feeding raw images.
- 4
Reward shaping with a strong death penalty (-10) and score-based positives can make training trends improve, but it may still fail to teach efficient apple-seeking.
- 5
Validation tools like check_env and random-action episode runs help catch interface and logic bugs before PPO training.
- 6
Even when training improves metrics like episode length, limited guidance in observations/rewards can keep the policy from becoming reliably effective.