Custom Environments - Reinforcement Learning with Stable Baselines 3 (P.3)

TL;DR

Custom RL environments require explicit definitions of both observation space and reward; these choices often determine whether learning progresses.

Briefing Cornell Notes

Briefing

Custom reinforcement learning hinges on two design choices that aren’t handed to you when you leave built-in benchmarks: what the agent observes and what it’s rewarded for. After converting a simple Snake game into a Gym-style environment compatible with Stable Baselines 3, the training run shows the pipeline works end-to-end—but the agent’s performance plateaus because the observation features and reward shaping still don’t give it enough guidance to reliably reach apples.

The tutorial starts by taking a playable Snake implementation (controlled with WASD) and speeding it up so it’s practical for an agent to interact with. The core conversion step is wrapping the game logic into a custom environment class that defines a discrete action space (four actions) and a Gym-compatible reset/step loop. In this setup, reset initializes episode state (including a done flag and score), while step applies an action, updates the game state (snake movement, apple consumption, collisions), and returns the next observation, reward, done status, and an info dictionary.

The hardest part is deciding the observation space. Instead of feeding the agent raw pixel/state images (which would include lots of irrelevant “black space”), the environment uses feature engineering: the observation vector includes the snake head coordinates (head x, head y), the apple’s relative position (apple delta x, apple delta y), the current snake length, and a fixed-length history of previous actions padded with -1. This keeps the observation size fixed—an essential requirement for standard RL algorithms.

Reward design follows a similarly pragmatic approach. The environment tracks score, incrementing when the snake eats an apple. Death (collision) triggers a large negative reward (set to -10), while non-terminal steps initially yield a reward tied to score (and later the code is adjusted to ensure reward resets correctly on each episode). The intent is to heavily punish failure and provide positive reinforcement for apple consumption, but the training results suggest the agent still needs more informative signals to learn faster.

To validate correctness before training, the tutorial uses Stable Baselines 3’s check_env (checkm) and then runs a separate “double checkm” script that executes random actions for many episodes, printing rewards and verifying that episode termination and reward behavior match expectations.

Training then runs with PPO in Stable Baselines 3 for about 41,000 steps. Episode length increases over time, which makes sense given there’s no ongoing penalty for surviving and only a -10 hit on death. Episode reward also trends upward, but the agent is not yet strong. The conclusion is straightforward: the environment conversion and training loop are functioning, yet the current feature engineering and reward shaping don’t provide enough guidance for the snake to consistently reach apples. The next iteration is framed around improving feature engineering and reward design to accelerate learning and produce a better policy.

Cornell Notes

A custom Snake environment for Stable Baselines 3 is built by defining a Gym-style reset/step interface, a discrete action space of four moves, and a fixed-size observation vector. The observation uses engineered features—snake head position, apple relative position (deltas), snake length, and padded previous-action history—instead of raw images to reduce noise. Rewards heavily penalize death with -10 and give positive reward when apples are eaten (via score), with careful handling so reward resets correctly each episode. Validation uses check_env plus a random-action “double checkm” to confirm observations, rewards, and done logic behave as intended. PPO training for ~41k steps improves episode length and reward trend, but performance remains limited, pointing to the need for better feature engineering and reward shaping.

Why does the observation design become the hardest part when building a custom RL environment?

Built-in Gym environments provide observation and reward structure automatically, but a custom environment must define them from scratch. The observation must be informative enough for the agent to act, yet fixed-size for standard RL algorithms. In this Snake setup, the observation can’t simply be “the whole image” without adding lots of irrelevant pixels; instead it uses engineered features (head x/y, apple delta x/y, snake length, and a fixed-length previous-action history padded with -1).

What specific observation features are used, and why are they chosen over raw images?

The observation vector includes: (1) snake head coordinates (head x, head y), (2) apple relative position as deltas (apple delta x, apple delta y), (3) snake length computed from the snake’s stored coordinate list, and (4) previous actions stored in a deque of fixed length (snake length goal) filled with -1 initially. Raw images are avoided because most pixels are uninformative background, creating noise and extra learning burden.

How does the environment define actions and episode flow?

Actions are discrete with four possible values, matching the four movement directions. The environment follows the Gym lifecycle: reset initializes episode state (done=false, score/reward initialization, and observation state), then step applies the chosen action, updates the game (movement, apple consumption, collisions), and returns (observation, reward, done, info).

What reward scheme is implemented, and what problem does it create?

Death (collision) yields a large negative reward of -10. Otherwise, reward is tied to score (which increases when apples are eaten), and reward is reset properly at the start of each episode. Even with this shaping, training shows limited improvement: the agent survives longer and reward trends upward, but it still doesn’t reliably reach apples quickly, implying the reward/feature signals aren’t guiding learning enough.

How are environment bugs caught before training starts?

Two layers of checks are used. First, Stable Baselines 3’s check_env (checkm) validates that the environment conforms to expected Gym interfaces (spaces, return types, shapes). Second, a “double checkm” script runs many episodes with random actions, printing actions/rewards/done outcomes so issues like incorrect reward resetting or observation formatting show up before PPO training.

What training outcome indicates the environment works but learning is still constrained?

After roughly 41,000 PPO steps, episode length increases and episode reward shows a slow upward trend. That indicates the environment loop and reward plumbing are functioning. However, the agent is described as not impressive yet, and the tutorial attributes the limitation to insufficient feature engineering and reward shaping—especially the need for better signals to reach apples faster.

Review Questions

What constraints force the observation to be fixed-size, and how does the chosen Snake observation satisfy them?
How do the reset and step methods interact to produce correct episode termination (done) and reward behavior?
Which parts of the reward design likely influence the observed increase in episode length during training?

Key Points

1
Custom RL environments require explicit definitions of both observation space and reward; these choices often determine whether learning progresses.
2
Wrapping a game into a Gym-style environment means implementing reset and step so the agent can repeatedly interact with consistent state transitions.
3
Using engineered features (head position, apple deltas, snake length, previous actions) can reduce noise compared with feeding raw images.
4
Reward shaping with a strong death penalty (-10) and score-based positives can make training trends improve, but it may still fail to teach efficient apple-seeking.
5
Validation tools like check_env and random-action episode runs help catch interface and logic bugs before PPO training.
6
Even when training improves metrics like episode length, limited guidance in observations/rewards can keep the policy from becoming reliably effective.

Highlights

The observation vector is built from engineered state features—head x/y, apple delta x/y, snake length, and padded previous actions—rather than raw images to avoid overwhelming noise.

A large terminal penalty (-10) for collisions and score-based rewards create measurable learning progress, but not enough to produce a strong apple-seeking agent within ~41k PPO steps.

check_env plus a separate random-action “double checkm” run is used to verify observation shapes, reward resets, and done logic before committing to long training runs.

Topics

Custom Environments
Gym Conversion
Observation Engineering
Reward Shaping
Stable Baselines 3 PPO