Creating A Reinforcement Learning (RL) Environment

TL;DR

The agent’s state is relative geometry: (player−food) and (player−enemy) deltas are combined into a tuple that indexes the Q-table.

Briefing Cornell Notes

Briefing

A simple grid-world built from scratch lets a Q-learning agent learn to reach a “food” blob while avoiding an “enemy” blob—despite having no explicit knowledge of walls or boundaries. The environment uses relative positions (player-to-food and player-to-enemy deltas) as the observation, a small discrete action set for movement, and a tabular Q-table updated with standard Q-learning. The striking result: the agent can still succeed at high rates, often by using the walls as an accidental navigation aid, even though it never “learns” wall geometry directly.

The setup starts with a 10×10 grid. Three blobs are initialized at random coordinates: the player (blue), the food (green), and the enemy (red). In the early training mode, only the player moves; food and enemy remain stationary. The reward structure is tuned to push behavior: hitting the enemy ends the episode with a large negative penalty (enemy penalty set to 300), reaching the food ends the episode with a positive reward (food reward set to 25), and every other move incurs a small negative move penalty (−1). Exploration is controlled by epsilon-greedy policy: epsilon begins at 0.9 and decays over 25,000 episodes.

Instead of feeding absolute coordinates, the observation is built from relative geometry. A custom blob class supports operator overloading so the code can compute (player − food) and (player − enemy), producing a tuple of coordinate deltas. That tuple becomes the key into a dictionary-based Q-table. Because there are two relative position pairs, the Q-table effectively stores Q-values for every combination of those deltas, with four discrete actions corresponding to diagonal moves (choice 0–3). The move method also clamps positions to the grid edges: if the player attempts to step outside the 0–9 range, it gets snapped back to the boundary.

Training iterates over episodes and steps (200 steps per episode). For each step, the agent chooses either a random action (with probability epsilon) or the action with the highest Q-value for the current observation. After moving, it recomputes the new observation, calculates the reward (food, enemy, or move penalty), and updates the Q-table using the Q-learning update rule with a learning rate of 0.1 and a discount factor of 0.95. When food or enemy is reached, the episode terminates; otherwise it continues.

Visualization is done with OpenCV: the grid is rendered as an RGB image, scaled up for display, and refreshed each step. After training, the learned Q-table is saved via pickle and can be reloaded for evaluation with epsilon set to 0 (pure exploitation). The agent’s behavior becomes visibly competent: it zigzags toward the food and, in many cases, “bounces” off walls to reach otherwise hard-to-access positions.

Scaling experiments highlight a practical cost of tabular methods. Increasing the grid from 10×10 to 20×20 causes the Q-table size to balloon—from roughly 15 MB to about 250 MB. Yet the agent still learns in the larger environment, including when movement is enabled for both food and enemy, sometimes handling more complex avoidance and pursuit dynamics. The overall takeaway is that even a constrained, wall-agnostic tabular setup can produce surprisingly effective navigation—while also demonstrating why state/action explosion quickly makes Q-tables unwieldy.

Cornell Notes

A tabular Q-learning agent learns in a custom 10×10 grid-world with three blobs: player (blue), food (green), and enemy (red). The agent’s observation is not absolute position; it’s the relative deltas (player−food and player−enemy), encoded as a tuple used as the key into a dictionary Q-table. Rewards are sparse and decisive: reaching food ends the episode with +25, hitting the enemy ends with −300, and every other move costs −1. Even though the agent only moves diagonally and has no explicit wall-awareness, it still reaches the food at high rates—often by using wall clamping/bounces as an implicit navigation mechanism. Scaling to 20×20 dramatically increases Q-table size (about 15 MB to ~250 MB), showing the limits of tabular approaches.

How does the environment define the agent’s state (observation) for Q-learning?

The state is built from relative positions using operator overloading in a blob class. After the player moves, the code computes two deltas: (player − food) and (player − enemy). Each delta is a pair of coordinates (dx, dy), so the observation becomes a tuple of tuples: ((dx_food, dy_food), (dx_enemy, dy_enemy)). This tuple directly indexes the dictionary-based Q-table.

What actions can the agent take, and why does that matter?

The action space is four discrete choices, but movement is restricted to diagonals. Choice 0 moves (+1, +1), choice 1 moves (−1, −1), choice 2 moves (−1, +1), and choice 3 moves (+1, −1). The code later notes that if it had been designed from scratch, it would likely include up/down/left/right actions too; instead, the agent learns despite the diagonal-only constraint, often relying on wall clamping to change effective trajectories.

How are rewards assigned, and how do they shape learning?

Rewards are tied to terminal events and step costs. If the player’s coordinates match the enemy’s coordinates, the episode ends with reward = −enemy_penalty (enemy_penalty = 300). If the player matches the food’s coordinates, the episode ends with reward = food_reward (food_reward = 25). Otherwise, each move gets reward = −move_penalty (move_penalty = 1). This structure makes reaching food the only positive terminal outcome while strongly discouraging enemy contact.

What is the Q-table update rule used after each move?

After taking an action and receiving a reward, the code computes the new observation and then the maximum future Q-value for that new state: max_future_q = max(Q[new_observation]). It then updates the current state-action value using: new_q = (1 − learning_rate) * current_q + learning_rate * (reward + discount * max_future_q). The learning_rate is 0.1 and discount is 0.95. Special cases shortcut the update when reward equals food_reward or −enemy_penalty by setting new_q directly to those terminal values.

Why does the agent sometimes succeed even though it has no explicit wall knowledge?

The move method clamps positions to the grid boundaries (0 to size−1). So when diagonal movement would push the player outside the grid, the player gets snapped back to the edge, effectively creating a bounce-like behavior. The agent never encodes “walls” in its state, but the environment dynamics still make wall interactions part of the transition outcomes. Over many episodes, Q-learning can exploit those transition effects to reach food positions that would otherwise be unreachable under diagonal-only movement.

What happens to the approach when the grid size increases?

Tabular state coverage becomes expensive. The transcript reports that moving from 10×10 to 20×20 increases Q-table size from roughly 15 MB to about 250 MB. Training still works, including with movement enabled for food and enemy, but the memory and compute costs rise sharply—illustrating why deep Q-networks (DQNs) are often used later when state spaces grow.

Review Questions

What exact tuple structure is used as the Q-table key, and how is it computed from blob positions?
How do the diagonal-only actions and boundary clamping interact to produce “wall usage” behavior?
Why does increasing the grid from 10×10 to 20×20 cause such a large jump in Q-table size?

Key Points

1
The agent’s state is relative geometry: (player−food) and (player−enemy) deltas are combined into a tuple that indexes the Q-table.
2
Food (+25) and enemy (−300) are terminal events; every non-terminal step costs −1 to encourage efficient paths.
3
Epsilon-greedy exploration starts high (epsilon = 0.9) and decays across 25,000 episodes to shift from exploration to exploitation.
4
Diagonal-only movement (four actions) plus boundary clamping can still yield high success rates, because wall interactions become implicit transition dynamics.
5
Tabular Q-learning with a dictionary Q-table scales poorly: 10×10 is ~15 MB, while 20×20 is ~250 MB.
6
Saving and reloading the learned Q-table via pickle enables evaluation with epsilon set to 0 for deterministic play.
7
Enabling movement for food and enemy increases task complexity, and the agent can still learn avoidance/pursuit patterns without changing the core Q-learning loop.

Highlights

The agent learns to reach food at high rates even though it never encodes wall boundaries in its state—success emerges from how boundary clamping alters transitions.

Using relative-position deltas as observations keeps the state compact enough for a tabular approach, but it still explodes as grid size grows.

A reward design with strong terminal penalties (−300 for enemy) and small step costs (−1) drives clear, goal-directed behavior.

Scaling from 10×10 to 20×20 turns a manageable Q-table (~15 MB) into a much larger one (~250 MB), underscoring tabular limits.

Topics

Tabular Q-Learning
Custom Grid Environment
State Representation
Epsilon-Greedy Exploration
Reward Shaping

Creating A Reinforcement Learning (RL) Environment - Reinforcement Learning p.4