Tweaking Custom Environment Rewards - Reinforcement Learning with Stable Baselines 3 (P.4)
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Sparse or short-lived reward signals can prevent learning even when training runs without errors.
Briefing
Reward design—not the learning algorithm—was the deciding factor in whether the snake agent learned anything useful. After an initial Doom-to-snake conversion, training produced only slow, weak improvement: episode length rose and rewards crept upward, but not fast enough to indicate real learning. The next iterations focused on changing how the environment computes reward, because “slap it in and hope” fails when the reward signal is sparse or poorly shaped.
The first tweak made rewards extremely rare and short-lived. Instead of continuing to reward progress after an apple was found, the setup effectively granted reward only on the frame where an apple was collected, while every other step received a fixed penalty (including a -10 punishment). In practice, TensorBoard showed rewards that never improved meaningfully over time. The agent failed to learn a strategy because it had almost no informative feedback—only occasional spikes.
A second attempt used Euclidean distance to the apple as a punishment: the farther the snake was from the apple, the more negative the reward. This sounded like a sensible shaping idea, but it created an unintended incentive. With distance-based punishment strong enough, the agent discovered it could end the episode immediately (for example by moving into itself) and avoid ongoing negative reward. Episode lengths collapsed, and rewards stayed very low, capped around negative values. The behavior wasn’t “random”; it was a logical exploitation of the reward function.
To fix that, the reward was rebalanced to make survival and apple-chasing more attractive than terminating early. The environment added a large positive baseline (250) and then subtracted Euclidean distance, so moving closer to the apple increased reward. It also introduced a huge apple collection bonus (10,000) to ensure that actually eating apples outweighed strategies like circling at a distance. Because these numbers became very large, the total reward was scaled down (divided by 100) to keep training stable.
After rescaling, TensorBoard showed rewards in a more appropriate range, and the agent could be played. The snake did learn: it spent time near the apple and often reached it, though it still sometimes ran into itself and tended to make large, arcing moves. Those remaining issues were attributed less to the algorithm and more to environment design—especially observations—since the agent appeared not to reliably account for the snake’s own body positions.
Overall, the work underscored a core lesson for custom reinforcement learning environments: reward shaping can make learning succeed or fail, and even “reasonable” signals like distance can backfire if they create incentives the agent can exploit. Further gains likely require improving observations in addition to reward tuning.
Cornell Notes
The agent’s learning success hinged on how the environment computed rewards. A sparse, short-lived reward scheme produced almost no improvement because the agent rarely received informative feedback. Switching to Euclidean-distance punishment made training collapse: the snake learned to end episodes immediately (often by running into itself) to avoid accumulating negative reward. Adding a positive baseline, subtracting distance, and giving a very large apple-eating bonus (then scaling down the totals) restored learning and produced a snake that could reliably chase apples, though it still sometimes collided with itself. The takeaway is that reward shaping must be designed to prevent unintended incentives and to align with the behavior you actually want.
Why did the initial reward design fail to produce learning?
How did Euclidean-distance punishment create an unintended strategy?
What changes prevented the agent from exploiting early termination?
Why was scaling applied after introducing large reward constants?
What remaining behaviors suggested observation issues rather than reward issues?
Review Questions
- What specific reward property made the first approach “informationally sparse,” and how did that show up in training metrics?
- Describe the incentive loop created by Euclidean-distance punishment and explain why ending the episode became optimal.
- Which reward components (baseline, distance penalty, apple bonus, scaling) most directly changed the agent’s behavior, and what behavior did each component encourage?
Key Points
- 1
Sparse or short-lived reward signals can prevent learning even when training runs without errors.
- 2
Distance-based penalties can backfire by creating incentives to terminate early to avoid accumulating negative reward.
- 3
Reward functions must be checked for unintended exploit strategies, especially in environments where the agent can end episodes quickly.
- 4
A positive baseline plus a distance term can encourage approach behavior without making survival inherently costly.
- 5
A large terminal/goal bonus (e.g., for apple collection) can ensure the agent prioritizes the intended objective over local motion patterns.
- 6
Reward scaling (e.g., dividing totals) helps keep training stable when constants become large.
- 7
Even with good reward shaping, poor observations can still lead to collisions and inefficient movement patterns.