Get AI summaries of any video or article — Sign up free
Tweaking Custom Environment Rewards - Reinforcement Learning with Stable Baselines 3 (P.4) thumbnail

Tweaking Custom Environment Rewards - Reinforcement Learning with Stable Baselines 3 (P.4)

sentdex·
4 min read

Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.

TL;DR

Sparse or short-lived reward signals can prevent learning even when training runs without errors.

Briefing

Reward design—not the learning algorithm—was the deciding factor in whether the snake agent learned anything useful. After an initial Doom-to-snake conversion, training produced only slow, weak improvement: episode length rose and rewards crept upward, but not fast enough to indicate real learning. The next iterations focused on changing how the environment computes reward, because “slap it in and hope” fails when the reward signal is sparse or poorly shaped.

The first tweak made rewards extremely rare and short-lived. Instead of continuing to reward progress after an apple was found, the setup effectively granted reward only on the frame where an apple was collected, while every other step received a fixed penalty (including a -10 punishment). In practice, TensorBoard showed rewards that never improved meaningfully over time. The agent failed to learn a strategy because it had almost no informative feedback—only occasional spikes.

A second attempt used Euclidean distance to the apple as a punishment: the farther the snake was from the apple, the more negative the reward. This sounded like a sensible shaping idea, but it created an unintended incentive. With distance-based punishment strong enough, the agent discovered it could end the episode immediately (for example by moving into itself) and avoid ongoing negative reward. Episode lengths collapsed, and rewards stayed very low, capped around negative values. The behavior wasn’t “random”; it was a logical exploitation of the reward function.

To fix that, the reward was rebalanced to make survival and apple-chasing more attractive than terminating early. The environment added a large positive baseline (250) and then subtracted Euclidean distance, so moving closer to the apple increased reward. It also introduced a huge apple collection bonus (10,000) to ensure that actually eating apples outweighed strategies like circling at a distance. Because these numbers became very large, the total reward was scaled down (divided by 100) to keep training stable.

After rescaling, TensorBoard showed rewards in a more appropriate range, and the agent could be played. The snake did learn: it spent time near the apple and often reached it, though it still sometimes ran into itself and tended to make large, arcing moves. Those remaining issues were attributed less to the algorithm and more to environment design—especially observations—since the agent appeared not to reliably account for the snake’s own body positions.

Overall, the work underscored a core lesson for custom reinforcement learning environments: reward shaping can make learning succeed or fail, and even “reasonable” signals like distance can backfire if they create incentives the agent can exploit. Further gains likely require improving observations in addition to reward tuning.

Cornell Notes

The agent’s learning success hinged on how the environment computed rewards. A sparse, short-lived reward scheme produced almost no improvement because the agent rarely received informative feedback. Switching to Euclidean-distance punishment made training collapse: the snake learned to end episodes immediately (often by running into itself) to avoid accumulating negative reward. Adding a positive baseline, subtracting distance, and giving a very large apple-eating bonus (then scaling down the totals) restored learning and produced a snake that could reliably chase apples, though it still sometimes collided with itself. The takeaway is that reward shaping must be designed to prevent unintended incentives and to align with the behavior you actually want.

Why did the initial reward design fail to produce learning?

Rewards were effectively sparse and short-lived. The agent only got meaningful positive feedback when it collected an apple; otherwise it received a fixed negative punishment (including -10). Because step-by-step feedback didn’t reflect progress toward the goal, TensorBoard showed reward staying flat and not improving over time.

How did Euclidean-distance punishment create an unintended strategy?

Using Euclidean distance as a penalty meant the agent accumulated negative reward the longer it stayed alive while far from the apple. In this snake setup, the agent could terminate quickly by moving into itself. That made “end immediately” a rational way to minimize ongoing punishment, so episode lengths dropped sharply and rewards remained very low.

What changes prevented the agent from exploiting early termination?

The reward was rebalanced with a positive baseline (250) and then reduced by Euclidean distance, so moving closer increased reward without making survival inherently bad. A large apple bonus (10,000) ensured that eating apples dominated the reward tradeoff, making apple-chasing better than circling or ending the episode.

Why was scaling applied after introducing large reward constants?

The apple bonus and baseline made reward magnitudes very large, which can destabilize training. Dividing the total reward by 100 kept the reward scale more manageable while preserving the relative incentive structure (apple collection still worth far more than proximity alone).

What remaining behaviors suggested observation issues rather than reward issues?

Even after reward tuning, the snake sometimes ran into itself and made big arcing moves. Those patterns suggest the agent wasn’t accurately representing the positions of its own body in the observation space, so it couldn’t reliably avoid collisions or plan tighter paths.

Review Questions

  1. What specific reward property made the first approach “informationally sparse,” and how did that show up in training metrics?
  2. Describe the incentive loop created by Euclidean-distance punishment and explain why ending the episode became optimal.
  3. Which reward components (baseline, distance penalty, apple bonus, scaling) most directly changed the agent’s behavior, and what behavior did each component encourage?

Key Points

  1. 1

    Sparse or short-lived reward signals can prevent learning even when training runs without errors.

  2. 2

    Distance-based penalties can backfire by creating incentives to terminate early to avoid accumulating negative reward.

  3. 3

    Reward functions must be checked for unintended exploit strategies, especially in environments where the agent can end episodes quickly.

  4. 4

    A positive baseline plus a distance term can encourage approach behavior without making survival inherently costly.

  5. 5

    A large terminal/goal bonus (e.g., for apple collection) can ensure the agent prioritizes the intended objective over local motion patterns.

  6. 6

    Reward scaling (e.g., dividing totals) helps keep training stable when constants become large.

  7. 7

    Even with good reward shaping, poor observations can still lead to collisions and inefficient movement patterns.

Highlights

When rewards were only granted on apple collection and otherwise punished, training produced essentially no improvement—feedback was too sparse to guide behavior.
Euclidean-distance punishment caused episode lengths to collapse because the agent learned that ending the game quickly avoided ongoing penalties.
Adding a 250 baseline, subtracting Euclidean distance, and granting a 10,000 apple bonus (then scaling down) restored learning and produced a snake that could chase apples.
After reward fixes, remaining failures (running into itself, big arcing moves) pointed toward observation limitations rather than algorithmic issues.

Topics

Mentioned

  • PPO