Q Learning Algorithm and Agent - Reinforcement Learning p.2

TL;DR

Convert continuous MountainCar observations into discrete (position, velocity) tuples so a tabular Q-table can be indexed reliably.

Briefing Cornell Notes

Briefing

The core breakthrough is turning a Q-learning agent from “random Q values” into a working policy by discretizing MountainCar’s continuous state, then updating a Q-table with the standard temporal-difference rule that propagates future value backward through the state-action chain. With a learning rate of 0.1 and a discount factor of 0.95, the agent repeatedly steps through the environment, converts each new continuous observation into a discrete (position, velocity) tuple, and updates exactly one Q-table entry: the value for the previous discrete state and the action taken, using reward plus discounted best future Q from the new discrete state.

That discretization step is the hinge. MountainCar provides continuous observations, but the Q-table needs discrete indices. A helper function computes a discrete state by scaling each component of the observation relative to the environment’s observation-space bounds and the chosen window size, then returns the result as a NumPy integer tuple. The agent can then index the Q-table directly with that tuple, retrieve the best action via argmax over Q-values, and compute the “current Q” and “max future Q” needed for the update.

Once the loop runs, the update rule is the engine of learning: new_Q = (1 - learning_rate) * current_Q + learning_rate * (reward + discount * max_future_Q). Because the environment’s reward is mostly negative until the goal is reached, meaningful positive outcomes arrive late; the discount factor determines how strongly those late successes influence earlier states. In practice, the agent’s success rate rises in a “snowball” pattern: after it stumbles into a successful trajectory once, the Q-values along that path become more favorable, making future attempts more likely to repeat and improve.

The training setup then scales from a single run to thousands of episodes (25,000). Rendering is throttled to every 2,000 episodes to keep training fast, and the code prints when the car reaches the goal. A key practical issue emerges: even if the agent can solve the task quickly, it may do so inefficiently or get stuck optimizing one discovered route.

That’s where exploration enters. The transcript introduces epsilon-greedy exploration: start with epsilon = 0.5 so random actions sometimes override the greedy argmax choice, then decay epsilon over time so the agent gradually shifts from exploration to exploitation. A bug/oversight is corrected: epsilon must actually be used when selecting actions, not just computed. After fixing that, learning becomes slower at first (e.g., success around episode ~1200 instead of ~200), but the agent continues exploring long enough to find better strategies rather than locking onto the first workable path.

By the end, the agent is reliably learning to climb the mountain, and the remaining work is tuning epsilon decay and other hyperparameters—plus adding better metrics—to measure time-to-completion and efficiency more systematically. The central lesson is that Q-learning only becomes effective once continuous states are discretized and the Q-table update is driven by both reward and discounted future value, with exploration managed via epsilon-greedy action selection.

Cornell Notes

Q-learning becomes effective once continuous MountainCar observations are converted into discrete state tuples so a Q-table can be indexed. Each step computes the “current Q” for the previous discrete state and action, the “max future Q” from the new discrete state, then updates the table using: new_Q = (1−α)·current_Q + α·(reward + γ·max_future_Q), with learning rate α = 0.1 and discount γ = 0.95. Because rewards are mostly negative until the goal is reached, success propagates backward slowly through the chain, producing a snowball effect after the first successful episode. Training runs across many episodes (25,000) with rendering throttled for speed. Finally, epsilon-greedy exploration (epsilon starting at 0.5 and decaying) is required to avoid locking onto inefficient solutions and to keep searching for better routes.

Why is discretizing the continuous MountainCar state necessary for a Q-table?

Q-learning with a tabular Q-table requires discrete indices. MountainCar’s observation includes continuous position and velocity, so the transcript adds a helper like get_discrete_state(state) that scales each component using the environment’s observation-space bounds (observation_space.low) and a chosen window size (discrete observation space window size). The result is returned as a NumPy integer tuple, such as (7, 10), which can directly index Q_table[(pos_bin, vel_bin)].

How does the Q update rule propagate future success backward?

Each update uses the temporal-difference form: new_Q = (1−learning_rate)*current_Q + learning_rate*(reward + discount*max_future_Q). The agent computes max_future_Q = max_a Q_table[new_discrete_state, a] and multiplies it by discount (0.95). Since the agent repeatedly applies this rule across steps, the value of reaching the goal influences earlier states through repeated discounting, effectively backing up value along the trajectory chain.

What do learning rate (α) and discount (γ) do in practice?

Learning rate α = 0.1 controls how strongly new experience overwrites old Q estimates. Discount γ = 0.95 controls how much future rewards matter compared with immediate reward. With γ close to 1, value from later success can influence many earlier steps (e.g., repeated multiplication like 0.95^k), which is crucial because MountainCar’s reward is typically negative until the goal is reached.

Why does the agent’s success often appear suddenly after many failures?

Most steps yield negative reward, so Q-values don’t improve much until the agent experiences a full successful run. Once a successful trajectory occurs, the Q-values along that path get updated toward higher values via reward + γ·max_future_Q. Those improved values make the next episodes more likely to reproduce the successful path, creating a snowball effect where success becomes more frequent.

What problem does epsilon-greedy exploration solve, and what happens if epsilon isn’t used correctly?

Epsilon-greedy exploration prevents the agent from committing too early to the first route it finds. With epsilon starting at 0.5, the agent sometimes chooses random actions instead of argmax, allowing it to discover alternative strategies that may be more efficient. The transcript notes a critical fix: epsilon must actually be used during action selection; computing/decaying epsilon without applying it leaves behavior effectively greedy, which can lead to slower learning or suboptimal, inefficient solutions.

How does epsilon decay change learning behavior over time?

Epsilon decay gradually reduces randomness. Early on, higher epsilon encourages exploration, which can delay initial success (e.g., learning around episode ~1200 after the fix). As epsilon decreases, the agent increasingly exploits the best-known actions, rapidly optimizing the strategy it has found—often improving efficiency once a good path is discovered.

Review Questions

How would changing the discount factor from 0.95 to a smaller value likely affect how quickly value from reaching the goal propagates backward through earlier states?
What exact Q-table entry gets updated each step, and why does the update use the discrete state from before the environment step rather than the new discrete state?
In epsilon-greedy action selection, what is the decision rule for choosing between argmax and a random action, and how does epsilon decay alter that rule over episodes?

Key Points

1
Convert continuous MountainCar observations into discrete (position, velocity) tuples so a tabular Q-table can be indexed reliably.
2
Use the standard temporal-difference Q-learning update: new_Q = (1−α)·current_Q + α·(reward + γ·max_future_Q).
3
Set learning rate α (0.1) to control how quickly Q-values change, and set discount γ (0.95) to determine how strongly future value influences earlier decisions.
4
Train across many episodes (25,000) and throttle rendering to keep learning fast while still monitoring progress.
5
Recognize that mostly negative rewards make success propagate slowly; the first successful trajectory often triggers a snowball improvement pattern.
6
Apply epsilon-greedy exploration by actually using epsilon in action selection; otherwise exploration never happens and learning can stall or converge to inefficient routes.
7
Decay epsilon over episodes to shift from exploration to exploitation once the agent has found at least one workable strategy.

Highlights

Discretizing continuous state into an integer tuple is the practical prerequisite for tabular Q-learning on MountainCar.

The update uses max_future_Q from the new discrete state, so discounted future value backs up through the chain of earlier states.

Because rewards are mostly negative until the goal, Q-values improve dramatically only after the agent experiences a full successful run.

Epsilon-greedy exploration must be wired into action selection; computing epsilon without using it leaves behavior effectively greedy.

With epsilon decay, initial learning can slow down, but the agent can later optimize a better (more efficient) route instead of the first one it finds.

Topics

Q-Learning
Epsilon-Greedy
State Discretization
Temporal-Difference Update
MountainCar