Deep Q Learning w/ DQN - Reinforcement Learning p.5

TL;DR

Deep Q-learning predicts Q-values for all actions using a neural network, replacing the memory-heavy Q-table approach.

Briefing Cornell Notes

Briefing

Deep Q-learning replaces the classic Q-table with a deep neural network that outputs Q-values for every possible action, letting reinforcement learning handle far more complex environments than discrete tables can manage. Instead of storing and updating a value for each state-action pair, the network takes an observation (often an image, but it can also be feature values like Δx/Δy) and produces a vector of Q-values—one per action. The agent then chooses the action with the highest predicted Q-value (via argmax), while exploration is handled with an epsilon-greedy strategy that sometimes picks a random action.

The core motivation is scalability. Q-learning relies on a discrete “observation space” and a discrete “action space,” and the memory required for a Q-table grows explosively as either space expands. Even small increases in the number of possible states or actions can make the table impractically large. Deep Q-networks avoid that by learning a function approximation: they can generalize across similar-but-not-identical situations. Where a Q-table might fail on a scenario outside its previously seen discrete combinations (forcing random behavior), a neural network can still produce meaningful outputs by recognizing patterns that resemble earlier experiences.

That generalization comes with tradeoffs. Training deep Q-learning is slower and more finicky than filling a Q-table. The brute-force “populate the table” approach can take minutes for small problems, while deep Q-learning can take hours for comparable setups. In exchange, deep Q-learning can tackle environments where Q-tables would require absurd amounts of memory—situations that would otherwise be measured in petabytes.

To stabilize learning, the method introduces two key engineering ideas. First, it uses a separate target network. One model is trained continuously, while a second “target” model is used to generate more consistent predictions; periodically, the target network’s weights are updated to match the main model. This reduces the feedback loop instability that can happen when the same network both predicts and learns from rapidly changing targets.

Second, it uses experience replay via a replay memory buffer (implemented as a fixed-length deque). Each agent step stores a transition containing the current observation, chosen action, received reward, next observation, and a terminal flag. Rather than training on the most recent single transition, the algorithm later samples random minibatches from this buffer. That breaks correlations between consecutive experiences and improves training stability, even though the model still gets updated frequently.

The transcript then moves into code structure for a DQN agent. It defines a convolutional neural network using Keras/TensorFlow components: Conv2D layers, ReLU activations, MaxPooling2D, Dropout, a Flatten layer, and Dense layers that end with a linear output sized to the action space. The network is compiled with mean squared error loss and an Adam optimizer with a learning rate of 0.001. The agent class creates both the main model and the target model (initially copying weights), sets up replay memory with a capacity of 50,000 transitions, and configures TensorBoard logging using a modified TensorBoard callback to avoid creating a new log file on every fit call.

Finally, it sketches helper methods: one to append transitions into replay memory and another to compute predicted Q-values for a given state, including input normalization by dividing by 255. Training itself is deferred to the next installment, but the groundwork—model, target network, replay buffer, and Q-value inference—is laid for implementing the DQN update rule.

Cornell Notes

Deep Q-learning swaps a discrete Q-table for a neural network that outputs Q-values for all actions at once. Given an observation, the network produces a vector of predicted action values; the agent typically selects the action with the highest Q-value (argmax) while using epsilon-greedy exploration. This approach avoids the memory blow-up of Q-tables and can generalize to similar states it has never seen exactly. Stability is improved with two mechanisms: a target network (predictions come from a periodically updated copy) and experience replay (transitions are stored and later sampled randomly for training). The transcript then builds the DQN model in Keras with Conv2D/MaxPooling/Dropout/Flatten/Dense layers and sets up replay memory and TensorBoard logging.

Why does replacing a Q-table with a deep neural network help in reinforcement learning?

A Q-table requires explicit storage and updates for every discrete state-action pair, so memory grows rapidly as the observation space and action space expand. A deep Q-network instead learns a function that maps observations to a vector of Q-values (one per action). Because the network generalizes across similar inputs, it can handle situations outside the exact discrete combinations the table would have covered, rather than falling back to random actions.

How does the agent choose an action from the network’s outputs?

The network’s final layer uses a linear activation to output scalar Q-values for each possible action. The agent then selects the action index with the maximum predicted value using argmax (the transcript’s example indicates the argmax corresponds to the action to take). Exploration is introduced via epsilon-greedy: with probability epsilon, the agent takes a random action instead of the argmax choice.

What problem does the target network solve, and how is it implemented?

Training instability can occur when the same network both predicts Q-values and is updated every step, causing targets to shift constantly. The transcript uses two models: a main model that is trained frequently and a target model used for predictions. The target model’s weights are periodically synchronized to match the main model (via set_weights/get_weights), making predictions more consistent.

What is experience replay, and why sample randomly from it?

Experience replay stores transitions in a fixed-size buffer (a deque with max length, e.g., 50,000). Instead of training on only the most recent transition (which is highly correlated with the next one), the algorithm later samples random minibatches from the buffer. Random sampling reduces correlation between training samples and improves learning stability, especially when updates happen every step.

What does the DQN model architecture look like in the code sketch?

The model is a Keras Sequential network with Conv2D layers (e.g., 256 filters with 3x3 kernels), ReLU activations, MaxPooling2D (2x2), Dropout (20%), a Flatten layer, and Dense layers (e.g., a 64-unit dense layer). The output layer is a Dense layer with units equal to the action space size and a linear activation, producing Q-values for each action.

How are inputs normalized before predicting Q-values?

Before calling model.predict, the state is reshaped and then divided by 255 (div by 255) to normalize image pixel values. The transcript also notes that model.predict returns a list/array even for a single input, so indexing with [0] is used to extract the predicted Q-values for that one state.

Review Questions

How do target networks and experience replay each reduce instability in deep Q-learning, and what would likely go wrong if either mechanism were removed?
Why does the DQN output a vector of Q-values (one per action) instead of training separate models per action?
What role does epsilon-greedy exploration play relative to argmax action selection, and how does it affect early learning?

Key Points

1
Deep Q-learning predicts Q-values for all actions using a neural network, replacing the memory-heavy Q-table approach.
2
Generalization is a major advantage: the network can act sensibly in states not seen in the original discrete table.
3
Q-tables scale poorly because memory requirements explode as observation and action spaces grow.
4
Training deep Q-learning is slower and more finicky than table-based Q-learning, but it enables learning in environments where Q-tables would be impractical.
5
A target network stabilizes learning by providing less volatile prediction targets, updated periodically from the main model.
6
Experience replay stores transitions in a fixed buffer and trains on random minibatches to reduce correlation between consecutive experiences.
7
The provided code sketch builds a convolutional DQN model (Conv2D/MaxPooling/Dropout/Flatten/Dense) and sets up replay memory, target model syncing, and TensorBoard logging.

Highlights

Deep Q-networks output a full vector of Q-values—one per action—so action choice becomes argmax over predicted values.

The memory explosion of Q-tables motivates function approximation: discrete state-action coverage becomes infeasible as spaces grow.

Stability comes from two engineering choices: a periodically updated target network and random minibatch training from a replay buffer.

The DQN model ends with a linear output layer sized to the action space, producing scalar Q-values directly.

Replay memory is implemented as a fixed-length deque (e.g., 50,000), enabling sampling-based training rather than step-by-step correlation.

Topics

Deep Q Learning
DQN Architecture
Target Network
Experience Replay
TensorBoard Logging