Deep Q Learning w/ DQN - Reinforcement Learning p.5
Based on sentdex's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Deep Q-learning predicts Q-values for all actions using a neural network, replacing the memory-heavy Q-table approach.
Briefing
Deep Q-learning replaces the classic Q-table with a deep neural network that outputs Q-values for every possible action, letting reinforcement learning handle far more complex environments than discrete tables can manage. Instead of storing and updating a value for each state-action pair, the network takes an observation (often an image, but it can also be feature values like Δx/Δy) and produces a vector of Q-values—one per action. The agent then chooses the action with the highest predicted Q-value (via argmax), while exploration is handled with an epsilon-greedy strategy that sometimes picks a random action.
The core motivation is scalability. Q-learning relies on a discrete “observation space” and a discrete “action space,” and the memory required for a Q-table grows explosively as either space expands. Even small increases in the number of possible states or actions can make the table impractically large. Deep Q-networks avoid that by learning a function approximation: they can generalize across similar-but-not-identical situations. Where a Q-table might fail on a scenario outside its previously seen discrete combinations (forcing random behavior), a neural network can still produce meaningful outputs by recognizing patterns that resemble earlier experiences.
That generalization comes with tradeoffs. Training deep Q-learning is slower and more finicky than filling a Q-table. The brute-force “populate the table” approach can take minutes for small problems, while deep Q-learning can take hours for comparable setups. In exchange, deep Q-learning can tackle environments where Q-tables would require absurd amounts of memory—situations that would otherwise be measured in petabytes.
To stabilize learning, the method introduces two key engineering ideas. First, it uses a separate target network. One model is trained continuously, while a second “target” model is used to generate more consistent predictions; periodically, the target network’s weights are updated to match the main model. This reduces the feedback loop instability that can happen when the same network both predicts and learns from rapidly changing targets.
Second, it uses experience replay via a replay memory buffer (implemented as a fixed-length deque). Each agent step stores a transition containing the current observation, chosen action, received reward, next observation, and a terminal flag. Rather than training on the most recent single transition, the algorithm later samples random minibatches from this buffer. That breaks correlations between consecutive experiences and improves training stability, even though the model still gets updated frequently.
The transcript then moves into code structure for a DQN agent. It defines a convolutional neural network using Keras/TensorFlow components: Conv2D layers, ReLU activations, MaxPooling2D, Dropout, a Flatten layer, and Dense layers that end with a linear output sized to the action space. The network is compiled with mean squared error loss and an Adam optimizer with a learning rate of 0.001. The agent class creates both the main model and the target model (initially copying weights), sets up replay memory with a capacity of 50,000 transitions, and configures TensorBoard logging using a modified TensorBoard callback to avoid creating a new log file on every fit call.
Finally, it sketches helper methods: one to append transitions into replay memory and another to compute predicted Q-values for a given state, including input normalization by dividing by 255. Training itself is deferred to the next installment, but the groundwork—model, target network, replay buffer, and Q-value inference—is laid for implementing the DQN update rule.
Cornell Notes
Deep Q-learning swaps a discrete Q-table for a neural network that outputs Q-values for all actions at once. Given an observation, the network produces a vector of predicted action values; the agent typically selects the action with the highest Q-value (argmax) while using epsilon-greedy exploration. This approach avoids the memory blow-up of Q-tables and can generalize to similar states it has never seen exactly. Stability is improved with two mechanisms: a target network (predictions come from a periodically updated copy) and experience replay (transitions are stored and later sampled randomly for training). The transcript then builds the DQN model in Keras with Conv2D/MaxPooling/Dropout/Flatten/Dense layers and sets up replay memory and TensorBoard logging.
Why does replacing a Q-table with a deep neural network help in reinforcement learning?
How does the agent choose an action from the network’s outputs?
What problem does the target network solve, and how is it implemented?
What is experience replay, and why sample randomly from it?
What does the DQN model architecture look like in the code sketch?
How are inputs normalized before predicting Q-values?
Review Questions
- How do target networks and experience replay each reduce instability in deep Q-learning, and what would likely go wrong if either mechanism were removed?
- Why does the DQN output a vector of Q-values (one per action) instead of training separate models per action?
- What role does epsilon-greedy exploration play relative to argmax action selection, and how does it affect early learning?
Key Points
- 1
Deep Q-learning predicts Q-values for all actions using a neural network, replacing the memory-heavy Q-table approach.
- 2
Generalization is a major advantage: the network can act sensibly in states not seen in the original discrete table.
- 3
Q-tables scale poorly because memory requirements explode as observation and action spaces grow.
- 4
Training deep Q-learning is slower and more finicky than table-based Q-learning, but it enables learning in environments where Q-tables would be impractical.
- 5
A target network stabilizes learning by providing less volatile prediction targets, updated periodically from the main model.
- 6
Experience replay stores transitions in a fixed buffer and trains on random minibatches to reduce correlation between consecutive experiences.
- 7
The provided code sketch builds a convolutional DQN model (Conv2D/MaxPooling/Dropout/Flatten/Dense) and sets up replay memory, target model syncing, and TensorBoard logging.