Deep Reinforcement Learning - Markov Decision Process (MDP) - Explained (5)
Based on Alex, PhD AI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
Reinforcement learning is presented as a response to environments that shift over time, where supervised models can’t reliably maintain feature importance or performance.
Briefing
Deep reinforcement learning is positioned as the fix for a core mismatch in finance: supervised learning struggles in ultra high frequency trading because the market environment is highly dynamic, so feature importance and model behavior shift day to day. Reinforcement learning instead learns through active interaction—updating decisions as the environment changes—rather than freezing a model after training. That adaptability is framed as the reason reinforcement learning can outperform “retrain-and-replace” approaches when the same feature set can’t reliably predict future conditions.
The foundation for that adaptive decision-making is the Markov Decision Process (MDP), a mathematical framework for sequential choices under uncertainty. In an MDP, an agent repeatedly observes a state, selects an action from a fixed set, and receives a reward that evaluates the action’s outcome. The environment then transitions to a new state, and the agent continues, aiming to maximize the total (summed) reward over time. Uncertainty is captured by a transition function: taking action a in state s leads to the expected next state with high probability, but there is also a smaller chance of ending up in an unexpected state due to external randomness.
An MDP is built from four components: a set of states, a set of actions, a transition function (often described as probabilities over next states), and a reward function (how much feedback the agent gets for moving between states). For episodic tasks, there is also a start state where interaction begins and a terminal state where the episode ends. Chess is offered as a clean example of an MDP without randomness: the board configuration is the state, legal moves are actions, and the game ends when a terminal condition is reached.
A key simplifying assumption is the “Markov” property: the future depends only on the current state, not on the full history. In other words, once the agent knows the present state, earlier states don’t add extra predictive power for rewards or transitions. This makes the problem tractable and turns decision-making into a state-based optimization.
The transcript grounds these ideas with a car overheating example. The environment has three states—cool, warm, and overheated—with actions “move slow” and “move fast.” Moving slow from cool yields +1 reward and keeps the car cool with 100% probability, while moving fast from cool gives +2 reward but carries a 50% chance of transitioning to warm. In warm, moving fast guarantees overheating and a -10 penalty, while moving slow has a 50% chance to cool back down. The optimal policy follows a simple rule: in cool, move fast; in warm, move slow; once back in cool, move fast again. When the car reaches the overheated terminal state, the episode ends, and a new episode begins—allowing reinforcement learning to improve behavior over repeated trials.
Overall, the MDP formalism provides the structure that lets reinforcement learning learn adaptive policies through trial-and-error interaction—an approach presented as especially relevant when static supervised models can’t keep up with shifting environments like financial markets or other real-world systems.
Cornell Notes
Reinforcement learning is framed as a way to handle environments that change over time—especially in finance, where supervised models can’t reliably maintain feature importance or performance. The core formal tool is the Markov Decision Process (MDP), which models an agent interacting with an environment through states, actions, transition probabilities, and rewards. Outcomes are partly random and partly controlled by the agent, captured by a transition function that assigns probabilities to next states. The Markov property assumes future rewards and transitions depend only on the current state, not on past history. In episodic tasks, start and terminal states define when an interaction ends, allowing repeated episodes for learning an optimal policy.
Why does supervised learning often underperform in ultra high frequency trading, according to the transcript?
What are the four core components of an MDP, and what does each represent?
How does the MDP transition function represent uncertainty?
What does the Markov property mean in practical terms?
How does the car overheating example illustrate an optimal policy?
What role do terminal states play in episodic reinforcement learning?
Review Questions
- In an MDP, how do the transition function and reward function work together to determine what an agent should do next?
- Explain the Markov property and give an example of what would break it in a decision process.
- Using the car example, why is moving fast in the warm state suboptimal even though it might seem to offer higher immediate reward?
Key Points
- 1
Reinforcement learning is presented as a response to environments that shift over time, where supervised models can’t reliably maintain feature importance or performance.
- 2
An MDP models sequential decision-making by combining states, actions, transition probabilities, and rewards.
- 3
Uncertainty in outcomes is encoded in the transition function, which assigns probabilities to next states after an action.
- 4
The Markov property assumes future outcomes depend only on the current state, not on past history.
- 5
Episodic MDPs use start and terminal states to define when interactions end, enabling learning across repeated episodes.
- 6
The car overheating example demonstrates how an optimal policy can be state-dependent: move fast in cool, move slow in warm, and avoid the terminal overheating state.