Deep Reinforcement Learning - Markov Decision Process (MDP)

TL;DR

Reinforcement learning is presented as a response to environments that shift over time, where supervised models can’t reliably maintain feature importance or performance.

Briefing Cornell Notes

Briefing

Deep reinforcement learning is positioned as the fix for a core mismatch in finance: supervised learning struggles in ultra high frequency trading because the market environment is highly dynamic, so feature importance and model behavior shift day to day. Reinforcement learning instead learns through active interaction—updating decisions as the environment changes—rather than freezing a model after training. That adaptability is framed as the reason reinforcement learning can outperform “retrain-and-replace” approaches when the same feature set can’t reliably predict future conditions.

The foundation for that adaptive decision-making is the Markov Decision Process (MDP), a mathematical framework for sequential choices under uncertainty. In an MDP, an agent repeatedly observes a state, selects an action from a fixed set, and receives a reward that evaluates the action’s outcome. The environment then transitions to a new state, and the agent continues, aiming to maximize the total (summed) reward over time. Uncertainty is captured by a transition function: taking action a in state s leads to the expected next state with high probability, but there is also a smaller chance of ending up in an unexpected state due to external randomness.

An MDP is built from four components: a set of states, a set of actions, a transition function (often described as probabilities over next states), and a reward function (how much feedback the agent gets for moving between states). For episodic tasks, there is also a start state where interaction begins and a terminal state where the episode ends. Chess is offered as a clean example of an MDP without randomness: the board configuration is the state, legal moves are actions, and the game ends when a terminal condition is reached.

A key simplifying assumption is the “Markov” property: the future depends only on the current state, not on the full history. In other words, once the agent knows the present state, earlier states don’t add extra predictive power for rewards or transitions. This makes the problem tractable and turns decision-making into a state-based optimization.

The transcript grounds these ideas with a car overheating example. The environment has three states—cool, warm, and overheated—with actions “move slow” and “move fast.” Moving slow from cool yields +1 reward and keeps the car cool with 100% probability, while moving fast from cool gives +2 reward but carries a 50% chance of transitioning to warm. In warm, moving fast guarantees overheating and a -10 penalty, while moving slow has a 50% chance to cool back down. The optimal policy follows a simple rule: in cool, move fast; in warm, move slow; once back in cool, move fast again. When the car reaches the overheated terminal state, the episode ends, and a new episode begins—allowing reinforcement learning to improve behavior over repeated trials.

Overall, the MDP formalism provides the structure that lets reinforcement learning learn adaptive policies through trial-and-error interaction—an approach presented as especially relevant when static supervised models can’t keep up with shifting environments like financial markets or other real-world systems.

Cornell Notes

Reinforcement learning is framed as a way to handle environments that change over time—especially in finance, where supervised models can’t reliably maintain feature importance or performance. The core formal tool is the Markov Decision Process (MDP), which models an agent interacting with an environment through states, actions, transition probabilities, and rewards. Outcomes are partly random and partly controlled by the agent, captured by a transition function that assigns probabilities to next states. The Markov property assumes future rewards and transitions depend only on the current state, not on past history. In episodic tasks, start and terminal states define when an interaction ends, allowing repeated episodes for learning an optimal policy.

Why does supervised learning often underperform in ultra high frequency trading, according to the transcript?

Because the market environment is highly dynamic: feature importance and model results can change day to day. That makes it hard to craft one stable feature set and expect a trained model to generalize. Reinforcement learning is presented as a better fit because it adapts dynamically by interacting with the environment and updating decisions as conditions change, rather than relying on a fixed model trained offline.

What are the four core components of an MDP, and what does each represent?

An MDP includes: (1) a set of states (the situations the agent can be in), (2) a set of actions (what the agent can do), (3) a transition function P(s'|s,a) (probabilities that action a in state s leads to next state s'), and (4) a reward function R (feedback tied to state transitions or outcomes). Together, these define how decisions lead to consequences over time.

How does the MDP transition function represent uncertainty?

It assigns probabilities to next states. For example, taking action a in state s might lead to the expected next state s1 with 90% probability, but with 10% probability the system lands in an unexpected state. This models randomness from external forces the agent can’t control.

What does the Markov property mean in practical terms?

It means the agent’s future depends only on the current state, not on the sequence of past states. If the agent is in s0, the reward and next-state distribution are determined by s0 alone; earlier history is ignored. This turns decision-making into a state-based problem rather than a history-based one.

How does the car overheating example illustrate an optimal policy?

The environment has states cool, warm, and overheated (terminal). Actions are move slow or move fast. In cool, moving fast maximizes expected reward (+2) even though it risks moving to warm; in warm, moving slow avoids guaranteed overheating and the -10 penalty. The optimal policy is therefore: move fast in cool until warm is reached, then move slow in warm until returning to cool, and repeat.

What role do terminal states play in episodic reinforcement learning?

A terminal state ends an episode—no further actions occur until a new episode starts. The transcript uses overheating as the terminal state: once reached, the episode stops. Learning happens across many episodes, with the agent gradually improving its policy based on experience.

Review Questions

In an MDP, how do the transition function and reward function work together to determine what an agent should do next?
Explain the Markov property and give an example of what would break it in a decision process.
Using the car example, why is moving fast in the warm state suboptimal even though it might seem to offer higher immediate reward?

Key Points

1
Reinforcement learning is presented as a response to environments that shift over time, where supervised models can’t reliably maintain feature importance or performance.
2
An MDP models sequential decision-making by combining states, actions, transition probabilities, and rewards.
3
Uncertainty in outcomes is encoded in the transition function, which assigns probabilities to next states after an action.
4
The Markov property assumes future outcomes depend only on the current state, not on past history.
5
Episodic MDPs use start and terminal states to define when interactions end, enabling learning across repeated episodes.
6
The car overheating example demonstrates how an optimal policy can be state-dependent: move fast in cool, move slow in warm, and avoid the terminal overheating state.

Highlights

Finance is framed as a setting where static supervised learning breaks down because the environment changes, making feature importance and results unstable.

An MDP formalizes decision-making under uncertainty using a transition function P(s'|s,a) and a reward function tied to outcomes.

The Markov property reduces complexity by making the present state sufficient for predicting future transitions and rewards.

In the overheating example, the optimal policy is explicitly state-based: fast in cool, slow in warm, to avoid the -10 terminal penalty.

Topics

Reinforcement Learning
Markov Decision Process
MDP Components
Markov Property
Optimal Policy

Deep Reinforcement Learning - Markov Decision Process (MDP) - Explained (5)