High-Frequency Trading Data and Features

TL;DR

High-frequency trading data is modeled as a dynamic limit order book updated on nanosecond-to-microsecond timescales, not hour/minute intervals.

Briefing Cornell Notes

Briefing

High-frequency trading data is built around a limit order book that updates on extremely short timescales—nanoseconds to microseconds—so the dataset’s core job is to capture how buy and sell orders stack up at different price levels and how those stacks evolve. Instead of trading “hours, minutes, or seconds,” the market state changes so quickly that models must treat the order book as a dynamic, time-ordered sequence of snapshots. Each snapshot records the best bid and best ask (the closest buy and sell price levels) plus the depth of liquidity across multiple levels, enabling features like spread and level-wise volumes that reflect real trading conditions.

A limit order book is an electronic system where all traders submit orders through a network and the exchange matches them. Orders arrive as buy orders (from traders seeking to purchase) and sell orders (from traders seeking to sell). When a buy order is placed at a price that matches an existing sell order at the same price, a transaction occurs and the corresponding order quantities are removed from the relevant levels—making the book change immediately. The book is represented as price “levels” on both sides: each level corresponds to a price and aggregates the total quantity resting at that price. The “best bid” is the highest buy price currently requested; the “best ask” is the lowest sell price currently requested. The spread—often the most informative feature—is computed between these closest levels (best ask minus best bid), though spreads between other level pairs can also be derived.

Because the order book is stochastic and “spiky,” the dataset is normalized but normalization choices strongly affect downstream results. Min-max scaling, score-based scaling, and other methods are available, and selecting an appropriate scaler matters: if extremes dominate, the remaining structure can become less meaningful. Outlier handling is also recommended, since the distributional quirks of limit order books can distort learning.

The dataset is organized as post-processed event streams for five Nasdaq instruments (one file per stock per day), reconstructed from exchange messages such as order submissions, cancellations, and executions. Network effects can cause events to arrive out of order, so the pipeline reorders them and then generates future targets from the reconstructed market state. Snapshots form a time series: the time between events is irregular because the market only records states when events occur, so consecutive snapshots can be separated by anything from milliseconds to minutes. The data spans 10 days with nearly 4 million consecutive samples, and it includes stratified, walk-forward splits for supervised learning targets across nine folds.

Targets are provided for five classification tasks, while reinforcement learning can use the raw data directly. For modeling, the book is recorded with 10 levels per snapshot (five bid levels and five ask levels), even though real books contain more depth. Finally, the dataset includes trading-session metadata and recommends trimming anomalous periods—cutting roughly the first and last half hour after market open/close in Helsinki time—so models learn from typical regimes rather than edge-case behavior. Prices are discretized using tick size (e.g., one cent) and scaled (multiplied by 10,000) to avoid floating-point storage issues, and each message includes nanosecond timestamps, price levels, quantities, and event types (submission, cancellation, execution).

Cornell Notes

The dataset centers on a dynamic limit order book (LOB) that updates on nanosecond-to-microsecond timescales, producing a time series of irregularly spaced snapshots. Each snapshot aggregates resting buy and sell orders into price levels, letting models compute liquidity depth (level volumes) and key microstructure features like the spread between best ask and best bid. Data is reconstructed from exchange messages (submissions, cancellations, executions), reordered to fix out-of-order delivery, and then used to generate future targets. Normalization choices matter because LOB data is highly stochastic and can be spiky; poor scaling can make extremes dominate and wash out signal. The dataset spans 10 days for five Nasdaq instruments, includes stratified walk-forward splits, and provides five classification targets for supervised learning.

How does a limit order book snapshot translate into features like spread and volume?

A snapshot records price levels on both sides of the market. The best bid is the highest buy price level currently requested; the best ask is the lowest sell price level requested. The spread is typically computed as best ask minus best bid (closest levels). Volume is the aggregated quantity resting at each price level; it can be used per level (e.g., ask level at 101 has size 31) and also summed across levels for total bid/ask depth.

Why is the time series “irregular,” and why does that matter for modeling?

Snapshots are recorded only when market events occur, so the time delta between consecutive snapshots is not constant. The gap can range from milliseconds to several minutes. Models must treat the data as event-driven rather than assuming fixed intervals, and the dataset provides exact timestamps for each snapshot state.

What’s the difference between raw exchange messages and the dataset’s reconstructed inputs?

Raw messages arrive as event packets such as order submissions, cancellations, and executions, often with nanosecond timestamps. Because network delivery can cause events to arrive out of order, the pipeline reorders them (windowing) and reconstructs the order book state. The final inputs include reconstructed LOB features and a reconstructed message list aligned to those states.

Why do normalization and outlier handling have outsized impact on LOB learning?

LOB dynamics are highly stochastic and can produce spiky extremes. If min-max or score-based scaling is applied poorly, scalers can treat rare extremes as dominant, compressing the rest of the distribution and making remaining structure less informative. The guidance is to choose normalization carefully and consider dropping or mitigating outliers.

How are targets and dataset splits set up for supervised learning?

The dataset includes labels for five classification targets derived from future outcomes based on the reconstructed market state. Splits use stratified cross-validation in a walk-forward manner, producing nine folds. All data across days and stocks are pre-split, and the approach is designed to respect temporal ordering.

What trading-session filtering is recommended, and what problem does it prevent?

Trading happens only during official market hours (10:00 to 18:30 in Helsinki local time). The recommendation is to cut about the first and last half hour of each day because anomalous events cluster near the open and close, creating atypical regimes that can skew model training and evaluation.

Review Questions

What are best bid and best ask in a limit order book, and how is the spread between them computed?
Why can consecutive LOB snapshots be separated by very different time deltas, and how should that influence feature engineering or model assumptions?
How does out-of-order message delivery get handled before reconstructing the order book state?

Key Points

1
High-frequency trading data is modeled as a dynamic limit order book updated on nanosecond-to-microsecond timescales, not hour/minute intervals.
2
Each LOB snapshot aggregates resting orders into price levels; best bid and best ask define the closest liquidity boundary and drive the spread feature.
3
Spread is most informative when computed between the closest levels (best ask minus best bid), though other level-pair spreads are possible.
4
The dataset is reconstructed from exchange messages (submission, cancellation, execution) and reorders events to correct for network-induced out-of-order delivery.
5
Normalization choice strongly affects results because LOB data is stochastic and can be spiky; scaling can let extremes dominate and obscure signal.
6
Snapshots form an event-driven time series with irregular time deltas between consecutive states, ranging from milliseconds to minutes.
7
Supervised learning uses five classification targets with stratified walk-forward splits into nine folds; reinforcement learning can use the raw data directly.

Highlights

A limit order book snapshot is essentially a depth-of-liquidity map: price levels on both sides with aggregated quantities, enabling microstructure features like spread and level volumes.

The spread between best ask and best bid is singled out as the most informative spread definition because it reflects the nearest executable prices.

Event-driven sampling makes time deltas irregular; consecutive snapshots can be milliseconds apart or minutes apart even though timestamps are exact.

Normalization can make or break learning on LOB data: spiky extremes can distort scaling and reduce the usefulness of the rest of the distribution.

Market open/close periods are treated as anomalous; trimming roughly the first and last half hour helps models avoid regime shifts.

Topics

Limit Order Book
LOB Features
Data Normalization
Event-Driven Time Series
Supervised Targets

Mentioned

Alex, PhD AI

High-Frequency Trading Data and Features - Explained (3)