High-Frequency Trading - Hard-Rule & Supervised Traders

TL;DR

Drop weekends (Saturday and Sunday) to keep the dataset aligned with consistent trading behavior.

Briefing Cornell Notes

Briefing

The core takeaway is how the trading dataset is engineered for both supervised learning and “hard-rule” trading: each day’s limit-order-book (LOB) stream is converted into a walk-forward, nine-fold training/test structure with 144 carefully designed features and five future prediction horizons. That setup matters because it turns irregular, event-driven market data into consistent supervised targets—so models can learn when the mid price is likely to rise, fall, or stay flat after accounting for transaction costs.

Data preparation starts by dropping weekends: Sunday and Saturday are removed, leaving trading days that follow a consistent pattern. Each remaining day is split into nine consecutive folds arranged in a walk-forward manner. Training grows iteratively: a model is trained on earlier folds, tested on the next subset, then retrained as additional folds become available. Metrics are computed by averaging performance across all test subsets, using a dataset size of 400,000 samples with 144 features.

The feature set is built from raw 10-level LOB snapshots—ask and bid prices and volumes across 10 depth levels—plus derived indicators that summarize both “state” and “movement.” State features include spread and mid price (best ask minus best bid, and their average), total and level volumes, and average prices. For each depth level, the pipeline computes normalized price differences relative to the best ask/bid and normalizes volumes. It also adds time-sensitive measures: differences between previous and current prices or volumes divided by the elapsed time between events, plus “intensity” indicators that capture how quickly price changes. The result is a mix of continuous and categorical-style signals (including thresholded indicators).

Targets are defined around the mid price at future horizons. Five separate horizons are generated, ranging from near-term to more distant future points (described as “next one, two, three, five, or ten events per millisecond,” with the exact construction tied to event timing). Each target label is categorical direction rather than a raw price value: the dataset computes the percentage change of the mid price relative to the current mid price, then assigns labels based on a threshold of 0.002. Changes ≥ +0.002 become “up” (label 1), changes between −0.002 and +0.002 become “flat” (label 2), and changes ≤ −0.002 become “down” (label 3).

The trading logic connects these predictions to execution and fees. A strategy buys or goes short only if the predicted mid-price move is large enough to cover transaction costs and still deliver profit. The hard-rule baseline is simple: buy one share at 10:30 each day and sell at 6 p.m. the same day (then repeat daily). For the supervised approach, the plan is to adapt a random forest classifier (using scikit-learn) trained on the prepared features and five horizon-specific targets, then evaluate results by computing a “heat ratio” and saving outputs to CSV.

Overall, the dataset engineering—walk-forward splits, 10-level LOB feature construction, five horizon labels, and fee-aware decision framing—turns raw LOB streams into a supervised learning problem that can be implemented with a straightforward random forest pipeline and compared against a deterministic hard-rule trader.

Cornell Notes

The dataset is engineered for supervised trading by converting event-driven limit-order-book (LOB) data into a walk-forward learning problem. Each trading day is split into nine consecutive folds, with training expanding iteratively and metrics averaged across all test subsets. From 10-level LOB snapshots, the pipeline builds 144 features capturing both book state (spread, mid price, normalized price/volume by depth) and book dynamics (time-normalized differences and intensity indicators). Targets are five future horizons predicting the direction of mid-price movement using a percentage-change threshold of 0.002: up (≥ +0.002), flat (between −0.002 and +0.002), and down (≤ −0.002). This matters because it aligns model outputs with fee-aware trading decisions and enables horizon-by-horizon evaluation using a random forest classifier.

Why are weekends dropped and what does the “walk-forward nine-fold” structure accomplish?

Sunday and Saturday are removed so only consistent trading sessions remain. Each remaining day is split into nine consecutive folds. Training grows iteratively: a model is trained on earlier folds, predicts on the next test subset, then retrains as additional folds become available. Metrics are computed by averaging performance across all test subsets, which reduces look-ahead bias compared with random shuffling.

What exactly goes into the 144 features derived from the limit order book?

The base inputs are 10-level LOB snapshots: for each depth level, ask price, bid price, ask volume, and bid volume. On top of that, the pipeline adds state features like spread and mid price, plus aggregated volume and average-price measures. It also computes normalized price differences relative to the best ask/bid for each level and normalizes volumes. Time-sensitive features include price/volume differences from previous events divided by elapsed time, and “intensity” indicators that reflect how fast price changes.

How are the five supervised targets constructed, and why are they categorical instead of continuous?

Targets are mid-price values at multiple future horizons (near-term to more distant, described as next 1, 2, 3, 5, or 10 events per millisecond). The dataset computes the percentage change of the mid price from the current timestamp to that future horizon, then converts it into categories using a threshold of 0.002. Up (label 1) is ≥ +0.002, flat (label 2) is between −0.002 and +0.002, and down (label 3) is ≤ −0.002. This turns a continuous movement into a classification problem for each horizon.

How do transaction fees influence whether a predicted move leads to an actual trade?

A trade is only profitable if the predicted mid-price increase (or decrease for shorting) is large enough to cover exchange/transaction fees plus the desired profit margin. That means the model’s predicted direction and magnitude must clear a fee-aware threshold; otherwise, the strategy should avoid acting even if the direction is favorable.

What is the hard-rule baseline and how does it compare to the supervised random forest approach?

The hard-rule trader buys exactly one share at 10:30 each day and sells at 6 p.m. the same day, repeating daily. The supervised approach instead trains a random forest classifier (scikit-learn) on the prepared 144 features and five horizon-specific categorical targets, then uses the predictions to drive trading decisions. Evaluation then compares outcomes via a heat ratio and writes results to CSV.

Review Questions

How does the walk-forward nine-fold split prevent leakage compared with random train/test splitting?
What thresholding rule maps percentage mid-price change into labels 1, 2, and 3?
Why must the prediction horizon be at least as large as the model’s end-to-end delay (inference, exchange latency, and order placement time)?

Key Points

1
Drop weekends (Saturday and Sunday) to keep the dataset aligned with consistent trading behavior.
2
Split each day into nine consecutive folds and use walk-forward training so models never see future test periods during training.
3
Build 144 features from 10-level LOB snapshots, combining book state (spread, mid price, normalized prices/volumes) with time-normalized dynamics (differences divided by elapsed time, intensity indicators).
4
Create five horizon-specific targets by computing future mid-price percentage change and converting it into categorical direction labels using a 0.002 threshold.
5
Tie trading decisions to fee-aware profitability: only act when predicted mid-price movement is large enough to cover transaction costs and desired gains.
6
Use a simple hard-rule baseline (buy 1 share at 10:30, sell at 6 p.m.) to benchmark against supervised models.
7
Train a scikit-learn random forest classifier for each of the five targets, then evaluate using a heat ratio and save results to CSV.

Highlights

Each day becomes a nine-fold walk-forward sequence, with metrics averaged over all test subsets to reduce look-ahead bias.

The 144-feature design blends 10-level book structure with time-sensitive dynamics like time-normalized price/volume differences.

Targets are five future horizons, but labels are categorical direction derived from a 0.002 mid-price percentage-change threshold.

Fee-aware execution is central: predictions only translate into trades when expected movement clears transaction costs.

A deterministic baseline—buy at 10:30 and sell at 6 p.m.—provides a straightforward benchmark for the random-forest strategy.

Topics

Limit Order Book Features
Walk-Forward Validation
Mid-Price Horizon Labels
Random Forest Trading
Transaction Fee Logic

Mentioned

scikit-learn

High-Frequency Trading - Hard-Rule & Supervised Traders - Implemented (4)