XGBoost For Classification | How XGBoost works on Classification Problems

TL;DR

XGBoost classification uses log-odds as the working prediction space and converts to probabilities only when computing residuals.

Briefing Cornell Notes

Briefing

XGBoost classification works by repeatedly training decision trees to fix the mistakes of the current model, using log-odds (not raw probabilities) as the working space. The core workflow stays the same as gradient boosting—build an additive model in stages—but XGBoost swaps in a classification-specific tree-building criterion (“similarity score”) and a leaf output formula designed to push residuals toward zero.

The walkthrough starts with a toy dataset: each student has a single feature, CGPA, and a binary label indicating placement (0 = no placement, 1 = placement). The goal is to predict placement for a new CGPA. Because classification outputs are handled through log-odds, stage one begins with a constant base prediction in log-odds space. Using the log-odds definition log(p/(1−p)), the initial model outputs the same log-odds value for every data point—meaning it cannot yet distinguish between different CGPAs.

To measure how wrong that constant guess is, the method computes “pseudo-residuals” for each observation. These residuals are derived by converting the stage-one log-odds to probabilities via p = e^(log-odds) / (1 + e^(log-odds)), then comparing against the true labels. With residuals in hand, stage two trains a decision tree whose job is to reduce those residual errors. The tree is built by sorting observations by CGPA and then evaluating candidate split points between consecutive CGPA values (the transcript lists split thresholds like 5.97, 6.67, 7.62, and 8.87).

For each potential split, XGBoost computes a “similarity score” for the resulting leaf nodes. That score uses the sum of squared residuals normalized by a term involving the previous probabilities (and a regularization parameter λ, set to zero in the example). The algorithm chooses the split that maximizes the gain, where gain is calculated from the similarity scores of the left and right leaves minus the similarity score of the parent node. After comparing gains across all candidate thresholds, the best split becomes the root decision in the stage-two tree.

Once the tree structure is fixed, each leaf receives an output value computed from the residuals and previous probabilities using a classification-specific formula (again involving the same normalization term used in similarity score, but without the square). The stage-two model then adds the learning-rate-scaled leaf outputs to the stage-one log-odds predictions. Predictions are still produced in log-odds space, then converted back to probabilities when residuals are recomputed.

The process repeats: stage three adds another tree trained on the updated residuals, and so on until residuals get very close to zero. The transcript emphasizes that the only real conceptual difference from earlier gradient boosting is how XGBoost constructs trees—through the similarity score/gain machinery and the leaf output formula—while the overall additive, stage-wise correction loop remains the same. The next step, it notes, is to derive the mathematical formulation behind those formulas in a follow-up.

Cornell Notes

XGBoost classification builds an additive model in stages, where each new decision tree is trained to reduce the current model’s errors. Stage one starts with a constant base prediction in log-odds space, then computes pseudo-residuals by converting log-odds to probabilities. Stage two trains a CGPA-splitting decision tree by evaluating candidate split thresholds and choosing the one that maximizes gain, where gain is driven by a “similarity score” computed from residuals and previous probabilities. After the tree structure is chosen, each leaf gets an output value derived from residuals and previous probabilities, scaled by the learning rate. Repeating this residual-fixing loop yields progressively better predictions until residuals approach zero.

Why does the classification workflow operate in log-odds space instead of probabilities directly?

Stage one outputs a constant log-odds value for every data point, computed from p = P(y=1). The transcript uses log-odds = log(p/(1−p)) as the base quantity. When predictions are needed in probability form (for residual computation), log-odds are converted back using p = e^(log-odds) / (1 + e^(log-odds)). This keeps the boosting updates consistent across stages while still allowing probabilities to be recovered when calculating errors.

How are pseudo-residuals used to train the next tree?

After stage one produces log-odds, the method converts those to probabilities and then computes an error term per observation (pseudo-residual). These residuals represent how far the current predictions are from the true binary labels. Stage two then trains a decision tree to predict these residuals indirectly by choosing splits that maximize gain based on residual-driven similarity scores.

What determines the best split in the stage-two decision tree?

The transcript sorts observations by CGPA and considers candidate split thresholds between consecutive values (e.g., 5.97, 6.67, 7.62, 8.87). For each candidate split, it computes similarity scores for the left and right leaf nodes using a formula that normalizes squared residual sums by a term involving previous probabilities (and λ, set to 0 in the example). Gain is then computed as (SSL + SSR − SS_root) with the root similarity score subtracted, and the split with the highest gain becomes the tree’s decision rule.

How is a leaf output value computed once the tree structure is fixed?

For each leaf, the output value is computed using a classification-specific formula: (sum of residuals) divided by (sum of previous probabilities * (1 − previous probabilities) + λ). In the transcript, λ is set to 0, and the calculations yield leaf outputs like −1.11 and 1.66 for the two leaves created by the chosen split. These leaf outputs are then multiplied by the learning rate (η) and added to the stage-one log-odds predictions.

How do stage-wise updates combine into final predictions?

Stage two combines stage-one log-odds (the constant base, e.g., 4.05 in the example) with η times the decision tree output (leaf value depending on whether CGPA is below or above the split threshold). The resulting combined log-odds are converted to probabilities when computing the next residuals. Stage three repeats the same pattern: add another η-scaled tree trained on the new residuals, until residuals are near zero.

Review Questions

In the transcript’s setup, what quantity is constant in stage one, and how is it converted into probabilities for residual computation?
What role do similarity score and gain play in choosing a split threshold during stage two?
After the stage-two tree is built, how is each leaf’s output value computed and incorporated into the model’s log-odds predictions?

Key Points

1
XGBoost classification uses log-odds as the working prediction space and converts to probabilities only when computing residuals.
2
Stage one starts with a constant base log-odds prediction derived from the positive-class probability p.
3
Pseudo-residuals are computed per observation from the difference between true labels and the current probability estimates.
4
Stage two trains a decision tree by sorting by CGPA, testing candidate split thresholds, and selecting the split that maximizes gain computed from similarity scores.
5
Similarity score normalizes squared residual sums using previous probabilities (and λ regularization, set to zero in the example).
6
Leaf output values are computed from residuals normalized by previous probabilities’ variance term plus λ, then scaled by the learning rate η.
7
The process repeats stage-by-stage—each new tree targets the updated residuals—until residuals approach zero.

Highlights

Stage one outputs the same log-odds for every CGPA, so it can’t separate students until residual-driven trees are added.

The tree-building decision hinges on maximizing gain, where gain comes from similarity scores computed using residuals and previous probabilities.

Leaf outputs are not arbitrary: each leaf’s value is derived from residual sums divided by a probability-variance-like term (plus λ).

The overall loop stays gradient-boosting-like: additive models in stages, each tree correcting the last model’s errors.

Topics

XGBoost Classification
Gradient Boosting
Log-Odds
Decision Tree Splits
Similarity Score

Mentioned

XGBoost

XGBoost For Classification | How XGBoost works on Classification Problems | CampusX