XGBoost For Classification | How XGBoost works on Classification Problems | CampusX
Based on CampusX's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
XGBoost classification uses log-odds as the working prediction space and converts to probabilities only when computing residuals.
Briefing
XGBoost classification works by repeatedly training decision trees to fix the mistakes of the current model, using log-odds (not raw probabilities) as the working space. The core workflow stays the same as gradient boosting—build an additive model in stages—but XGBoost swaps in a classification-specific tree-building criterion (“similarity score”) and a leaf output formula designed to push residuals toward zero.
The walkthrough starts with a toy dataset: each student has a single feature, CGPA, and a binary label indicating placement (0 = no placement, 1 = placement). The goal is to predict placement for a new CGPA. Because classification outputs are handled through log-odds, stage one begins with a constant base prediction in log-odds space. Using the log-odds definition log(p/(1−p)), the initial model outputs the same log-odds value for every data point—meaning it cannot yet distinguish between different CGPAs.
To measure how wrong that constant guess is, the method computes “pseudo-residuals” for each observation. These residuals are derived by converting the stage-one log-odds to probabilities via p = e^(log-odds) / (1 + e^(log-odds)), then comparing against the true labels. With residuals in hand, stage two trains a decision tree whose job is to reduce those residual errors. The tree is built by sorting observations by CGPA and then evaluating candidate split points between consecutive CGPA values (the transcript lists split thresholds like 5.97, 6.67, 7.62, and 8.87).
For each potential split, XGBoost computes a “similarity score” for the resulting leaf nodes. That score uses the sum of squared residuals normalized by a term involving the previous probabilities (and a regularization parameter λ, set to zero in the example). The algorithm chooses the split that maximizes the gain, where gain is calculated from the similarity scores of the left and right leaves minus the similarity score of the parent node. After comparing gains across all candidate thresholds, the best split becomes the root decision in the stage-two tree.
Once the tree structure is fixed, each leaf receives an output value computed from the residuals and previous probabilities using a classification-specific formula (again involving the same normalization term used in similarity score, but without the square). The stage-two model then adds the learning-rate-scaled leaf outputs to the stage-one log-odds predictions. Predictions are still produced in log-odds space, then converted back to probabilities when residuals are recomputed.
The process repeats: stage three adds another tree trained on the updated residuals, and so on until residuals get very close to zero. The transcript emphasizes that the only real conceptual difference from earlier gradient boosting is how XGBoost constructs trees—through the similarity score/gain machinery and the leaf output formula—while the overall additive, stage-wise correction loop remains the same. The next step, it notes, is to derive the mathematical formulation behind those formulas in a follow-up.
Cornell Notes
XGBoost classification builds an additive model in stages, where each new decision tree is trained to reduce the current model’s errors. Stage one starts with a constant base prediction in log-odds space, then computes pseudo-residuals by converting log-odds to probabilities. Stage two trains a CGPA-splitting decision tree by evaluating candidate split thresholds and choosing the one that maximizes gain, where gain is driven by a “similarity score” computed from residuals and previous probabilities. After the tree structure is chosen, each leaf gets an output value derived from residuals and previous probabilities, scaled by the learning rate. Repeating this residual-fixing loop yields progressively better predictions until residuals approach zero.
Why does the classification workflow operate in log-odds space instead of probabilities directly?
How are pseudo-residuals used to train the next tree?
What determines the best split in the stage-two decision tree?
How is a leaf output value computed once the tree structure is fixed?
How do stage-wise updates combine into final predictions?
Review Questions
- In the transcript’s setup, what quantity is constant in stage one, and how is it converted into probabilities for residual computation?
- What role do similarity score and gain play in choosing a split threshold during stage two?
- After the stage-two tree is built, how is each leaf’s output value computed and incorporated into the model’s log-odds predictions?
Key Points
- 1
XGBoost classification uses log-odds as the working prediction space and converts to probabilities only when computing residuals.
- 2
Stage one starts with a constant base log-odds prediction derived from the positive-class probability p.
- 3
Pseudo-residuals are computed per observation from the difference between true labels and the current probability estimates.
- 4
Stage two trains a decision tree by sorting by CGPA, testing candidate split thresholds, and selecting the split that maximizes gain computed from similarity scores.
- 5
Similarity score normalizes squared residual sums using previous probabilities (and λ regularization, set to zero in the example).
- 6
Leaf output values are computed from residuals normalized by previous probabilities’ variance term plus λ, then scaled by the learning rate η.
- 7
The process repeats stage-by-stage—each new tree targets the updated residuals—until residuals approach zero.