MST0052
## MST0052 -- Lecture 12 ### Gradient Boosting Fall 2026 --- ## Where we are | Lectures | Phase | |----------|-------| | 1--3 | Foundations | | 4--7 | Core methods | | **9--14** | **Going further -- you are here** | | 15--16 | Wrapping up | --- ## Today's plan - **Boosting vs bagging** -- two answers to the same question - The **gradient boosting** algorithm, in one line - The three knobs that matter: **learning rate**, **n_estimators**, **max_depth** - Production implementations: **HistGBM**, **XGBoost**, **LightGBM** - Worked example: gradient boosting vs random forest vs SVM on breast cancer --- ## Two strategies for "many weak models" Both take a **weak base learner** (typically a small tree) and combine many of them. They differ in *how*. - **Bagging (L10):** fit each tree on a different bootstrap sample, **independently**, then average → **variance down**, bias unchanged - **Boosting (today):** fit trees **sequentially**, each focused on what the current ensemble still gets wrong → **bias and variance both drop**, at the cost of fit-time and overfitting risk --- ## When boosting beats bagging - Bagging **plateaus** when the base learner is already low-bias -- there's no more variance to average away - Boosting keeps reducing error by going after the **residuals** -- the part the ensemble has not learned yet - On **clean, low-noise** tabular data: boosting usually wins - On **noisy** data where residuals are mostly noise: boosting can chase the noise; bagging is steadier --- ## Why this contrast matters for your project For a tabular project: - **Try both, under the same CV protocol.** Report both numbers honestly. The winner is the winner. - For the oral exam: be able to explain *why* a forest or a boosting model won on **your specific dataset**. --- ## A naive first model Fit a small decision tree (depth 3, say) on the training data. The model is OK -- it captures the strongest signal -- but gets a lot wrong. > **Key observation:** the *residuals* (what the model missed) are themselves a target that another model could try to predict. --- ## Fit the next model to the residuals Compute the residuals from the first tree: $$r\_i^{(1)} = y\_i - \hat{y}\_i^{(1)}$$ Fit a second small tree to the residuals (not to $y$). Add it to the ensemble with a small weight $\eta$ (the learning rate): $$F\_2(x) = F\_1(x) + \eta \cdot h\_2(x)$$ Repeat. Each new tree focuses on what the **current** ensemble still gets wrong. --- ## From residuals to gradients Residuals are the negative gradient of squared-error loss: $$y\_i - \hat{y}\_i \;=\; -\frac{\partial}{\partial \hat{y}} \tfrac{1}{2}(y\_i - \hat{y}\_i)^2$$ For other losses (logistic for classification, absolute error, etc.), the "thing the next tree should target" is the **negative gradient** of the loss with respect to the current prediction. That is the generalisation. The algorithm is the same; only the **target of each new tree** changes with the loss. --- ## The gradient boosting update, in one line At each step $m$: 1. Compute the **negative gradient** of the loss at the current predictions (the "pseudo-residuals") 2. Fit a new weak learner $h\_m$ to those pseudo-residuals 3. Update: $$F\_m(x) = F\_{m-1}(x) + \eta \cdot h\_m(x)$$ That is the **entire algorithm.** Everything else (regularisation, subsampling, histogram splits) is engineering on top. --- ## Why the learning rate matters $\eta$ controls *how much* each new tree is allowed to nudge the ensemble. | `learning_rate` | Per-tree effect | What it buys | |-----------------|-----------------|--------------| | **Small** (0.01--0.1) | Tiny correction each step | Robust, slow, need many trees | | **Large** (0.5--1.0) | Big correction each step | Fast, easy to overfit | > Use a **small $\eta$** and let CV (or early stopping) choose `n_estimators`. --- ## Bias and variance, revisited - Each weak learner has **low capacity** (depth 3 trees) → individually high bias - Boosting **reduces bias** by stacking many corrections - Reduces variance through the small learning rate's averaging effect - But: too many trees + too large $\eta$ → fits noise → variance up, generalisation down The bias-variance dial here is **`n_estimators × learning_rate`** together, not either one alone. --- ## Boosting on a 2D problem  - **Iteration 1:** roughly a single split, very wrong in places - **Iteration 10:** more nuanced, capturing the broad shape - **Iteration 100:** a precise boundary hugging the class structure --- ## The three primary hyperparameters | Parameter | What it controls | Default starting point | |-----------|------------------|-------------------------| | `learning_rate` ($\eta$) | Step size per tree | 0.05--0.1 | | `n_estimators` | Total number of trees | 200--1000 (use early stopping) | | `max_depth` | Capacity of each individual tree | 3--6 | These three **interact**: - Lower learning rate → more trees needed - Deeper trees → smaller learning rate to compensate --- ## Early stopping Instead of fitting all `n_estimators` and reading CV later: - Monitor a **validation score** during training - **Stop** when it stops improving for some patience window Saves compute and avoids the manual "how many trees?" question. ```python HistGradientBoostingClassifier( learning_rate=0.05, max_depth=4, max_iter=2000, early_stopping=True, # default in modern sklearn validation_fraction=0.1, n_iter_no_change=20, ) ``` --- ## Stochastic boosting (`subsample`) Fit each tree on a random **subsample** (say 80%) of the training rows. - Adds randomness → reduces overfitting, regularises the ensemble - Cheap variance-reduction trick Default 0.8 or 1.0; **0.5--0.8** is often slightly better than 1.0. --- ## L1 / L2 regularisation on leaf weights Modern implementations regularise the *weights assigned to each leaf* of every tree. | Library | L2 knob | L1 knob | |---------|---------|---------| | sklearn `HistGradientBoosting` | `l2_regularization` | -- | | XGBoost | `reg_lambda` | `reg_alpha` | | LightGBM | `reg_lambda` | `reg_alpha` | Useful when the dataset is noisy and trees are tempted to make extreme predictions. --- ## A sensible default protocol 1. Start with `learning_rate=0.05`, `max_depth=4`, `n_estimators=2000` + **early stopping** 2. CV `learning_rate × max_depth` on a small grid; let early stopping pick the number of trees inside each fold 3. Tune `subsample` and regularisation **only if** the basics don't reach satisfying CV scores This is what the strong project entries in past semesters have looked like. --- ## Three implementations to know - **`sklearn.ensemble.HistGradientBoostingClassifier`** -- scikit-learn's modern histogram-based implementation. **Built-in**, fast, good defaults, supports early stopping and missing values natively. - **XGBoost (`xgboost.XGBClassifier`)** -- battle-tested, widely used in industry and competitions, rich regularisation and tooling. - **LightGBM (`lightgbm.LGBMClassifier`)** -- usually the fastest, especially on large datasets; leaf-wise growth instead of level-wise. - **CatBoost** -- strong on **categorical features** without explicit encoding. Same algorithm, three competitive engineering choices. **None is "the right" answer.** --- ## Histogram-based gradient boosting Classical gradient boosting evaluates splits on **raw continuous values** -- slow. Histogram boosting: - Bins each feature into ~256 buckets up front - Evaluates splits on **bins** - **Hundreds of times faster** on large datasets; almost no loss in accuracy All three modern libraries use this. --- ## When to use which | Library | Reach for it when | |---------|------------------| | **`HistGradientBoostingClassifier`** | Zero extra dependencies; sklearn workflows; small-to-medium tabular data | | **XGBoost** | Maximum regularisation knobs, robust defaults, broad community support | | **LightGBM** | Largest datasets, fastest training; you don't mind being slightly more careful about overfitting | | **CatBoost** | Datasets with many high-cardinality categorical features | > Any of these is defensible. Pick one and document it. --- ## Installing the others `HistGradientBoostingClassifier` is in scikit-learn -- **no install needed**. ```bash pip install xgboost pip install lightgbm ``` Both work out of the box on modern Python. > If you go beyond scikit-learn, **commit your `requirements.txt`** (L2 reminder). --- ## Pitfall 1: tuning everything at once A grid over six knobs explodes combinatorially and hides the signal in noise. **Defence:** - Tune `learning_rate × max_depth` first - Early stopping handles `n_estimators` - Touch `subsample` and regularisation **only if needed** --- ## Pitfall 2: comparing GBM to RF with different protocols Easy to give boosting an advantage by accident: - Different CV folds - Different random seeds - More compute budget **Fair comparison:** same train/test split, same CV splitter, same metric. (L7's selection rule.) --- ## Pitfall 3: ignoring overfitting after CV Boosting can fit the **CV folds** while still missing the test set. Always look at: - **CV mean** -- is it good? - **CV std** -- is it stable? - **Test score** -- does it match CV? If CV is excellent but test is much worse → overfitting your tuning to the folds. **Defence:** smaller grid, simpler models, repeated CV. --- ## What to report in the project For your boosting model: 1. The **library** you used and its version 2. The **CV protocol**, including the early-stopping setup 3. The values of `learning_rate`, `max_depth`, and the number of trees chosen by early stopping 4. **CV mean ± std**, and the **test score**, once 5. A **direct comparison** to your random-forest baseline (and ideally your linear baseline) Step 5 is the point. A boosting model that *barely* beats a random forest is a finding worth reporting honestly. --- ## The setup - **Dataset:** `load_breast_cancer()` -- same as L10 and L11 - **Protocol:** - 80/20 split, `random_state=42`, stratified - Stratified 5-fold CV on the training set - Metric: `f1` - **Today's contender:** `HistGradientBoostingClassifier` with early stopping - **Baselines on the same table:** the L10 random forest and the L11 RBF SVM ```python from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split X, y = load_breast_cancer(return_X_y=True, as_frame=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) ``` --- ## A first gradient-boosting fit ```python from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.model_selection import cross_val_score gb = HistGradientBoostingClassifier( learning_rate=0.05, max_depth=4, max_iter=2000, early_stopping=True, random_state=42, ) scores = cross_val_score(gb, X_train, y_train, cv=5, scoring='f1') print(f"HistGBM CV F1: {scores.mean():.3f} ± {scores.std():.3f}") ``` Out-of-the-box defaults plus early stopping. Already competitive. --- ## Tuning `learning_rate` and `max_depth` ```python from sklearn.model_selection import GridSearchCV grid = { 'learning_rate': [0.02, 0.05, 0.1, 0.2], 'max_depth': [3, 4, 6], } search = GridSearchCV(gb, grid, cv=5, scoring='f1', n_jobs=-1) search.fit(X_train, y_train) print(f"Best: {search.best_params_}") print(f"CV F1: {search.best_score_:.3f}") ```  Tune the two interacting knobs together. `n_estimators` is handled by early stopping inside each fold. --- ## XGBoost on the same dataset ```python from xgboost import XGBClassifier xgb = XGBClassifier( learning_rate=0.05, max_depth=4, n_estimators=2000, eval_metric='logloss', early_stopping_rounds=50, random_state=42, n_jobs=-1, ) xgb.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False) ``` Same code structure for XGBoost. Note the early-stopping argument style is different. Expect very similar numbers to `HistGradientBoostingClassifier` on this dataset. --- ## The three-family comparison | Family | CV F1 (mean ± std) | Test F1 | Notes | |--------|---------------------|---------|-------| | Random forest (L10) | -- | -- | Defaults, no scaling | | RBF SVM (L11) | -- | -- | Tuned `C × gamma` grid | | HistGBM (today) | -- | -- | Early stopping + small grid | Walk through the table: - Are the **gaps** bigger than the **std**? - Which is **simplest**? - Which can you **defend** most easily? --- ## Reading feature importance from boosting ```python import pandas as pd gb.fit(X_train, y_train) importances = pd.Series( gb.feature_importances_, index=X_train.columns, ).sort_values(ascending=False) print(importances.head(8)) ``` - Impurity-based -- **same caveats as L10**: biased toward high-cardinality and correlated features - For honest interpretation: **permutation importance** (L10) or **SHAP values** - SHAP is the next-level interpretation tool -- used in past A-grade projects --- ## Summary - Boosting fits weak learners **sequentially**, each targeting the **negative gradient** of the loss - The bias-variance dial is **`learning_rate × n_estimators`** together -- small steps, many of them - Use **early stopping**. Tune `learning_rate × max_depth`. Touch other knobs only when needed. - **HistGBM / XGBoost / LightGBM** are the three engineering choices; the algorithm is the same - Boosting often wins tabular projects -- **when carefully tuned**. When it doesn't, random forests are still excellent. --- ## Before Lecture 13 - Run today's HistGBM comparison on your own machine. Add it to your project as a **third or fourth model family**. - This is the **last new supervised family** in the course. From here on, you have everything you need to finish the modelling part of the project. - Read ahead: **Lecture 13 is clustering** -- back to unsupervised methods, with k-means and hierarchical clustering. --- ## Questions Open the floor. -- ## Backup slides Use if questions arise or time allows. -- ## AdaBoost, the original The first boosting algorithm: - Initialise equal weights on training points - At each round, fit a weak learner; up-weight the **misclassified** points; down-weight the others - Combine learners with weights based on their accuracy ```python from sklearn.ensemble import AdaBoostClassifier ada = AdaBoostClassifier(n_estimators=200, random_state=0) ``` Pedagogically useful for showing that boosting **predates** gradient boosting. Rarely used today on tabular data -- gradient boosting dominates. -- ## Categorical handling in modern boosting ```python HistGradientBoostingClassifier(categorical_features=[2, 5, 7]) ``` - **scikit-learn HistGBM:** native categorical support (boolean mask or column indices) - **LightGBM:** `categorical_feature=` parameter, native support - **CatBoost:** the name -- "Cat" for categorical -- built around this For project datasets with many string columns, **native handling beats one-hot encoding** in both speed and accuracy. -- ## Missing-value handling All modern boosting libraries handle `NaN` natively: - At each split, the algorithm learns an **optimal direction** to send missing values - No imputation needed at the boosting step - Compare to L3 where you had to impute *before* the model ```python gb.fit(X_with_nans, y) # just works ``` Useful when missingness is informative (e.g., "missing" really means "not applicable"). -- ## SHAP values for boosting Model-agnostic but **boosting-friendly** interpretation tool: ```python import shap explainer = shap.TreeExplainer(gb) shap_values = explainer(X_test) shap.plots.beeswarm(shap_values) ``` - Per-prediction contributions of each feature - Beeswarm plot shows distribution + direction of effects across the dataset - **Defensible interpretation** for the oral exam -- past A-grade projects used SHAP -- ## Hyperparameter search budgets When the grid gets large: ```python from sklearn.model_selection import RandomizedSearchCV from scipy.stats import loguniform, randint param_dist = { 'learning_rate': loguniform(0.01, 0.3), 'max_depth': randint(3, 8), 'l2_regularization': loguniform(1e-3, 10), } search = RandomizedSearchCV(gb, param_dist, n_iter=40, cv=5, scoring='f1', n_jobs=-1) ``` Boosting + large grid = where randomised or halving search (L7 backup) shine. --- ## What's next **Lecture 13:** Clustering - K-means in depth - Hierarchical clustering - Choosing the number of clusters