Gradient Boosting

## MST0052 -- Lecture 12

### Gradient Boosting

Fall 2026

---

## Where we are

| Lectures | Phase |
|----------|-------|
| 1--3 | Foundations |
| 4--7 | Core methods |
| **9--14** | **Going further -- you are here** |
| 15--16 | Wrapping up |

---

## Today's plan

- **Boosting vs bagging** -- two answers to the same question
- The **gradient boosting** algorithm, in one line
- The three knobs that matter: **learning rate**, **n_estimators**, **max_depth**
- Production implementations: **HistGBM**, **XGBoost**, **LightGBM**
- Worked example: gradient boosting vs random forest vs SVM on breast cancer

---

## Two strategies for "many weak models"

Both take a **weak base learner** (typically a small tree) and combine many of them. They differ in *how*.

- **Bagging (L10):** fit each tree on a different bootstrap sample, **independently**, then average → **variance down**, bias unchanged
- **Boosting (today):** fit trees **sequentially**, each focused on what the current ensemble still gets wrong → **bias and variance both drop**, at the cost of fit-time and overfitting risk

---

## When boosting beats bagging

- Bagging **plateaus** when the base learner is already low-bias -- there's no more variance to average away
- Boosting keeps reducing error by going after the **residuals** -- the part the ensemble has not learned yet
- On **clean, low-noise** tabular data: boosting usually wins
- On **noisy** data where residuals are mostly noise: boosting can chase the noise; bagging is steadier

---

## Why this contrast matters for your project

For a tabular project:

- **Try both, under the same CV protocol.** Report both numbers honestly. The winner is the winner.
- For the oral exam: be able to explain *why* a forest or a boosting model won on **your specific dataset**.

---

## A naive first model

Fit a small decision tree (depth 3, say) on the training data.

The model is OK -- it captures the strongest signal -- but gets a lot wrong.

> **Key observation:** the *residuals* (what the model missed) are themselves a target that another model could try to predict.

---

## Fit the next model to the residuals

Compute the residuals from the first tree:

$$r\_i^{(1)} = y\_i - \hat{y}\_i^{(1)}$$

Fit a second small tree to the residuals (not to $y$). Add it to the ensemble with a small weight $\eta$ (the learning rate):

$$F\_2(x) = F\_1(x) + \eta \cdot h\_2(x)$$

Repeat. Each new tree focuses on what the **current** ensemble still gets wrong.

---

## From residuals to gradients

Residuals are the negative gradient of squared-error loss:

$$y\_i - \hat{y}\_i \;=\; -\frac{\partial}{\partial \hat{y}} \tfrac{1}{2}(y\_i - \hat{y}\_i)^2$$

For other losses (logistic for classification, absolute error, etc.), the "thing the next tree should target" is the **negative gradient** of the loss with respect to the current prediction.

That is the generalisation. The algorithm is the same; only the **target of each new tree** changes with the loss.

---

## The gradient boosting update, in one line

At each step $m$:

1. Compute the **negative gradient** of the loss at the current predictions (the "pseudo-residuals")
2. Fit a new weak learner $h\_m$ to those pseudo-residuals
3. Update:

$$F\_m(x) = F\_{m-1}(x) + \eta \cdot h\_m(x)$$

That is the **entire algorithm.** Everything else (regularisation, subsampling, histogram splits) is engineering on top.

---

## Why the learning rate matters

$\eta$ controls *how much* each new tree is allowed to nudge the ensemble.

| `learning_rate` | Per-tree effect | What it buys |
|-----------------|-----------------|--------------|
| **Small** (0.01--0.1) | Tiny correction each step | Robust, slow, need many trees |
| **Large** (0.5--1.0) | Big correction each step | Fast, easy to overfit |

> Use a **small $\eta$** and let CV (or early stopping) choose `n_estimators`.

---

## Bias and variance, revisited

- Each weak learner has **low capacity** (depth 3 trees) → individually high bias
- Boosting **reduces bias** by stacking many corrections
- Reduces variance through the small learning rate's averaging effect
- But: too many trees + too large $\eta$ → fits noise → variance up, generalisation down

The bias-variance dial here is **`n_estimators × learning_rate`** together, not either one alone.

---

## Boosting on a 2D problem

![Gradient boosting decision boundary at iterations 1, 10, and 100](/figures/boosting-iterations.svg)

- **Iteration 1:** roughly a single split, very wrong in places
- **Iteration 10:** more nuanced, capturing the broad shape
- **Iteration 100:** a precise boundary hugging the class structure

---

## The three primary hyperparameters

| Parameter | What it controls | Default starting point |
|-----------|------------------|-------------------------|
| `learning_rate` ($\eta$) | Step size per tree | 0.05--0.1 |
| `n_estimators` | Total number of trees | 200--1000 (use early stopping) |
| `max_depth` | Capacity of each individual tree | 3--6 |

These three **interact**:

- Lower learning rate → more trees needed
- Deeper trees → smaller learning rate to compensate

---

## Early stopping

Instead of fitting all `n_estimators` and reading CV later:

- Monitor a **validation score** during training
- **Stop** when it stops improving for some patience window

Saves compute and avoids the manual "how many trees?" question.

```python
HistGradientBoostingClassifier(
    learning_rate=0.05, max_depth=4,
    max_iter=2000,
    early_stopping=True,   # default in modern sklearn
    validation_fraction=0.1,
    n_iter_no_change=20,
)
```

---

## Stochastic boosting (`subsample`)

Fit each tree on a random **subsample** (say 80%) of the training rows.

- Adds randomness → reduces overfitting, regularises the ensemble
- Cheap variance-reduction trick

Default 0.8 or 1.0; **0.5--0.8** is often slightly better than 1.0.

---

## L1 / L2 regularisation on leaf weights

Modern implementations regularise the *weights assigned to each leaf* of every tree.

| Library | L2 knob | L1 knob |
|---------|---------|---------|
| sklearn `HistGradientBoosting` | `l2_regularization` | -- |
| XGBoost | `reg_lambda` | `reg_alpha` |
| LightGBM | `reg_lambda` | `reg_alpha` |

Useful when the dataset is noisy and trees are tempted to make extreme predictions.

---

## A sensible default protocol

1. Start with `learning_rate=0.05`, `max_depth=4`, `n_estimators=2000` + **early stopping**
2. CV `learning_rate × max_depth` on a small grid; let early stopping pick the number of trees inside each fold
3. Tune `subsample` and regularisation **only if** the basics don't reach satisfying CV scores

This is what the strong project entries in past semesters have looked like.

---

## Three implementations to know

- **`sklearn.ensemble.HistGradientBoostingClassifier`** -- scikit-learn's modern histogram-based implementation. **Built-in**, fast, good defaults, supports early stopping and missing values natively.
- **XGBoost (`xgboost.XGBClassifier`)** -- battle-tested, widely used in industry and competitions, rich regularisation and tooling.
- **LightGBM (`lightgbm.LGBMClassifier`)** -- usually the fastest, especially on large datasets; leaf-wise growth instead of level-wise.
- **CatBoost** -- strong on **categorical features** without explicit encoding.

Same algorithm, three competitive engineering choices. **None is "the right" answer.**

---

## Histogram-based gradient boosting

Classical gradient boosting evaluates splits on **raw continuous values** -- slow.

Histogram boosting:

- Bins each feature into ~256 buckets up front
- Evaluates splits on **bins**
- **Hundreds of times faster** on large datasets; almost no loss in accuracy

All three modern libraries use this.

---

## When to use which

| Library | Reach for it when |
|---------|------------------|
| **`HistGradientBoostingClassifier`** | Zero extra dependencies; sklearn workflows; small-to-medium tabular data |
| **XGBoost** | Maximum regularisation knobs, robust defaults, broad community support |
| **LightGBM** | Largest datasets, fastest training; you don't mind being slightly more careful about overfitting |
| **CatBoost** | Datasets with many high-cardinality categorical features |

> Any of these is defensible. Pick one and document it.

---

## Installing the others

`HistGradientBoostingClassifier` is in scikit-learn -- **no install needed**.

```bash
pip install xgboost
pip install lightgbm
```

Both work out of the box on modern Python.

> If you go beyond scikit-learn, **commit your `requirements.txt`** (L2 reminder).

---

## Pitfall 1: tuning everything at once

A grid over six knobs explodes combinatorially and hides the signal in noise.

**Defence:**

- Tune `learning_rate × max_depth` first
- Early stopping handles `n_estimators`
- Touch `subsample` and regularisation **only if needed**

---

## Pitfall 2: comparing GBM to RF with different protocols

Easy to give boosting an advantage by accident:

- Different CV folds
- Different random seeds
- More compute budget

**Fair comparison:** same train/test split, same CV splitter, same metric. (L7's selection rule.)

---

## Pitfall 3: ignoring overfitting after CV

Boosting can fit the **CV folds** while still missing the test set.

Always look at:

- **CV mean** -- is it good?
- **CV std** -- is it stable?
- **Test score** -- does it match CV?

If CV is excellent but test is much worse → overfitting your tuning to the folds.

**Defence:** smaller grid, simpler models, repeated CV.

---

## What to report in the project

For your boosting model:

1. The **library** you used and its version
2. The **CV protocol**, including the early-stopping setup
3. The values of `learning_rate`, `max_depth`, and the number of trees chosen by early stopping
4. **CV mean ± std**, and the **test score**, once
5. A **direct comparison** to your random-forest baseline (and ideally your linear baseline)

Step 5 is the point. A boosting model that *barely* beats a random forest is a finding worth reporting honestly.

---

## The setup

- **Dataset:** `load_breast_cancer()` -- same as L10 and L11
- **Protocol:**
  - 80/20 split, `random_state=42`, stratified
  - Stratified 5-fold CV on the training set
  - Metric: `f1`
- **Today's contender:** `HistGradientBoostingClassifier` with early stopping
- **Baselines on the same table:** the L10 random forest and the L11 RBF SVM

```python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
```

---

## A first gradient-boosting fit

```python
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_score

gb = HistGradientBoostingClassifier(
    learning_rate=0.05, max_depth=4,
    max_iter=2000, early_stopping=True, random_state=42,
)
scores = cross_val_score(gb, X_train, y_train, cv=5, scoring='f1')
print(f"HistGBM CV F1: {scores.mean():.3f} ± {scores.std():.3f}")
```

Out-of-the-box defaults plus early stopping. Already competitive.

---

## Tuning `learning_rate` and `max_depth`

```python
from sklearn.model_selection import GridSearchCV

grid = {
    'learning_rate': [0.02, 0.05, 0.1, 0.2],
    'max_depth':     [3, 4, 6],
}

search = GridSearchCV(gb, grid, cv=5, scoring='f1', n_jobs=-1)
search.fit(X_train, y_train)
print(f"Best: {search.best_params_}")
print(f"CV F1: {search.best_score_:.3f}")
```

![CV F1 across the learning_rate × max_depth grid](/figures/boosting-cv-heatmap.svg)

Tune the two interacting knobs together. `n_estimators` is handled by early stopping inside each fold.

---

## XGBoost on the same dataset

```python
from xgboost import XGBClassifier

xgb = XGBClassifier(
    learning_rate=0.05, max_depth=4, n_estimators=2000,
    eval_metric='logloss', early_stopping_rounds=50,
    random_state=42, n_jobs=-1,
)
xgb.fit(X_train, y_train,
        eval_set=[(X_test, y_test)], verbose=False)
```

Same code structure for XGBoost. Note the early-stopping argument style is different.

Expect very similar numbers to `HistGradientBoostingClassifier` on this dataset.

---

## The three-family comparison

| Family | CV F1 (mean ± std) | Test F1 | Notes |
|--------|---------------------|---------|-------|
| Random forest (L10) | -- | -- | Defaults, no scaling |
| RBF SVM (L11) | -- | -- | Tuned `C × gamma` grid |
| HistGBM (today) | -- | -- | Early stopping + small grid |

Walk through the table:

- Are the **gaps** bigger than the **std**?
- Which is **simplest**?
- Which can you **defend** most easily?

---

## Reading feature importance from boosting

```python
import pandas as pd

gb.fit(X_train, y_train)
importances = pd.Series(
    gb.feature_importances_, index=X_train.columns,
).sort_values(ascending=False)
print(importances.head(8))
```

- Impurity-based -- **same caveats as L10**: biased toward high-cardinality and correlated features
- For honest interpretation: **permutation importance** (L10) or **SHAP values**
- SHAP is the next-level interpretation tool -- used in past A-grade projects

---

## Summary

- Boosting fits weak learners **sequentially**, each targeting the **negative gradient** of the loss
- The bias-variance dial is **`learning_rate × n_estimators`** together -- small steps, many of them
- Use **early stopping**. Tune `learning_rate × max_depth`. Touch other knobs only when needed.
- **HistGBM / XGBoost / LightGBM** are the three engineering choices; the algorithm is the same
- Boosting often wins tabular projects -- **when carefully tuned**. When it doesn't, random forests are still excellent.

---

## Before Lecture 13

- Run today's HistGBM comparison on your own machine. Add it to your project as a **third or fourth model family**.
- This is the **last new supervised family** in the course. From here on, you have everything you need to finish the modelling part of the project.
- Read ahead: **Lecture 13 is clustering** -- back to unsupervised methods, with k-means and hierarchical clustering.

---

## Questions

Open the floor.

## Backup slides

Use if questions arise or time allows.

## AdaBoost, the original

The first boosting algorithm:

- Initialise equal weights on training points
- At each round, fit a weak learner; up-weight the **misclassified** points; down-weight the others
- Combine learners with weights based on their accuracy

```python
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(n_estimators=200, random_state=0)
```

Pedagogically useful for showing that boosting **predates** gradient boosting. Rarely used today on tabular data -- gradient boosting dominates.

## Categorical handling in modern boosting

```python
HistGradientBoostingClassifier(categorical_features=[2, 5, 7])
```

- **scikit-learn HistGBM:** native categorical support (boolean mask or column indices)
- **LightGBM:** `categorical_feature=` parameter, native support
- **CatBoost:** the name -- "Cat" for categorical -- built around this

For project datasets with many string columns, **native handling beats one-hot encoding** in both speed and accuracy.

## Missing-value handling

All modern boosting libraries handle `NaN` natively:

- At each split, the algorithm learns an **optimal direction** to send missing values
- No imputation needed at the boosting step
- Compare to L3 where you had to impute *before* the model

```python
gb.fit(X_with_nans, y)   # just works
```

Useful when missingness is informative (e.g., "missing" really means "not applicable").

## SHAP values for boosting

Model-agnostic but **boosting-friendly** interpretation tool:

```python
import shap

explainer = shap.TreeExplainer(gb)
shap_values = explainer(X_test)

shap.plots.beeswarm(shap_values)
```

- Per-prediction contributions of each feature
- Beeswarm plot shows distribution + direction of effects across the dataset
- **Defensible interpretation** for the oral exam -- past A-grade projects used SHAP

## Hyperparameter search budgets

When the grid gets large:

```python
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, randint

param_dist = {
    'learning_rate': loguniform(0.01, 0.3),
    'max_depth':     randint(3, 8),
    'l2_regularization': loguniform(1e-3, 10),
}

search = RandomizedSearchCV(gb, param_dist, n_iter=40,
                            cv=5, scoring='f1', n_jobs=-1)
```

Boosting + large grid = where randomised or halving search (L7 backup) shine.

---

## What's next

**Lecture 13:** Clustering

- K-means in depth
- Hierarchical clustering
- Choosing the number of clusters