Ensemble Methods

## MST0052 -- Lecture 10

### Ensemble Methods

Fall 2026

---

## Where we are

| Lectures | Phase |
|----------|-------|
| 1--3 | Foundations |
| 4--7 | Core methods |
| **9--14** | **Going further -- you are here** |
| 15--16 | Wrapping up |

---

## Today's plan

- The **decision tree** -- one rule at a time
- Why a single tree is the textbook **high-variance** learner
- **Bagging** -- averaging trees fitted on bootstrap samples
- **Random forests** -- bagging plus random feature subsets
- Tuning, feature importance, and what to report in the project

---

## A tree is a sequence of questions

![A small decision tree](/figures/decision-tree-example.svg)

The model: follow the answers down the tree, predict whatever leaf you land in.

No equations, no kernels, no scaling -- radically different machinery from L4-L7.

---

## How splits are chosen

At each node, pick the (feature, threshold) pair that maximises **node purity** in the two children.

Two standard impurity measures (classification):

- **Gini:** $\text{Gini}(S) = 1 - \sum_c p_c^2$
- **Entropy:** $\text{Entropy}(S) = -\sum_c p_c \log_2 p_c$

The tree grows **greedily** -- best split at each node, no look-ahead.

---

## Trees for regression

Same algorithm, different impurity: minimise the **within-node variance** (or sum of squared residuals) of the target.

Leaf prediction:

- **Regression:** mean of the training targets in that leaf
- **Classification:** majority class in that leaf

One model family, two task types -- same tuning knobs.

---

## Why trees overfit

An unpruned tree keeps splitting until every leaf is **pure** (classification) or has one observation (regression).

- Training error: **zero**
- Test error: usually **terrible**

This is the textbook **high-variance** learner from L6 -- refit on a different sample and you get a wildly different tree.

---

## Controlling complexity in a single tree

| Knob | What it does |
|------|--------------|
| `max_depth` | Hard cap on tree depth |
| `min_samples_split` | Minimum samples a node must have to be split |
| `min_samples_leaf` | Minimum samples in each leaf |
| `ccp_alpha` | Cost-complexity pruning -- penalises tree size |

All trade variance for bias -- the same L6 dial in different clothing.

---

## The averaging idea

From L6: averaging many noisy unbiased estimates **reduces variance**, while bias stays the same.

If we had many **independent** trees fitted to **independent** samples, averaging their predictions would:

- inherit each tree's **bias** (unchanged)
- slash each tree's **variance**

We don't have many independent samples. **Bootstrap** is the trick we use to fake them.

---

## The bootstrap

A **bootstrap sample**: a random sample of size $n$ drawn **with replacement** from the original $n$ training rows.

- Each bootstrap sample contains roughly **63%** of the unique rows; the rest are repeats
- The leftover ~37% are called **out-of-bag (OOB)**

Draw $B$ bootstrap samples → fit $B$ trees → ensemble.

---

## Bagging in one slide

**Bootstrap AGGregating:**

1. Draw $B$ bootstrap samples
2. Fit a tree to each (unpruned, so each is low-bias)
3. Aggregate: **average** (regression) or **majority vote** (classification)

- Ensemble bias ≈ a single tree's bias
- Ensemble variance ≈ a single tree's variance ÷ $B$ (in the unrealistic limit of perfect independence)

---

## Out-of-bag error: free CV

Each tree's OOB rows were not seen during its training.

For every training row $i$, the trees that did not include $i$ in their bootstrap sample give an honest prediction.

Averaging those → an OOB estimate of generalisation, **without** refitting.

```python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=500, oob_score=True,
                            random_state=0)
rf.fit(X_train, y_train)
print(f"OOB accuracy: {rf.oob_score_:.3f}")
```

---

## Bagging vs a single tree, visually

![Decision boundary: single tree vs bagged ensemble](/figures/bagging-variance-reduction.svg)

- **Left:** a single deep tree -- jagged, axis-aligned, chasing noise
- **Right:** the bagged ensemble -- smoother, same general shape

This is variance reduction made visible.

---

## Bagging's weakness: correlated trees

Even with different bootstrap samples, trees tend to use the **same strong features** at the top splits.

→ Trees are **correlated** → correlated errors do not average away.

**Random forests** add one more layer of randomness to fix this.

---

## The random forest recipe

Bagging + **random feature subsets at each split**:

- At every candidate split, sample $m$ features at random from the $p$ available
- Best split is chosen only among those $m$

Typical defaults:

- **Classification:** $m = \sqrt{p}$
- **Regression:** $m = p/3$

---

## Why it works

Forcing each tree to look at **different feature subsets** makes the trees **less correlated**.

- Less correlation → averaging is more effective → variance drops further
- Price: each individual tree is slightly worse (fewer features to choose from)
- Net: the ensemble is meaningfully better

---

## Random forests in scikit-learn

```python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=500,
    max_features='sqrt',
    min_samples_leaf=1,
    n_jobs=-1,
    random_state=42,
)
rf.fit(X_train, y_train)
print(f"Test accuracy: {rf.score(X_test, y_test):.3f}")
```

- **No scaling needed** -- trees don't care about feature units
- `n_jobs=-1` parallelises tree fitting across cores

---

## Key hyperparameters

| Parameter | Effect | Typical range |
|-----------|--------|---------------|
| `n_estimators` | More trees → lower variance (diminishing past a few hundred) | 100--1000 |
| `max_features` | Lower → more decorrelation, higher per-tree bias | `'sqrt'`, `'log2'`, fractions |
| `max_depth` | Cap individual tree complexity | `None` or 5--30 |
| `min_samples_leaf` | Floor on leaf size; larger = more regularisation | 1--20 |

Tune with `GridSearchCV` (L7). `n_estimators` is almost never the decisive parameter.

---

## Feature importance (impurity-based)

`rf.feature_importances_` averages, across all trees and splits, the impurity reduction attributable to each feature.

```python
import pandas as pd
importances = pd.Series(rf.feature_importances_,
                        index=feature_names)
print(importances.sort_values(ascending=False).head(10))
```

- Useful first pass for "which features matter"
- **But:** biased toward high-cardinality features and toward features chosen first when others are correlated

---

## Permutation importance (the honest answer)

Permute one feature's values across rows; refit nothing; measure the drop in CV score.

```python
from sklearn.inspection import permutation_importance

result = permutation_importance(
    rf, X_test, y_test,
    n_repeats=20, random_state=0, scoring='accuracy'
)
```

A feature with high permutation importance **actively** matters at prediction time. High impurity importance only means it *used to* be picked.

Slower, more honest. **Use it before reporting feature importance in your project.**

---

## Partial dependence (brief)

For a feature of interest, average the model's prediction across the training data while varying that feature alone.

The result is a curve: how does the prediction change as that feature changes, holding everything else "average"?

```python
from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(rf, X_train, features=['mean_radius'])
```

Cheap to compute. Useful for "what does this feature do?" answers.

---

## What random forests don't give you

- A **coefficient table** you can sign-and-interpret like a linear model
- **Smooth** functional forms -- boundaries are staircases of axis-aligned splits
- A way to **extrapolate** beyond the range of the training data -- predictions are flat outside
- **Calibrated probabilities** -- `predict_proba` is just "fraction of trees voting yes"

---

## Pitfall 1: shipping OOB as the headline

OOB is a great **sanity check**. It is **not** a substitute for the test set.

- Test set rule from L7: touched exactly once
- Report OOB **alongside** CV in the project, not in place of either

---

## Pitfall 2: cranking `n_estimators` to make a point

Past a few hundred trees, gains are negligible and compute grows linearly.

- More trees do **not** overfit -- but they cost time
- Default: **200-500**
- Push higher only if a learning curve says it helps

---

## Pitfall 3: trusting impurity importance with correlated features

Two perfectly correlated features will **split the importance** between them -- each looks half as important as the underlying signal.

- The model is unaffected
- The *story* the importance plot tells is misleading

**Defence:** use permutation importance, and check importance against domain knowledge.

---

## The project workflow for ensembles

1. **Baseline** with logistic regression / ridge from L4-L5
2. **Random forest** with sensible defaults (`n_estimators=300`, `max_features='sqrt'`)
3. **CV** the forest under the same protocol (L7)
4. **Compare** baseline and forest under the same metric
5. **Interpret** with permutation importance and (optionally) partial dependence

A forest that doesn't beat the baseline is a **finding**, not a failure.

---

## The setup

- **Dataset:** `load_breast_cancer()` -- binary, 569 rows, 30 numeric features
- **Protocol:**
  - 80/20 train/test split, `random_state=42`, stratified
  - Stratified 5-fold CV on the training set
  - Metric: `f1` (slightly imbalanced, ~63% benign)
- **Three workflows:** single decision tree, bagging classifier, random forest

```python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
```

---

## Workflow 1: a single decision tree

```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

tree = DecisionTreeClassifier(random_state=42)
scores = cross_val_score(tree, X_train, y_train, cv=5, scoring='f1')
print(f"Tree    CV F1: {scores.mean():.3f} ± {scores.std():.3f}")
```

Expected: solid mean, **high standard deviation** across folds.

---

## Workflow 2: bagging

```python
from sklearn.ensemble import BaggingClassifier

bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=300, n_jobs=-1, random_state=42,
)
scores = cross_val_score(bag, X_train, y_train, cv=5, scoring='f1')
print(f"Bagging CV F1: {scores.mean():.3f} ± {scores.std():.3f}")
```

Same base learner, averaged over 300 bootstrap samples.

---

## Workflow 3: random forest

```python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300, max_features='sqrt',
    n_jobs=-1, random_state=42,
)
scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='f1')
print(f"RF      CV F1: {scores.mean():.3f} ± {scores.std():.3f}")
```

Adds the random feature subset at each split.

---

## Read the three results together

| Workflow | CV F1 (mean ± std) | Test F1 |
|----------|---------------------|---------|
| Single tree | high std | -- |
| Bagging | lower std | -- |
| Random forest | lowest std | -- |

Two L7 questions to ask:

- Are the **gaps** between models bigger than the **std** across folds?
- Did the **variance** drop as we moved from tree → bagging → forest?

---

## Interpretation pass

```python
from sklearn.inspection import permutation_importance
import pandas as pd

rf.fit(X_train, y_train)
result = permutation_importance(
    rf, X_test, y_test, n_repeats=20, random_state=0, scoring='f1'
)
importances = pd.Series(result.importances_mean, index=X_train.columns)
print(importances.sort_values(ascending=False).head(8))
```

Compare to `rf.feature_importances_` on the same features. Note any disagreement.

---

## Summary

- A single decision tree is the textbook **high-variance** learner
- **Bagging** averages bootstrap-sample trees → variance down, bias unchanged
- **Random forests** add random feature subsets → further variance reduction by decorrelating trees
- **No scaling** needed. **OOB** gives a free generalisation estimate.
- **Permutation importance** is the honest interpretation tool

---

## Before Lecture 11

- **Run** today's tree-bagging-forest comparison on your own machine
- For your project: add a **random forest as a second-family comparison** under the same CV protocol
- Read ahead: **Lecture 11 is SVMs** -- a very different way to draw a decision boundary

---

## Questions

Open the floor.

## Backup slides

Use if questions arise or time allows.

## Extra trees (`ExtraTreesClassifier`)

Even more randomness -- split **thresholds** picked at random, not optimised.

```python
from sklearn.ensemble import ExtraTreesClassifier

et = ExtraTreesClassifier(n_estimators=500, n_jobs=-1, random_state=0)
```

- Faster than RF (no threshold search)
- Sometimes competitive, occasionally noticeably better
- Less stable across runs

Useful as a comparison family alongside RF.

## Calibration of random forest probabilities

`rf.predict_proba` = fraction of trees voting yes. Rarely well-calibrated.

```python
from sklearn.calibration import CalibratedClassifierCV

calibrated = CalibratedClassifierCV(rf, method='isotonic', cv=5)
calibrated.fit(X_train, y_train)
y_prob = calibrated.predict_proba(X_test)[:, 1]
```

Use when downstream decisions depend on the probability value, not just the rank.

Connect to the L5 calibration backup.

## Class imbalance with `class_weight='balanced_subsample'`

```python
rf = RandomForestClassifier(
    n_estimators=300,
    class_weight='balanced_subsample',
    random_state=0,
)
```

The forest reweights samples **per bootstrap**, so each tree sees a roughly balanced sample.

A simpler alternative than SMOTE for many imbalanced problems.

## Multi-output regression

One forest predicts multiple targets jointly:

```python
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=300, random_state=0)
rf.fit(X_train, Y_train)   # Y_train shape: (n, k)
```

Useful when:

- Targets are correlated (joint fit captures structure)
- You want to avoid maintaining $k$ separate models

Splits still pick a single feature/threshold; impurity is computed across all targets.

## Why not just one massive tree?

A deep tree is unstable -- one row changes the splits at the top, the whole structure shifts.

Many shallow-ish trees, each slightly different, average out that instability:

- Each tree's bias is bounded by its depth limit
- The ensemble's variance is bounded by (correlated) averaging

The ensemble wins because trees are cheap, parallel, and individually unstable -- exactly the conditions where averaging pays.

---

## What's next

**Lecture 11:** Support vector machines

- Maximum margin classification
- The kernel trick
- When to use SVMs vs forests