MST0052
## MST0052 -- Lecture 10 ### Ensemble Methods Fall 2026 --- ## Where we are | Lectures | Phase | |----------|-------| | 1--3 | Foundations | | 4--7 | Core methods | | **9--14** | **Going further -- you are here** | | 15--16 | Wrapping up | --- ## Today's plan - The **decision tree** -- one rule at a time - Why a single tree is the textbook **high-variance** learner - **Bagging** -- averaging trees fitted on bootstrap samples - **Random forests** -- bagging plus random feature subsets - Tuning, feature importance, and what to report in the project --- ## A tree is a sequence of questions  The model: follow the answers down the tree, predict whatever leaf you land in. No equations, no kernels, no scaling -- radically different machinery from L4-L7. --- ## How splits are chosen At each node, pick the (feature, threshold) pair that maximises **node purity** in the two children. Two standard impurity measures (classification): - **Gini:** $\text{Gini}(S) = 1 - \sum_c p_c^2$ - **Entropy:** $\text{Entropy}(S) = -\sum_c p_c \log_2 p_c$ The tree grows **greedily** -- best split at each node, no look-ahead. --- ## Trees for regression Same algorithm, different impurity: minimise the **within-node variance** (or sum of squared residuals) of the target. Leaf prediction: - **Regression:** mean of the training targets in that leaf - **Classification:** majority class in that leaf One model family, two task types -- same tuning knobs. --- ## Why trees overfit An unpruned tree keeps splitting until every leaf is **pure** (classification) or has one observation (regression). - Training error: **zero** - Test error: usually **terrible** This is the textbook **high-variance** learner from L6 -- refit on a different sample and you get a wildly different tree. --- ## Controlling complexity in a single tree | Knob | What it does | |------|--------------| | `max_depth` | Hard cap on tree depth | | `min_samples_split` | Minimum samples a node must have to be split | | `min_samples_leaf` | Minimum samples in each leaf | | `ccp_alpha` | Cost-complexity pruning -- penalises tree size | All trade variance for bias -- the same L6 dial in different clothing. --- ## The averaging idea From L6: averaging many noisy unbiased estimates **reduces variance**, while bias stays the same. If we had many **independent** trees fitted to **independent** samples, averaging their predictions would: - inherit each tree's **bias** (unchanged) - slash each tree's **variance** We don't have many independent samples. **Bootstrap** is the trick we use to fake them. --- ## The bootstrap A **bootstrap sample**: a random sample of size $n$ drawn **with replacement** from the original $n$ training rows. - Each bootstrap sample contains roughly **63%** of the unique rows; the rest are repeats - The leftover ~37% are called **out-of-bag (OOB)** Draw $B$ bootstrap samples → fit $B$ trees → ensemble. --- ## Bagging in one slide **Bootstrap AGGregating:** 1. Draw $B$ bootstrap samples 2. Fit a tree to each (unpruned, so each is low-bias) 3. Aggregate: **average** (regression) or **majority vote** (classification) - Ensemble bias ≈ a single tree's bias - Ensemble variance ≈ a single tree's variance ÷ $B$ (in the unrealistic limit of perfect independence) --- ## Out-of-bag error: free CV Each tree's OOB rows were not seen during its training. For every training row $i$, the trees that did not include $i$ in their bootstrap sample give an honest prediction. Averaging those → an OOB estimate of generalisation, **without** refitting. ```python from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=500, oob_score=True, random_state=0) rf.fit(X_train, y_train) print(f"OOB accuracy: {rf.oob_score_:.3f}") ``` --- ## Bagging vs a single tree, visually  - **Left:** a single deep tree -- jagged, axis-aligned, chasing noise - **Right:** the bagged ensemble -- smoother, same general shape This is variance reduction made visible. --- ## Bagging's weakness: correlated trees Even with different bootstrap samples, trees tend to use the **same strong features** at the top splits. → Trees are **correlated** → correlated errors do not average away. **Random forests** add one more layer of randomness to fix this. --- ## The random forest recipe Bagging + **random feature subsets at each split**: - At every candidate split, sample $m$ features at random from the $p$ available - Best split is chosen only among those $m$ Typical defaults: - **Classification:** $m = \sqrt{p}$ - **Regression:** $m = p/3$ --- ## Why it works Forcing each tree to look at **different feature subsets** makes the trees **less correlated**. - Less correlation → averaging is more effective → variance drops further - Price: each individual tree is slightly worse (fewer features to choose from) - Net: the ensemble is meaningfully better --- ## Random forests in scikit-learn ```python from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier( n_estimators=500, max_features='sqrt', min_samples_leaf=1, n_jobs=-1, random_state=42, ) rf.fit(X_train, y_train) print(f"Test accuracy: {rf.score(X_test, y_test):.3f}") ``` - **No scaling needed** -- trees don't care about feature units - `n_jobs=-1` parallelises tree fitting across cores --- ## Key hyperparameters | Parameter | Effect | Typical range | |-----------|--------|---------------| | `n_estimators` | More trees → lower variance (diminishing past a few hundred) | 100--1000 | | `max_features` | Lower → more decorrelation, higher per-tree bias | `'sqrt'`, `'log2'`, fractions | | `max_depth` | Cap individual tree complexity | `None` or 5--30 | | `min_samples_leaf` | Floor on leaf size; larger = more regularisation | 1--20 | Tune with `GridSearchCV` (L7). `n_estimators` is almost never the decisive parameter. --- ## Feature importance (impurity-based) `rf.feature_importances_` averages, across all trees and splits, the impurity reduction attributable to each feature. ```python import pandas as pd importances = pd.Series(rf.feature_importances_, index=feature_names) print(importances.sort_values(ascending=False).head(10)) ``` - Useful first pass for "which features matter" - **But:** biased toward high-cardinality features and toward features chosen first when others are correlated --- ## Permutation importance (the honest answer) Permute one feature's values across rows; refit nothing; measure the drop in CV score. ```python from sklearn.inspection import permutation_importance result = permutation_importance( rf, X_test, y_test, n_repeats=20, random_state=0, scoring='accuracy' ) ``` A feature with high permutation importance **actively** matters at prediction time. High impurity importance only means it *used to* be picked. Slower, more honest. **Use it before reporting feature importance in your project.** --- ## Partial dependence (brief) For a feature of interest, average the model's prediction across the training data while varying that feature alone. The result is a curve: how does the prediction change as that feature changes, holding everything else "average"? ```python from sklearn.inspection import PartialDependenceDisplay PartialDependenceDisplay.from_estimator(rf, X_train, features=['mean_radius']) ``` Cheap to compute. Useful for "what does this feature do?" answers. --- ## What random forests don't give you - A **coefficient table** you can sign-and-interpret like a linear model - **Smooth** functional forms -- boundaries are staircases of axis-aligned splits - A way to **extrapolate** beyond the range of the training data -- predictions are flat outside - **Calibrated probabilities** -- `predict_proba` is just "fraction of trees voting yes" --- ## Pitfall 1: shipping OOB as the headline OOB is a great **sanity check**. It is **not** a substitute for the test set. - Test set rule from L7: touched exactly once - Report OOB **alongside** CV in the project, not in place of either --- ## Pitfall 2: cranking `n_estimators` to make a point Past a few hundred trees, gains are negligible and compute grows linearly. - More trees do **not** overfit -- but they cost time - Default: **200-500** - Push higher only if a learning curve says it helps --- ## Pitfall 3: trusting impurity importance with correlated features Two perfectly correlated features will **split the importance** between them -- each looks half as important as the underlying signal. - The model is unaffected - The *story* the importance plot tells is misleading **Defence:** use permutation importance, and check importance against domain knowledge. --- ## The project workflow for ensembles 1. **Baseline** with logistic regression / ridge from L4-L5 2. **Random forest** with sensible defaults (`n_estimators=300`, `max_features='sqrt'`) 3. **CV** the forest under the same protocol (L7) 4. **Compare** baseline and forest under the same metric 5. **Interpret** with permutation importance and (optionally) partial dependence A forest that doesn't beat the baseline is a **finding**, not a failure. --- ## The setup - **Dataset:** `load_breast_cancer()` -- binary, 569 rows, 30 numeric features - **Protocol:** - 80/20 train/test split, `random_state=42`, stratified - Stratified 5-fold CV on the training set - Metric: `f1` (slightly imbalanced, ~63% benign) - **Three workflows:** single decision tree, bagging classifier, random forest ```python from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split X, y = load_breast_cancer(return_X_y=True, as_frame=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) ``` --- ## Workflow 1: a single decision tree ```python from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import cross_val_score tree = DecisionTreeClassifier(random_state=42) scores = cross_val_score(tree, X_train, y_train, cv=5, scoring='f1') print(f"Tree CV F1: {scores.mean():.3f} ± {scores.std():.3f}") ``` Expected: solid mean, **high standard deviation** across folds. --- ## Workflow 2: bagging ```python from sklearn.ensemble import BaggingClassifier bag = BaggingClassifier( estimator=DecisionTreeClassifier(random_state=42), n_estimators=300, n_jobs=-1, random_state=42, ) scores = cross_val_score(bag, X_train, y_train, cv=5, scoring='f1') print(f"Bagging CV F1: {scores.mean():.3f} ± {scores.std():.3f}") ``` Same base learner, averaged over 300 bootstrap samples. --- ## Workflow 3: random forest ```python from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier( n_estimators=300, max_features='sqrt', n_jobs=-1, random_state=42, ) scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='f1') print(f"RF CV F1: {scores.mean():.3f} ± {scores.std():.3f}") ``` Adds the random feature subset at each split. --- ## Read the three results together | Workflow | CV F1 (mean ± std) | Test F1 | |----------|---------------------|---------| | Single tree | high std | -- | | Bagging | lower std | -- | | Random forest | lowest std | -- | Two L7 questions to ask: - Are the **gaps** between models bigger than the **std** across folds? - Did the **variance** drop as we moved from tree → bagging → forest? --- ## Interpretation pass ```python from sklearn.inspection import permutation_importance import pandas as pd rf.fit(X_train, y_train) result = permutation_importance( rf, X_test, y_test, n_repeats=20, random_state=0, scoring='f1' ) importances = pd.Series(result.importances_mean, index=X_train.columns) print(importances.sort_values(ascending=False).head(8)) ``` Compare to `rf.feature_importances_` on the same features. Note any disagreement. --- ## Summary - A single decision tree is the textbook **high-variance** learner - **Bagging** averages bootstrap-sample trees → variance down, bias unchanged - **Random forests** add random feature subsets → further variance reduction by decorrelating trees - **No scaling** needed. **OOB** gives a free generalisation estimate. - **Permutation importance** is the honest interpretation tool --- ## Before Lecture 11 - **Run** today's tree-bagging-forest comparison on your own machine - For your project: add a **random forest as a second-family comparison** under the same CV protocol - Read ahead: **Lecture 11 is SVMs** -- a very different way to draw a decision boundary --- ## Questions Open the floor. -- ## Backup slides Use if questions arise or time allows. -- ## Extra trees (`ExtraTreesClassifier`) Even more randomness -- split **thresholds** picked at random, not optimised. ```python from sklearn.ensemble import ExtraTreesClassifier et = ExtraTreesClassifier(n_estimators=500, n_jobs=-1, random_state=0) ``` - Faster than RF (no threshold search) - Sometimes competitive, occasionally noticeably better - Less stable across runs Useful as a comparison family alongside RF. -- ## Calibration of random forest probabilities `rf.predict_proba` = fraction of trees voting yes. Rarely well-calibrated. ```python from sklearn.calibration import CalibratedClassifierCV calibrated = CalibratedClassifierCV(rf, method='isotonic', cv=5) calibrated.fit(X_train, y_train) y_prob = calibrated.predict_proba(X_test)[:, 1] ``` Use when downstream decisions depend on the probability value, not just the rank. Connect to the L5 calibration backup. -- ## Class imbalance with `class_weight='balanced_subsample'` ```python rf = RandomForestClassifier( n_estimators=300, class_weight='balanced_subsample', random_state=0, ) ``` The forest reweights samples **per bootstrap**, so each tree sees a roughly balanced sample. A simpler alternative than SMOTE for many imbalanced problems. -- ## Multi-output regression One forest predicts multiple targets jointly: ```python from sklearn.ensemble import RandomForestRegressor rf = RandomForestRegressor(n_estimators=300, random_state=0) rf.fit(X_train, Y_train) # Y_train shape: (n, k) ``` Useful when: - Targets are correlated (joint fit captures structure) - You want to avoid maintaining $k$ separate models Splits still pick a single feature/threshold; impurity is computed across all targets. -- ## Why not just one massive tree? A deep tree is unstable -- one row changes the splits at the top, the whole structure shifts. Many shallow-ish trees, each slightly different, average out that instability: - Each tree's bias is bounded by its depth limit - The ensemble's variance is bounded by (correlated) averaging The ensemble wins because trees are cheap, parallel, and individually unstable -- exactly the conditions where averaging pays. --- ## What's next **Lecture 11:** Support vector machines - Maximum margin classification - The kernel trick - When to use SVMs vs forests