MST0052
## MST0052 -- Lecture 7 ### Model Selection and Cross-Validation Fall 2026 --- ## Where we are | Lectures | Phase | |----------|-------| | 1--3 | Foundations | | **4--7** | **Core methods -- you are here** | | 9--14 | Going further | | 15--16 | Wrapping up | --- ## Today's plan - **Cross-validation** in depth -- k-fold, stratified, time-series - **Hyperparameter tuning** with `GridSearchCV` - Choosing a **metric** that matches the problem - The most common selection **mistakes** - Worked example: **comparing three model families** on one dataset --- ## L6 in one sentence > Out-of-sample error has a sweet spot in model complexity. The job is to find it without cheating. Today: *how* you find it, in code, on real data. --- ## Model selection is *workflow* selection A "model" in your project is not just an algorithm. It is: > **preprocessing + features + algorithm + hyperparameters** When you compare ridge vs random forest, you are comparing two *workflows*. The winner is a workflow, not just an algorithm. This matters: mixing pieces from different workflows is how leakage and unfair comparisons creep in. --- ## The three sets | Set | Used for | Touched how often | |-----|----------|-------------------| | **Training** | Fit the model | Every time | | **Validation** | Tune hyperparameters, compare workflows | Many times -- but never the final score | | **Test** | One final, honest evaluation | **Exactly once** | The test set is sacred. Tune against it and it stops being a test set -- you have no honest evaluation left. --- ## Why a single train/validation split isn't enough A single 80/20 split gives you *one* number. That number is **noisy**: - The validation set is small - Which 20% of rows you happened to draw matters - Run it twice with different seeds and the "best model" can change We need an estimate that **averages over the luck of the split.** --- ## k-fold cross-validation  1. Split training data into $k$ equal folds 2. For each fold: fit on the other $k-1$, score on the held-out one 3. Average the $k$ scores (and report the standard deviation) Typical choices: $k = 5$ (default) or $k = 10$ (lower bias, more compute). --- ## What CV is estimating CV gives you an estimate of the **out-of-sample error of the workflow** -- not of any particular fitted model. - Each of the $k$ folds produces a *different* fitted model - The **mean** of their scores is the generalisation estimate - After CV picks the workflow, you **refit on the full training set** to get the model you ship --- ## Stratified k-fold for classification Plain k-fold randomises rows into folds. For **imbalanced** classification, that can produce folds with very different class ratios. **Stratified k-fold** preserves the class ratio in every fold. In scikit-learn: - Classifiers passed to `cross_val_score(..., cv=5)` use stratified k-fold **by default** - Regressors do not --- ## `cross_val_score` and `cross_validate` ```python from sklearn.model_selection import cross_val_score, StratifiedKFold from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)) cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='f1') print(f"CV F1: {scores.mean():.3f} ± {scores.std():.3f}") ``` - Pass a splitter object for full control -- `shuffle=True` matters when data is ordered - `cross_validate` returns more: train scores, fit times, multiple metrics --- ## When k-fold is wrong: time series If your rows have a **temporal order**, plain shuffled k-fold trains on the future to predict the past. That is a leak -- CV scores will be unrealistically good. Use `TimeSeriesSplit` instead: ``` Fold 1: [train: 0..n1] [val: n1..n2] Fold 2: [train: 0..n2] [val: n2..n3] Fold 3: [train: 0..n3] [val: n3..n4] ``` Each fold trains on data **up to** a cut point and validates on the next block. --- ## Other resampling variants | Variant | When to reach for it | |---------|---------------------| | `KFold` | Default for regression on i.i.d. data | | `StratifiedKFold` | Classification -- preserves class ratios | | `ShuffleSplit` | Many random train/val splits | | `GroupKFold` | Rows have a grouping (patients, users) that must not cross folds | | `TimeSeriesSplit` | Temporal data | Pick the splitter that matches your data's structure. **Defaults are safe only if your data is unstructured i.i.d.** --- ## The search problem Every model family has knobs ($\lambda$ in ridge, $k$ in k-NN, depth in trees). Different knob values produce **different workflows**. Each workflow has its own CV score. We want the knob value that maximises CV performance. Hand-tuning ("I tried 0.1 and it looked OK") is not reproducible. **We need a search.** --- ## `GridSearchCV` ```python from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier pipe = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=42)) param_grid = { 'randomforestclassifier__n_estimators': [100, 300], 'randomforestclassifier__max_depth': [None, 5, 10], 'randomforestclassifier__min_samples_leaf': [1, 5], } search = GridSearchCV(pipe, param_grid, cv=5, scoring='f1', n_jobs=-1, return_train_score=True) search.fit(X_train, y_train) print(f"Best params: {search.best_params_}") print(f"Best CV F1: {search.best_score_:.3f}") print(f"Test F1: {search.score(X_test, y_test):.3f}") ``` --- ## Read the full results, not just `.best_params_` ```python import pandas as pd results = pd.DataFrame(search.cv_results_) cols = ['mean_train_score', 'mean_test_score', 'std_test_score', 'param_randomforestclassifier__max_depth', 'param_randomforestclassifier__n_estimators'] print(results[cols].sort_values('mean_test_score', ascending=False).head()) ``` - The mean is one number; the **std** tells you whether the result is stable - Many configurations score similarly → not very sensitive → pick the **simplest** (per L6) - Best config sits at the **edge of your grid** → grid is too small, extend it --- ## Grid vs random vs halving | Strategy | When it wins | |----------|-------------| | **Grid** | Small parameter space, want full coverage | | **Randomized** (`RandomizedSearchCV`) | Large or continuous space -- sample $n$ random configs | | **Halving** (`HalvingGridSearchCV`) | Many configs; eliminate poor ones on small data first | For your project, **Grid** is fine for 1-2 hyperparameters; **Randomized** past 3-4. --- ## The pipeline-CV pattern, end to end 1. **Split** off the test set. Lock it. 2. **Build** a `Pipeline` (preprocessing + model). 3. **Define** a parameter grid using `stepname__param` keys. 4. **GridSearchCV** on the training set with the right splitter and metric. 5. **Inspect** `cv_results_` -- not just `best_params_`. 6. **Score** `best_estimator_` on the test set, once. This pattern does not change per model family -- only the pipeline and the grid do. --- ## The metric is a modelling choice Whatever you pass to `scoring=` is the thing `GridSearchCV` optimises. **Pick wisely.** The wrong metric is silent: the code runs, the search "finishes," the chosen workflow is wrong for your problem. --- ## Common `scoring=` strings | Task | Useful strings | |------|----------------| | Regression | `'r2'`, `'neg_mean_squared_error'`, `'neg_root_mean_squared_error'`, `'neg_mean_absolute_error'` | | Binary classification | `'accuracy'`, `'f1'`, `'roc_auc'`, `'average_precision'` | | Multi-class classification | `'f1_macro'`, `'f1_weighted'`, `'balanced_accuracy'` | | Custom | Build with `make_scorer` | The `neg_` prefix is a convention: scikit-learn **maximises by default**, so loss-style metrics are negated. --- ## Multi-metric evaluation ```python search = GridSearchCV( pipe, param_grid, cv=5, scoring={'f1': 'f1', 'roc_auc': 'roc_auc'}, refit='f1', ) ``` - Optimise on F1, but **also report** ROC-AUC for each configuration - Useful when you want to *report* multiple numbers without changing the selection rule --- ## Pitfall 1: leakage inside CV The L3 leakage rule applies *within every fold of CV*. - **Wrong:** scale the whole training set, then pass to `cross_val_score` - The scaler has seen the validation folds → CV scores are optimistic **Fix:** put the scaler **inside the pipeline**. `GridSearchCV` refits it on each fold's training data. --- ## Pitfall 2: optimism from the best CV score `search.best_score_` is the CV score of the **best** configuration you tried. That number is biased **upward** -- by the act of picking the maximum over many tries, you partly fit to validation noise. > The honest generalisation estimate is `search.score(X_test, y_test)` -- the **test score** of the refit best model. --- ## Pitfall 3: too many comparisons Test 50 workflows and one will look great by chance, even if all are equally weak. **Symptoms:** - "Winning" hyperparameter is at one end of the grid - Std of the best CV score is much smaller than the gaps between configs - You can't reproduce the winner with a different random seed **Defence:** keep the grid focused, prefer the simpler model when scores tie, run with a couple of different `random_state` values to check stability. --- ## Nested cross-validation: the concept ``` Outer loop (estimates generalisation) └ Inner loop (tunes hyperparameters with GridSearchCV) ``` - **Outer loop:** honest estimate of the *whole pipeline including the hyperparameter tuning* - **Inner loop:** picks hyperparameters for each outer fold This is what you need when you want a CV score that includes the tuning cost. --- ## Nested CV in scikit-learn ```python from sklearn.model_selection import cross_val_score, StratifiedKFold inner = StratifiedKFold(n_splits=3, shuffle=True, random_state=0) outer = StratifiedKFold(n_splits=5, shuffle=True, random_state=1) search = GridSearchCV(pipe, param_grid, cv=inner, scoring='f1') nested = cross_val_score(search, X_train, y_train, cv=outer, scoring='f1') print(f"Nested CV F1: {nested.mean():.3f} ± {nested.std():.3f}") ``` Costs $k\_\text{outer} \times k\_\text{inner} \times |\text{grid}|$ fits. > For most coursework, **GridSearchCV + a held-out test set is enough.** Nested CV is the rigorous upgrade. --- ## The task and the protocol **Dataset:** scikit-learn's `load_wine()` -- 3-class classification, 178 samples, 13 chemical features. **Protocol:** - 80/20 train/test split, `random_state=42` - Stratified 5-fold CV on the training set - Metric: `f1_macro` (multi-class, treats classes equally) - Three model families: **logistic regression, k-NN, random forest** All three families compete under the **same rule**. ```python from sklearn.datasets import load_wine from sklearn.model_selection import train_test_split X, y = load_wine(return_X_y=True, as_frame=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) ``` --- ## Family 1: logistic regression ```python from sklearn.linear_model import LogisticRegression pipe_lr = make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)) grid_lr = {'logisticregression__C': [0.01, 0.1, 1.0, 10.0]} search_lr = GridSearchCV(pipe_lr, grid_lr, cv=5, scoring='f1_macro') search_lr.fit(X_train, y_train) print(f"LR best C: {search_lr.best_params_}") print(f"LR CV f1_macro: {search_lr.best_score_:.3f}") ``` Read the trajectory of CV scores across `C`. Which side of the U-curve are we on? --- ## Family 2: k-NN ```python from sklearn.neighbors import KNeighborsClassifier pipe_knn = make_pipeline(StandardScaler(), KNeighborsClassifier()) grid_knn = {'kneighborsclassifier__n_neighbors': [3, 5, 11, 21, 41]} search_knn = GridSearchCV(pipe_knn, grid_knn, cv=5, scoring='f1_macro') search_knn.fit(X_train, y_train) print(f"k-NN best k: {search_knn.best_params_}") print(f"k-NN CV f1_macro: {search_knn.best_score_:.3f}") ``` The L5 "scale before k-NN" lesson is enforced here by the pipeline. --- ## Family 3: random forest ```python from sklearn.ensemble import RandomForestClassifier pipe_rf = make_pipeline(RandomForestClassifier(random_state=42)) grid_rf = { 'randomforestclassifier__n_estimators': [100, 300], 'randomforestclassifier__max_depth': [None, 5, 10], } search_rf = GridSearchCV(pipe_rf, grid_rf, cv=5, scoring='f1_macro', n_jobs=-1) search_rf.fit(X_train, y_train) print(f"RF best params: {search_rf.best_params_}") print(f"RF CV f1_macro: {search_rf.best_score_:.3f}") ``` Trees don't need scaling -- the pipeline is shorter. **Same selection protocol.** --- ## Honest comparison ```python for name, s in [('Logistic', search_lr), ('k-NN', search_knn), ('RF', search_rf)]: cv = s.best_score_ cv_std = s.cv_results_['std_test_score'][s.best_index_] test = s.score(X_test, y_test) print(f"{name:9s} CV f1_macro = {cv:.3f} ± {cv_std:.3f} " f"Test f1_macro = {test:.3f}") ``` Two questions to ask of the output: - Do the **CV winners** agree with the **test winners**? - Are the gaps between families **larger than the std** across folds? If not, the comparison is less decisive than a single number suggests. --- ## Which one would you ship? - Two families tie within their CV std → **ship the simpler one** (per L6) - Test score for the chosen winner is much worse than its CV score → suspect an **unlucky test split** or **hyperparameter overfitting** (pitfall 2) - Winning hyperparameter sits at the **edge of its grid** → extend the grid and re-run --- ## What to report in your project For each modelling family you tried: 1. The **pipeline** (preprocessing + model) 2. The **CV protocol** (splitter, $k$, random state, stratification) 3. The **metric** and why it matches the problem 4. The **grid** searched 5. The **chosen configuration** and its CV score (mean ± std) 6. The **test score** of the chosen model, reported **once** --- ## A reproducible selection rule beats a "best model" claim > "Among configurations within one CV std of the best, we picked the simplest." That sentence is more defensible than: > "We got 0.91 on the test set." The selection rule **survives** if your data is updated. The "best model" claim does not. --- ## Summary - Cross-validation is the **operational** answer to L6's tradeoff - **Stratify for classification**, `TimeSeriesSplit` for temporal data - **GridSearchCV** does the search; **read `cv_results_`**, not just `best_params_` - The **metric** is a modelling choice -- pick it deliberately - The **test set** is touched exactly once - A reproducible **selection rule** beats a "best model" claim --- ## Before Lecture 8 - **Run** today's wine workflow on your own machine - **Apply** the same protocol to your own dataset and at least two model families - **L8 is the optional project showcase** -- sign up for a 5-minute slot if you want feedback in front of the class. **Hard deadline** -- link on the course site. --- ## Questions Open the floor. -- ## Backup slides Use if questions arise or time allows. -- ## Leave-one-out CV (LOOCV) $k = n$: each fold leaves out a single observation. - Lowest bias (you train on $n-1$ points each time) - Highest variance across folds - Slowest -- $n$ model fits Useful only for **small samples** or as a theoretical limit. For everything else, k=5 or k=10 is the right answer. -- ## Repeated k-fold ```python from sklearn.model_selection import RepeatedStratifiedKFold cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=0) scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='f1') ``` Averages over multiple **shuffles** of the folds. Useful when the std across folds is uncomfortably large -- you get a tighter estimate at $n_\text{repeats}$× the cost. -- ## Calibration in model selection When picking a model whose **probabilities** matter (not just labels): - Evaluate **calibration** (Brier score, reliability diagram) alongside discrimination (AUC, F1) - A well-discriminating but poorly calibrated model can still mislead downstream decisions ```python from sklearn.metrics import brier_score_loss y_prob = search.best_estimator_.predict_proba(X_test)[:, 1] print(f"Brier: {brier_score_loss(y_test, y_prob):.3f}") ``` Connect to the L5 calibration backup (`CalibratedClassifierCV`). -- ## Why `neg_mean_squared_error`? scikit-learn's selection logic **maximises** the scoring function. Loss-style metrics (MSE, MAE) get **negated** so larger = better: ```python search = GridSearchCV(pipe, grid, cv=5, scoring='neg_mean_squared_error') # Report it back with the sign flipped: print(f"Best CV MSE: {-search.best_score_:.3f}") ``` Annoying convention, but consistent across all `neg_*` metrics. -- ## Permutation importance after model selection Once a workflow is chosen, you may want to **interpret** it: ```python from sklearn.inspection import permutation_importance result = permutation_importance( search.best_estimator_, X_test, y_test, n_repeats=20, random_state=0, scoring='f1_macro' ) ``` - Model-agnostic -- works on any fitted estimator - Measures the drop in score when each feature is randomly shuffled - A more honest importance signal than `.feature_importances_` for tree models Skip in lecture; useful in office hours. --- ## What's next **Lecture 8:** Project showcase (optional) - Volunteers present their project to the class - Feedback from me, ideas from your classmates - Not presenting? Come anyway -- half the value is in the audience