MST0052
## MST0052 — Lecture 4 ### Linear Models for Prediction Fall 2026 --- ## Where we are | Lectures | Phase | |----------|-------| | 1–3 | Foundations | | **4–7** | **Core methods — you are here** | | 8–13 | Going further | | 14–16 | Wrapping up | --- ## Today's plan - Linear regression as your **first baseline** - What OLS is optimising — and when it breaks - Ridge and lasso — why we penalise - Tuning the penalty with cross-validation - Worked example: California housing prices --- ## Why start with linear regression? - **Transparent** — every coefficient is interpretable - **Fast** — fits in milliseconds - **Fully inspectable** — you can see exactly what the model learned A credible baseline gives you something **defensible to beat.** If a complex model can't improve on a well-specified linear model, the extra complexity is hard to justify. --- ## The model, in one equation $$\hat{y} = \beta\_0 + \beta\_1 x\_1 + \beta\_2 x\_2 + \cdots + \beta\_p x\_p$$ - Each $\beta\_j$: how much does $\hat{y}$ change per unit change in $x\_j$? - This is a **prediction** equation, not a causal claim --- ## What OLS is optimising Ordinary least squares minimises the sum of squared residuals: $$\hat{\beta} = \arg\min\_{\beta} \sum\_{i=1}^{n}(y\_i - x\_i^\top \beta)^2$$ - Large errors are penalised more than small ones (squared) - Closed-form solution — no iterative optimisation - **No hyperparameters** — a strength and a weakness --- ## OLS in scikit-learn ```python from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler pipe = make_pipeline(StandardScaler(), LinearRegression()) pipe.fit(X_train, y_train) print(f"Train R²: {pipe.score(X_train, y_train):.3f}") print(f"Test R²: {pipe.score(X_test, y_test):.3f}") ``` Scaling is optional for OLS but makes coefficients comparable. --- ## Evaluation metrics for regression | Metric | What it measures | |--------|-----------------| | **MSE** | Average squared error — penalises large errors | | **RMSE** | $\sqrt{\text{MSE}}$ — same units as $y$ | | **MAE** | Average absolute error — robust to outliers | | **R²** | Proportion of variance explained (1.0 = perfect) | Always report **both** train and test metrics. --- ## When OLS breaks down - **Too many features** relative to observations — overfitting - **Correlated features** (multicollinearity) — unstable coefficients - **Irrelevant features** — noise gets fitted, generalisation suffers OLS has no built-in mechanism to handle any of these. --- ## The idea behind regularisation Instead of *only* minimising prediction error, also **penalise large coefficients.** - Trade a small increase in bias for a large reduction in variance - Coefficients shrink toward zero — less extreme, more stable - One hyperparameter $\lambda$ controls the penalty strength > Make the model a little worse on training data so it generalises better to new data. --- ## Ridge regression (L2 penalty) $$\hat{\beta}\_{\text{ridge}} = \arg\min\_{\beta} \sum\_{i}(y\_i - x\_i^\top \beta)^2 + \lambda \sum\_{j}\beta\_j^2$$ - Shrinks all coefficients toward zero, but **never exactly to zero** - Good when you have many correlated features — distributes weight among them - One hyperparameter: $\lambda$ (called `alpha` in scikit-learn) --- ## Lasso regression (L1 penalty) $$\hat{\beta}\_{\text{lasso}} = \arg\min\_{\beta} \sum\_{i}(y\_i - x\_i^\top \beta)^2 + \lambda \sum\_{j}|\beta\_j|$$ - Shrinks coefficients toward zero — can set some **exactly to zero** - Built-in **feature selection**: irrelevant features get removed - Good when you suspect many features are noise --- ## Ridge vs lasso | Property | Ridge | Lasso | |----------|-------|-------| | Shrinks coefficients | Yes, toward zero | Yes, some exactly to zero | | Feature selection | No | Automatic | | Correlated predictors | Shrinks together | Picks one, drops others | | When to try first | Many useful features | Many irrelevant features | **Elastic net** combines both penalties — a good default when unsure. --- ## The regularisation path  As $\lambda$ increases: coefficients shrink, some (lasso) hit zero, model gets simpler. The best $\lambda$ is chosen by **cross-validation**, not visual inspection. --- ## Why scaling matters for regularisation The penalty treats all coefficients equally: $\lambda \sum\_j \beta\_j^2$ If features are on different scales, the penalty hits them **unequally**. **Always standardise features before ridge or lasso.** In a pipeline: `StandardScaler()` before the model handles this automatically. --- ## Elastic net Combines L1 and L2 penalties: $$\lambda \left[ \alpha \sum\_j |\beta\_j| + (1-\alpha) \sum\_j \beta\_j^2 \right]$$ - Two hyperparameters: $\lambda$ (overall strength) and $\alpha$ (L1/L2 mix) - In scikit-learn: `ElasticNet(alpha=..., l1_ratio=...)` - A safe default when you're unsure whether ridge or lasso is better --- ## The tuning problem Ridge and lasso each have a hyperparameter ($\lambda$) that controls penalty strength: - **Too small** → overfitting (like OLS) - **Too large** → underfitting (all coefficients near zero) We need a principled way to choose. --- ## GridSearchCV ```python from sklearn.linear_model import Ridge from sklearn.model_selection import GridSearchCV from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler pipe = make_pipeline(StandardScaler(), Ridge()) param_grid = {'ridge__alpha': [0.01, 0.1, 1.0, 10.0, 100.0]} search = GridSearchCV(pipe, param_grid, cv=5, scoring='r2') search.fit(X_train, y_train) print(f"Best alpha: {search.best_params_['ridge__alpha']}") print(f"Best CV R²: {search.best_score_:.3f}") print(f"Test R²: {search.score(X_test, y_test):.3f}") ``` --- ## Reading GridSearchCV results ```python import pandas as pd results = pd.DataFrame(search.cv_results_) print(results[['param_ridge__alpha', 'mean_test_score', 'std_test_score']]) ``` - Look at mean CV score **and** standard deviation across folds - Many similar scores → model is not sensitive to alpha (good) - Best alpha at the edge of your grid → extend the grid --- ## The workflow for penalised models 1. **Split** data (train / test) 2. **Build** a pipeline: scaler + penalised model 3. **Define** a grid of alpha values 4. **GridSearchCV** on training data 5. **Inspect** CV results 6. **Evaluate** best model on held-out test set This is the L3 pipeline pattern with one extra step: the grid search. --- ## Reading the coefficients ```python coefs = pipe.named_steps['ridge'].coef_ ``` - With standardised features, **magnitude = relative importance** - Positive coefficient → feature increases prediction - Negative → decreases prediction **Be careful:** this is correlation, not causation. --- ## Residual analysis Residuals = actual − predicted. Plot them vs predicted values: - **Pattern** → model is missing something (nonlinearity, interactions) - **Fan shape** → variance changes with prediction level - **Random scatter** → the model is capturing the structure it can Residual plots are the **cheapest diagnostic** you have. --- ## When linear models aren't enough - **Nonlinear relationships** (curves, thresholds, interactions) - **Heterogeneous effects** across subgroups - **Complex interactions** among many features In these cases, tree-based models may help. But: **always start with the linear baseline.** --- ## Feature engineering for linear models Linear models can only learn linear relationships — unless you give them nonlinear features: - **Polynomial features:** $x$, $x^2$, $x^3$ - **Log transforms:** $\log(\text{income})$ - **Interaction terms:** $x\_1 \times x\_2$ ```python from sklearn.preprocessing import PolynomialFeatures # degree=2 adds x², x₁·x₂, etc. ``` Combine with regularisation to control the complexity. --- ## California housing dataset - Built into scikit-learn: `fetch_california_housing()` - **Target:** median house value (in $100k) for California districts - **8 features:** median income, house age, average rooms, average bedrooms, population, average occupancy, latitude, longitude - ~20,000 observations. Regression problem. --- ## Step 1: Load and split ```python from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split X, y = fetch_california_housing(return_X_y=True, as_frame=True) print(X.describe()) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) ``` --- ## Step 2: OLS baseline ```python from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler ols = make_pipeline(StandardScaler(), LinearRegression()) ols.fit(X_train, y_train) print(f"OLS — Train R²: {ols.score(X_train, y_train):.3f}") print(f"OLS — Test R²: {ols.score(X_test, y_test):.3f}") ``` Expected: R² around **0.60**. Decent but not great — room for improvement. --- ## Step 3: Ridge with GridSearchCV ```python from sklearn.linear_model import Ridge from sklearn.model_selection import GridSearchCV pipe = make_pipeline(StandardScaler(), Ridge()) param_grid = {'ridge__alpha': [0.01, 0.1, 1.0, 10.0, 100.0]} search = GridSearchCV(pipe, param_grid, cv=5, scoring='r2') search.fit(X_train, y_train) print(f"Best alpha: {search.best_params_['ridge__alpha']}") print(f"Ridge — Test R²: {search.score(X_test, y_test):.3f}") ``` --- ## Step 4: Lasso ```python from sklearn.linear_model import Lasso pipe_lasso = make_pipeline(StandardScaler(), Lasso(max_iter=5000)) param_grid_lasso = {'lasso__alpha': [0.001, 0.01, 0.1, 1.0]} search_lasso = GridSearchCV(pipe_lasso, param_grid_lasso, cv=5, scoring='r2') search_lasso.fit(X_train, y_train) print(f"Best alpha: {search_lasso.best_params_['lasso__alpha']}") print(f"Lasso — Test R²: {search_lasso.score(X_test, y_test):.3f}") ``` Check which features lasso set to zero: `search_lasso.best_estimator_.named_steps['lasso'].coef_` --- ## Step 5: Compare all three | Model | Best alpha | Test R² | |-------|-----------|---------| | OLS | — | ~0.60 | | Ridge | (from CV) | ~0.60 | | Lasso | (from CV) | ~0.60 | For this dataset, all three perform similarly — OLS is well-conditioned with 8 features. **When the simple model works, use the simple model.** --- ## Step 6: Inspect coefficients and residuals - Bar chart of standardised coefficients → **median income** dominates - Residual plot → pattern at high predicted values suggests **nonlinearity** - Location features (lat/long) matter — a tree-based model might capture this better --- ## Summary - Linear regression is the **first credible baseline** - Ridge and lasso add a penalty to control overfitting - **Always standardise** before regularisation - Tune $\lambda$ with **cross-validation** - The simplest model that works is the right model --- ## What to report in your project - Target and feature set - Train/test or resampling rule - **MSE/RMSE and R²** on both train and test - Baseline linear model result for comparison - Whether the penalty improved prediction, stability, or interpretability --- ## Before Lecture 5 - Run today's California housing example on your own machine - Try ridge and lasso on **your own dataset** - Read ahead: Lecture 5 is **classification** (logistic regression, k-NN, naive Bayes) --- ## Questions Open the floor. -- ## Backup slides Use if questions arise or time allows. -- ## Polynomial regression ```python from sklearn.preprocessing import PolynomialFeatures pipe = make_pipeline( PolynomialFeatures(degree=2, include_bias=False), StandardScaler(), Ridge(alpha=1.0) ) ``` - Degree 2 adds $x^2$ and interaction terms - Degree 10 overfits badly — regularisation is the remedy - Always combine polynomial features with a penalty -- ## ElasticNet in detail ```python from sklearn.linear_model import ElasticNet pipe = make_pipeline(StandardScaler(), ElasticNet(alpha=0.1, l1_ratio=0.5)) ``` - `l1_ratio=1.0` → pure lasso - `l1_ratio=0.0` → pure ridge - `l1_ratio=0.5` → equal mix Tune both `alpha` and `l1_ratio` with GridSearchCV. -- ## Cross-validation curves ```python import matplotlib.pyplot as plt import numpy as np results = search.cv_results_ alphas = np.array(param_grid['ridge__alpha']) plt.semilogx(alphas, results['mean_test_score'], marker='o') plt.fill_between(alphas, results['mean_test_score'] - results['std_test_score'], results['mean_test_score'] + results['std_test_score'], alpha=0.2) plt.axvline(search.best_params_['ridge__alpha'], color='red', linestyle='--', label='best α') plt.xlabel('α'); plt.ylabel('Mean CV R²'); plt.legend() ``` - **Low α** → high variance (overfits training) - **High α** → high bias (underfits) - Best α sits at the **knee** of the curve A visual version of the bias-variance tradeoff — easier to read than a results table. -- ## RidgeCV and LassoCV ```python from sklearn.linear_model import RidgeCV, LassoCV # Built-in efficient CV — faster than GridSearchCV ridge = make_pipeline(StandardScaler(), RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0])) ridge.fit(X_train, y_train) print(f"Best alpha: {ridge.named_steps['ridgecv'].alpha_}") ``` Uses Leave-One-Out CV (ridge) or coordinate descent path (lasso) — much faster for these specific models. -- ## Coefficients are not causal effects - A large positive coefficient for "income" means income **predicts** higher house values - It does **not** mean giving people more money raises house prices - Omitted variables, confounders, and reverse causality are everywhere - For causal claims, you need a different course (and a different design) --- ## What's next **Lecture 5:** Classification methods - Logistic regression - k-nearest neighbours - Naive Bayes - Comparing classifiers