Linear Models for Prediction

## MST0052 — Lecture 4

### Linear Models for Prediction

Fall 2026

---

## Where we are

| Lectures | Phase |
|----------|-------|
| 1–3 | Foundations |
| **4–7** | **Core methods — you are here** |
| 8–13 | Going further |
| 14–16 | Wrapping up |

---

## Today's plan

- Linear regression as your **first baseline**
- What OLS is optimising — and when it breaks
- Ridge and lasso — why we penalise
- Tuning the penalty with cross-validation
- Worked example: California housing prices

---

## Why start with linear regression?

- **Transparent** — every coefficient is interpretable
- **Fast** — fits in milliseconds
- **Fully inspectable** — you can see exactly what the model learned

A credible baseline gives you something **defensible to beat.**

If a complex model can't improve on a well-specified linear model, the extra complexity is hard to justify.

---

## The model, in one equation

$$\hat{y} = \beta\_0 + \beta\_1 x\_1 + \beta\_2 x\_2 + \cdots + \beta\_p x\_p$$

- Each $\beta\_j$: how much does $\hat{y}$ change per unit change in $x\_j$?
- This is a **prediction** equation, not a causal claim

---

## What OLS is optimising

Ordinary least squares minimises the sum of squared residuals:

$$\hat{\beta} = \arg\min\_{\beta} \sum\_{i=1}^{n}(y\_i - x\_i^\top \beta)^2$$

- Large errors are penalised more than small ones (squared)
- Closed-form solution — no iterative optimisation
- **No hyperparameters** — a strength and a weakness

---

## OLS in scikit-learn

```python
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(StandardScaler(), LinearRegression())
pipe.fit(X_train, y_train)
print(f"Train R²: {pipe.score(X_train, y_train):.3f}")
print(f"Test R²:  {pipe.score(X_test, y_test):.3f}")
```

Scaling is optional for OLS but makes coefficients comparable.

---

## Evaluation metrics for regression

| Metric | What it measures |
|--------|-----------------|
| **MSE** | Average squared error — penalises large errors |
| **RMSE** | $\sqrt{\text{MSE}}$ — same units as $y$ |
| **MAE** | Average absolute error — robust to outliers |
| **R²** | Proportion of variance explained (1.0 = perfect) |

Always report **both** train and test metrics.

---

## When OLS breaks down

- **Too many features** relative to observations — overfitting
- **Correlated features** (multicollinearity) — unstable coefficients
- **Irrelevant features** — noise gets fitted, generalisation suffers

OLS has no built-in mechanism to handle any of these.

---

## The idea behind regularisation

Instead of *only* minimising prediction error, also **penalise large coefficients.**

- Trade a small increase in bias for a large reduction in variance
- Coefficients shrink toward zero — less extreme, more stable
- One hyperparameter $\lambda$ controls the penalty strength

> Make the model a little worse on training data so it generalises better to new data.

---

## Ridge regression (L2 penalty)

$$\hat{\beta}\_{\text{ridge}} = \arg\min\_{\beta} \sum\_{i}(y\_i - x\_i^\top \beta)^2 + \lambda \sum\_{j}\beta\_j^2$$

- Shrinks all coefficients toward zero, but **never exactly to zero**
- Good when you have many correlated features — distributes weight among them
- One hyperparameter: $\lambda$ (called `alpha` in scikit-learn)

---

## Lasso regression (L1 penalty)

$$\hat{\beta}\_{\text{lasso}} = \arg\min\_{\beta} \sum\_{i}(y\_i - x\_i^\top \beta)^2 + \lambda \sum\_{j}|\beta\_j|$$

- Shrinks coefficients toward zero — can set some **exactly to zero**
- Built-in **feature selection**: irrelevant features get removed
- Good when you suspect many features are noise

---

## Ridge vs lasso

| Property | Ridge | Lasso |
|----------|-------|-------|
| Shrinks coefficients | Yes, toward zero | Yes, some exactly to zero |
| Feature selection | No | Automatic |
| Correlated predictors | Shrinks together | Picks one, drops others |
| When to try first | Many useful features | Many irrelevant features |

**Elastic net** combines both penalties — a good default when unsure.

---

## The regularisation path

![Regularisation path](/figures/regularisation-path.svg)

As $\lambda$ increases: coefficients shrink, some (lasso) hit zero, model gets simpler.

The best $\lambda$ is chosen by **cross-validation**, not visual inspection.

---

## Why scaling matters for regularisation

The penalty treats all coefficients equally: $\lambda \sum\_j \beta\_j^2$

If features are on different scales, the penalty hits them **unequally**.

**Always standardise features before ridge or lasso.**

In a pipeline: `StandardScaler()` before the model handles this automatically.

---

## Elastic net

Combines L1 and L2 penalties:

$$\lambda \left[ \alpha \sum\_j |\beta\_j| + (1-\alpha) \sum\_j \beta\_j^2 \right]$$

- Two hyperparameters: $\lambda$ (overall strength) and $\alpha$ (L1/L2 mix)
- In scikit-learn: `ElasticNet(alpha=..., l1_ratio=...)`
- A safe default when you're unsure whether ridge or lasso is better

---

## The tuning problem

Ridge and lasso each have a hyperparameter ($\lambda$) that controls penalty strength:

- **Too small** → overfitting (like OLS)
- **Too large** → underfitting (all coefficients near zero)

We need a principled way to choose.

---

## GridSearchCV

```python
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(StandardScaler(), Ridge())
param_grid = {'ridge__alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
search = GridSearchCV(pipe, param_grid, cv=5, scoring='r2')
search.fit(X_train, y_train)

print(f"Best alpha: {search.best_params_['ridge__alpha']}")
print(f"Best CV R²: {search.best_score_:.3f}")
print(f"Test R²:    {search.score(X_test, y_test):.3f}")
```

---

## Reading GridSearchCV results

```python
import pandas as pd
results = pd.DataFrame(search.cv_results_)
print(results[['param_ridge__alpha',
               'mean_test_score', 'std_test_score']])
```

- Look at mean CV score **and** standard deviation across folds
- Many similar scores → model is not sensitive to alpha (good)
- Best alpha at the edge of your grid → extend the grid

---

## The workflow for penalised models

1. **Split** data (train / test)
2. **Build** a pipeline: scaler + penalised model
3. **Define** a grid of alpha values
4. **GridSearchCV** on training data
5. **Inspect** CV results
6. **Evaluate** best model on held-out test set

This is the L3 pipeline pattern with one extra step: the grid search.

---

## Reading the coefficients

```python
coefs = pipe.named_steps['ridge'].coef_
```

- With standardised features, **magnitude = relative importance**
- Positive coefficient → feature increases prediction
- Negative → decreases prediction

**Be careful:** this is correlation, not causation.

---

## Residual analysis

Residuals = actual − predicted. Plot them vs predicted values:

- **Pattern** → model is missing something (nonlinearity, interactions)
- **Fan shape** → variance changes with prediction level
- **Random scatter** → the model is capturing the structure it can

Residual plots are the **cheapest diagnostic** you have.

---

## When linear models aren't enough

- **Nonlinear relationships** (curves, thresholds, interactions)
- **Heterogeneous effects** across subgroups
- **Complex interactions** among many features

In these cases, tree-based models may help. But: **always start with the linear baseline.**

---

## Feature engineering for linear models

Linear models can only learn linear relationships — unless you give them nonlinear features:

- **Polynomial features:** $x$, $x^2$, $x^3$
- **Log transforms:** $\log(\text{income})$
- **Interaction terms:** $x\_1 \times x\_2$

```python
from sklearn.preprocessing import PolynomialFeatures
# degree=2 adds x², x₁·x₂, etc.
```

Combine with regularisation to control the complexity.

---

## California housing dataset

- Built into scikit-learn: `fetch_california_housing()`
- **Target:** median house value (in $100k) for California districts
- **8 features:** median income, house age, average rooms, average bedrooms, population, average occupancy, latitude, longitude
- ~20,000 observations. Regression problem.

---

## Step 1: Load and split

```python
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
print(X.describe())

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
```

---

## Step 2: OLS baseline

```python
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

ols = make_pipeline(StandardScaler(), LinearRegression())
ols.fit(X_train, y_train)
print(f"OLS — Train R²: {ols.score(X_train, y_train):.3f}")
print(f"OLS — Test R²:  {ols.score(X_test, y_test):.3f}")
```

Expected: R² around **0.60**. Decent but not great — room for improvement.

---

## Step 3: Ridge with GridSearchCV

```python
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

pipe = make_pipeline(StandardScaler(), Ridge())
param_grid = {'ridge__alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
search = GridSearchCV(pipe, param_grid, cv=5, scoring='r2')
search.fit(X_train, y_train)

print(f"Best alpha: {search.best_params_['ridge__alpha']}")
print(f"Ridge — Test R²: {search.score(X_test, y_test):.3f}")
```

---

## Step 4: Lasso

```python
from sklearn.linear_model import Lasso

pipe_lasso = make_pipeline(StandardScaler(), Lasso(max_iter=5000))
param_grid_lasso = {'lasso__alpha': [0.001, 0.01, 0.1, 1.0]}
search_lasso = GridSearchCV(pipe_lasso, param_grid_lasso,
                            cv=5, scoring='r2')
search_lasso.fit(X_train, y_train)

print(f"Best alpha: {search_lasso.best_params_['lasso__alpha']}")
print(f"Lasso — Test R²: {search_lasso.score(X_test, y_test):.3f}")
```

Check which features lasso set to zero:
`search_lasso.best_estimator_.named_steps['lasso'].coef_`

---

## Step 5: Compare all three

| Model | Best alpha | Test R² |
|-------|-----------|---------|
| OLS | — | ~0.60 |
| Ridge | (from CV) | ~0.60 |
| Lasso | (from CV) | ~0.60 |

For this dataset, all three perform similarly — OLS is well-conditioned with 8 features.

**When the simple model works, use the simple model.**

---

## Step 6: Inspect coefficients and residuals

- Bar chart of standardised coefficients → **median income** dominates
- Residual plot → pattern at high predicted values suggests **nonlinearity**
- Location features (lat/long) matter — a tree-based model might capture this better

---

## Summary

- Linear regression is the **first credible baseline**
- Ridge and lasso add a penalty to control overfitting
- **Always standardise** before regularisation
- Tune $\lambda$ with **cross-validation**
- The simplest model that works is the right model

---

## What to report in your project

- Target and feature set
- Train/test or resampling rule
- **MSE/RMSE and R²** on both train and test
- Baseline linear model result for comparison
- Whether the penalty improved prediction, stability, or interpretability

---

## Before Lecture 5

- Run today's California housing example on your own machine
- Try ridge and lasso on **your own dataset**
- Read ahead: Lecture 5 is **classification** (logistic regression, k-NN, naive Bayes)

---

## Questions

Open the floor.

## Backup slides

Use if questions arise or time allows.

## Polynomial regression

```python
from sklearn.preprocessing import PolynomialFeatures

pipe = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False),
    StandardScaler(),
    Ridge(alpha=1.0)
)
```

- Degree 2 adds $x^2$ and interaction terms
- Degree 10 overfits badly — regularisation is the remedy
- Always combine polynomial features with a penalty

## ElasticNet in detail

```python
from sklearn.linear_model import ElasticNet

pipe = make_pipeline(StandardScaler(),
                     ElasticNet(alpha=0.1, l1_ratio=0.5))
```

- `l1_ratio=1.0` → pure lasso
- `l1_ratio=0.0` → pure ridge
- `l1_ratio=0.5` → equal mix

Tune both `alpha` and `l1_ratio` with GridSearchCV.

## Cross-validation curves

```python
import matplotlib.pyplot as plt
import numpy as np

results = search.cv_results_
alphas = np.array(param_grid['ridge__alpha'])

plt.semilogx(alphas, results['mean_test_score'], marker='o')
plt.fill_between(alphas,
                 results['mean_test_score'] - results['std_test_score'],
                 results['mean_test_score'] + results['std_test_score'],
                 alpha=0.2)
plt.axvline(search.best_params_['ridge__alpha'],
            color='red', linestyle='--', label='best α')
plt.xlabel('α'); plt.ylabel('Mean CV R²'); plt.legend()
```

- **Low α** → high variance (overfits training)
- **High α** → high bias (underfits)
- Best α sits at the **knee** of the curve

A visual version of the bias-variance tradeoff — easier to read than a results table.

## RidgeCV and LassoCV

```python
from sklearn.linear_model import RidgeCV, LassoCV

# Built-in efficient CV — faster than GridSearchCV
ridge = make_pipeline(StandardScaler(),
    RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0, 100.0]))
ridge.fit(X_train, y_train)
print(f"Best alpha: {ridge.named_steps['ridgecv'].alpha_}")
```

Uses Leave-One-Out CV (ridge) or coordinate descent path (lasso) — much faster for these specific models.

## Coefficients are not causal effects

- A large positive coefficient for "income" means income **predicts** higher house values
- It does **not** mean giving people more money raises house prices
- Omitted variables, confounders, and reverse causality are everywhere
- For causal claims, you need a different course (and a different design)

---

## What's next

**Lecture 5:** Classification methods

- Logistic regression
- k-nearest neighbours
- Naive Bayes
- Comparing classifiers