Neural Networks – MST0052

## MST0052 -- Lecture 14

### Neural Networks: A Bridge from Classical ML

Fall 2026

---

## Where we are

| Lectures | Phase |
|----------|-------|
| 1--3 | Foundations |
| 4--7 | Core methods |
| **9--14** | **Going further -- you are here** |
| 15--16 | Wrapping up |

---

## Today's plan

- Why this lecture exists -- and **why the project ban stays**
- **From logistic regression to a neuron** to a network
- **How neural networks learn** -- gradient descent and backprop, in plain English
- What makes deep learning work, when it does
- When classical ML still wins -- and the project advice that follows

---

## A conceptual bridge, not a recipe

This is **not** a deep-learning course.

- We will not train ImageNet
- We will not derive backprop
- We will not write PyTorch

> The goal: place neural networks in your **mental model** -- what they generalise, what they cost, what they buy you.

Strong students should be able to **read** a deep-learning paper at the end of this lecture, even if they could not write one.

---

## The project ban, revisited

The L1 rule: no deep learning as the headline model in the project. **This is a positive design choice.**

The course exists because **classical ML practice** -- validation, leakage, comparison, calibration, interpretation -- is what makes projects **defensible at the oral**.

Deep learning often hides those decisions behind library APIs.

> Take the deep-learning course **afterwards**. You will have the prerequisites to make good use of it precisely because of what you learned here.

---

## What you should leave knowing

- A clear story for **what a neural network is** -- in terms of things you already understand
- Enough **vocabulary** (epoch, batch, optimiser, dropout, attention, transformer) to read further on your own
- A **defensible position** on when classical ML still wins -- useful at the oral, useful in industry for the next decade

---

## Logistic regression, restated

From L5:

$$\Pr(y = 1 \mid x) = \sigma(\beta\_0 + x^\top \beta), \quad \sigma(z) = \frac{1}{1 + e^{-z}}$$

- A **linear combination** of features
- **Squashed** through a nonlinearity (sigmoid)
- Fitted to **maximise log-likelihood**

That is also a neural network. **It just has one neuron.**

---

## A single neuron *is* logistic regression renamed

![Same equation, different vocabulary](/figures/neuron-vs-logistic.svg)

A neuron computes $\sigma(w^\top x + b)$. Same equation as L5, different vocabulary:

- $\beta \to w$ (weights)
- $\beta\_0 \to b$ (bias)
- $\sigma \to$ "activation function"

> The deep-learning literature uses different words for the same objects. **Knowing the translation is half the battle.**

---

## Multiple neurons make a layer

A **layer** = a stack of neurons that all see the same input but learn different weights.

For input $x \in \mathbb{R}^p$ and a layer with $h$ neurons:

$$\text{layer}(x) = \sigma(W x + b), \quad W \in \mathbb{R}^{h \times p}, \; b \in \mathbb{R}^h$$

The output is a new vector in $\mathbb{R}^h$ -- a learned, nonlinear **representation** of the input.

---

## Stacking layers makes a network

![A small feedforward network with two hidden layers](/figures/network-stacked-layers.svg)

Feed the output of one layer into the next:

$$h\_1 = \sigma(W\_1 x + b\_1), \quad h\_2 = \sigma(W\_2 h\_1 + b\_2), \quad \hat{y} = W\_3 h\_2 + b\_3$$

Output activation depends on the task:

- **Regression** → identity (no activation)
- **Binary classification** → sigmoid
- **Multi-class** → softmax

---

## Activation functions

![ReLU, sigmoid, and tanh](/figures/activation-functions.svg)

| Function | Formula | Where used |
|----------|---------|------------|
| **ReLU** | $\max(0, z)$ | Default for hidden layers |
| **Sigmoid** | $1 / (1 + e^{-z})$ | Output (binary classification) |
| **Softmax** | $e^{z\_c} / \sum\_j e^{z\_j}$ | Output (multiclass) |
| **Tanh** | $\tanh(z)$ | Older; rarely used now |

ReLU is the default: cheap, doesn't saturate for large positive inputs, trains well with deep stacks.

---

## The loss function

Given training data, define a loss $L(\hat{y}, y)$:

- **Regression:** squared error
- **Classification:** cross-entropy (negative log-likelihood -- same as logistic regression's loss)

> Training is **finding the weights that minimise the average loss** on the training set.

Same idea as ridge or logistic regression. Just more parameters and a harder optimisation surface.

---

## Gradient descent, in one slide

The loss is a function of the weights. We can compute its **gradient** with respect to each weight.

- The gradient points in the direction of steepest **increase**
- We take a small step in the **opposite** direction

Repeat. Hundreds of thousands of times. That is gradient descent.

$$w \leftarrow w - \eta \cdot \nabla\_w L$$

with $\eta$ the **learning rate**.

---

## Backpropagation, in one paragraph

Backpropagation is the algorithm that computes the gradient of the loss with respect to **every** weight in a deep network, efficiently.

> It is the **chain rule from calculus**, applied layer by layer, from the output back to the input.

Modern frameworks (PyTorch, TensorFlow, JAX) compute backprop **automatically** -- you never write it.

---

## Minibatches and stochastic gradient descent

Computing the gradient on the full training set is slow.

**Minibatch SGD:**

- At each step, compute the gradient on a small random subset of training data (32-256 rows)
- Trade a little noise per step for **many more steps per epoch**
- Almost always faster to convergence
- The noise itself acts as a mild **regulariser**

---

## A training loop, in pseudocode

```text
for each epoch:
    shuffle the training set
    for each minibatch in the training set:
        compute predictions
        compute loss
        compute gradient via backprop
        update weights with learning rate
    evaluate on validation set
    optionally: save best model, decay learning rate, stop early
```

---

## Universal approximation, in one sentence

A neural network with one hidden layer of sufficient width can approximate **any continuous function** to arbitrary accuracy.

This sounds magical and is mostly useless:

- "Sufficient width" can mean **astronomically wide**
- The theorem says nothing about how **easy** it is to find those weights

> The real reason deep networks work is **depth** -- composing many simple transformations is more efficient than one extremely wide one.

---

## Why depth helps

A deep network learns **hierarchical representations**:

- Early layers learn **simple patterns**
- Middle layers learn **motifs**
- Late layers learn **complex objects or concepts**

In vision: early layers = edges; middle = textures; late = objects. Same architecture, learned end-to-end.

> "Feature engineering, automated" -- at the cost of an enormous amount of data and compute.

---

## The data hunger

- **Classical ML is sample-efficient** because we encoded a lot of structure in the model (linearity, sparsity, tree splits)
- **Deep learning is sample-hungry** because it removes that structure and learns it from data

| Data scale | Default choice |
|------------|----------------|
| Thousands of rows | Classical ML |
| Millions of rows | Deep learning starts to win |
| Perceptual structure (images, audio, text) | Deep learning |

---

## Architectures with inductive bias

Plain feedforward networks (MLPs) work, but rarely win in practice. The wins come from architectures that **bake in problem structure**:

- **CNNs** (convolutional networks): translation invariance for **images**
- **RNNs** / **LSTMs**: sequential structure (older; mostly replaced now)
- **Transformers**: attention-based -- for sequences, increasingly for everything

The architecture itself is a prior. **The right one shrinks the data hunger.**

---

## The 2026 landscape, in one slide

- **Foundation models** -- extremely large models trained on internet-scale data once, then **fine-tuned** or **prompted** for specific tasks
- **LLMs** are the most visible example, but the same recipe now runs vision, audio, robotics, code, and proteins
- The frontier in 2026 is **not training from scratch** -- it is **using foundation models well**

---

## Tabular data: classical still wins

For tabular data (rows = observations, columns = features, no spatial / sequential structure):

> **Gradient boosting is still state of the art.**

Many empirical comparisons confirm: GBM beats neural networks on most tabular tasks, often by a meaningful margin, with a **fraction of the compute**.

> The classical-ML toolkit is **not** "what people did before deep learning." It is the right tool for tabular problems, full stop.

---

## Small samples

Deep learning needs lots of data.

- A few hundred rows: **not enough**
- A few thousand rows: still not enough for from-scratch deep learning
- A linear model or a tuned gradient-boosting model will **outperform** a from-scratch neural network on small data -- and do so more reliably across seeds

The course datasets fit comfortably in the **classical regime**.

---

## Interpretability and trust

- **Linear models** → coefficient table
- **Tree ensembles** → permutation importance, SHAP
- **Neural networks** → a black box plus post-hoc tools that may or may not be faithful

For projects, regulated industries, or anything that needs to be **defended**: classical methods come with interpretability built in.

This is also why the oral exam can probe deeply into classical methods -- there is **something concrete to defend**.

---

## The practical boundary

> "If your data fits in a spreadsheet and your features are meaningful, start with gradient boosting. If your data is images, text, or sequences, consider neural networks -- and consider an existing pretrained model first."

This is a heuristic, not a law. Edge cases exist.

Default to the heuristic; **deviate when you can defend the deviation.**

---

## The setup

- **Dataset:** `load_breast_cancer()` -- same as L10, L11, L12. Three lectures of comparison numbers already on the board.
- **Protocol:** stratified 80/20, stratified 5-fold CV, `f1` metric.
- **Today's contender:** scikit-learn's `MLPClassifier` -- a small multilayer perceptron.

```python
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
```

---

## A small MLP in scikit-learn

```python
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

mlp = make_pipeline(
    StandardScaler(),
    MLPClassifier(
        hidden_layer_sizes=(32, 16),
        activation='relu',
        max_iter=300,
        early_stopping=True,
        random_state=42,
    ),
)
scores = cross_val_score(mlp, X_train, y_train, cv=5, scoring='f1')
print(f"MLP CV F1: {scores.mean():.3f} ± {scores.std():.3f}")
```

Two hidden layers, 32 then 16 ReLU units. `StandardScaler` is mandatory.

---

## The four-family comparison

| Family | Test F1 | Trained in | Defensible? |
|--------|---------|------------|-------------|
| Random forest (L10) | ~0.97 | seconds | yes |
| RBF SVM (L11) | ~0.98 | seconds | yes |
| HistGBM (L12) | ~0.98 | seconds | yes |
| MLP (today) | ~similar | seconds-minutes | **less obviously** |

The MLP is **competitive**. It is also:

- Harder to defend at the oral
- Slower to tune
- More sensitive to seed

---

## What this experiment shows

- A neural network can **match** classical methods on small tabular data -- but does not **exceed** them
- The marginal cost (tuning, interpretability, seed sensitivity) is real

> The project advice **does not change**: for tabular problems, gradient boosting is the right default. Use the time you would spend tuning an MLP on tightening your validation and report.

---

## Where neural networks would actually win on this dataset

**They wouldn't.**

If the breast-cancer task were "predict the diagnosis from the **raw biopsy image**" rather than from 30 hand-engineered numerical features → a CNN trained on a large image dataset would beat any tabular model.

> The win comes from the **data type and scale**, not from the algorithm in isolation.

---

## Summary

- A single neuron **is** logistic regression renamed. A network is many of them, stacked.
- Training = **loss + gradient descent + backprop**, automated by the framework.
- Deep learning **wins** with abundant data and perceptual structure (images, text, audio).
- Classical ML **wins** on tabular data, small samples, and anything that has to be *defended*.
- This was the **last new method** in the course. The toolbox is now closed.

---

## Before Lecture 15

- L15 is the **rehearsal project workshop** -- bring your project to present in front of the class
- This week: **finalise your model selection**. Pick the workflow you'll defend, write the selection rule, and run the test-set evaluation **once**
- Want feedback before the rehearsal? Office hours or email me.

---

## Questions

Open the floor.

## Backup slides

Use if questions arise or time allows.

## A taste of PyTorch

```python
import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(30, 32)
        self.fc2 = nn.Linear(32, 1)

def forward(self, x):
        h = torch.relu(self.fc1(x))
        return self.fc2(h)

model = Net()
loss_fn = nn.BCEWithLogitsLoss()
optimiser = torch.optim.Adam(model.parameters(), lr=1e-3)
```

Five lines for the architecture; the rest is the training loop. **Pedagogical, not project advice.**

## Convolutional networks in one slide

**Convolution** = a small filter slides across the input, computing weighted sums at every position.

Why it works for images:

- **Translation invariance** -- a cat is a cat whether it's in the top-left or bottom-right
- **Local connectivity** -- nearby pixels are correlated; distant ones aren't
- **Parameter sharing** -- the same filter applies everywhere → far fewer weights than a fully-connected layer

CNNs dominate vision because they encode these priors **architecturally**, not via data alone.

## Transformers in one slide

The core operation: **attention** -- every token looks at every other token, weighted by similarity.

```text
attention(Q, K, V) = softmax(Q K^T / sqrt(d)) V
```

- $Q, K, V$ are learned projections of the input
- The softmax over similarities tells each token **what to pay attention to**

Why it works:

- No fixed receptive field -- captures long-range dependencies
- Parallelisable (unlike RNNs)
- Scales beautifully with data + compute

Now used for sequences (text, audio, code, proteins) and increasingly for images.

## Regularisation in deep learning

The L6 dial in deep-learning clothing:

| Technique | What it does |
|-----------|--------------|
| **Dropout** | Randomly zero out neurons during training |
| **Weight decay** | L2 penalty on the weights (= ridge) |
| **Batch normalisation** | Normalise activations per-batch within the network |
| **Early stopping** | Stop training when validation loss stops improving |
| **Data augmentation** | Add transformed copies of training examples (rotations, crops, noise) |

All address the same bias-variance tradeoff -- different mechanisms.

## Fine-tuning vs from-scratch

When foundation models help:

```text
Do you have ≥ millions of labelled examples
in your task's exact domain?

Yes → train from scratch (rare in practice)
  No  → fine-tune a foundation model
        or use it via embeddings + classical model
```

For most real-world tasks in 2026, the answer is **fine-tune**. Foundation models are pre-trained on far more data than you'll ever collect for your specific task.

The deep-learning course will teach you how to do this well.

---

## What's next

**Lecture 15:** Project workshop

- Present your project to peers
- Give and receive feedback
- Final preparation before submission