MST0052
## MST0052 -- Lecture 14 ### Neural Networks: A Bridge from Classical ML Fall 2026 --- ## Where we are | Lectures | Phase | |----------|-------| | 1--3 | Foundations | | 4--7 | Core methods | | **9--14** | **Going further -- you are here** | | 15--16 | Wrapping up | --- ## Today's plan - Why this lecture exists -- and **why the project ban stays** - **From logistic regression to a neuron** to a network - **How neural networks learn** -- gradient descent and backprop, in plain English - What makes deep learning work, when it does - When classical ML still wins -- and the project advice that follows --- ## A conceptual bridge, not a recipe This is **not** a deep-learning course. - We will not train ImageNet - We will not derive backprop - We will not write PyTorch > The goal: place neural networks in your **mental model** -- what they generalise, what they cost, what they buy you. Strong students should be able to **read** a deep-learning paper at the end of this lecture, even if they could not write one. --- ## The project ban, revisited The L1 rule: no deep learning as the headline model in the project. **This is a positive design choice.** The course exists because **classical ML practice** -- validation, leakage, comparison, calibration, interpretation -- is what makes projects **defensible at the oral**. Deep learning often hides those decisions behind library APIs. > Take the deep-learning course **afterwards**. You will have the prerequisites to make good use of it precisely because of what you learned here. --- ## What you should leave knowing - A clear story for **what a neural network is** -- in terms of things you already understand - Enough **vocabulary** (epoch, batch, optimiser, dropout, attention, transformer) to read further on your own - A **defensible position** on when classical ML still wins -- useful at the oral, useful in industry for the next decade --- ## Logistic regression, restated From L5: $$\Pr(y = 1 \mid x) = \sigma(\beta\_0 + x^\top \beta), \quad \sigma(z) = \frac{1}{1 + e^{-z}}$$ - A **linear combination** of features - **Squashed** through a nonlinearity (sigmoid) - Fitted to **maximise log-likelihood** That is also a neural network. **It just has one neuron.** --- ## A single neuron *is* logistic regression renamed  A neuron computes $\sigma(w^\top x + b)$. Same equation as L5, different vocabulary: - $\beta \to w$ (weights) - $\beta\_0 \to b$ (bias) - $\sigma \to$ "activation function" > The deep-learning literature uses different words for the same objects. **Knowing the translation is half the battle.** --- ## Multiple neurons make a layer A **layer** = a stack of neurons that all see the same input but learn different weights. For input $x \in \mathbb{R}^p$ and a layer with $h$ neurons: $$\text{layer}(x) = \sigma(W x + b), \quad W \in \mathbb{R}^{h \times p}, \; b \in \mathbb{R}^h$$ The output is a new vector in $\mathbb{R}^h$ -- a learned, nonlinear **representation** of the input. --- ## Stacking layers makes a network  Feed the output of one layer into the next: $$h\_1 = \sigma(W\_1 x + b\_1), \quad h\_2 = \sigma(W\_2 h\_1 + b\_2), \quad \hat{y} = W\_3 h\_2 + b\_3$$ Output activation depends on the task: - **Regression** → identity (no activation) - **Binary classification** → sigmoid - **Multi-class** → softmax --- ## Activation functions  | Function | Formula | Where used | |----------|---------|------------| | **ReLU** | $\max(0, z)$ | Default for hidden layers | | **Sigmoid** | $1 / (1 + e^{-z})$ | Output (binary classification) | | **Softmax** | $e^{z\_c} / \sum\_j e^{z\_j}$ | Output (multiclass) | | **Tanh** | $\tanh(z)$ | Older; rarely used now | ReLU is the default: cheap, doesn't saturate for large positive inputs, trains well with deep stacks. --- ## The loss function Given training data, define a loss $L(\hat{y}, y)$: - **Regression:** squared error - **Classification:** cross-entropy (negative log-likelihood -- same as logistic regression's loss) > Training is **finding the weights that minimise the average loss** on the training set. Same idea as ridge or logistic regression. Just more parameters and a harder optimisation surface. --- ## Gradient descent, in one slide The loss is a function of the weights. We can compute its **gradient** with respect to each weight. - The gradient points in the direction of steepest **increase** - We take a small step in the **opposite** direction Repeat. Hundreds of thousands of times. That is gradient descent. $$w \leftarrow w - \eta \cdot \nabla\_w L$$ with $\eta$ the **learning rate**. --- ## Backpropagation, in one paragraph Backpropagation is the algorithm that computes the gradient of the loss with respect to **every** weight in a deep network, efficiently. > It is the **chain rule from calculus**, applied layer by layer, from the output back to the input. Modern frameworks (PyTorch, TensorFlow, JAX) compute backprop **automatically** -- you never write it. --- ## Minibatches and stochastic gradient descent Computing the gradient on the full training set is slow. **Minibatch SGD:** - At each step, compute the gradient on a small random subset of training data (32-256 rows) - Trade a little noise per step for **many more steps per epoch** - Almost always faster to convergence - The noise itself acts as a mild **regulariser** --- ## A training loop, in pseudocode ```text for each epoch: shuffle the training set for each minibatch in the training set: compute predictions compute loss compute gradient via backprop update weights with learning rate evaluate on validation set optionally: save best model, decay learning rate, stop early ``` --- ## Universal approximation, in one sentence A neural network with one hidden layer of sufficient width can approximate **any continuous function** to arbitrary accuracy. This sounds magical and is mostly useless: - "Sufficient width" can mean **astronomically wide** - The theorem says nothing about how **easy** it is to find those weights > The real reason deep networks work is **depth** -- composing many simple transformations is more efficient than one extremely wide one. --- ## Why depth helps A deep network learns **hierarchical representations**: - Early layers learn **simple patterns** - Middle layers learn **motifs** - Late layers learn **complex objects or concepts** In vision: early layers = edges; middle = textures; late = objects. Same architecture, learned end-to-end. > "Feature engineering, automated" -- at the cost of an enormous amount of data and compute. --- ## The data hunger - **Classical ML is sample-efficient** because we encoded a lot of structure in the model (linearity, sparsity, tree splits) - **Deep learning is sample-hungry** because it removes that structure and learns it from data | Data scale | Default choice | |------------|----------------| | Thousands of rows | Classical ML | | Millions of rows | Deep learning starts to win | | Perceptual structure (images, audio, text) | Deep learning | --- ## Architectures with inductive bias Plain feedforward networks (MLPs) work, but rarely win in practice. The wins come from architectures that **bake in problem structure**: - **CNNs** (convolutional networks): translation invariance for **images** - **RNNs** / **LSTMs**: sequential structure (older; mostly replaced now) - **Transformers**: attention-based -- for sequences, increasingly for everything The architecture itself is a prior. **The right one shrinks the data hunger.** --- ## The 2026 landscape, in one slide - **Foundation models** -- extremely large models trained on internet-scale data once, then **fine-tuned** or **prompted** for specific tasks - **LLMs** are the most visible example, but the same recipe now runs vision, audio, robotics, code, and proteins - The frontier in 2026 is **not training from scratch** -- it is **using foundation models well** --- ## Tabular data: classical still wins For tabular data (rows = observations, columns = features, no spatial / sequential structure): > **Gradient boosting is still state of the art.** Many empirical comparisons confirm: GBM beats neural networks on most tabular tasks, often by a meaningful margin, with a **fraction of the compute**. > The classical-ML toolkit is **not** "what people did before deep learning." It is the right tool for tabular problems, full stop. --- ## Small samples Deep learning needs lots of data. - A few hundred rows: **not enough** - A few thousand rows: still not enough for from-scratch deep learning - A linear model or a tuned gradient-boosting model will **outperform** a from-scratch neural network on small data -- and do so more reliably across seeds The course datasets fit comfortably in the **classical regime**. --- ## Interpretability and trust - **Linear models** → coefficient table - **Tree ensembles** → permutation importance, SHAP - **Neural networks** → a black box plus post-hoc tools that may or may not be faithful For projects, regulated industries, or anything that needs to be **defended**: classical methods come with interpretability built in. This is also why the oral exam can probe deeply into classical methods -- there is **something concrete to defend**. --- ## The practical boundary > "If your data fits in a spreadsheet and your features are meaningful, start with gradient boosting. If your data is images, text, or sequences, consider neural networks -- and consider an existing pretrained model first." This is a heuristic, not a law. Edge cases exist. Default to the heuristic; **deviate when you can defend the deviation.** --- ## The setup - **Dataset:** `load_breast_cancer()` -- same as L10, L11, L12. Three lectures of comparison numbers already on the board. - **Protocol:** stratified 80/20, stratified 5-fold CV, `f1` metric. - **Today's contender:** scikit-learn's `MLPClassifier` -- a small multilayer perceptron. ```python from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split X, y = load_breast_cancer(return_X_y=True, as_frame=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) ``` --- ## A small MLP in scikit-learn ```python from sklearn.neural_network import MLPClassifier from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.model_selection import cross_val_score mlp = make_pipeline( StandardScaler(), MLPClassifier( hidden_layer_sizes=(32, 16), activation='relu', max_iter=300, early_stopping=True, random_state=42, ), ) scores = cross_val_score(mlp, X_train, y_train, cv=5, scoring='f1') print(f"MLP CV F1: {scores.mean():.3f} ± {scores.std():.3f}") ``` Two hidden layers, 32 then 16 ReLU units. `StandardScaler` is mandatory. --- ## The four-family comparison | Family | Test F1 | Trained in | Defensible? | |--------|---------|------------|-------------| | Random forest (L10) | ~0.97 | seconds | yes | | RBF SVM (L11) | ~0.98 | seconds | yes | | HistGBM (L12) | ~0.98 | seconds | yes | | MLP (today) | ~similar | seconds-minutes | **less obviously** | The MLP is **competitive**. It is also: - Harder to defend at the oral - Slower to tune - More sensitive to seed --- ## What this experiment shows - A neural network can **match** classical methods on small tabular data -- but does not **exceed** them - The marginal cost (tuning, interpretability, seed sensitivity) is real > The project advice **does not change**: for tabular problems, gradient boosting is the right default. Use the time you would spend tuning an MLP on tightening your validation and report. --- ## Where neural networks would actually win on this dataset **They wouldn't.** If the breast-cancer task were "predict the diagnosis from the **raw biopsy image**" rather than from 30 hand-engineered numerical features → a CNN trained on a large image dataset would beat any tabular model. > The win comes from the **data type and scale**, not from the algorithm in isolation. --- ## Summary - A single neuron **is** logistic regression renamed. A network is many of them, stacked. - Training = **loss + gradient descent + backprop**, automated by the framework. - Deep learning **wins** with abundant data and perceptual structure (images, text, audio). - Classical ML **wins** on tabular data, small samples, and anything that has to be *defended*. - This was the **last new method** in the course. The toolbox is now closed. --- ## Before Lecture 15 - L15 is the **rehearsal project workshop** -- bring your project to present in front of the class - This week: **finalise your model selection**. Pick the workflow you'll defend, write the selection rule, and run the test-set evaluation **once** - Want feedback before the rehearsal? Office hours or email me. --- ## Questions Open the floor. -- ## Backup slides Use if questions arise or time allows. -- ## A taste of PyTorch ```python import torch import torch.nn as nn class Net(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(30, 32) self.fc2 = nn.Linear(32, 1) def forward(self, x): h = torch.relu(self.fc1(x)) return self.fc2(h) model = Net() loss_fn = nn.BCEWithLogitsLoss() optimiser = torch.optim.Adam(model.parameters(), lr=1e-3) ``` Five lines for the architecture; the rest is the training loop. **Pedagogical, not project advice.** -- ## Convolutional networks in one slide **Convolution** = a small filter slides across the input, computing weighted sums at every position. Why it works for images: - **Translation invariance** -- a cat is a cat whether it's in the top-left or bottom-right - **Local connectivity** -- nearby pixels are correlated; distant ones aren't - **Parameter sharing** -- the same filter applies everywhere → far fewer weights than a fully-connected layer CNNs dominate vision because they encode these priors **architecturally**, not via data alone. -- ## Transformers in one slide The core operation: **attention** -- every token looks at every other token, weighted by similarity. ```text attention(Q, K, V) = softmax(Q K^T / sqrt(d)) V ``` - $Q, K, V$ are learned projections of the input - The softmax over similarities tells each token **what to pay attention to** Why it works: - No fixed receptive field -- captures long-range dependencies - Parallelisable (unlike RNNs) - Scales beautifully with data + compute Now used for sequences (text, audio, code, proteins) and increasingly for images. -- ## Regularisation in deep learning The L6 dial in deep-learning clothing: | Technique | What it does | |-----------|--------------| | **Dropout** | Randomly zero out neurons during training | | **Weight decay** | L2 penalty on the weights (= ridge) | | **Batch normalisation** | Normalise activations per-batch within the network | | **Early stopping** | Stop training when validation loss stops improving | | **Data augmentation** | Add transformed copies of training examples (rotations, crops, noise) | All address the same bias-variance tradeoff -- different mechanisms. -- ## Fine-tuning vs from-scratch When foundation models help: ```text Do you have ≥ millions of labelled examples in your task's exact domain? Yes → train from scratch (rare in practice) No → fine-tune a foundation model or use it via embeddings + classical model ``` For most real-world tasks in 2026, the answer is **fine-tune**. Foundation models are pre-trained on far more data than you'll ever collect for your specific task. The deep-learning course will teach you how to do this well. --- ## What's next **Lecture 15:** Project workshop - Present your project to peers - Give and receive feedback - Final preparation before submission