Summing Up and Q&A

## MST0052 -- Lecture 16

### Summing Up and Q&A

Fall 2026

---

## Where we are

| Lectures | Phase |
|----------|-------|
| 1--3 | Foundations |
| 4--7 | Core methods |
| 9--14 | Going further |
| **15--16** | **Wrapping up -- you are here** |

L15 was yesterday's rehearsal workshop. Submission deadline is next week. Oral exams in December.

---

## Today's plan

- A **quick recap** of the course as a single arc -- not 14 methods
- The **modelling workflow** that ran through every lecture
- **Oral exam:** what is asked, what a good answer looks like
- **Submission-week reminders**, then questions

---

## The 16-lecture arc

![Semester timeline with four phases](/figures/semester-timeline.svg)

- **Foundations** (L1-3): course intro, AI tools, preprocessing
- **Core methods** (L4-7): linear, classification, bias-variance, model selection
- **Going further** (L9-14): PCA, ensembles, SVM, boosting, clustering, neural nets
- **Wrap-up** (L15-16): rehearsal, summing up

---

## What you can now do

The three L1 promises, delivered:

- **Frame** a real problem as a predictive modelling task -- choose a target, choose a metric, defend both
- **Build** a reproducible pipeline from raw data to evaluated predictions -- pipelines, splits, CV, honest test
- **Compare** model families under a single protocol and explain the tradeoffs

These are also the three things the report and the oral defence are graded on.

---

## Method comparison, in one table

| Family | Type | Strength | Weakness | Lecture |
|--------|------|----------|----------|---------|
| Linear (ridge, lasso) | Regression | Interpretable, defensible | Misses nonlinearity | L4 |
| Logistic regression | Classification | Probabilistic, regularised | Linear boundary | L5 |
| k-NN | Both | Simple, non-parametric | Distance-driven; slow at scale | L5 |
| Naive Bayes | Classification | Cheap, strong on text | Independence assumption | L5 |
| Random forest | Both | Robust, no scaling | Less interpretable | L10 |
| SVM (linear / RBF) | Classification | Principled boundary | Slow on large data | L11 |
| Gradient boosting | Both | Best on tabular when tuned | More tuning, less interpretable | L12 |
| PCA | Unsupervised | Compress correlated features | Variance ≠ signal | L9 |
| Clustering (k-means) | Unsupervised | Cheap segmentation | No ground truth | L13 |
| Neural networks | Both | State of the art on perceptual data | Not for tabular projects | L14 |

---

## The selection rule, not the model

The course's actual content is not "here are 10 methods."

> It is **"here is how to pick between methods honestly."**

A reproducible **selection rule** beats a "best model" claim every time.

Whatever family you ended up with in your project, the **defensibility** comes from how you picked it -- not from the family itself.

---

## The pipeline pattern, end to end

The throughline of every lecture from L3 onwards:

1. **Split** off the test set, once
2. **Pipeline** = preprocessing + model, kept together
3. **CV** the workflow under an honest splitter (stratified / time-series / grouped)
4. **Tune** with `GridSearchCV` -- read `cv_results_`, not just `.best_params_`
5. **Evaluate** on the held-out test set, **once**
6. **Report** mean ± std, plus the test number, plus what you tried

Six steps, same pattern, every model family.

---

## The common pitfalls

| Pitfall | Where it usually shows up |
|---------|---------------------------|
| **Leakage** in preprocessing | Scaler / imputer / PCA fitted outside the pipeline |
| **Wrong metric** | Accuracy on imbalanced classes; R² with no error scale |
| **No real baseline** | Jumping to complex models with no linear or majority comparison |
| **Test set used for tuning** | "I tried this and the test score went up" |
| **Reporting `best_score_` as the headline** | CV optimism; should be the **test** score |
| **Naming clusters without checking** | Stories told *after* a clustering result |

A clean project audits itself against this list once before submission.

---

## What an honest report looks like

Four lines:

1. States the **problem** and the **metric**, with a sentence on why
2. Reports a **baseline** *and* at least one stronger family, compared under the same CV
3. Includes **CV mean ± std** *and* the **test score**, reported once
4. Acknowledges **limitations** -- leakage risks, missing data, things you didn't try and why

> Quality beats length. The examiner is looking for honest reasoning, not page count.

---

## What the oral exam tests

**Closed book.** No notes, no laptop, no AI tools, no printed report. You bring yourself.

Two parts, in this order:

1. **Project defence** -- explain every choice in your report
2. **General syllabus questions** -- conceptual, not deep

> The oral is where AI-assisted learning gets **verified.** If you cannot explain it without help, you cannot defend it.

---

## Part A: project defence -- what to expect

The examiner has **read your report**. Expect questions like:

- "Why **this dataset**, this problem, this metric?"
- "Why did you choose method X **over** method Y?"
- "Where in your pipeline does **leakage risk** live, and how did you handle it?"
- "What does this number in Table 3 mean? How was it computed?"
- "What's the **biggest weakness** of your analysis?"
- "If you had **more time**, what would you do next?"

The pattern: explain a choice → justify the tradeoff → acknowledge the limit.

---

## Part B: general syllabus questions

A smaller part of the exam. **Conceptual, not deep.** Examples:

- "What does **regularisation** do, and why might we want it?"
- "What is the **bias-variance tradeoff**, in two sentences?"
- "Why is **cross-validation** more honest than a single train/validation split?"
- "When would you use **unsupervised learning** rather than supervised?"
- "What changes when you move from a **random forest** to **gradient boosting**?"

You do **not** need to memorise formulas or derive anything on the board.

---

## What makes a good answer

A four-part structure:

1. **Explain the concept in your own words.** (Not the textbook's.)
2. **Connect it to your project.** Give a concrete example from your own pipeline.
3. **Discuss the tradeoff.** Nothing is always best -- say what is gained and given up.
4. **Be honest about what you don't know.** A precise "I'd need to look this up" beats a confident wrong answer.

> The exam rewards **understanding**, not memorisation.

---

## What an honest "I don't know" sounds like

**Passes the oral:**

- "I can think of the shape of the answer -- it has to do with X -- but I don't remember the specific name. Want me to reason through it?"
- "We didn't use that in our project; the closest thing we did was Y, and the difference would be Z."
- "I'm not certain. Let me think about it for a moment."

**Doesn't pass:**

- Making up plausible-sounding terms
- Repeating the question back as the answer
- Defending a choice you can't explain by saying *"the AI suggested it"*

---

## Final reminders for submission

A short checklist for the next ~10 days:

- **Project report** as a single PDF (figures embedded)
- **Code** in a runnable form -- notebook or script -- with `requirements.txt` and a README
- **AI usage statement** in the report -- brief, honest, not bureaucratic (L2)
- **Reproducibility check:** can a classmate (or the examiner) run your code from scratch and reproduce the headline numbers?
- **Deadline:** [DATE -- TBA] at [TIME -- TBA]

---

## Oral exam logistics

- **When:** [WEEK -- TBA] in December 2026
- **How to book:** [PROCESS -- TBA, link on course site]
- **What to bring:** yourself. Nothing else.
- **What I bring:** your report. I will have read it.

A typical exam runs ~20 minutes -- about two-thirds on your project, one-third on general questions.

---

## After the course

If you want to keep going:

- **Deep learning** -- there is a follow-up course on this in the programme. You now have the **prerequisites** to make good use of it.
- **Causal inference** -- predictive modelling and causal inference are different problems with different rules. Worth a separate course if your work needs causal answers.
- **Reading list** -- the course resources page has further references organised by area.

---

## Thank you

A genuine semester. Good luck with your projects and the exam.

Questions?

## Backup slides

Use if questions arise or time allows.

## Grade descriptors at a glance

| Grade | What it looks like |
|-------|--------------------|
| **A** | Ambitious, rigorous, well defended. Independent thinking. |
| **B** | Solid work with at least one aspect beyond the basics. |
| **C** | Standard methods, correctly applied. Competent but not deep. |
| **D** | Works end-to-end but noticeable gaps in reasoning or validation. |
| **E** | Minimal viable submission. Significant methodology flaws. |
| **F** | Cannot defend the work or missing major components. |

Useful for "what would push my project from a C to a B?" questions.

## Sample "good answer" walkthrough

**Question:** "What is the bias-variance tradeoff?"

**Concept:** Out-of-sample error decomposes into bias squared, variance, and irreducible noise. More flexible models have lower bias but higher variance. The U-shaped test error curve is the result.

**Project example:** In my random forest, I CVed `max_depth`. Shallow trees underfit (high bias); deep trees overfit (high variance). The CV-best depth was the U-shape's minimum.

**Tradeoff:** I could have used a single tree -- simpler, more interpretable -- but it would have had higher variance on this dataset. The ensemble traded a bit of interpretability for more stable predictions.

**Honest limit:** I didn't try `min_samples_leaf` separately; it might have reduced variance further without going to a forest. That's something I'd explore with more time.

## Further reading by topic

| Topic | Where to go next |
|-------|------------------|
| **Statistical foundations** | ISL (free); ESL for the rigorous version |
| **Tabular ML in practice** | Géron's *Hands-On Machine Learning* |
| **Boosting deep dive** | The XGBoost paper (Chen & Guestrin 2016); LightGBM paper |
| **Deep learning** | Goodfellow et al.; the follow-up course in the programme |
| **Causal inference** | Pearl, *Causal Inference in Statistics: A Primer*; Mostly Harmless Econometrics |
| **Interpretability** | Molnar, *Interpretable Machine Learning* (free online) |

All available either free online or in the BI library.