MST0052
## MST0052 -- Lecture 16 ### Summing Up and Q&A Fall 2026 --- ## Where we are | Lectures | Phase | |----------|-------| | 1--3 | Foundations | | 4--7 | Core methods | | 9--14 | Going further | | **15--16** | **Wrapping up -- you are here** | L15 was yesterday's rehearsal workshop. Submission deadline is next week. Oral exams in December. --- ## Today's plan - A **quick recap** of the course as a single arc -- not 14 methods - The **modelling workflow** that ran through every lecture - **Oral exam:** what is asked, what a good answer looks like - **Submission-week reminders**, then questions --- ## The 16-lecture arc  - **Foundations** (L1-3): course intro, AI tools, preprocessing - **Core methods** (L4-7): linear, classification, bias-variance, model selection - **Going further** (L9-14): PCA, ensembles, SVM, boosting, clustering, neural nets - **Wrap-up** (L15-16): rehearsal, summing up --- ## What you can now do The three L1 promises, delivered: - **Frame** a real problem as a predictive modelling task -- choose a target, choose a metric, defend both - **Build** a reproducible pipeline from raw data to evaluated predictions -- pipelines, splits, CV, honest test - **Compare** model families under a single protocol and explain the tradeoffs These are also the three things the report and the oral defence are graded on. --- ## Method comparison, in one table | Family | Type | Strength | Weakness | Lecture | |--------|------|----------|----------|---------| | Linear (ridge, lasso) | Regression | Interpretable, defensible | Misses nonlinearity | L4 | | Logistic regression | Classification | Probabilistic, regularised | Linear boundary | L5 | | k-NN | Both | Simple, non-parametric | Distance-driven; slow at scale | L5 | | Naive Bayes | Classification | Cheap, strong on text | Independence assumption | L5 | | Random forest | Both | Robust, no scaling | Less interpretable | L10 | | SVM (linear / RBF) | Classification | Principled boundary | Slow on large data | L11 | | Gradient boosting | Both | Best on tabular when tuned | More tuning, less interpretable | L12 | | PCA | Unsupervised | Compress correlated features | Variance ≠ signal | L9 | | Clustering (k-means) | Unsupervised | Cheap segmentation | No ground truth | L13 | | Neural networks | Both | State of the art on perceptual data | Not for tabular projects | L14 | --- ## The selection rule, not the model The course's actual content is not "here are 10 methods." > It is **"here is how to pick between methods honestly."** A reproducible **selection rule** beats a "best model" claim every time. Whatever family you ended up with in your project, the **defensibility** comes from how you picked it -- not from the family itself. --- ## The pipeline pattern, end to end The throughline of every lecture from L3 onwards: 1. **Split** off the test set, once 2. **Pipeline** = preprocessing + model, kept together 3. **CV** the workflow under an honest splitter (stratified / time-series / grouped) 4. **Tune** with `GridSearchCV` -- read `cv_results_`, not just `.best_params_` 5. **Evaluate** on the held-out test set, **once** 6. **Report** mean ± std, plus the test number, plus what you tried Six steps, same pattern, every model family. --- ## The common pitfalls | Pitfall | Where it usually shows up | |---------|---------------------------| | **Leakage** in preprocessing | Scaler / imputer / PCA fitted outside the pipeline | | **Wrong metric** | Accuracy on imbalanced classes; R² with no error scale | | **No real baseline** | Jumping to complex models with no linear or majority comparison | | **Test set used for tuning** | "I tried this and the test score went up" | | **Reporting `best_score_` as the headline** | CV optimism; should be the **test** score | | **Naming clusters without checking** | Stories told *after* a clustering result | A clean project audits itself against this list once before submission. --- ## What an honest report looks like Four lines: 1. States the **problem** and the **metric**, with a sentence on why 2. Reports a **baseline** *and* at least one stronger family, compared under the same CV 3. Includes **CV mean ± std** *and* the **test score**, reported once 4. Acknowledges **limitations** -- leakage risks, missing data, things you didn't try and why > Quality beats length. The examiner is looking for honest reasoning, not page count. --- ## What the oral exam tests **Closed book.** No notes, no laptop, no AI tools, no printed report. You bring yourself. Two parts, in this order: 1. **Project defence** -- explain every choice in your report 2. **General syllabus questions** -- conceptual, not deep > The oral is where AI-assisted learning gets **verified.** If you cannot explain it without help, you cannot defend it. --- ## Part A: project defence -- what to expect The examiner has **read your report**. Expect questions like: - "Why **this dataset**, this problem, this metric?" - "Why did you choose method X **over** method Y?" - "Where in your pipeline does **leakage risk** live, and how did you handle it?" - "What does this number in Table 3 mean? How was it computed?" - "What's the **biggest weakness** of your analysis?" - "If you had **more time**, what would you do next?" The pattern: explain a choice → justify the tradeoff → acknowledge the limit. --- ## Part B: general syllabus questions A smaller part of the exam. **Conceptual, not deep.** Examples: - "What does **regularisation** do, and why might we want it?" - "What is the **bias-variance tradeoff**, in two sentences?" - "Why is **cross-validation** more honest than a single train/validation split?" - "When would you use **unsupervised learning** rather than supervised?" - "What changes when you move from a **random forest** to **gradient boosting**?" You do **not** need to memorise formulas or derive anything on the board. --- ## What makes a good answer A four-part structure: 1. **Explain the concept in your own words.** (Not the textbook's.) 2. **Connect it to your project.** Give a concrete example from your own pipeline. 3. **Discuss the tradeoff.** Nothing is always best -- say what is gained and given up. 4. **Be honest about what you don't know.** A precise "I'd need to look this up" beats a confident wrong answer. > The exam rewards **understanding**, not memorisation. --- ## What an honest "I don't know" sounds like **Passes the oral:** - "I can think of the shape of the answer -- it has to do with X -- but I don't remember the specific name. Want me to reason through it?" - "We didn't use that in our project; the closest thing we did was Y, and the difference would be Z." - "I'm not certain. Let me think about it for a moment." **Doesn't pass:** - Making up plausible-sounding terms - Repeating the question back as the answer - Defending a choice you can't explain by saying *"the AI suggested it"* --- ## Final reminders for submission A short checklist for the next ~10 days: - **Project report** as a single PDF (figures embedded) - **Code** in a runnable form -- notebook or script -- with `requirements.txt` and a README - **AI usage statement** in the report -- brief, honest, not bureaucratic (L2) - **Reproducibility check:** can a classmate (or the examiner) run your code from scratch and reproduce the headline numbers? - **Deadline:** [DATE -- TBA] at [TIME -- TBA] --- ## Oral exam logistics - **When:** [WEEK -- TBA] in December 2026 - **How to book:** [PROCESS -- TBA, link on course site] - **What to bring:** yourself. Nothing else. - **What I bring:** your report. I will have read it. A typical exam runs ~20 minutes -- about two-thirds on your project, one-third on general questions. --- ## After the course If you want to keep going: - **Deep learning** -- there is a follow-up course on this in the programme. You now have the **prerequisites** to make good use of it. - **Causal inference** -- predictive modelling and causal inference are different problems with different rules. Worth a separate course if your work needs causal answers. - **Reading list** -- the course resources page has further references organised by area. --- ## Thank you A genuine semester. Good luck with your projects and the exam. Questions? -- ## Backup slides Use if questions arise or time allows. -- ## Grade descriptors at a glance | Grade | What it looks like | |-------|--------------------| | **A** | Ambitious, rigorous, well defended. Independent thinking. | | **B** | Solid work with at least one aspect beyond the basics. | | **C** | Standard methods, correctly applied. Competent but not deep. | | **D** | Works end-to-end but noticeable gaps in reasoning or validation. | | **E** | Minimal viable submission. Significant methodology flaws. | | **F** | Cannot defend the work or missing major components. | Useful for "what would push my project from a C to a B?" questions. -- ## Sample "good answer" walkthrough **Question:** "What is the bias-variance tradeoff?" **Concept:** Out-of-sample error decomposes into bias squared, variance, and irreducible noise. More flexible models have lower bias but higher variance. The U-shaped test error curve is the result. **Project example:** In my random forest, I CVed `max_depth`. Shallow trees underfit (high bias); deep trees overfit (high variance). The CV-best depth was the U-shape's minimum. **Tradeoff:** I could have used a single tree -- simpler, more interpretable -- but it would have had higher variance on this dataset. The ensemble traded a bit of interpretability for more stable predictions. **Honest limit:** I didn't try `min_samples_leaf` separately; it might have reduced variance further without going to a forest. That's something I'd explore with more time. -- ## Further reading by topic | Topic | Where to go next | |-------|------------------| | **Statistical foundations** | ISL (free); ESL for the rigorous version | | **Tabular ML in practice** | Géron's *Hands-On Machine Learning* | | **Boosting deep dive** | The XGBoost paper (Chen & Guestrin 2016); LightGBM paper | | **Deep learning** | Goodfellow et al.; the follow-up course in the programme | | **Causal inference** | Pearl, *Causal Inference in Statistics: A Primer*; Mostly Harmless Econometrics | | **Interpretability** | Molnar, *Interpretable Machine Learning* (free online) | All available either free online or in the BI library.