AI Tools in a Machine-Learning Workflow

## MST0052 — Lecture 2

### AI Tools in a Machine-Learning Workflow

Fall 2026

---

## Where we are

L1 set up the course. Today is about **how** you'll work for the rest of the semester.

| Lectures | Phase |
|----------|-------|
| **1–3** | **Foundations — you are here** |
| 4–7 | Core methods |
| 8–13 | Going further |
| 14–16 | Wrapping up |

---

## Today's plan

- Where AI tools genuinely help in an ML workflow
- Where they **fail silently**
- **Live demo:** Claude Code on a real ML task
- **Live demo:** scikit-learn pipeline from scratch
- How to document AI use in your project

---

## The state of AI tools in 2026

- AI coding assistants are part of the default toolchain
- Pretending otherwise is dishonest
- This course's stance: **allowed, encouraged**, with one hard constraint

> Everything you submit, you must be able to defend at the oral exam.

---

## Why naive use is dangerous in ML

- AI is good at **code**
- ML is partly code and partly **statistical reasoning**
- The dangerous failures live in the reasoning layer, not the syntax layer

That's why this course spends a whole lecture on it.

---

## What I'll ask you to do differently

- Use AI for what it's good at — don't outsource thinking
- **Verify** everything you can't explain
- **Document** what you used and how

---

## Mental model: junior collaborator

Treat the AI like a **fast, well-read junior** who has never seen your data.

![AI-human workflow loop](../figures/ai-workflow-loop.svg)

---

## Where AI tools help

- **Boilerplate and scaffolding** — project skeletons, `requirements.txt`, pipeline scaffolds
- **Debugging** — paste a traceback + code, get diagnostic suggestions
- **Explaining unfamiliar code** — "What does `StratifiedKFold` do?"
- **Drafting prose** — rough drafts, tightening paragraphs
- **Learning new methods** — go beyond the syllabus, then verify

---

## Explaining unfamiliar code

Good questions to ask an AI:

- "Explain what `StratifiedKFold` does and when I'd use it."
- "What does `class_weight='balanced'` change in logistic regression?"
- "What is the difference between `fit_transform` and `transform`?"

A faster route to documentation — **when you verify against the real docs.**

---

## Learning beyond the syllabus

- Want to try gradient boosting with monotonic constraints? Ask, read, try it.
- AI lets you go further than you could on your own
- **But:** the oral exam will still ask what it does and why you used it

---

## The category that matters most

![Silent failures iceberg](../figures/silent-failures-iceberg.svg)

---

## Failure 1: statistical reasoning

AI will happily write code that:

- Picks the "best" model without correcting for multiple comparisons
- Reports test-set accuracy after using the test set to pick a hyperparameter
- Uses RMSE on highly skewed targets without comment

Each piece of code is **syntactically correct** and **statistically wrong.**

---

## Failure 2: data-specific judgment

AI doesn't know your data.

- It assumes i.i.d. when your data is temporal
- It assumes the target is what you say it is
- It will not notice that `customer_lifetime_value` was computed using future information

---

## The cross-validation leakage pattern

**Wrong:**

```python
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)      # leakage!
scores = cross_val_score(model, X_scaled, y, cv=5)
```

**Right:**

```python
pipe = make_pipeline(StandardScaler(), LogisticRegression())
scores = cross_val_score(pipe, X, y, cv=5)
```

The scaler in the first version sees the **entire dataset** — including the validation fold.

---

## Failure 3: confidently wrong references

- Made-up function arguments
- Plausible-looking but non-existent sklearn methods
- Hallucinated paper citations

Less common in 2026 than in 2023, but not gone. **Always verify against real documentation.**

---

## Red flags you should never ignore

- The model accuracy that looks **too good**
- The CV score that matches the test score to three decimals
- The result that contradicts a sanity check
- The code you **can't explain** line-by-line

---

## Why the terminal matters

- The terminal is where you **run code**, manage files, and control your environment
- No hidden state — every command is explicit and reproducible
- Scripts you run in the terminal can be versioned, shared, and re-run
- Most professional ML work happens here, not in GUIs

If you're not comfortable in the terminal yet — this course will get you there.

---

## The terminal as an LLM interface

- Terminal-based AI tools (Claude Code, Codex CLI, Gemini CLI) give you **more control** than chat apps built on top of LLMs
- The agent can see your files, run your code, and iterate — all in your actual project
- No copy-pasting between a chat window and your editor
- You decide what the agent can read, write, and execute

> The closer the AI is to your actual workflow, the more useful it becomes.

---

## Demo time: Claude Code

**What it is:** an AI agent that runs in the terminal — reads your files, writes code, runs commands.

**What we'll do:**

1. Start a small ML project from scratch
2. Ask Claude Code to explore the data
3. Build a scikit-learn pipeline
4. Check whether it gets the validation right

---

## What just happened

- Claude Code was fast at **scaffolding**, **boilerplate**, and **running things**
- But you still needed to check:
  - Was the pipeline correct?
  - Did the validation strategy make sense?
  - Would you trust those numbers?

The tool accelerated the work. **The judgment was still yours.**

---

## The scikit-learn workflow

![scikit-learn pipeline flow](../figures/sklearn-pipeline-flow.svg)

---

## Step 1: Load and split

```python
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("penguins.csv").dropna()
X = df.drop(columns=["species"])
y = df["species"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
```

- `stratify=y` — preserves class balance in both splits
- `random_state=42` — reproducibility

---

## Step 2: Build a pipeline

```python
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression

num_cols = ["bill_length_mm", "bill_depth_mm",
            "flipper_length_mm", "body_mass_g"]
cat_cols = ["island", "sex"]

preprocess = make_column_transformer(
    (StandardScaler(), num_cols),
    (OneHotEncoder(), cat_cols),
)

pipe = make_pipeline(preprocess, LogisticRegression())
```

---

## Step 3: Fit, predict, evaluate

```python
from sklearn.metrics import classification_report

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
```

---

## Step 4: Cross-validation

```python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipe, X_train, y_train,
                         cv=5, scoring="accuracy")
print(f"CV accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
```

- We cross-validate on the **training set**
- The test set stays untouched until the very end
- This is the correct pattern from the leakage slide

---

## What you just saw

- The five-step pattern: **load → split → pipeline → fit → evaluate**
- Preprocessing inside the pipeline prevents leakage
- Cross-validation on training data; test set held out until the end
- You will use this exact pattern in your project

---

## A weak prompt vs a strong prompt

**Weak:**

> "Help me classify this data."

**Strong:**

> "I have a binary churn target on 12k rows of tabular data with 30 mixed-type features. I want a baseline logistic regression with preprocessing inside a sklearn pipeline, evaluated with 5-fold stratified CV, reporting ROC-AUC ± std."

The strong prompt is the question you'd ask a thoughtful colleague.

---

## Verification habits

- **Run the code.** Don't just read it.
- **Check shapes, ranges, NaNs** after every step
- For statistical claims, ask: *"What would I see if this were wrong?"*
- Cross-check with the **official docs** — 5 seconds of verification saves a week of confusion

---

## When to push back on the AI

- Ask: "What could be wrong with this analysis?"
- Ask: "What assumptions am I making?"
- Treat AI answers as **starting points for a debate**, not final answers

This is also a good way to *learn.*

---

## The main families in 2026

| Type | Examples |
|------|----------|
| **Editor-integrated** | GitHub Copilot, Cursor |
| **Terminal agents** | Claude Code, Codex CLI, Gemini CLI |
| **Chat-only** | ChatGPT, Claude, Gemini |

They overlap heavily. Differences shrink each release.

---

## Practical picks for this course

- **VS Code / JetBrains?** GitHub Copilot (free for students) or Cursor
- **Terminal user?** Claude Code, Codex CLI, or Gemini CLI
- **No strong preference?** Start with the free options and experiment

See the **AI page** on the course site for full detail.

---

## Access and cost

- **BI students:** free Gemini access via student email
- **GitHub Student Developer Pack:** Copilot Pro free for verified students
- Most other tools have free tiers; ~$20/month unlocks the full experience

---

## Setup checklist

- Pick **one or two tools** and stick with them
- Sign up for the **GitHub Student Developer Pack** if you haven't
- Make sure your AI integration works **before Lecture 3**

---

## The one rule

> Anything in your report must be something you can defend at the oral exam.

Everything else is a means to that end.

---

## The AI usage statement

A required short section in your project report:

```text
## AI usage statement
I used the following AI tools during this project:
- [Tool]: for [purpose — e.g., debugging, drafting Section 3]
- [Tool]: for [purpose]
Optional: one or two sentences on anything noteworthy.
```

Brief, not bureaucratic.

---

## Good vs bad AI use

**Good:** Used AI to learn SHAP. Applied it to my dataset. Can explain what each plot means and why I picked the features I did.

**Bad:** Pasted AI-generated analysis into the report. Can't explain what `tree_path_dependent` does in the SHAP call.

The first path will likely **outperform** working without AI. The second path fails the oral.

---

## Before Lecture 3

- Get an AI tool **set up and working**
- Re-read the **AI page** on the course site
- Have a **candidate dataset** by Lecture 3 (preprocessing)
- Try running today's scikit-learn pipeline on your own machine

---

## Three habits for the semester

- **Read** what the AI writes
- **Verify** what you can't explain
- **Document** what you used

---

## Questions

Open the floor.

## Backup slides

Use if questions arise or time allows.

## Second leakage example: target encoding

```python
from category_encoders import TargetEncoder

# Wrong: fit on full data, then cross-validate
X["city_encoded"] = TargetEncoder().fit_transform(X["city"], y)
scores = cross_val_score(model, X, y, cv=5)
```

The encoder saw the **target values** from the validation fold during fit.

Fix: wrap encoding inside the pipeline, or use `cross_val_score` over a pipeline that includes the encoder.

## A worked debugging session

**Error:**
```
ValueError: Input contains NaN, infinity
or a value too large for dtype('float64').
```

**Good prompt:** "I'm getting this error when calling `pipe.fit(X_train, y_train)`. Here is `X_train.info()`: [paste]. Here is the pipeline: [paste]. What's most likely going wrong?"

**Why it works:** full context — error, data shape, and code.

## Comparing two tools on the same task

Same prompt to two assistants:

> "Add 5-fold CV with ROC-AUC to my logistic regression pipeline on this CSV."

What to notice: terminal agents fail less often on data-specific judgment because they have the data in scope. **Neither replaces your review.**

## Hallucinated sklearn API

AI-generated code:

```python
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(early_stopping=True)
```

`GradientBoostingClassifier` has **no `early_stopping` parameter.**

(You're thinking of `HistGradientBoostingClassifier`.)

Always check the **official API docs.**

## AI for writing the report itself

**Fair:**

- Drafting structure or paragraphs that you then edit and verify
- Tightening prose you already wrote
- Translating your bullet-point notes into a section

**Not fair:**

- Pasting AI-generated analysis or numbers you can't reproduce from your code
- Submitting AI-drafted text verbatim without disclosing it
- Asking AI to "interpret" results you haven't read yourself

**BI's academic misconduct rules apply** to unattributed use and to text presented as your own when you cannot account for it.

When in doubt: disclose it in the AI usage statement and be ready to explain it at the oral.

## Notebooks vs scripts with AI agents

- **Notebooks:** good for exploration, bad for AI agents (cell state is invisible to the agent)
- **Scripts:** AI agents can read, edit, and run them end-to-end
- **Recommendation:** use notebooks for exploration, scripts for the final pipeline
- Export the final analysis to a clean script before submission

---

## What's next

**Lecture 3:** Foundations and preprocessing pipelines

- Why preprocessing is part of the model
- scikit-learn Pipelines and ColumnTransformers
- The split-first rule