Foundations and Preprocessing Pipelines

Preprocessing is part of the model, not a separate cleanup step. This lecture covers the key preprocessing tasks (imputation, encoding, scaling), why the split-first rule prevents data leakage, and how scikit-learn’s Pipeline and ColumnTransformer keep the workflow reproducible. We also discuss how different models require different preprocessing — tree-based methods vs distance-based methods vs linear models.

Foundations and preprocessing pipelines

Foundations and Preprocessing Pipelines