Foundations and preprocessing pipelines
Data cleaning, feature engineering, and scikit-learn pipelines
Open slidesFoundations and Preprocessing Pipelines
Preprocessing is part of the model, not a separate cleanup step. This lecture covers the key preprocessing tasks (imputation, encoding, scaling), why the split-first rule prevents data leakage, and how scikit-learn’s Pipeline and ColumnTransformer keep the workflow reproducible. We also discuss how different models require different preprocessing — tree-based methods vs distance-based methods vs linear models.