The ML Pipeline - Automating the Workflow

Why pipelines matter

Pipelines prevent a common ML failure:

preprocessing done differently in train vs test vs production

They also prevent leakage by ensuring:

preprocessors are fit only on training folds during CV

The pattern

false

  flowchart LR
  X[Raw X] --> P[Preprocess (fit on train)] --> M[Model] --> Y[Predictions]

false

Pipeline components

PipelinePipeline: chain steps
ColumnTransformerColumnTransformer: apply different transforms to numeric vs categorical columns

Example: end-to-end tabular pipeline

Preprocess + model pipeline

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
 
numeric_features = ["age", "income"]
categorical_features = ["city", "plan"]
 
numeric_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)
 
categorical_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)
 
preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, numeric_features),
        ("cat", categorical_pipe, categorical_features),
    ]
)
 
model = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("clf", LogisticRegression(max_iter=1000)),
    ]
)

Preprocess + model pipeline

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
 
numeric_features = ["age", "income"]
categorical_features = ["city", "plan"]
 
numeric_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)
 
categorical_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)
 
preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, numeric_features),
        ("cat", categorical_pipe, categorical_features),
    ]
)
 
model = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("clf", LogisticRegression(max_iter=1000)),
    ]
)

Pipelines + hyperparameter tuning

You can tune model params inside the pipeline using the step name:

clf__Cclf__C
clf__penaltyclf__penalty

Mini-checkpoint

If you’re still scaling outside a pipeline:

you’re one step away from leakage.

If this helped you, consider buying me a coffee ☕

Buy me a coffee

The ML Pipeline - Automating the Workflow

Why pipelines matter

The pattern

false

Pipeline components

Example: end-to-end tabular pipeline

Pipelines + hyperparameter tuning

Mini-checkpoint

Was this page helpful?