Skip to content

The ML Pipeline - Automating the Workflow

Why pipelines matter

Pipelines prevent a common ML failure:

  • preprocessing done differently in train vs test vs production

They also prevent leakage by ensuring:

  • preprocessors are fit only on training folds during CV

The pattern

false


  flowchart LR
  X[Raw X] --> P[Preprocess (fit on train)] --> M[Model] --> Y[Predictions]

false

Pipeline components

  • PipelinePipeline: chain steps
  • ColumnTransformerColumnTransformer: apply different transforms to numeric vs categorical columns

Example: end-to-end tabular pipeline

Preprocess + model pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
 
numeric_features = ["age", "income"]
categorical_features = ["city", "plan"]
 
numeric_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)
 
categorical_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)
 
preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, numeric_features),
        ("cat", categorical_pipe, categorical_features),
    ]
)
 
model = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("clf", LogisticRegression(max_iter=1000)),
    ]
)
Preprocess + model pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
 
numeric_features = ["age", "income"]
categorical_features = ["city", "plan"]
 
numeric_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)
 
categorical_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)
 
preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, numeric_features),
        ("cat", categorical_pipe, categorical_features),
    ]
)
 
model = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("clf", LogisticRegression(max_iter=1000)),
    ]
)

Pipelines + hyperparameter tuning

You can tune model params inside the pipeline using the step name:

  • clf__Cclf__C
  • clf__penaltyclf__penalty

Mini-checkpoint

If youโ€™re still scaling outside a pipeline:

  • youโ€™re one step away from leakage.

If this helped you, consider buying me a coffee โ˜•

Buy me a coffee

Was this page helpful?

Let us know how we did