The ML Pipeline - Automating the Workflow
Why pipelines matter
Pipelines prevent a common ML failure:
- preprocessing done differently in train vs test vs production
They also prevent leakage by ensuring:
- preprocessors are fit only on training folds during CV
The pattern
false
flowchart LR X[Raw X] --> P[Preprocess (fit on train)] --> M[Model] --> Y[Predictions]
false
Pipeline components
PipelinePipeline: chain stepsColumnTransformerColumnTransformer: apply different transforms to numeric vs categorical columns
Example: end-to-end tabular pipeline
Preprocess + model pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
numeric_features = ["age", "income"]
categorical_features = ["city", "plan"]
numeric_pipe = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
]
)
categorical_pipe = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
preprocess = ColumnTransformer(
transformers=[
("num", numeric_pipe, numeric_features),
("cat", categorical_pipe, categorical_features),
]
)
model = Pipeline(
steps=[
("preprocess", preprocess),
("clf", LogisticRegression(max_iter=1000)),
]
)Preprocess + model pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
numeric_features = ["age", "income"]
categorical_features = ["city", "plan"]
numeric_pipe = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
]
)
categorical_pipe = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
preprocess = ColumnTransformer(
transformers=[
("num", numeric_pipe, numeric_features),
("cat", categorical_pipe, categorical_features),
]
)
model = Pipeline(
steps=[
("preprocess", preprocess),
("clf", LogisticRegression(max_iter=1000)),
]
)Pipelines + hyperparameter tuning
You can tune model params inside the pipeline using the step name:
clf__Cclf__Cclf__penaltyclf__penalty
Mini-checkpoint
If youโre still scaling outside a pipeline:
- youโre one step away from leakage.
If this helped you, consider buying me a coffee โ
Buy me a coffeeWas this page helpful?
Let us know how we did
