Train-Test Split Concepts

Why split data?

When you evaluate a model, you want to test on data it hasn’t seen.

Train set: used to learn patterns
Test set: used only for final evaluation

This simulates real-world performance.

The biggest danger: data leakage

Leakage happens when information from the test set influences training.

Examples:

Scaling using mean/std computed on the full dataset
Filling missing values using overall mean (including test)
Feature engineering that uses future information

Basic split with scikit-learn

train_test_split

import pandas as pd
from sklearn.model_selection import train_test_split
 
X = pd.DataFrame({"age": [20, 21, 22, 23, 24], "score": [80, 85, 78, 90, 88]})
y = pd.Series([0, 0, 0, 1, 1])
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)
 
print(X_train)
print(X_test)

train_test_split

import pandas as pd
from sklearn.model_selection import train_test_split
 
X = pd.DataFrame({"age": [20, 21, 22, 23, 24], "score": [80, 85, 78, 90, 88]})
y = pd.Series([0, 0, 0, 1, 1])
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)
 
print(X_train)
print(X_test)

Stratification

If your target classes are imbalanced, use stratify=ystratify=y so train/test have similar class distribution.

Time-series splits

For time series, you often do not shuffle. You train on past and test on future.

Good practice

Keep a final test set untouched.
Use cross-validation on training data for tuning.
Put preprocessing inside a pipeline.

If this helped you, consider buying me a coffee ☕

Buy me a coffee