Skip to content

Train-Test Split Concepts

Why split data?

When you evaluate a model, you want to test on data it hasn’t seen.

  • Train set: used to learn patterns
  • Test set: used only for final evaluation

This simulates real-world performance.

The biggest danger: data leakage

Leakage happens when information from the test set influences training.

Examples:

  • Scaling using mean/std computed on the full dataset
  • Filling missing values using overall mean (including test)
  • Feature engineering that uses future information

Basic split with scikit-learn

train_test_split
import pandas as pd
from sklearn.model_selection import train_test_split
 
X = pd.DataFrame({"age": [20, 21, 22, 23, 24], "score": [80, 85, 78, 90, 88]})
y = pd.Series([0, 0, 0, 1, 1])
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)
 
print(X_train)
print(X_test)
train_test_split
import pandas as pd
from sklearn.model_selection import train_test_split
 
X = pd.DataFrame({"age": [20, 21, 22, 23, 24], "score": [80, 85, 78, 90, 88]})
y = pd.Series([0, 0, 0, 1, 1])
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)
 
print(X_train)
print(X_test)

Stratification

If your target classes are imbalanced, use stratify=ystratify=y so train/test have similar class distribution.

Time-series splits

For time series, you often do not shuffle. You train on past and test on future.

Good practice

  • Keep a final test set untouched.
  • Use cross-validation on training data for tuning.
  • Put preprocessing inside a pipeline.

If this helped you, consider buying me a coffee ☕

Buy me a coffee

Was this page helpful?

Let us know how we did