Cleaning Data (astype, duplicates, string cleaning)
What βdirty dataβ looks like
Common problems:
- Numbers stored as strings (
"1,200""1,200"," 42 "" 42 ") - Inconsistent categories (
"Delhi""Delhi","delhi""delhi"," DELHI "" DELHI ") - Duplicate rows
- Weird whitespace and punctuation
Example dataset
Dirty data
import pandas as pd
df = pd.DataFrame({
"name": [" Asha ", "Ravi", "ravi", "Meera"],
"city": ["delhi", "Delhi ", "DELHI", " Pune"],
"amount": ["1,200", "500", "500", " 700 "],
})
print(df)Dirty data
import pandas as pd
df = pd.DataFrame({
"name": [" Asha ", "Ravi", "ravi", "Meera"],
"city": ["delhi", "Delhi ", "DELHI", " Pune"],
"amount": ["1,200", "500", "500", " 700 "],
})
print(df)1) Fix text: str.stripstr.strip, str.lowerstr.lower, str.replacestr.replace
Normalize name/city
clean = df.copy()
clean["name"] = clean["name"].str.strip()
clean["city"] = clean["city"].str.strip().str.lower()
print(clean)Normalize name/city
clean = df.copy()
clean["name"] = clean["name"].str.strip()
clean["city"] = clean["city"].str.strip().str.lower()
print(clean)2) Convert string numbers to numeric: to_numericto_numeric
Remove commas and spaces first
Clean amount and convert
clean = df.copy()
clean["amount"] = clean["amount"].astype(str).str.replace(",", "", regex=False).str.strip()
clean["amount"] = pd.to_numeric(clean["amount"], errors="coerce")
print(clean)
print(clean.dtypes)Clean amount and convert
clean = df.copy()
clean["amount"] = clean["amount"].astype(str).str.replace(",", "", regex=False).str.strip()
clean["amount"] = pd.to_numeric(clean["amount"], errors="coerce")
print(clean)
print(clean.dtypes)errors="coerce"errors="coerce" turns invalid values into NaNNaN (so you can handle them with dropnadropna/fillnafillna).
3) Change column types safely: astypeastype
astype example
clean = clean.astype({"amount": "int64"})
print(clean.dtypes)astype example
clean = clean.astype({"amount": "int64"})
print(clean.dtypes)If conversion fails, Pandas throws an error. Prefer to_numeric(..., errors="coerce")to_numeric(..., errors="coerce") for messy data.
4) Remove duplicates: drop_duplicatesdrop_duplicates
Exact duplicates
Drop exact duplicates
dedup = clean.drop_duplicates()
print(dedup)Drop exact duplicates
dedup = clean.drop_duplicates()
print(dedup)Duplicates based on a subset of columns
Duplicates by normalized name+city
dedup = clean.drop_duplicates(subset=["name", "city"], keep="first")
print(dedup)Duplicates by normalized name+city
dedup = clean.drop_duplicates(subset=["name", "city"], keep="first")
print(dedup)Tip: validate your cleaning
After cleaning, quickly check:
df.dtypesdf.dtypesdf.isna().sum()df.isna().sum()df["col"].value_counts()df["col"].value_counts()for categories
Cleaning is not a single step. Itβs a loop: inspect β clean β validate.
π§ͺ Try It Yourself
Exercise 1 β Create a DataFrame
Exercise 2 β Select a Column
Exercise 3 β Filter Rows
If this helped you, consider buying me a coffee β
Buy me a coffeeWas this page helpful?
Let us know how we did
