Cleaning Data (astype, duplicates, string cleaning)

What “dirty data” looks like

Common problems:

Numbers stored as strings ("1,200""1,200", " 42 "" 42 ")
Inconsistent categories ("Delhi""Delhi", "delhi""delhi", " DELHI "" DELHI ")
Duplicate rows
Weird whitespace and punctuation

Example dataset

Dirty data

import pandas as pd
 
df = pd.DataFrame({
    "name": [" Asha ", "Ravi", "ravi", "Meera"],
    "city": ["delhi", "Delhi ", "DELHI", " Pune"],
    "amount": ["1,200", "500", "500", " 700 "],
})
 
print(df)

Dirty data

import pandas as pd
 
df = pd.DataFrame({
    "name": [" Asha ", "Ravi", "ravi", "Meera"],
    "city": ["delhi", "Delhi ", "DELHI", " Pune"],
    "amount": ["1,200", "500", "500", " 700 "],
})
 
print(df)

1) Fix text: `str.stripstr.strip`, `str.lowerstr.lower`, `str.replacestr.replace`

Normalize name/city

clean = df.copy()
clean["name"] = clean["name"].str.strip()
clean["city"] = clean["city"].str.strip().str.lower()
print(clean)

Normalize name/city

clean = df.copy()
clean["name"] = clean["name"].str.strip()
clean["city"] = clean["city"].str.strip().str.lower()
print(clean)

2) Convert string numbers to numeric: `to_numericto_numeric`

Remove commas and spaces first

Clean amount and convert

clean = df.copy()
clean["amount"] = clean["amount"].astype(str).str.replace(",", "", regex=False).str.strip()
clean["amount"] = pd.to_numeric(clean["amount"], errors="coerce")
 
print(clean)
print(clean.dtypes)

Clean amount and convert

clean = df.copy()
clean["amount"] = clean["amount"].astype(str).str.replace(",", "", regex=False).str.strip()
clean["amount"] = pd.to_numeric(clean["amount"], errors="coerce")
 
print(clean)
print(clean.dtypes)

errors="coerce"errors="coerce" turns invalid values into NaNNaN (so you can handle them with dropnadropna/fillnafillna).

3) Change column types safely: `astypeastype`

astype example

clean = clean.astype({"amount": "int64"})
print(clean.dtypes)

astype example

clean = clean.astype({"amount": "int64"})
print(clean.dtypes)

If conversion fails, Pandas throws an error. Prefer to_numeric(..., errors="coerce")to_numeric(..., errors="coerce") for messy data.

4) Remove duplicates: `drop_duplicatesdrop_duplicates`

Exact duplicates

Drop exact duplicates

dedup = clean.drop_duplicates()
print(dedup)

Drop exact duplicates

dedup = clean.drop_duplicates()
print(dedup)

Duplicates based on a subset of columns

Duplicates by normalized name+city

dedup = clean.drop_duplicates(subset=["name", "city"], keep="first")
print(dedup)

Duplicates by normalized name+city

dedup = clean.drop_duplicates(subset=["name", "city"], keep="first")
print(dedup)

Tip: validate your cleaning

After cleaning, quickly check:

df.dtypesdf.dtypes
df.isna().sum()df.isna().sum()
df["col"].value_counts()df["col"].value_counts() for categories

Cleaning is not a single step. It’s a loop: inspect → clean → validate.

🧪 Try It Yourself

Exercise 1 – Create a DataFrame

Exercise 2 – Select a Column

Exercise 3 – Filter Rows

If this helped you, consider buying me a coffee ☕

Buy me a coffee