Skip to content

Cleaning Data (astype, duplicates, string cleaning)

What β€œdirty data” looks like

Common problems:

  • Numbers stored as strings ("1,200""1,200", " 42 "" 42 ")
  • Inconsistent categories ("Delhi""Delhi", "delhi""delhi", " DELHI "" DELHI ")
  • Duplicate rows
  • Weird whitespace and punctuation

Example dataset

Dirty data
import pandas as pd
 
df = pd.DataFrame({
    "name": [" Asha ", "Ravi", "ravi", "Meera"],
    "city": ["delhi", "Delhi ", "DELHI", " Pune"],
    "amount": ["1,200", "500", "500", " 700 "],
})
 
print(df)
Dirty data
import pandas as pd
 
df = pd.DataFrame({
    "name": [" Asha ", "Ravi", "ravi", "Meera"],
    "city": ["delhi", "Delhi ", "DELHI", " Pune"],
    "amount": ["1,200", "500", "500", " 700 "],
})
 
print(df)

1) Fix text: str.stripstr.strip, str.lowerstr.lower, str.replacestr.replace

Normalize name/city
clean = df.copy()
clean["name"] = clean["name"].str.strip()
clean["city"] = clean["city"].str.strip().str.lower()
print(clean)
Normalize name/city
clean = df.copy()
clean["name"] = clean["name"].str.strip()
clean["city"] = clean["city"].str.strip().str.lower()
print(clean)

2) Convert string numbers to numeric: to_numericto_numeric

Remove commas and spaces first

Clean amount and convert
clean = df.copy()
clean["amount"] = clean["amount"].astype(str).str.replace(",", "", regex=False).str.strip()
clean["amount"] = pd.to_numeric(clean["amount"], errors="coerce")
 
print(clean)
print(clean.dtypes)
Clean amount and convert
clean = df.copy()
clean["amount"] = clean["amount"].astype(str).str.replace(",", "", regex=False).str.strip()
clean["amount"] = pd.to_numeric(clean["amount"], errors="coerce")
 
print(clean)
print(clean.dtypes)

errors="coerce"errors="coerce" turns invalid values into NaNNaN (so you can handle them with dropnadropna/fillnafillna).

3) Change column types safely: astypeastype

astype example
clean = clean.astype({"amount": "int64"})
print(clean.dtypes)
astype example
clean = clean.astype({"amount": "int64"})
print(clean.dtypes)

If conversion fails, Pandas throws an error. Prefer to_numeric(..., errors="coerce")to_numeric(..., errors="coerce") for messy data.

4) Remove duplicates: drop_duplicatesdrop_duplicates

Exact duplicates

Drop exact duplicates
dedup = clean.drop_duplicates()
print(dedup)
Drop exact duplicates
dedup = clean.drop_duplicates()
print(dedup)

Duplicates based on a subset of columns

Duplicates by normalized name+city
dedup = clean.drop_duplicates(subset=["name", "city"], keep="first")
print(dedup)
Duplicates by normalized name+city
dedup = clean.drop_duplicates(subset=["name", "city"], keep="first")
print(dedup)

Tip: validate your cleaning

After cleaning, quickly check:

  • df.dtypesdf.dtypes
  • df.isna().sum()df.isna().sum()
  • df["col"].value_counts()df["col"].value_counts() for categories

Cleaning is not a single step. It’s a loop: inspect β†’ clean β†’ validate.

πŸ§ͺ Try It Yourself

Exercise 1 – Create a DataFrame

Exercise 2 – Select a Column

Exercise 3 – Filter Rows

If this helped you, consider buying me a coffee β˜•

Buy me a coffee

Was this page helpful?

Let us know how we did