Table of Contents
ToggleBefore building machine learning models, your data must be clean and consistent. This lesson helps you learn how to find, handle, and fix **missing values** and **outliers** using Pandas.
import pandas as pd
df = pd.read_csv("data.csv")
print(df.isnull().sum())
Use isnull().sum()
to quickly see how many missing values each column has.
# Drop rows with missing values
df.dropna()
# Fill with mean/median/mode
df['Age'].fillna(df['Age'].mean(), inplace=True)
dropna() removes rows, while fillna() fills in gaps — choose based on data context!
import seaborn as sns
sns.boxplot(x=df["Age"])
Use boxplots to visually spot outliers — they appear as points outside the whiskers.
# Remove outliers using IQR
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['Age'] >= Q1 - 1.5 * IQR) & (df['Age'] <= Q3 + 1.5 * IQR)]
This method filters out extreme values using the **interquartile range (IQR)**.
df.isnull().sum()
.⏭️ Next Lesson: Encoding Categorical Variables (One-Hot & Label Encoding)