Lesson 1 – Handling Missing Data & Outliers

🧼 Lesson 1: Handling Missing Data & Outliers in Python

Table of Contents

Before building machine learning models, your data must be clean and consistent. This lesson helps you learn how to find, handle, and fix **missing values** and **outliers** using Pandas.

🔹 Step 1: Finding Missing Values

import pandas as pd

df = pd.read_csv("data.csv")
print(df.isnull().sum())

Use isnull().sum() to quickly see how many missing values each column has.

🧪 Step 2: Fixing Missing Data

# Drop rows with missing values
df.dropna()

# Fill with mean/median/mode
df['Age'].fillna(df['Age'].mean(), inplace=True)

dropna() removes rows, while fillna() fills in gaps — choose based on data context!

📉 Step 3: Detecting Outliers

import seaborn as sns
sns.boxplot(x=df["Age"])

Use boxplots to visually spot outliers — they appear as points outside the whiskers.

🔧 Step 4: Handling Outliers

# Remove outliers using IQR
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1

df = df[(df['Age'] >= Q1 - 1.5 * IQR) & (df['Age'] <= Q3 + 1.5 * IQR)]

This method filters out extreme values using the **interquartile range (IQR)**.

🧠 Try This:

Load any dataset with missing values (e.g., Titanic or custom CSV).
Check for null values using df.isnull().sum().
Visualize outliers for “Fare” or “Age” columns using Seaborn.

⏭️ Next Lesson: Encoding Categorical Variables (One-Hot & Label Encoding)

Machine Learning with Python: From Basics to Capstone

Curriculum