Data Cleaning in Python: Pandas & Numpy Explained [2025 Guide]

Introduction to Pandas in Python

Table of Contents

If you are starting your journey in data cleaning with Python, the first step is understanding what Pandas in Python really is. In this section, we explain the basics of Pandas, DataFrame (df), and why it is the most popular tool for data manipulation and data cleansing in Python.

1. What is Pandas in Python?

Answer: Pandas is an open-source Python library used for data analysis, cleaning, and manipulation. It provides flexible data structures like Series (1-D) and DataFrame (2-D) to work with structured data (tables, CSVs, Excel, SQL queries).

2. What is Pandas Python used for?

Answer: Pandas is mainly used for:

Data Cleaning (handling missing values, duplicates)
Data Transformation (filtering, grouping, sorting)
Data Analysis (statistics, summarization)
Reading/Writing files like CSV, Excel, SQL

3. What are Pandas in Python?

Answer: In simple terms, Pandas are powerful Python tools that help turn raw data into clean, structured tables so analysts can focus on insights instead of messy datasets.

4. What is df in Python?

Answer: In Pandas, df is a common variable name used for a DataFrame. A DataFrame is a 2D table of rows and columns, just like an Excel sheet or SQL table.

import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())  # displays first 5 rows

5. What is the difference between Pandas and df?

Answer: Pandas is the library, while df (DataFrame) is the data structure created using Pandas to store and manipulate tabular data.

📘 Learn more about Pandas Data Cleaning in the next section: Handling Missing Values with Pandas.

Go to Next Section →

Importing CSV & Excel Files in Pandas

A common step in data cleaning with Python is importing data from external files. With Pandas, you can easily load CSV and Excel files into a DataFrame (df) for further analysis. This is where terms like “csv to pandas” or “python import csv as dataframe” come in.

1. Import a CSV File into Pandas

Answer: Use the read_csv() function to load CSV data directly into a DataFrame.

import pandas as pd

# Load CSV into DataFrame
df = pd.read_csv("data.csv")

# Preview first 5 rows
print(df.head())

👉 This is the most common way for pandas load csv and import csv python pandas.

2. Import an Excel File into Pandas

Answer: Use the read_excel() function to load Excel sheets into Pandas.

# Load Excel file
df_excel = pd.read_excel("data.xlsx")

# Display first 5 rows
print(df_excel.head())

3. Quick Data Inspection

Answer: After importing, use head() to preview rows and describe() to get statistical summaries.

# Preview first 10 rows
print(df.head(10))

# Get data summary
print(df.describe())

📊 Now that you can load files into Pandas, let’s move to data cleaning methods such as handling missing values.

Next: Handling Missing Values →

Pandas `head()` Method in Python

The head() function in Pandas is used to quickly preview the first few rows of a DataFrame (df). Commonly searched as “pandas head method” or “what is df in Python”.

import pandas as pd

# Sample DataFrame
data = {'Name': ['Amit', 'Priya', 'Ravi', 'Sneha', 'Karan', 'Anjali'],
        'Age': [23, 25, 22, 24, 28, 21]}
df = pd.DataFrame(data)

# Display first 5 rows
print(df.head())

# Display first 3 rows
print(df.head(3))

   Name   Age
0  Amit   23
1  Priya  25
2  Ravi   22
3  Sneha  24
4  Karan  28

Pandas `iloc` – Integer Indexing

Use iloc to select rows and columns by integer index. Popular queries: “df in Python”, “what is df in Python”, “pandas indexing”.

import pandas as pd

# Sample DataFrame
data = {'Name': ['Amit', 'Priya', 'Ravi', 'Sneha', 'Karan'],
        'Age': [23, 25, 22, 24, 28]}
df = pd.DataFrame(data)

# Select single row (first row)
print(df.iloc[0])

# Select multiple rows (first 3 rows)
print(df.iloc[0:3])

# Select specific cell (row 1, column 'Name')
print(df.iloc[1, 0])

Name    Amit
Age       23
Name: 0, dtype: object

    Name   Age
0   Amit   23
1  Priya   25
2   Ravi   22

Priya

Sorting Data in Pandas (`sort_values()` & `sort_index()`)

Use sort_values() to sort by column values and sort_index() to sort by index. Common search queries: “jupyter ascending”, “pandas sort method”.

import pandas as pd

# Sample DataFrame
data = {'Name': ['Amit', 'Priya', 'Ravi', 'Sneha'],
        'Monthly Income': [50000, 70000, 45000, 60000]}
df = pd.DataFrame(data)

# Sort by index (descending)
print(df.sort_index(axis=0, ascending=False))

# Sort by column values (Monthly Income descending)
descending_order = df.sort_values(by='Monthly Income', ascending=False)
print(descending_order)

    Name   Monthly Income
3  Sneha           60000
2   Ravi           45000
1  Priya           70000
0   Amit           50000

    Name   Monthly Income
1  Priya           70000
3  Sneha           60000
0   Amit           50000
2   Ravi           45000

Detecting Missing Values with `isnull()` in Pandas

Use isnull() to check for missing values and isnull().sum() to count them in each column. GSC keywords covered: “isnull in python”, “pandas isnull function usage”, “isnull.sum in python”.

import pandas as pd

# Sample DataFrame with missing values
data = {'Name': ['Amit', 'Priya', 'Ravi', 'Sneha'],
        'Age': [25, None, 30, None]}
df = pd.DataFrame(data)

# Detect missing values (Boolean DataFrame)
print(df.isnull())

# Count missing values column-wise
print(df.isnull().sum())

    Name    Age
0  False  False
1  False   True
2  False  False
3  False   True

Name    0
Age     2
dtype: int64

Identifying Columns & Rows with Missing Values

Use isnull() with any() and sum() to find missing values in columns, and filter rows containing null values. Keywords: “find columns with missing values pandas”, “python data cleansing”.

import pandas as pd

# Sample DataFrame
data = {'Name': ['Amit', 'Priya', 'Ravi', 'Sneha'],
        'Age': [25, None, 30, None],
        'City': ['Delhi', 'Mumbai', None, 'Pune']}
df = pd.DataFrame(data)

# Columns with missing values
print(df.isnull().any(axis=0))

# Count missing values per column
print(df.isnull().sum())

# Rows with missing values
missing_rows = df[df.isnull().any(axis=1)]
print(missing_rows)

Name    False
Age      True
City     True
dtype: bool

Name    0
Age     2
City    1
dtype: int64

    Name   Age    City
1  Priya   NaN  Mumbai
2   Ravi  30.0     NaN
3  Sneha   NaN    Pune

Descriptive Statistics with describe()

The describe() method summarizes numerical columns: count, mean, std, min, 25%, 50% (median), 75%, and max values. Keywords: “pandas describe method”, “data cleaning python pandas”.

import pandas as pd

# Sample DataFrame
data = {'Age': [25, 30, 35, 40, 28],
        'Income': [40000, 50000, 60000, 75000, 48000]}
df = pd.DataFrame(data)

# Descriptive statistics
print(df.describe())

             Age        Income
count   5.000000      5.000000
mean   31.600000  54600.000000
std     6.429101  14031.787366
min    25.000000  40000.000000
25%    28.000000  48000.000000
50%    30.000000  50000.000000
75%    35.000000  60000.000000
max    40.000000  75000.000000

info()

DataFrame Summary with info()

The info() method displays column names, data types, non-null counts, and memory usage of a DataFrame. Keywords: “df in python”, “what is df in python”, “python data cleansing”.

import pandas as pd

# Sample DataFrame
data = {'Name': ['Amit', 'Priya', 'Ravi'],
        'Age': [25, 30, None],
        'Country': ['India', 'USA', 'UK']}
df = pd.DataFrame(data)

# Get DataFrame info
df.info()


RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Name     3 non-null      object 
 1   Age      2 non-null      float64
 2   Country  3 non-null      object 
dtypes: float64(1), object(2)
memory usage: 200.0+ bytes

Handling Missing Data with fillna()

Use fillna() to replace NaN values in different ways: constant values, forward/backward fill, interpolation, mean/median, or custom logic. Keywords: “fill missing values with median pandas”, “python pandas data cleaning”.

# Fill missing values with constant
df.fillna(0)

# Fill with previous value
df.fillna(method='ffill')

# Fill with interpolated values
df.interpolate()

# Fill missing values with mean
df.fillna(value=df.mean())

# Fill missing values with median
df.fillna(value=df.median())

# Fill using custom function
def my_func():
    return 99
df.fillna(value=my_func())

# Fill specific column
df['Education'] = df['Education'].fillna(value='Partial College', inplace=False)

Removing Missing Data with dropna()

Use dropna() to remove rows or columns with missing values in a DataFrame. Keywords: “pandas data cleaning methods”, “data cleaning python pandas”.

# Drop rows with any NaN values
df.dropna()

# Drop rows where all values are NaN
df.dropna(how='all')

# Drop columns with NaN values
df.dropna(axis=1)

# Keep rows with at least 2 non-NaN values
df.dropna(thresh=2)

# Drop rows where 'Age' is NaN
df.dropna(subset=['Age'])

Replacing Values with replace()

The replace() method is used to substitute specific values in a DataFrame or Series. Keywords: “pandas data cleaning methods”, “python data cleansing”.

# Replace value 0 with NaN
import numpy as np
df.replace(0, np.nan)

# Replace multiple values
df.replace(['?', 'NA', 'missing'], np.nan)

# Replace using column-specific mapping
df.replace({'Gender': {'M': 'Male', 'F': 'Female'}})

# Replace text patterns using regex
df.replace(to_replace=r'^Unknown.*', value='Other', regex=True)

# Replace 0 in 'Income' column with column median
df['Income'].replace(0, df['Income'].median(), inplace=True)

Handling Duplicates with duplicated() and drop_duplicates()

Use duplicated() to detect duplicate rows and drop_duplicates() to remove them. Keywords: “pandas data cleaning methods”, “python pandas data cleaning”.

# Find duplicate rows
df.duplicated()

# Count duplicate rows
df.duplicated().sum()

# Drop duplicate rows
df.drop_duplicates()

# Drop duplicates based on 'Name' column
df.drop_duplicates(subset=['Name'])

# Keep last occurrence, drop others
df.drop_duplicates(keep='last')

Custom Cleaning with apply() and Lambda Functions

The apply() method lets you apply a custom function (or lambda) to each element, row, or column in a DataFrame. Keywords: “python data cleansing”, “data cleaning python pandas”.

# Convert all names to lowercase
df['Name'] = df['Name'].apply(lambda x: x.lower())

# Remove leading/trailing spaces
df['City'] = df['City'].apply(lambda x: x.strip())

# Replace negative values with 0
df['Income'] = df['Income'].apply(lambda x: max(x, 0))

# Create full_name by combining two columns
df['Full_Name'] = df.apply(lambda row: row['First'] + ' ' + row['Last'], axis=1)

# Define a custom function
def clean_age(x):
    return 0 if pd.isnull(x) else int(x)

df['Age'] = df['Age'].apply(clean_age)

What is data cleaning?

Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting or removing errors and inconsistencies in datasets to improve data quality and ensure it is suitable for analysis

What are common data cleaning tasks?

Common data cleaning tasks include handling missing values, removing duplicates, correcting data types, standardizing formats, dealing with outliers, and handling inconsistencies in data.

What is imputation, and how can I perform it in Python?

Imputation is the process of filling in missing values with estimated or calculated values. You can use Pandas functions like fillna() or libraries like Scikit-Learn’s SimpleImputer for imputation.

How can I handle missing values in Python?

You can handle missing values using libraries like Pandas by using functions like dropna() to remove rows with missing values, fillna() to fill missing values with a specified value, or by using techniques like interpolation or imputation.

How can I remove duplicate rows in a DataFrame using Pandas?

You can remove duplicate rows in a Pandas DataFrame using the drop_duplicates() method.

What is outlier detection, and how can I handle outliers in Python?

Outliers are data points that deviate significantly from the rest of the data. You can detect and handle outliers using techniques such as Z-score, IQR (Interquartile Range), or visualization methods. Libraries like Scikit-Learn and Matplotlib can be helpful.