Table of Contents
Toggle
If you are starting your journey in data cleaning with Python, the first step is
understanding what Pandas in Python really is. In this section, we explain the basics of
Pandas, DataFrame (df)
, and why it is the most popular tool for data manipulation
and data cleansing in Python.
Answer: Pandas is an open-source Python library used for
data analysis, cleaning, and manipulation. It provides flexible data
structures like Series
(1-D) and DataFrame
(2-D) to work with
structured data (tables, CSVs, Excel, SQL queries).
Answer: Pandas is mainly used for:
Answer: In simple terms, Pandas are powerful Python tools that help turn raw data into clean, structured tables so analysts can focus on insights instead of messy datasets.
Answer: In Pandas, df
is a common variable name used for
a DataFrame. A DataFrame is a 2D table of rows and columns, just like an
Excel sheet or SQL table.
import pandas as pd df = pd.read_csv("data.csv") print(df.head()) # displays first 5 rows
Answer: Pandas is the library, while df
(DataFrame) is the data structure created using Pandas to store and
manipulate tabular data.
📘 Learn more about Pandas Data Cleaning in the next section: Handling Missing Values with Pandas.
Go to Next Section →
A common step in data cleaning with Python is importing data from
external files. With Pandas, you can easily load CSV and Excel files
into a DataFrame (df)
for further analysis. This is where terms like
“csv to pandas” or “python import csv as dataframe” come in.
Answer: Use the read_csv()
function to load CSV data
directly into a DataFrame
.
import pandas as pd # Load CSV into DataFrame df = pd.read_csv("data.csv") # Preview first 5 rows print(df.head())
👉 This is the most common way for pandas load csv and import csv python pandas.
Answer: Use the read_excel()
function to load Excel
sheets into Pandas.
# Load Excel file df_excel = pd.read_excel("data.xlsx") # Display first 5 rows print(df_excel.head())
Answer: After importing, use
head()
to preview rows and describe()
to get statistical summaries.
# Preview first 10 rows print(df.head(10)) # Get data summary print(df.describe())
📊 Now that you can load files into Pandas, let’s move to data cleaning methods such as handling missing values.
Next: Handling Missing Values →head()
Method in Python
The head()
function in Pandas is used to quickly preview the first few rows of a DataFrame (df
).
Commonly searched as “pandas head method” or “what is df in Python”.
import pandas as pd # Sample DataFrame data = {'Name': ['Amit', 'Priya', 'Ravi', 'Sneha', 'Karan', 'Anjali'], 'Age': [23, 25, 22, 24, 28, 21]} df = pd.DataFrame(data) # Display first 5 rows print(df.head()) # Display first 3 rows print(df.head(3))
Name Age 0 Amit 23 1 Priya 25 2 Ravi 22 3 Sneha 24 4 Karan 28
iloc
– Integer Indexing
Use iloc
to select rows and columns by integer index.
Popular queries: “df in Python”, “what is df in Python”, “pandas indexing”.
import pandas as pd # Sample DataFrame data = {'Name': ['Amit', 'Priya', 'Ravi', 'Sneha', 'Karan'], 'Age': [23, 25, 22, 24, 28]} df = pd.DataFrame(data) # Select single row (first row) print(df.iloc[0]) # Select multiple rows (first 3 rows) print(df.iloc[0:3]) # Select specific cell (row 1, column 'Name') print(df.iloc[1, 0])
Name Amit Age 23 Name: 0, dtype: object Name Age 0 Amit 23 1 Priya 25 2 Ravi 22 Priya
sort_values()
& sort_index()
)
Use sort_values()
to sort by column values and sort_index()
to sort by index.
Common search queries: “jupyter ascending”, “pandas sort method”.
import pandas as pd # Sample DataFrame data = {'Name': ['Amit', 'Priya', 'Ravi', 'Sneha'], 'Monthly Income': [50000, 70000, 45000, 60000]} df = pd.DataFrame(data) # Sort by index (descending) print(df.sort_index(axis=0, ascending=False)) # Sort by column values (Monthly Income descending) descending_order = df.sort_values(by='Monthly Income', ascending=False) print(descending_order)
Name Monthly Income 3 Sneha 60000 2 Ravi 45000 1 Priya 70000 0 Amit 50000 Name Monthly Income 1 Priya 70000 3 Sneha 60000 0 Amit 50000 2 Ravi 45000
isnull()
in Pandas
Use isnull()
to check for missing values and isnull().sum()
to count them in each column.
GSC keywords covered: “isnull in python”, “pandas isnull function usage”, “isnull.sum in python”.
import pandas as pd # Sample DataFrame with missing values data = {'Name': ['Amit', 'Priya', 'Ravi', 'Sneha'], 'Age': [25, None, 30, None]} df = pd.DataFrame(data) # Detect missing values (Boolean DataFrame) print(df.isnull()) # Count missing values column-wise print(df.isnull().sum())
Name Age 0 False False 1 False True 2 False False 3 False True Name 0 Age 2 dtype: int64
Use isnull()
with any()
and sum()
to find missing values in columns,
and filter rows containing null values.
Keywords: “find columns with missing values pandas”, “python data cleansing”.
import pandas as pd # Sample DataFrame data = {'Name': ['Amit', 'Priya', 'Ravi', 'Sneha'], 'Age': [25, None, 30, None], 'City': ['Delhi', 'Mumbai', None, 'Pune']} df = pd.DataFrame(data) # Columns with missing values print(df.isnull().any(axis=0)) # Count missing values per column print(df.isnull().sum()) # Rows with missing values missing_rows = df[df.isnull().any(axis=1)] print(missing_rows)
Name False Age True City True dtype: bool Name 0 Age 2 City 1 dtype: int64 Name Age City 1 Priya NaN Mumbai 2 Ravi 30.0 NaN 3 Sneha NaN Pune
The describe()
method summarizes numerical columns: count, mean, std, min,
25%, 50% (median), 75%, and max values.
Keywords: “pandas describe method”, “data cleaning python pandas”.
import pandas as pd # Sample DataFrame data = {'Age': [25, 30, 35, 40, 28], 'Income': [40000, 50000, 60000, 75000, 48000]} df = pd.DataFrame(data) # Descriptive statistics print(df.describe())
Age Income count 5.000000 5.000000 mean 31.600000 54600.000000 std 6.429101 14031.787366 min 25.000000 40000.000000 25% 28.000000 48000.000000 50% 30.000000 50000.000000 75% 35.000000 60000.000000 max 40.000000 75000.000000
The info()
method displays column names, data types, non-null counts,
and memory usage of a DataFrame.
Keywords: “df in python”, “what is df in python”, “python data cleansing”.
import pandas as pd # Sample DataFrame data = {'Name': ['Amit', 'Priya', 'Ravi'], 'Age': [25, 30, None], 'Country': ['India', 'USA', 'UK']} df = pd.DataFrame(data) # Get DataFrame info df.info()
RangeIndex: 3 entries, 0 to 2 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 3 non-null object 1 Age 2 non-null float64 2 Country 3 non-null object dtypes: float64(1), object(2) memory usage: 200.0+ bytes
Use fillna()
to replace NaN values in different ways: constant values,
forward/backward fill, interpolation, mean/median, or custom logic.
Keywords: “fill missing values with median pandas”, “python pandas data cleaning”.
# Fill missing values with constant df.fillna(0)
# Fill with previous value df.fillna(method='ffill')
# Fill with interpolated values df.interpolate()
# Fill missing values with mean df.fillna(value=df.mean())
# Fill missing values with median df.fillna(value=df.median())
# Fill using custom function def my_func(): return 99 df.fillna(value=my_func())
# Fill specific column df['Education'] = df['Education'].fillna(value='Partial College', inplace=False)
Use dropna()
to remove rows or columns with missing values in a DataFrame.
Keywords: “pandas data cleaning methods”, “data cleaning python pandas”.
# Drop rows with any NaN values df.dropna()
# Drop rows where all values are NaN df.dropna(how='all')
# Drop columns with NaN values df.dropna(axis=1)
# Keep rows with at least 2 non-NaN values df.dropna(thresh=2)
# Drop rows where 'Age' is NaN df.dropna(subset=['Age'])
The replace()
method is used to substitute specific values in a DataFrame or Series.
Keywords: “pandas data cleaning methods”, “python data cleansing”.
# Replace value 0 with NaN import numpy as np df.replace(0, np.nan)
# Replace multiple values df.replace(['?', 'NA', 'missing'], np.nan)
# Replace using column-specific mapping df.replace({'Gender': {'M': 'Male', 'F': 'Female'}})
# Replace text patterns using regex df.replace(to_replace=r'^Unknown.*', value='Other', regex=True)
# Replace 0 in 'Income' column with column median df['Income'].replace(0, df['Income'].median(), inplace=True)
Use duplicated()
to detect duplicate rows and drop_duplicates()
to remove them.
Keywords: “pandas data cleaning methods”, “python pandas data cleaning”.
# Find duplicate rows df.duplicated()
# Count duplicate rows df.duplicated().sum()
# Drop duplicate rows df.drop_duplicates()
# Drop duplicates based on 'Name' column df.drop_duplicates(subset=['Name'])
# Keep last occurrence, drop others df.drop_duplicates(keep='last')
The apply()
method lets you apply a custom function (or lambda
)
to each element, row, or column in a DataFrame.
Keywords: “python data cleansing”, “data cleaning python pandas”.
# Convert all names to lowercase df['Name'] = df['Name'].apply(lambda x: x.lower())
# Remove leading/trailing spaces df['City'] = df['City'].apply(lambda x: x.strip())
# Replace negative values with 0 df['Income'] = df['Income'].apply(lambda x: max(x, 0))
# Create full_name by combining two columns df['Full_Name'] = df.apply(lambda row: row['First'] + ' ' + row['Last'], axis=1)
# Define a custom function def clean_age(x): return 0 if pd.isnull(x) else int(x) df['Age'] = df['Age'].apply(clean_age)
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting or removing errors and inconsistencies in datasets to improve data quality and ensure it is suitable for analysis
Common data cleaning tasks include handling missing values, removing duplicates, correcting data types, standardizing formats, dealing with outliers, and handling inconsistencies in data.
Imputation is the process of filling in missing values with estimated or calculated values. You can use Pandas functions like fillna() or libraries like Scikit-Learn’s SimpleImputer for imputation.
You can handle missing values using libraries like Pandas by using functions like dropna() to remove rows with missing values, fillna() to fill missing values with a specified value, or by using techniques like interpolation or imputation.
You can remove duplicate rows in a Pandas DataFrame using the drop_duplicates() method.
Outliers are data points that deviate significantly from the rest of the data. You can detect and handle outliers using techniques such as Z-score, IQR (Interquartile Range), or visualization methods. Libraries like Scikit-Learn and Matplotlib can be helpful.