If you are starting your journey in data cleaning with Python, the first step is understanding what Pandas in Python really is. In this section, we explain the basics of Pandas, DataFrame (df), and why it is the most popular tool for data manipulation and data cleansing in Python.
Answer: Pandas is an open-source Python library used for
data analysis, cleaning, and manipulation. It provides flexible data
structures like Series (1-D) and DataFrame (2-D) to work with
structured data (tables, CSVs, Excel, SQL queries).
Answer: Pandas is mainly used for:
Answer: In simple terms, Pandas are powerful Python tools that help turn raw data into clean, structured tables so analysts can focus on insights instead of messy datasets.
Answer: In Pandas, df is a common variable name used for
a DataFrame. A DataFrame is a 2D table of rows and columns, just like an
Excel sheet or SQL table.
import pandas as pd
df = pd.read_csv("data.csv")
print(df.head()) # displays first 5 rows
Answer: Pandas is the library, while df
(DataFrame) is the data structure created using Pandas to store and
manipulate tabular data.
Answer: Pandas is the industry standard for Python data cleaning because it offers powerful built-in functions for handling missing data, removing duplicates, data type conversion, and filtering outliersโall with simple, readable syntax that saves hours of manual coding.
Learn advanced techniques for handling missing values, duplicates, and outliers in the next section.
Learn data cleaning in python, pandas read csv, and df.head() in python using real datasets like Google Sheets and Superstore dataset inside Google Colab.
import pandas as pd
๐ This is the first step in every python pandas data cleaning project.
import pandas as pd
url = "https://docs.google.com/spreadsheets/d/1zZDBuiFCitgMU-D5XA2G7DVvoGmJ2G6KqXM9sOLpGz0/export?format=csv&gid=1140868527"
df = pd.read_csv(url)
df.head()
๐ Used for pandas load csv and
data cleaning in pandas python.
import pandas as pd
url = "https://docs.google.com/spreadsheets/d/1zZDBuiFCitgMU-D5XA2G7DVvoGmJ2G6KqXM9sOLpGz0/export?format=csv&gid=1140868527"
df = pd.read_csv(url)
df.head()
๐ Real dataset used in data analyst projects.
df.head(10)
๐ Shows first rows โ important for df.head() meaning.
df.isnull().sum()
df.dropna()
df.describe()
๐ Core of data cleaning in pandas.
Practice data cleaning in python pandas using real datasets. Download or open directly in Google Colab.
๐ Now that you understand pandas read csv and df.head() in python, it’s time to clean real data.
Next: Handling Missing Values (dropna, fillna) โLearn how to filter data using pandas dataframe. Filtering is one of the most important steps in data cleaning in python to select specific rows, remove unwanted data, and analyze patterns.
โ ๏ธ Note: Dataset is already loaded as df. Continue using the same DataFrame.
df.head()
๐ Always check data before filtering
# Filter rows where Sales > 500
df[df['Sales'] > 500]
๐ Returns only rows where condition is TRUE
# Sales > 500 AND Profit > 50
df[(df['Sales'] > 500) & (df['Profit'] > 50)]
๐ & = AND conditiondf[['Sales', 'Profit']]
๐ Select only required columns
df.query("Sales > 500 and Profit > 50")
๐ Cleaner and readable filtering syntax
df[‘Sales’] > 500 โ creates True/False condition
df[condition] โ filters rows
& โ AND condition
| โ OR condition
# High value customers
high_sales = df[df['Sales'] > 1000]
print(high_sales.head())
๐ Real data analyst workflow
๐ Next: Learn how to sort data using sort_values() and sort_index()
Next: Sorting Data โSorting is a key step in data cleaning in python. Using pandas sort_values and sort_index(), you can organize your dataset to find top values, trends, and insights.
โ ๏ธ Note: Dataset is already loaded as df in previous section. Continue using the same DataFrame.
df.head()
๐ Check dataset before sorting
# Sort by Sales (low โ high)
df.sort_values(by='Sales')
๐ Default sorting = ascending# Sort by Sales (high โ low)
df.sort_values(by='Sales', ascending=False)
๐ Find top values easily
# Sort by Category then Sales
df.sort_values(by=['Category','Sales'], ascending=[True, False])
๐ Multi-level sortingdf.sort_index()
๐ Sorts rows by indexsort_values() โ Sort by column values
ascending=True โ Low to high
ascending=False โ High to low
sort_index() โ Sort by row index
# Top 5 highest Sales
top_sales = df.sort_values(by='Sales', ascending=False).head(5)
print(top_sales)
๐ Real data analyst workflow
๐ Next: Learn how to detect and remove duplicate data
Next: Handling Duplicates โMissing data is common in real datasets. In data cleaning in python, we use fillna() and dropna() to handle missing values effectively using Pandas DataFrame.
โ ๏ธ Note: Dataset is already loaded as df. Continue using the same DataFrame.
df.isnull().sum()
๐ Shows number of missing values in each column
# Fill all missing values with 0
df.fillna(0, inplace=True)
๐ Replace missing values when data is important
# Fill Sales column with mean
df['Sales'].fillna(df['Sales'].mean(), inplace=True)
๐ Better approach for numerical data
df.dropna()
๐ Remove rows with missing values
df.dropna(axis=1)
๐ axis=1 โ columnsisnull() โ Detect missing values
fillna() โ Replace missing values
dropna() โ Remove missing data
# Check missing values
print(df.isnull().sum())
# Fill missing values
df.fillna(0, inplace=True)
# Verify
print(df.isnull().sum())
๐ Real data cleaning workflow
โ Use fillna() โ when data is important
โ Use dropna() โ when missing values are too many
๐ Next: Learn how to remove duplicate rows in Pandas
Next: Remove Duplicates โBefore cleaning data, you must detect missing values. In data cleaning in python, we use df.isnull(), sum(), and any() to find missing data in rows and columns.
โ ๏ธ Note: Dataset is already loaded as df. Continue using the same DataFrame.
df.head()
๐ Always inspect data before checking missing values
df.isnull()
๐ Returns True/False valuesdf.isnull().sum()
๐ Shows missing values per column
df.columns[df.isnull().any()]
๐ Returns columns with missing values
df[df.isnull().any(axis=1)]
๐ Shows rows where values are missing
isnull() โ Detect missing values
sum() โ Count missing values
any() โ Check if any value is missing
axis=1 โ Row, axis=0 โ Column
# Count missing values
print(df.isnull().sum())
# Show rows with missing values
print(df[df.isnull().any(axis=1)])
๐ Real data cleaning workflow
๐ Next: Learn how to remove duplicate rows in Pandas
Next: Remove Duplicates โThe pandas describe method is used to summarize numerical data. It helps in data cleaning in python pandas by showing mean, standard deviation, min, max, and quartiles of your dataset.
import pandas as pd
url = "https://raw.githubusercontent.com/plotly/datasets/master/superstore.csv"
df = pd.read_csv(url)
df.head()
๐ Load real dataset for analysis
df.describe()
๐ Generates statistical summarydf[['Sales','Profit']].describe()
๐ Analyze only required columnsdf.describe(include='all')
๐ Includes categorical columns alsocount โ number of values
mean โ average
std โ standard deviation
min / max โ smallest & largest value
25%, 50%, 75% โ quartiles (data distribution)
import pandas as pd
url = "https://docs.google.com/spreadsheets/d/1zZDBuiFCitgMU-D5XA2G7DVvoGmJ2G6KqXM9sOLpGz0/export?format=csv&gid=1140868527"
df = pd.read_csv(url)
# Full summary
print(df.describe())
# Specific columns
print(df[['Sales','Profit']].describe())
๐ Used in real data analyst projects
๐ Next: Learn how to remove duplicate rows in Pandas
Next: Remove Duplicates โDuplicate data is a common problem in real datasets. In data cleaning in python, we use duplicated() and drop_duplicates() to identify and remove duplicate rows from a Pandas DataFrame.
โ ๏ธ Note: Dataset is already loaded as df. Continue using the same DataFrame.
df.duplicated().sum()
๐ Counts number of duplicate rows
df[df.duplicated()]
๐ Shows duplicate records
df.drop_duplicates(inplace=True)
๐ Removes duplicate rows permanently
df.drop_duplicates(subset='Customer Name', inplace=True)
๐ Remove duplicates using specific column
df.drop_duplicates(keep='last', inplace=True)
๐ keep=’first’ (default)duplicated() โ Detect duplicates
drop_duplicates() โ Remove duplicates
subset โ Column used for checking
keep โ Choose which record to keep
# Count duplicates
print(df.duplicated().sum())
# Remove duplicates
df.drop_duplicates(inplace=True)
# Verify
print(df.duplicated().sum())
๐ Real data cleaning workflow
โ What is duplicate data?
โ Difference between duplicated() and drop_duplicates()?
โ How to remove duplicates using one column?
โ What does keep=’last’ do?
๐ Next: Learn how to change data types using astype()
Next: Data Type Conversion โIn real datasets, columns often have incorrect data types. In data cleaning in python, we use astype(), to_datetime(), and to_numeric() to fix data types for proper analysis.
โ ๏ธ Note: Dataset is already loaded as df. Continue using the same DataFrame.
df.dtypes
๐ Always check types before conversion
# Convert column safely
df['Age'] = df['Age'].astype('Int64')
๐ Use Int64 instead of int to handle missing values
df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce')
๐ Converts text to numbersdf['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce')
๐ Converts text โ datetimecols = ['Sales', 'Profit']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
๐ Safer method than astype for real data
df.dtypes โ Check column types
astype() โ Manual type conversion
to_numeric() โ Convert safely to numbers
to_datetime() โ Convert safely to dates
# Check types
print(df.dtypes)
# Fix numeric column
df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce')
# Fix date column
df['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce')
# Verify changes
print(df.dtypes)
๐ Real data cleaning workflow
โ Always check data before converting
โ Use errors=’coerce’ to avoid crashes
โ Use Int64 for columns with missing values
๐ Next: Learn how to detect and handle outliers in data
Next: Outlier Detection โOutliers are extreme values that can distort analysis. In data cleaning in python, we use the IQR (Interquartile Range) method to detect and remove outliers.
โ ๏ธ Note: Dataset is already loaded as df. Continue using the same DataFrame.
df['Sales'].describe()
๐ Understand distribution before detecting outliers
Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)
๐ Q1 = 25th percentileIQR = Q3 - Q1
๐ IQR = range of middle 50% data
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
๐ Values outside this range = outliers
outliers = df[(df['Sales'] < lower_bound) | (df['Sales'] > upper_bound)]
๐ Shows rows with extreme values
df = df[(df['Sales'] >= lower_bound) & (df['Sales'] <= upper_bound)]
๐ Keep only valid data range
Q1 โ 25% data
Q3 โ 75% data
IQR โ middle range (Q3 - Q1)
Outliers โ values outside normal range
# Calculate quartiles
Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)
# Calculate IQR
IQR = Q3 - Q1
# Define bounds
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
# Filter data
df_clean = df[(df['Sales'] >= lower) & (df['Sales'] <= upper)]
print(df_clean.head())
๐ Real data cleaning workflow
โ What is an outlier?
โ What is IQR method?
โ Why use 1.5 * IQR?
โ When should you NOT remove outliers?
๐ Next: Learn how to clean text data (lowercase, strip, replace)
Next: Text Cleaning โReal datasets often contain messy text like extra spaces, inconsistent casing, or wrong spellings. In data cleaning in python, we use string methods (.str) to clean and standardize text data.
โ ๏ธ Note: Dataset is already loaded as df. Continue using the same DataFrame.
df['City'] = df['City'].str.lower()
๐ “Delhi” โ “delhi”df['City'] = df['City'].str.strip()
๐ “delhi ” โ “delhi”df['City'] = df['City'].replace({
'delhi ': 'delhi',
'new delhi': 'delhi'
})
๐ Standardize similar valuesdf['Name'] = df['Name'].str.replace('[^a-zA-Z ]', '', regex=True)
๐ Removes numbers & symbolsdf['City'] = df['City'].str.title()
๐ “delhi” โ “Delhi”.str.lower() โ convert to lowercase
.str.strip() โ remove spaces
.replace() โ fix wrong values
.str.replace() โ remove patterns
.str.title() โ capitalize text
# Clean City column
df['City'] = df['City'].str.lower().str.strip()
# Standardize values
df['City'] = df['City'].replace({
'new delhi': 'delhi'
})
# Format nicely
df['City'] = df['City'].str.title()
print(df['City'].unique())
๐ Real data cleaning workflow
โ Why is text cleaning important?
โ Difference between replace() and str.replace()?
โ How do you remove extra spaces?
โ How to standardize categories?
๐ Next: Learn how to create new features from existing data
Next: Feature Engineering โFeature engineering is the process of creating new columns from existing data. It helps transform raw data into meaningful insights for analysis and machine learning.
โ ๏ธ Note: Dataset is already loaded as df. Continue using the same DataFrame.
# Create Profit Ratio
df['Profit_Ratio'] = df['Profit'] / df['Sales']
๐ New feature from existing columnsdf['Year'] = df['Order Date'].dt.year
๐ Extract year for trend analysis
df['Month'] = df['Order Date'].dt.month
๐ Useful for monthly reports
# High / Low Sales
df['Sales_Category'] = df['Sales'].apply(
lambda x: 'High' if x > 500 else 'Low'
)
๐ Convert numeric โ categorical# Combine City and State
df['Location'] = df['City'] + ", " + df['State']
๐ Create meaningful combined feature
New Column โ derived from existing data
dt.year โ extract year from date
apply() โ custom logic on column
lambda โ inline function
# Create new features
df['Profit_Ratio'] = df['Profit'] / df['Sales']
df['Year'] = df['Order Date'].dt.year
# Categorize data
df['Sales_Category'] = df['Sales'].apply(
lambda x: 'High' if x > 500 else 'Low'
)
print(df.head())
๐ Real data analyst workflow
โ What is feature engineering?
โ Why create new columns?
โ What is lambda function?
โ How to extract year from date?
๐ Next: Learn how to group and aggregate data using groupby()
Next: GroupBy & Aggregation โGroupBy is used to analyze data by categories. In data analysis using pandas, we use groupby() to summarize data like total sales, average profit, and customer insights.
โ ๏ธ Note: Dataset is already loaded as df. Continue using the same DataFrame.
df.groupby('Category')['Sales'].sum()
๐ Total sales per category
df.groupby('Category')['Profit'].mean()
๐ Average profit per category
df.groupby('Category').agg({
'Sales': 'sum',
'Profit': 'mean'
})
๐ Perform multiple calculations together
df.groupby(['Category','Region'])['Sales'].sum()
๐ Multi-level grouping
df.groupby('Category')['Sales'].sum().reset_index()
๐ Convert result into DataFrame
groupby() โ group data by category
sum() โ total value
mean() โ average value
agg() โ multiple calculations
# Total sales per category
sales_by_cat = df.groupby('Category')['Sales'].sum()
# Average profit
profit_avg = df.groupby('Category')['Profit'].mean()
print(sales_by_cat)
print(profit_avg)
๐ Real business analysis workflow
โ What is groupby in pandas?
โ Difference between sum() and mean()?
โ How to group by multiple columns?
โ What is agg() used for?
๐ Next: Learn how to export cleaned data to CSV or Excel
Next: Export Data โNow it’s time to apply everything you learned. This project covers data cleaning, feature engineering, and data analysis using a real dataset in Google Colab.
๐ Click the button โ Run all cells โ Practice step by step
โ Which category has highest sales?
โ Which region generates maximum profit?
โ Who are the top customers?
โ Monthly sales trends?
โ Complete data cleaning workflow
โ Real-world data analysis
โ Industry-level Pandas skills
โ Portfolio-ready project
๐ Congratulations! You have completed the Data Cleaning & Analysis Course
Explore More Projects โGet answers to common questions about data cleaning in python and pandas.
