📊 Pandas in Python: Complete Course from Scratch with Real Dataset (2025)
Table of Contents
ToggleWelcome to Vista Academy’s Pandas in Python Course — a step-by-step guide designed for beginners and aspiring data analysts in 2025. Here, you will not just learn Pandas, you’ll practice with a real messy dataset that includes NULL values, duplicates, typos, and anomalies — just like in real-world business data.
💡 Why This Dataset?
- Contains NULL values, duplicates, and text inconsistencies for hands-on cleaning.
- Includes age anomalies, wrong discounts, typos — perfect for teaching real-world data wrangling.
- Ideal for projects: sales analysis, customer churn, fraud detection, and more.
📘 What You Will Learn (Course Blueprint)
This course is structured in 14 modules that take you from zero to advanced Pandas skills:
- Pandas Basics –
read_csv
,head()
,info()
- Indexing & Selection –
.loc
,.iloc
,.query()
- Cleaning & Missing Data –
fillna
,dropna
, replace - Data Types & Conversions – datetime, astype, categorical
- String Operations & Normalization – cleaning text, fixing typos
- Filtering, Sorting & Ranking – top customers, suspicious orders
- GroupBy & Aggregations – sales by category, customer KPIs
- Time-Series & Resampling – monthly sales, rolling averages
- Joins & Merges – customer & product tables
- Pivot Tables & Crosstab – city vs category analysis
- Advanced Transformations – apply, map, window functions
- Performance Optimization – memory, chunking
- Exporting & Reporting – CSV, Excel, JSON
- Mini Projects – Sales Dashboard, Churn Risk, Fraud Detection
Module 0 — About the Dataset
This module introduces the dataset you’ll use throughout the course. Read the schema, preview sample rows, and set the business questions we will solve using Pandas.
Dataset Summary
- Type: E-commerce orders (synthetic, realistic mess).
- Rows: 122 (includes intentional duplicates & anomalies).
- Columns: 24 fields (see data dictionary).
- Dirty-data features: NULL/NaN values, placeholder strings (“NULL”, “n/a”, “none”), inconsistent casing, typos, duplicate rows, numeric anomalies (e.g., age=150), and some zero-quantity orders.
Data Dictionary (Fields)
Column | Type | Description & Dirty-data notes |
---|---|---|
order_id | string | Unique order identifier. Duplicates intentionally inserted for dedup lessons. |
order_date | datetime | Order placement date. Some entries may be out of sequence or missing. |
customer_id | int | Numeric customer ID. Use joins to merge customer-level data. |
customer_name | string | May include ‘NULL’ placeholder or empty strings. |
gender | string | “Male”/”Female” but may include inconsistent capitalization. |
age | int | Contains realistic ages and anomalies (e.g., 150, 17) for validation exercises. |
city | string | Inconsistent casing: “Delhi”, “DELHI”, “dehradun” etc. Good for normalization lessons. |
product_id / product_name | string | Product names include typos (e → 3) for fuzzy matching tasks. |
category | string | Product category; used for aggregation and pivot tables. |
unit_price | float | Contains missing values (None) to practice imputation/checks. |
quantity | int | Occasional zero-quantity rows to detect invalid orders. |
discount | float / string | Mix of numeric discounts and ‘n/a’ strings — perfect for coercion lessons. |
total_amount | float | Calculated field but some rows are NaN (missing calculation) to be recomputed. |
payment_method | string | Variations like ‘COD’ vs ‘Cash on Delivery’. |
is_returned | bool | Marks returned orders. |
delivery_days / delivery_date | int / datetime | Delivery info; some delivery_date entries may be missing. |
rating / review | int / string | Ratings sometimes missing; reviews short text with blanks. |
loyalty_member | string | Mixed markers: ‘Yes’,’Y’,’yes’,’No’,’N’ etc. — standardize in cleaning lessons. |
signup_date / last_purchase_date | datetime | Use to compute recency and RFM features; some dates may be NaT. |
coupon_code | string | Mixed case and None values — teach grouping by coupon usage. |
Business Questions (What we’ll answer)
- Which products and categories generate the most revenue (even with dirty data)?
- Which cities have the highest Average Order Value (AOV)?
- Who are repeat customers? What is the repeat purchase rate?
- How long does delivery take on average — and which cities have delays?
- Which orders look suspicious (zero quantity, missing prices, extreme discounts)?
Learning Outcomes (After Module 0)
- Understand the dataset schema and common real-world data issues.
- Formulate analysis questions and translate them into Pandas tasks.
- Download and open the CSV in Jupyter / Colab and run initial EDA commands.
- Plan a cleaning checklist for the data (what to fix first and why).
Starter Exercises — Module 0
- Load the CSV and show
df.shape
,df.info()
, anddf.head()
. Expected: 122 rows, 24 columns (some NaN counts). - Count placeholders — how many times do the strings “NULL”, “n/a”, “none” appear? Expected: non-zero counts to demonstrate cleaning need.
- Find duplicates by
order_id
and drop them for a practice run (but keep a copy of raw data). Expected: at least 2 duplicate rows found. - Preview anomalies — show rows where
age > 100
orquantity == 0
orunit_price
is null.
Module 2 — Indexing & Selection
Think of a DataFrame like an Excel table. In this module we will learn 3 simple things: 👉 how to pick some columns 👉 how to pick some rows 👉 how to filter rows with conditions
Step 1 — Load the dataset
import pandas as pd df = pd.read_csv('orders_raw.csv') df.head()
This opens the CSV file into Pandas and shows the first 5 rows. Simple start 🚀
Step 2 — Pick some columns
df[['order_id', 'product_name', 'total_amount']].head()
This shows only 3 columns (order id, product name, and total). Think of it as hiding other columns you don’t need.
Step 3 — Pick some rows
df.iloc[0:5]
This shows rows 0 to 4 (the first 5 rows). You can change numbers to see any block of rows.
Step 4 — Filter rows with a condition
df[df['category'] == 'Clothing']
This shows only rows where the category is “Clothing”. Filters are the most powerful tool in Pandas 🎯
Step 5 — Filter with two conditions
df[(df['category'] == 'Clothing') & (df['total_amount'] > 1000)]
This shows clothing orders where total is more than 1000.
Tip: Always use &
(and) or |
(or) with brackets.
🎯 Mini Exercises
- Show only the
order_id
andcity
columns. - Display rows 10 to 15 using
iloc
. - Filter all orders where
quantity
is greater than 3.
Module 3 — Cleaning & Missing Data
Real-world data is never perfect. It has blanks, duplicates, or wrong text. In this module we learn 4 very easy cleaning steps.
1) Find blanks (missing values)
df.isna().sum()
This shows how many blanks are in each column.
2) Replace ‘NULL’ / ‘n/a’ with blanks
df = df.replace(['NULL','n/a','none'], pd.NA)
Now Pandas understands them as blanks.
3) Fix missing total_amount
df['total_amount'] = df['unit_price'] * df['quantity']
If total is empty but price × quantity exist, we can calculate it again.
4) Remove duplicates
df = df.drop_duplicates(subset=['order_id'])
This keeps only one row if the same order appears twice.
🎯 Mini Exercises
- Run
df.isna().sum()
and say which column has the most blanks. - Replace ‘NULL’ and run again — what changed?
- Pick one row, calculate
unit_price × quantity
yourself, and compare withtotal_amount
. - Try
df.drop_duplicates()
and see if row count changes.
Module 4 — Data Types & Conversions (Very Easy)
Short and simple: make sure columns are the right type so math and dates work correctly.
1) See column types
df.info()
This shows which columns are numbers, text, or dates. Start here.
2) Convert date columns
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce') df['delivery_date'] = pd.to_datetime(df['delivery_date'], errors='coerce')
Now Pandas understands these as dates and you can sort or resample by them.
3) Make number columns numeric
df['unit_price'] = pd.to_numeric(df['unit_price'], errors='coerce') df['total_amount'] = pd.to_numeric(df['total_amount'], errors='coerce')
If a cell can’t become a number, it becomes blank (so you can fix it later).
4) Convert text that repeats into a “category”
df['category'] = df['category'].astype('category')
This makes grouping and counting faster and uses less memory.
5) Quick sanity check
# make sure quantity is not negative (df['quantity'] < 0).sum()
If result is not zero, we have bad data to fix (show students how to inspect those rows).
Mini exercises (5 min)
- Run
df.info()
and tell the class which columns are not numbers but should be. - Convert
unit_price
usingpd.to_numeric
and showdf.info()
again. - Convert
order_date
and show the first 3 dates sorted.
Module 5 — String Ops & Normalization (Very Easy)
Text columns often have spaces, wrong case, or small typos. These short steps make text clean and consistent — so counting and grouping work correctly.
1) Remove extra spaces
# remove leading/trailing spaces in city and product_name df['city'] = df['city'].str.strip() df['product_name'] = df['product_name'].str.strip()
Why: ” Delhi ” → “Delhi”. Helps avoid accidental duplicates.
2) Make text same case
# lowercase for grouping, uppercase for codes df['city_clean'] = df['city'].str.lower() df['coupon_code'] = df['coupon_code'].str.upper()
Why: “Delhi”, “DELHI”, “delhi” become the same when using one case.
3) Replace common variants
# make payment methods consistent df['payment_method'] = df['payment_method'].replace({'COD':'Cash on Delivery', 'cod':'Cash on Delivery'})
Why: Different ways to say the same thing should be one value for easy counting.
4) Clean coupon codes (letters + numbers only)
# remove non-alphanumeric characters df['coupon_code'] = df['coupon_code'].str.replace('[^A-Za-z0-9]', '', regex=True)
Why: “SAVE-10!” → “SAVE10”, so duplicates are counted correctly.
5) Fix small typos (example)
# if product_name has 'T-Shirt' and 'Tshirt', make both 'T-Shirt' df['product_name'] = df['product_name'].replace({'Tshirt':'T-Shirt', 'tshirt':'T-Shirt'})
Why: Replace obvious misspellings with the correct name.
6) Check unique counts
# see how many unique product names before = df['product_name'].nunique() after = df['product_name'].nunique() # run after cleaning to compare print('unique before:', before, 'unique after:', after)
Why: Cleaning should reduce the number of unique messy names.
Mini exercises (5 min)
- Trim spaces: run
df['city'] = df['city'].str.strip()
and show first 5 cities. - Uppercase coupon codes and remove symbols, then show
df['coupon_code'].head()
. - Replace one obvious typo in product names using
.replace()
.
Module 6 — Filtering, Sorting & Ranking (Very Easy)
Quick hands-on: learn how to find rows you want, sort them, and show the top items. Use these three simple tricks in every analysis.
1) Show only one kind of rows
# show only clothing orders df[df['category'] == 'Clothing']
This keeps rows where category equals “Clothing”. Think: filter like Excel.
2) Filter with two rules
# clothing AND total more than 1000 df[(df['category'] == 'Clothing') & (df['total_amount'] > 1000)]
Use &
for AND and |
for OR. Put each condition in parentheses.
3) Show biggest orders first
# biggest total_amount first df.sort_values('total_amount', ascending=False).head(10)
Sort by a column to see top records — very useful to find important rows fast.
4) Get the top N rows
# top 5 orders by total_amount df.nlargest(5, 'total_amount')
Fast way to get the largest values without sorting everything.
5) Top item in each group (easy way)
# sort by category then total, keep first row per category top_per_category = df.sort_values(['category','total_amount'], ascending=[True, False]).groupby('category').head(1) top_per_category
Sort inside each group and pick the first row — simple method to find the winner per group.
6) Add a rank number per group
# rank orders inside each category (1 = largest) df['rank_in_category'] = df.groupby('category')['total_amount'].rank(method='dense', ascending=False) df[['category','order_id','total_amount','rank_in_category']].head(8)
This gives a position number for each order inside its category (useful for leaderboards).
Mini exercises (5–10 min)
- Show orders where
total_amount > 1500
. - Show the 5 most recent orders (sort by
order_date
). - Find top product (by total_amount) in the “Electronics” category.
- Add
rank_in_category
and show the top 2 orders for each category.
Module 7 — GroupBy & Aggregations (Very Easy)
Grouping helps answer business questions like “which product sells most?” and “which city has highest revenue?” — we do this with a few simple commands.
1) Total revenue by product
# add revenue per product and sort top 10 revenue_by_product = df.groupby('product_name')['total_amount'].sum().reset_index() revenue_by_product = revenue_by_product.sort_values('total_amount', ascending=False).head(10) revenue_by_product
What it does: adds up `total_amount` for each product and shows the top 10 products by revenue.
2) Orders count and average order value (AOV)
# one table with count and average cat_stats = df.groupby('category').agg( orders_count = ('order_id','nunique'), avg_order_value = ('total_amount','mean') ).reset_index() cat_stats
What it does: for each category, shows how many orders and the average order value — easy KPI table.
3) Show each row’s share of its category
# give each row the category total and percent share df['category_total'] = df.groupby('category')['total_amount'].transform('sum') df['pct_of_category'] = df['total_amount'] / df['category_total'] df[['order_id','category','total_amount','pct_of_category']].head(6)
What it does: keeps row shape but adds how much that order contributes to its category.
4) Pivot: quick matrix of category × city revenue
pivot = df.pivot_table(index='category', columns='city', values='total_amount', aggfunc='sum', fill_value=0) pivot.head()
What it does: makes a table where rows are categories, columns are cities, and cells show total revenue — great for dashboards.
5) Simple rule for missing totals
# Option A: ignore rows with missing total_amount (default) # Option B: treat missing as zero: df2 = df.copy() df2['total_amount'] = df2['total_amount'].fillna(0) revenue_by_product = df2.groupby('product_name')['total_amount'].sum().reset_index()
What it does: choose whether missing totals count as 0 (business decision). Show both to the class and compare totals.
Mini exercises (5–15 min)
- Show top 5 products by revenue (use code in step 1).
- Create `cat_stats` and point out which category has highest AOV.
- Add the `category_total` column and show how much the first 3 orders contribute (use `%` formatting for class demo).
- Build the pivot table and pick one city to compare category revenues.
Module 8 — Time-Series & Resampling (Very Easy)
Dates are everywhere: sales by month, customers by signup week, deliveries per day. Here you’ll learn how to handle dates, resample them, and compute rolling averages.
1) Convert text to datetime
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
Why: makes `order_date` a real datetime so Pandas can use it for resampling.
2) Use date as index
df = df.set_index('order_date') df.head(3)
Why: setting the date as index makes resampling super easy.
3) Monthly sales totals
monthly_sales = df['total_amount'].resample('M').sum() monthly_sales.head()
Why: `resample(‘M’)` means “group by month” — now you see monthly revenue.
4) Orders per day
daily_orders = df['order_id'].resample('D').count() daily_orders.head()
Why: counts how many orders happened each day.
5) Rolling 7-day average
rolling_avg = daily_orders.rolling(7).mean() rolling_avg.head(10)
Why: smooths daily noise by showing the 7-day average order count.
Mini exercises (5–10 min)
- Convert `delivery_date` to datetime and set as index.
- Find monthly count of orders using `resample(‘M’).count()`.
- Compute a 30-day rolling sum of `total_amount`.
Module 9 — Joins & Merges (Very Easy)
Learn how to combine tables — like customers, products, and orders — so you can answer real business questions. We’ll keep examples short and simple.
Start: load data (one-line)
# If you use the Google Sheet: (first sheet / default gid) sheet_url = "https://docs.google.com/spreadsheets/d/1UgLeYK_U-DMKbJt66aWX_Tu7E9zmDHOMGFP9C8GSnbA/export?format=csv" df = pd.read_csv(sheet_url) # Or load your local orders_raw.csv # df = pd.read_csv("orders_raw.csv")
What this does: loads your master table into `df` — the single table that contains orders, customer and product info.
Step 1 — Make customers, products & orders tables
# customers: unique customers customers = df[['customer_id','customer_name','city','gender','signup_date','loyalty_member']].drop_duplicates() # products: unique products products = df[['product_id','product_name','category','unit_price']].drop_duplicates() # orders: trimmed orders table orders = df[['order_id','customer_id','product_id','order_date','quantity','discount','total_amount','payment_method']].copy()
Why: Smaller tables make joins simpler and mirror how real databases are organized.
Step 2 — Easy joins (what they do)
- Left join: keep every row from left table, add matching info from right. (Good for: keep all orders, add customer info.)
- Inner join: keep only rows that appear in both tables. (Good for: analysis where both sides must exist.)
- Right join: keep every row from right table, add matching left info (opposite of left).
- Outer join: keep all rows from both tables, fill missing values with NaN. (Good for a full view.)
Step 3 — Code examples (one-liners)
# left join: keep all orders, bring customer info orders_with_customers = orders.merge(customers, on='customer_id', how='left') orders_with_customers.head() # left join orders with products orders_full = orders.merge(products, on='product_id', how='left') orders_full.head() # inner join (only orders that have matching product and customer) orders_inner = orders.merge(customers, on='customer_id', how='inner').merge(products, on='product_id', how='inner') orders_inner.head() # outer join example: full outer between products and orders (shows unmatched products or orders) products_orders_outer = products.merge(orders, on='product_id', how='outer', indicator=True) products_orders_outer.head()
Tip: use indicator=True
to see where rows came from (`left_only`, `right_only`, `both`).
Step 4 — Simple business questions (use joins)
Q: Who are the top 5 customers by spend (name + city)?
cust_revenue = orders.groupby('customer_id')['total_amount'].sum().reset_index() cust_top = cust_revenue.merge(customers, on='customer_id', how='left').sort_values('total_amount', ascending=False).head(5) cust_top[['customer_name','city','total_amount']]
Q: Show orders with product name and customer city:
orders_full = orders.merge(customers[['customer_id','city','customer_name']], on='customer_id', how='left') \ .merge(products[['product_id','product_name','category']], on='product_id', how='left') orders_full[['order_id','customer_name','city','product_name','category','total_amount']].head(8)
Mini exercises (5–15 min)
- Create
customers
,products
, andorders
tables from your loadeddf
. - Do a left join of
orders
withcustomers
and show the first 5 rows. - Find top 3 products by revenue (join orders→products to show product names).
- Use
indicator=True
with an outer join between products and orders and count how many rows areleft_only
,right_only
,both
. - Explain in one line when you’d use an inner vs left join (class discussion).
Module 10 — Pivot Tables & Crosstab (City × Category)
Learn how to make a quick matrix (pivot table) showing revenue by city and product category, and how to use crosstab
to compare counts or other stats. Simple examples — perfect for beginners.
What is a pivot table?
A pivot table rearranges data into a 2D table: one variable for rows (e.g., category), another for columns (e.g., city), and an aggregation inside (e.g., sum of revenue or count of orders).
1) Pivot — total revenue (sum)
# pivot: rows = category, columns = city, values = total_amount (sum) pivot_rev = df.pivot_table( index='category', columns='city', values='total_amount', aggfunc='sum', fill_value=0 ) pivot_rev
What this does: Builds a table where each cell shows total revenue for a category in a city. fill_value=0
fills empty cells with 0.
2) Pivot with multiple statistics (sum & mean)
# multi-agg pivot: show sum and average order value pivot_multi = df.pivot_table( index='category', columns='city', values='total_amount', aggfunc=['sum','mean'], fill_value=0 ) pivot_multi
What this does: Shows both total revenue and average order amount for each category × city pair. The pivot has two layers of column labels (aggfunc and city).
3) Crosstab — count of orders by city & category
# crosstab counts: how many orders per category in each city ct_orders = pd.crosstab(df['category'], df['city'], values=df['order_id'], aggfunc='count', dropna=False).fillna(0) ct_orders
What this does: Gives a simple counts matrix: how many orders belong to each category in each city.
4) Crosstab as percentages (row-wise)
# show percent distribution per category across cities ct_pct_row = pd.crosstab(df['category'], df['city'], normalize='index').round(3) ct_pct_row
What this does: For each category row, shows the fraction of orders coming from each city (useful to compare where categories sell best).
5) Quick heatmap idea (optional)
# (optional) plot a heatmap of pivot_rev using matplotlib + seaborn import matplotlib.pyplot as plt import seaborn as sns plt.figure(figsize=(10,6)) sns.heatmap(pivot_rev, annot=True, fmt=".0f") plt.title("Revenue by Category and City") plt.show()
Note: seaborn makes a nice heatmap. If you prefer plain matplotlib, use plt.imshow()
. Visuals help students see patterns fast.
6) Which category makes the most money in each city?
# find top category per city using groupby + unstack + idxmax city_cat = df.groupby(['city','category'])['total_amount'].sum().unstack(fill_value=0) top_per_city = city_cat.idxmax(axis=1).reset_index().rename(columns={0:'top_category'}) top_per_city
What this does: Sums revenue for every city/category, then picks the category with max revenue per city.
Mini exercises (5–15 min)
- Create
pivot_rev
and show the top 3 categories by total revenue across all cities (sum the row). - Build
ct_orders
and find which city has the most orders for “Electronics”. - Create
ct_pct_row
and identify categories that are concentrated (>60%) in a single city. - Plot a heatmap of
pivot_rev
(or export it to Excel to show color scales in Excel).
Module 11 — Transformations (Beginner Level)
Transformations are small changes to your data: replacing codes with labels, creating new columns with functions, or adding group-level totals. Let’s learn the 3 most common ones step by step.
1) map()
— Replace values
# Replace short payment codes with full names pay_map = {'COD':'Cash on Delivery', 'UPI':'UPI Payment', 'Card':'Card Payment'} df['payment_method_full'] = df['payment_method'].map(pay_map) df[['payment_method','payment_method_full']].head()
What this does: Changes small codes (COD, UPI, Card) into human-readable labels.
2) apply()
— Create new column using a function
# Label orders as small, medium, or big def order_size(x): if x >= 2000: return 'Big' elif x >= 500: return 'Medium' else: return 'Small' df['order_size'] = df['total_amount'].apply(order_size) df[['total_amount','order_size']].head()
What this does: Runs your custom function for each row. Here we classify orders by size.
3) transform()
— Add group totals
# Add each product's total revenue to every row of that product df['product_total'] = df.groupby('product_id')['total_amount'].transform('sum') df[['product_id','total_amount','product_total']].head()
What this does: Keeps all rows but adds a new column showing the sum of revenue per product.
Quick exercises (5–10 min)
- Use
map()
to replace loyalty codesY/N
withYes/No
. - Use
apply()
to mark orders as discounted if discount > 0. - Use
transform()
to calculate each customer’s total spend and add it as a new column.
❓ Frequently Asked Questions — Pandas in Python
Beginner-friendly answers to common Pandas questions. Helps students starting Data Analytics in Python.
1) What is Pandas in Python?
Pandas is a popular Python library used for data analysis. It helps you work with tables called DataFrames
— like Excel inside Python.
2) Why should I learn Pandas?
Because almost every data science or analytics project starts with cleaning messy data. Pandas makes this easy.
3) Do I need Python basics before Pandas?
Yes, at least basics like variables, loops, and functions. If you know simple Python code, Pandas is easy to learn.
4) What dataset do we use in this course?
We use a real messy dataset with NULL values, duplicates, and typos. Download it here.
5) Can Pandas handle large datasets?
Yes, Pandas can handle millions of rows. For very large files, use chunked reading or tools like Dask.
6) Where is Pandas used in real life?
Everywhere: analyzing sales, building dashboards, cleaning text for NLP, and preparing predictive analytics models.
7) Is Pandas enough for Data Analytics?
Pandas is the foundation. After Pandas, you can learn visualization, machine learning, and SQL. But Pandas is your first step.
📝 Pandas Beginner Quiz
Answer one question at a time. At the end, you’ll see the correct answers!
Q1. What is a DataFrame in Pandas?
Q2. Which function is used to read a CSV file?
Q3. Which method shows the first 5 rows?
Q4. How do you select rows where total_amount > 1000?
Q5. Which function removes duplicate rows?
✅ Correct Answers
- Q1 → B (A 2D table like Excel)
- Q2 → A (pd.read_csv())
- Q3 → C (df.head())
- Q4 → A (df[df[‘total_amount’] > 1000])
- Q5 → B (df.drop_duplicates())