📊 Pandas in Python: Complete Course from Scratch with Real Dataset (2025)

Welcome to Vista Academy’s Pandas in Python Course — a step-by-step guide designed for beginners and aspiring data analysts in 2025. Here, you will not just learn Pandas, you’ll practice with a real messy dataset that includes NULL values, duplicates, typos, and anomalies — just like in real-world business data.

💡 Why This Dataset?

  • Contains NULL values, duplicates, and text inconsistencies for hands-on cleaning.
  • Includes age anomalies, wrong discounts, typos — perfect for teaching real-world data wrangling.
  • Ideal for projects: sales analysis, customer churn, fraud detection, and more.

📘 What You Will Learn (Course Blueprint)

This course is structured in 14 modules that take you from zero to advanced Pandas skills:

  1. Pandas Basics – read_csv, head(), info()
  2. Indexing & Selection – .loc, .iloc, .query()
  3. Cleaning & Missing Data – fillna, dropna, replace
  4. Data Types & Conversions – datetime, astype, categorical
  5. String Operations & Normalization – cleaning text, fixing typos
  6. Filtering, Sorting & Ranking – top customers, suspicious orders
  7. GroupBy & Aggregations – sales by category, customer KPIs
  8. Time-Series & Resampling – monthly sales, rolling averages
  9. Joins & Merges – customer & product tables
  10. Pivot Tables & Crosstab – city vs category analysis
  11. Advanced Transformations – apply, map, window functions
  12. Performance Optimization – memory, chunking
  13. Exporting & Reporting – CSV, Excel, JSON
  14. Mini Projects – Sales Dashboard, Churn Risk, Fraud Detection

Module 0 — About the Dataset

This module introduces the dataset you’ll use throughout the course. Read the schema, preview sample rows, and set the business questions we will solve using Pandas.

Dataset Summary

  • Type: E-commerce orders (synthetic, realistic mess).
  • Rows: 122 (includes intentional duplicates & anomalies).
  • Columns: 24 fields (see data dictionary).
  • Dirty-data features: NULL/NaN values, placeholder strings (“NULL”, “n/a”, “none”), inconsistent casing, typos, duplicate rows, numeric anomalies (e.g., age=150), and some zero-quantity orders.
⬇️ Download Raw Dataset (CSV) (Use this exact CSV for course exercises)

Data Dictionary (Fields)

Column Type Description & Dirty-data notes
order_idstringUnique order identifier. Duplicates intentionally inserted for dedup lessons.
order_datedatetimeOrder placement date. Some entries may be out of sequence or missing.
customer_idintNumeric customer ID. Use joins to merge customer-level data.
customer_namestringMay include ‘NULL’ placeholder or empty strings.
genderstring“Male”/”Female” but may include inconsistent capitalization.
ageintContains realistic ages and anomalies (e.g., 150, 17) for validation exercises.
citystringInconsistent casing: “Delhi”, “DELHI”, “dehradun” etc. Good for normalization lessons.
product_id / product_namestringProduct names include typos (e → 3) for fuzzy matching tasks.
categorystringProduct category; used for aggregation and pivot tables.
unit_pricefloatContains missing values (None) to practice imputation/checks.
quantityintOccasional zero-quantity rows to detect invalid orders.
discountfloat / stringMix of numeric discounts and ‘n/a’ strings — perfect for coercion lessons.
total_amountfloatCalculated field but some rows are NaN (missing calculation) to be recomputed.
payment_methodstringVariations like ‘COD’ vs ‘Cash on Delivery’.
is_returnedboolMarks returned orders.
delivery_days / delivery_dateint / datetimeDelivery info; some delivery_date entries may be missing.
rating / reviewint / stringRatings sometimes missing; reviews short text with blanks.
loyalty_memberstringMixed markers: ‘Yes’,’Y’,’yes’,’No’,’N’ etc. — standardize in cleaning lessons.
signup_date / last_purchase_datedatetimeUse to compute recency and RFM features; some dates may be NaT.
coupon_codestringMixed case and None values — teach grouping by coupon usage.

Business Questions (What we’ll answer)

  1. Which products and categories generate the most revenue (even with dirty data)?
  2. Which cities have the highest Average Order Value (AOV)?
  3. Who are repeat customers? What is the repeat purchase rate?
  4. How long does delivery take on average — and which cities have delays?
  5. Which orders look suspicious (zero quantity, missing prices, extreme discounts)?

Learning Outcomes (After Module 0)

  • Understand the dataset schema and common real-world data issues.
  • Formulate analysis questions and translate them into Pandas tasks.
  • Download and open the CSV in Jupyter / Colab and run initial EDA commands.
  • Plan a cleaning checklist for the data (what to fix first and why).

Starter Exercises — Module 0

  1. Load the CSV and show df.shape, df.info(), and df.head(). Expected: 122 rows, 24 columns (some NaN counts).
  2. Count placeholders — how many times do the strings “NULL”, “n/a”, “none” appear? Expected: non-zero counts to demonstrate cleaning need.
  3. Find duplicates by order_id and drop them for a practice run (but keep a copy of raw data). Expected: at least 2 duplicate rows found.
  4. Preview anomalies — show rows where age > 100 or quantity == 0 or unit_price is null.

Pro tip: Keep the original raw CSV unchanged. Create a copy named orders_raw.csv and work on copies like orders_working.csv while you learn.

Module 2 — Indexing & Selection

Think of a DataFrame like an Excel table. In this module we will learn 3 simple things: 👉 how to pick some columns 👉 how to pick some rows 👉 how to filter rows with conditions

Step 1 — Load the dataset

import pandas as pd
df = pd.read_csv('orders_raw.csv')
df.head()
    

This opens the CSV file into Pandas and shows the first 5 rows. Simple start 🚀

Step 2 — Pick some columns

df[['order_id', 'product_name', 'total_amount']].head()
    

This shows only 3 columns (order id, product name, and total). Think of it as hiding other columns you don’t need.

Step 3 — Pick some rows

df.iloc[0:5]
    

This shows rows 0 to 4 (the first 5 rows). You can change numbers to see any block of rows.

Step 4 — Filter rows with a condition

df[df['category'] == 'Clothing']
    

This shows only rows where the category is “Clothing”. Filters are the most powerful tool in Pandas 🎯

Step 5 — Filter with two conditions

df[(df['category'] == 'Clothing') & (df['total_amount'] > 1000)]
    

This shows clothing orders where total is more than 1000. Tip: Always use & (and) or | (or) with brackets.

🎯 Mini Exercises

  1. Show only the order_id and city columns.
  2. Display rows 10 to 15 using iloc.
  3. Filter all orders where quantity is greater than 3.

Module 3 — Cleaning & Missing Data

Real-world data is never perfect. It has blanks, duplicates, or wrong text. In this module we learn 4 very easy cleaning steps.

1) Find blanks (missing values)

df.isna().sum()
    

This shows how many blanks are in each column.

2) Replace ‘NULL’ / ‘n/a’ with blanks

df = df.replace(['NULL','n/a','none'], pd.NA)
    

Now Pandas understands them as blanks.

3) Fix missing total_amount

df['total_amount'] = df['unit_price'] * df['quantity']
    

If total is empty but price × quantity exist, we can calculate it again.

4) Remove duplicates

df = df.drop_duplicates(subset=['order_id'])
    

This keeps only one row if the same order appears twice.

🎯 Mini Exercises

  1. Run df.isna().sum() and say which column has the most blanks.
  2. Replace ‘NULL’ and run again — what changed?
  3. Pick one row, calculate unit_price × quantity yourself, and compare with total_amount.
  4. Try df.drop_duplicates() and see if row count changes.

Module 4 — Data Types & Conversions (Very Easy)

Short and simple: make sure columns are the right type so math and dates work correctly.

1) See column types

df.info()
    

This shows which columns are numbers, text, or dates. Start here.

2) Convert date columns

df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
df['delivery_date'] = pd.to_datetime(df['delivery_date'], errors='coerce')
    

Now Pandas understands these as dates and you can sort or resample by them.

3) Make number columns numeric

df['unit_price'] = pd.to_numeric(df['unit_price'], errors='coerce')
df['total_amount'] = pd.to_numeric(df['total_amount'], errors='coerce')
    

If a cell can’t become a number, it becomes blank (so you can fix it later).

4) Convert text that repeats into a “category”

df['category'] = df['category'].astype('category')
    

This makes grouping and counting faster and uses less memory.

5) Quick sanity check

# make sure quantity is not negative
(df['quantity'] < 0).sum()
    

If result is not zero, we have bad data to fix (show students how to inspect those rows).

Mini exercises (5 min)

  1. Run df.info() and tell the class which columns are not numbers but should be.
  2. Convert unit_price using pd.to_numeric and show df.info() again.
  3. Convert order_date and show the first 3 dates sorted.

Module 5 — String Ops & Normalization (Very Easy)

Text columns often have spaces, wrong case, or small typos. These short steps make text clean and consistent — so counting and grouping work correctly.

1) Remove extra spaces

# remove leading/trailing spaces in city and product_name
df['city'] = df['city'].str.strip()
df['product_name'] = df['product_name'].str.strip()
    

Why: ” Delhi ” → “Delhi”. Helps avoid accidental duplicates.

2) Make text same case

# lowercase for grouping, uppercase for codes
df['city_clean'] = df['city'].str.lower()
df['coupon_code'] = df['coupon_code'].str.upper()
    

Why: “Delhi”, “DELHI”, “delhi” become the same when using one case.

3) Replace common variants

# make payment methods consistent
df['payment_method'] = df['payment_method'].replace({'COD':'Cash on Delivery', 'cod':'Cash on Delivery'})
    

Why: Different ways to say the same thing should be one value for easy counting.

4) Clean coupon codes (letters + numbers only)

# remove non-alphanumeric characters
df['coupon_code'] = df['coupon_code'].str.replace('[^A-Za-z0-9]', '', regex=True)
    

Why: “SAVE-10!” → “SAVE10”, so duplicates are counted correctly.

5) Fix small typos (example)

# if product_name has 'T-Shirt' and 'Tshirt', make both 'T-Shirt'
df['product_name'] = df['product_name'].replace({'Tshirt':'T-Shirt', 'tshirt':'T-Shirt'})
    

Why: Replace obvious misspellings with the correct name.

6) Check unique counts

# see how many unique product names
before = df['product_name'].nunique()
after = df['product_name'].nunique()  # run after cleaning to compare
print('unique before:', before, 'unique after:', after)
    

Why: Cleaning should reduce the number of unique messy names.

Mini exercises (5 min)

  1. Trim spaces: run df['city'] = df['city'].str.strip() and show first 5 cities.
  2. Uppercase coupon codes and remove symbols, then show df['coupon_code'].head().
  3. Replace one obvious typo in product names using .replace().

Module 6 — Filtering, Sorting & Ranking (Very Easy)

Quick hands-on: learn how to find rows you want, sort them, and show the top items. Use these three simple tricks in every analysis.

1) Show only one kind of rows

# show only clothing orders
df[df['category'] == 'Clothing']
    

This keeps rows where category equals “Clothing”. Think: filter like Excel.

2) Filter with two rules

# clothing AND total more than 1000
df[(df['category'] == 'Clothing') & (df['total_amount'] > 1000)]
    

Use & for AND and | for OR. Put each condition in parentheses.

3) Show biggest orders first

# biggest total_amount first
df.sort_values('total_amount', ascending=False).head(10)
    

Sort by a column to see top records — very useful to find important rows fast.

4) Get the top N rows

# top 5 orders by total_amount
df.nlargest(5, 'total_amount')
    

Fast way to get the largest values without sorting everything.

5) Top item in each group (easy way)

# sort by category then total, keep first row per category
top_per_category = df.sort_values(['category','total_amount'], ascending=[True, False]).groupby('category').head(1)
top_per_category
    

Sort inside each group and pick the first row — simple method to find the winner per group.

6) Add a rank number per group

# rank orders inside each category (1 = largest)
df['rank_in_category'] = df.groupby('category')['total_amount'].rank(method='dense', ascending=False)
df[['category','order_id','total_amount','rank_in_category']].head(8)
    

This gives a position number for each order inside its category (useful for leaderboards).

Mini exercises (5–10 min)

  1. Show orders where total_amount > 1500.
  2. Show the 5 most recent orders (sort by order_date).
  3. Find top product (by total_amount) in the “Electronics” category.
  4. Add rank_in_category and show the top 2 orders for each category.

Module 7 — GroupBy & Aggregations (Very Easy)

Grouping helps answer business questions like “which product sells most?” and “which city has highest revenue?” — we do this with a few simple commands.

1) Total revenue by product

# add revenue per product and sort top 10
revenue_by_product = df.groupby('product_name')['total_amount'].sum().reset_index()
revenue_by_product = revenue_by_product.sort_values('total_amount', ascending=False).head(10)
revenue_by_product
    

What it does: adds up `total_amount` for each product and shows the top 10 products by revenue.

2) Orders count and average order value (AOV)

# one table with count and average
cat_stats = df.groupby('category').agg(
  orders_count = ('order_id','nunique'),
  avg_order_value = ('total_amount','mean')
).reset_index()
cat_stats
    

What it does: for each category, shows how many orders and the average order value — easy KPI table.

3) Show each row’s share of its category

# give each row the category total and percent share
df['category_total'] = df.groupby('category')['total_amount'].transform('sum')
df['pct_of_category'] = df['total_amount'] / df['category_total']
df[['order_id','category','total_amount','pct_of_category']].head(6)
    

What it does: keeps row shape but adds how much that order contributes to its category.

4) Pivot: quick matrix of category × city revenue

pivot = df.pivot_table(index='category', columns='city', values='total_amount', aggfunc='sum', fill_value=0)
pivot.head()
    

What it does: makes a table where rows are categories, columns are cities, and cells show total revenue — great for dashboards.

5) Simple rule for missing totals

# Option A: ignore rows with missing total_amount (default)
# Option B: treat missing as zero:
df2 = df.copy()
df2['total_amount'] = df2['total_amount'].fillna(0)
revenue_by_product = df2.groupby('product_name')['total_amount'].sum().reset_index()
    

What it does: choose whether missing totals count as 0 (business decision). Show both to the class and compare totals.

Mini exercises (5–15 min)

  1. Show top 5 products by revenue (use code in step 1).
  2. Create `cat_stats` and point out which category has highest AOV.
  3. Add the `category_total` column and show how much the first 3 orders contribute (use `%` formatting for class demo).
  4. Build the pivot table and pick one city to compare category revenues.

Module 8 — Time-Series & Resampling (Very Easy)

Dates are everywhere: sales by month, customers by signup week, deliveries per day. Here you’ll learn how to handle dates, resample them, and compute rolling averages.

1) Convert text to datetime

df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
    

Why: makes `order_date` a real datetime so Pandas can use it for resampling.

2) Use date as index

df = df.set_index('order_date')
df.head(3)
    

Why: setting the date as index makes resampling super easy.

3) Monthly sales totals

monthly_sales = df['total_amount'].resample('M').sum()
monthly_sales.head()
    

Why: `resample(‘M’)` means “group by month” — now you see monthly revenue.

4) Orders per day

daily_orders = df['order_id'].resample('D').count()
daily_orders.head()
    

Why: counts how many orders happened each day.

5) Rolling 7-day average

rolling_avg = daily_orders.rolling(7).mean()
rolling_avg.head(10)
    

Why: smooths daily noise by showing the 7-day average order count.

Mini exercises (5–10 min)

  1. Convert `delivery_date` to datetime and set as index.
  2. Find monthly count of orders using `resample(‘M’).count()`.
  3. Compute a 30-day rolling sum of `total_amount`.

Module 9 — Joins & Merges (Very Easy)

Learn how to combine tables — like customers, products, and orders — so you can answer real business questions. We’ll keep examples short and simple.

Start: load data (one-line)

# If you use the Google Sheet: (first sheet / default gid)
sheet_url = "https://docs.google.com/spreadsheets/d/1UgLeYK_U-DMKbJt66aWX_Tu7E9zmDHOMGFP9C8GSnbA/export?format=csv"
df = pd.read_csv(sheet_url)

# Or load your local orders_raw.csv
# df = pd.read_csv("orders_raw.csv")
    

What this does: loads your master table into `df` — the single table that contains orders, customer and product info.

Step 1 — Make customers, products & orders tables

# customers: unique customers
customers = df[['customer_id','customer_name','city','gender','signup_date','loyalty_member']].drop_duplicates()

# products: unique products
products  = df[['product_id','product_name','category','unit_price']].drop_duplicates()

# orders: trimmed orders table
orders    = df[['order_id','customer_id','product_id','order_date','quantity','discount','total_amount','payment_method']].copy()
    

Why: Smaller tables make joins simpler and mirror how real databases are organized.

Step 2 — Easy joins (what they do)

  • Left join: keep every row from left table, add matching info from right. (Good for: keep all orders, add customer info.)
  • Inner join: keep only rows that appear in both tables. (Good for: analysis where both sides must exist.)
  • Right join: keep every row from right table, add matching left info (opposite of left).
  • Outer join: keep all rows from both tables, fill missing values with NaN. (Good for a full view.)

Step 3 — Code examples (one-liners)

# left join: keep all orders, bring customer info
orders_with_customers = orders.merge(customers, on='customer_id', how='left')
orders_with_customers.head()

# left join orders with products
orders_full = orders.merge(products, on='product_id', how='left')
orders_full.head()

# inner join (only orders that have matching product and customer)
orders_inner = orders.merge(customers, on='customer_id', how='inner').merge(products, on='product_id', how='inner')
orders_inner.head()

# outer join example: full outer between products and orders (shows unmatched products or orders)
products_orders_outer = products.merge(orders, on='product_id', how='outer', indicator=True)
products_orders_outer.head()
    

Tip: use indicator=True to see where rows came from (`left_only`, `right_only`, `both`).

Step 4 — Simple business questions (use joins)

Q: Who are the top 5 customers by spend (name + city)?

cust_revenue = orders.groupby('customer_id')['total_amount'].sum().reset_index()
cust_top = cust_revenue.merge(customers, on='customer_id', how='left').sort_values('total_amount', ascending=False).head(5)
cust_top[['customer_name','city','total_amount']]
    

Q: Show orders with product name and customer city:

orders_full = orders.merge(customers[['customer_id','city','customer_name']], on='customer_id', how='left') \
                    .merge(products[['product_id','product_name','category']], on='product_id', how='left')
orders_full[['order_id','customer_name','city','product_name','category','total_amount']].head(8)
    

Mini exercises (5–15 min)

  1. Create customers, products, and orders tables from your loaded df.
  2. Do a left join of orders with customers and show the first 5 rows.
  3. Find top 3 products by revenue (join orders→products to show product names).
  4. Use indicator=True with an outer join between products and orders and count how many rows are left_only, right_only, both.
  5. Explain in one line when you’d use an inner vs left join (class discussion).

Module 10 — Pivot Tables & Crosstab (City × Category)

Learn how to make a quick matrix (pivot table) showing revenue by city and product category, and how to use crosstab to compare counts or other stats. Simple examples — perfect for beginners.

What is a pivot table?

A pivot table rearranges data into a 2D table: one variable for rows (e.g., category), another for columns (e.g., city), and an aggregation inside (e.g., sum of revenue or count of orders).

1) Pivot — total revenue (sum)

# pivot: rows = category, columns = city, values = total_amount (sum)
pivot_rev = df.pivot_table(
    index='category',
    columns='city',
    values='total_amount',
    aggfunc='sum',
    fill_value=0
)
pivot_rev
    

What this does: Builds a table where each cell shows total revenue for a category in a city. fill_value=0 fills empty cells with 0.

2) Pivot with multiple statistics (sum & mean)

# multi-agg pivot: show sum and average order value
pivot_multi = df.pivot_table(
    index='category',
    columns='city',
    values='total_amount',
    aggfunc=['sum','mean'],
    fill_value=0
)
pivot_multi
    

What this does: Shows both total revenue and average order amount for each category × city pair. The pivot has two layers of column labels (aggfunc and city).

3) Crosstab — count of orders by city & category

# crosstab counts: how many orders per category in each city
ct_orders = pd.crosstab(df['category'], df['city'], values=df['order_id'], aggfunc='count', dropna=False).fillna(0)
ct_orders
    

What this does: Gives a simple counts matrix: how many orders belong to each category in each city.

4) Crosstab as percentages (row-wise)

# show percent distribution per category across cities
ct_pct_row = pd.crosstab(df['category'], df['city'], normalize='index').round(3)
ct_pct_row
    

What this does: For each category row, shows the fraction of orders coming from each city (useful to compare where categories sell best).

5) Quick heatmap idea (optional)

# (optional) plot a heatmap of pivot_rev using matplotlib + seaborn
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,6))
sns.heatmap(pivot_rev, annot=True, fmt=".0f")
plt.title("Revenue by Category and City")
plt.show()
    

Note: seaborn makes a nice heatmap. If you prefer plain matplotlib, use plt.imshow(). Visuals help students see patterns fast.

6) Which category makes the most money in each city?

# find top category per city using groupby + unstack + idxmax
city_cat = df.groupby(['city','category'])['total_amount'].sum().unstack(fill_value=0)
top_per_city = city_cat.idxmax(axis=1).reset_index().rename(columns={0:'top_category'})
top_per_city
    

What this does: Sums revenue for every city/category, then picks the category with max revenue per city.

Mini exercises (5–15 min)

  1. Create pivot_rev and show the top 3 categories by total revenue across all cities (sum the row).
  2. Build ct_orders and find which city has the most orders for “Electronics”.
  3. Create ct_pct_row and identify categories that are concentrated (>60%) in a single city.
  4. Plot a heatmap of pivot_rev (or export it to Excel to show color scales in Excel).

Module 11 — Transformations (Beginner Level)

Transformations are small changes to your data: replacing codes with labels, creating new columns with functions, or adding group-level totals. Let’s learn the 3 most common ones step by step.

1) map() — Replace values

# Replace short payment codes with full names
pay_map = {'COD':'Cash on Delivery', 'UPI':'UPI Payment', 'Card':'Card Payment'}
df['payment_method_full'] = df['payment_method'].map(pay_map)
df[['payment_method','payment_method_full']].head()
    

What this does: Changes small codes (COD, UPI, Card) into human-readable labels.

2) apply() — Create new column using a function

# Label orders as small, medium, or big
def order_size(x):
    if x >= 2000:
        return 'Big'
    elif x >= 500:
        return 'Medium'
    else:
        return 'Small'

df['order_size'] = df['total_amount'].apply(order_size)
df[['total_amount','order_size']].head()
    

What this does: Runs your custom function for each row. Here we classify orders by size.

3) transform() — Add group totals

# Add each product's total revenue to every row of that product
df['product_total'] = df.groupby('product_id')['total_amount'].transform('sum')
df[['product_id','total_amount','product_total']].head()
    

What this does: Keeps all rows but adds a new column showing the sum of revenue per product.

Quick exercises (5–10 min)

  1. Use map() to replace loyalty codes Y/N with Yes/No.
  2. Use apply() to mark orders as discounted if discount > 0.
  3. Use transform() to calculate each customer’s total spend and add it as a new column.

❓ Frequently Asked Questions — Pandas in Python

Beginner-friendly answers to common Pandas questions. Helps students starting Data Analytics in Python.

1) What is Pandas in Python?

Pandas is a popular Python library used for data analysis. It helps you work with tables called DataFrames — like Excel inside Python.

2) Why should I learn Pandas?

Because almost every data science or analytics project starts with cleaning messy data. Pandas makes this easy.

3) Do I need Python basics before Pandas?

Yes, at least basics like variables, loops, and functions. If you know simple Python code, Pandas is easy to learn.

4) What dataset do we use in this course?

We use a real messy dataset with NULL values, duplicates, and typos. Download it here.

5) Can Pandas handle large datasets?

Yes, Pandas can handle millions of rows. For very large files, use chunked reading or tools like Dask.

6) Where is Pandas used in real life?

Everywhere: analyzing sales, building dashboards, cleaning text for NLP, and preparing predictive analytics models.

7) Is Pandas enough for Data Analytics?

Pandas is the foundation. After Pandas, you can learn visualization, machine learning, and SQL. But Pandas is your first step.

📝 Pandas Beginner Quiz

Answer one question at a time. At the end, you’ll see the correct answers!

Q1. What is a DataFrame in Pandas?




Vista Academy – 316/336, Park Rd, Laxman Chowk, Dehradun – 248001
📞 +91 94117 78145 | 📧 thevistaacademy@gmail.com | 💬 WhatsApp
💬 Chat on WhatsApp: Ask About Our Courses