1. What is Pandas in Python?

Answer: Pandas is an open-source Python library used for data analysis, cleaning, and manipulation. It provides flexible data structures like Series (1-D) and DataFrame (2-D) to work with structured data (tables, CSVs, Excel, SQL queries).

2. What is Pandas Python used for?

Answer: Pandas is mainly used for:

  • โœ“ Data Cleaning (handling missing values, duplicates)
  • โœ“ Data Transformation (filtering, grouping, sorting)
  • โœ“ Data Analysis (statistics, summarization)
  • โœ“ Reading/Writing files like CSV, Excel, SQL
3. What are Pandas in Python?

Answer: In simple terms, Pandas are powerful Python tools that help turn raw data into clean, structured tables so analysts can focus on insights instead of messy datasets.

4. What is df in Python?

Answer: In Pandas, df is a common variable name used for a DataFrame. A DataFrame is a 2D table of rows and columns, just like an Excel sheet or SQL table.

import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())  # displays first 5 rows
5. What is the difference between Pandas and df?

Answer: Pandas is the library, while df (DataFrame) is the data structure created using Pandas to store and manipulate tabular data.

6. Why use Pandas for data cleaning?

Answer: Pandas is the industry standard for Python data cleaning because it offers powerful built-in functions for handling missing data, removing duplicates, data type conversion, and filtering outliersโ€”all with simple, readable syntax that saves hours of manual coding.

Ready to Master Pandas Data Cleaning?

Learn advanced techniques for handling missing values, duplicates, and outliers in the next section.

Continue Learning โ†’
Data Cleaning with Python

Import Pandas DataFrame from Google Sheets & Practice in Google Colab

Learn data cleaning in python, pandas read csv, and df.head() in python using real datasets like Google Sheets and Superstore dataset inside Google Colab.

Step 1: Open Google Colab Start coding here:

Open Google Colab โ†’
Step 2: Import Pandas Library
import pandas as pd
๐Ÿ‘‰ This is the first step in every python pandas data cleaning project.
Step 3: Load Google Sheet Dataset
import pandas as pd

url = "https://docs.google.com/spreadsheets/d/1zZDBuiFCitgMU-D5XA2G7DVvoGmJ2G6KqXM9sOLpGz0/export?format=csv&gid=1140868527"

df = pd.read_csv(url)

df.head()
๐Ÿ‘‰ Used for pandas load csv and data cleaning in pandas python.
Step 4: Superstore Dataset Practice
import pandas as pd

url = "https://docs.google.com/spreadsheets/d/1zZDBuiFCitgMU-D5XA2G7DVvoGmJ2G6KqXM9sOLpGz0/export?format=csv&gid=1140868527"

df = pd.read_csv(url)

df.head()
๐Ÿ‘‰ Real dataset used in data analyst projects.
Step 5: Inspect Data (df.head)
df.head(10)
๐Ÿ‘‰ Shows first rows โ†’ important for df.head() meaning.
Step 6: Start Data Cleaning
df.isnull().sum()
df.dropna()
df.describe()
๐Ÿ‘‰ Core of data cleaning in pandas.
Data Cleaning with Pandas

Filtering Data in Pandas (Condition, Boolean Indexing, Query)

Learn how to filter data using pandas dataframe. Filtering is one of the most important steps in data cleaning in python to select specific rows, remove unwanted data, and analyze patterns.

Step 1: Preview Data
df.head()
๐Ÿ‘‰ Always check data before filtering
Step 2: Filter Rows (Condition)
# Filter rows where Sales > 500
df[df['Sales'] > 500]
๐Ÿ‘‰ Returns only rows where condition is TRUE
Step 3: Multiple Conditions
# Sales > 500 AND Profit > 50
df[(df['Sales'] > 500) & (df['Profit'] > 50)]
๐Ÿ‘‰ & = AND condition
๐Ÿ‘‰ | = OR condition
Step 4: Select Specific Columns
df[['Sales', 'Profit']]
๐Ÿ‘‰ Select only required columns
Step 5: Using Query Method
df.query("Sales > 500 and Profit > 50")
๐Ÿ‘‰ Cleaner and readable filtering syntax
Practice (Real Scenario)
# High value customers
high_sales = df[df['Sales'] > 1000]

print(high_sales.head())
๐Ÿ‘‰ Real data analyst workflow
Data Cleaning with Pandas

Sorting Data in Pandas (sort_values & sort_index Explained)

Sorting is a key step in data cleaning in python. Using pandas sort_values and sort_index(), you can organize your dataset to find top values, trends, and insights.

Step 1: Preview Data
df.head()
๐Ÿ‘‰ Check dataset before sorting
Step 2: Sort by Column (Ascending)
# Sort by Sales (low โ†’ high)
df.sort_values(by='Sales')
๐Ÿ‘‰ Default sorting = ascending
๐Ÿ‘‰ Used in sorting data in pandas
Step 3: Sort by Column (Descending)
# Sort by Sales (high โ†’ low)
df.sort_values(by='Sales', ascending=False)
๐Ÿ‘‰ Find top values easily
Step 4: Sort Multiple Columns
# Sort by Category then Sales
df.sort_values(by=['Category','Sales'], ascending=[True, False])
๐Ÿ‘‰ Multi-level sorting
๐Ÿ‘‰ Important for analysis
Step 5: Sort by Index
df.sort_index()
๐Ÿ‘‰ Sorts rows by index
๐Ÿ‘‰ Useful after filtering
Practice (Top Records)
# Top 5 highest Sales
top_sales = df.sort_values(by='Sales', ascending=False).head(5)

print(top_sales)
๐Ÿ‘‰ Real data analyst workflow
Data Cleaning with Pandas

Handling Missing Values in Pandas (fillna & dropna Explained)

Missing data is common in real datasets. In data cleaning in python, we use fillna() and dropna() to handle missing values effectively using Pandas DataFrame.

Step 1: Check Missing Values
df.isnull().sum()
๐Ÿ‘‰ Shows number of missing values in each column
Step 2: Fill Missing Values
# Fill all missing values with 0
df.fillna(0, inplace=True)
๐Ÿ‘‰ Replace missing values when data is important
Step 3: Fill with Mean
# Fill Sales column with mean
df['Sales'].fillna(df['Sales'].mean(), inplace=True)
๐Ÿ‘‰ Better approach for numerical data
Step 4: Drop Missing Rows
df.dropna()
๐Ÿ‘‰ Remove rows with missing values
Step 5: Drop Missing Columns
df.dropna(axis=1)
๐Ÿ‘‰ axis=1 โ†’ columns
๐Ÿ‘‰ axis=0 โ†’ rows
Practice (Continue Same DataFrame)
# Check missing values
print(df.isnull().sum())

# Fill missing values
df.fillna(0, inplace=True)

# Verify
print(df.isnull().sum())
๐Ÿ‘‰ Real data cleaning workflow
Data Cleaning with Pandas

Identify Missing Values in Pandas (isnull, any, sum Explained)

Before cleaning data, you must detect missing values. In data cleaning in python, we use df.isnull(), sum(), and any() to find missing data in rows and columns.

Step 1: Preview Data
df.head()
๐Ÿ‘‰ Always inspect data before checking missing values
Step 2: Check Missing Values
df.isnull()
๐Ÿ‘‰ Returns True/False values
๐Ÿ‘‰ True = missing data
Step 3: Count Missing Values
df.isnull().sum()
๐Ÿ‘‰ Shows missing values per column
Step 4: Find Columns with Missing Data
df.columns[df.isnull().any()]
๐Ÿ‘‰ Returns columns with missing values
Step 5: Find Rows with Missing Data
df[df.isnull().any(axis=1)]
๐Ÿ‘‰ Shows rows where values are missing
Practice (Continue Same DataFrame)
# Count missing values
print(df.isnull().sum())

# Show rows with missing values
print(df[df.isnull().any(axis=1)])
๐Ÿ‘‰ Real data cleaning workflow
Data Cleaning with Pandas

Descriptive Statistics in Pandas (describe() Explained)

The pandas describe method is used to summarize numerical data. It helps in data cleaning in python pandas by showing mean, standard deviation, min, max, and quartiles of your dataset.

Step 1: Load Dataset (Colab)
import pandas as pd

url = "https://raw.githubusercontent.com/plotly/datasets/master/superstore.csv"

df = pd.read_csv(url)

df.head()
๐Ÿ‘‰ Load real dataset for analysis
Step 2: Apply describe()
df.describe()
๐Ÿ‘‰ Generates statistical summary
๐Ÿ‘‰ Important for pandas dataframe analysis
Step 3: Select Specific Columns
df[['Sales','Profit']].describe()
๐Ÿ‘‰ Analyze only required columns
๐Ÿ‘‰ Useful in data cleaning pandas example
Step 4: Include All Columns
df.describe(include='all')
๐Ÿ‘‰ Includes categorical columns also
๐Ÿ‘‰ Shows count, unique, top values
Real Practice (Colab)
import pandas as pd

url = "https://docs.google.com/spreadsheets/d/1zZDBuiFCitgMU-D5XA2G7DVvoGmJ2G6KqXM9sOLpGz0/export?format=csv&gid=1140868527"

df = pd.read_csv(url)

# Full summary
print(df.describe())

# Specific columns
print(df[['Sales','Profit']].describe())
๐Ÿ‘‰ Used in real data analyst projects
Data Cleaning with Pandas

Remove Duplicate Rows in Pandas (drop_duplicates Explained)

Duplicate data is a common problem in real datasets. In data cleaning in python, we use duplicated() and drop_duplicates() to identify and remove duplicate rows from a Pandas DataFrame.

Step 1: Check Duplicate Rows
df.duplicated().sum()
๐Ÿ‘‰ Counts number of duplicate rows
Step 2: View Duplicate Rows
df[df.duplicated()]
๐Ÿ‘‰ Shows duplicate records
Step 3: Remove Duplicates
df.drop_duplicates(inplace=True)
๐Ÿ‘‰ Removes duplicate rows permanently
Step 4: Remove Based on Column
df.drop_duplicates(subset='Customer Name', inplace=True)
๐Ÿ‘‰ Remove duplicates using specific column
Step 5: Keep Last Record
df.drop_duplicates(keep='last', inplace=True)
๐Ÿ‘‰ keep=’first’ (default)
๐Ÿ‘‰ keep=’last’ keeps latest value
Practice (Continue Same DataFrame)
# Count duplicates
print(df.duplicated().sum())

# Remove duplicates
df.drop_duplicates(inplace=True)

# Verify
print(df.duplicated().sum())
๐Ÿ‘‰ Real data cleaning workflow
Data Cleaning with Pandas

Data Type Conversion in Pandas (astype, to_datetime Explained)

In real datasets, columns often have incorrect data types. In data cleaning in python, we use astype(), to_datetime(), and to_numeric() to fix data types for proper analysis.

Step 1: Check Data Types
df.dtypes
๐Ÿ‘‰ Always check types before conversion
Step 2: Convert Using astype()
# Convert column safely
df['Age'] = df['Age'].astype('Int64')
๐Ÿ‘‰ Use Int64 instead of int to handle missing values
Step 3: Convert to Numeric
df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce')
๐Ÿ‘‰ Converts text to numbers
๐Ÿ‘‰ Invalid values โ†’ NaN
Step 4: Convert to Date
df['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce')
๐Ÿ‘‰ Converts text โ†’ datetime
๐Ÿ‘‰ Handles wrong date formats safely
Step 5: Convert Multiple Columns
cols = ['Sales', 'Profit']

df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
๐Ÿ‘‰ Safer method than astype for real data
Practice (Continue Same DataFrame)
# Check types
print(df.dtypes)

# Fix numeric column
df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce')

# Fix date column
df['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce')

# Verify changes
print(df.dtypes)
๐Ÿ‘‰ Real data cleaning workflow
Data Cleaning with Pandas

Outlier Detection in Pandas (IQR Method Explained)

Outliers are extreme values that can distort analysis. In data cleaning in python, we use the IQR (Interquartile Range) method to detect and remove outliers.

Step 1: Select Numerical Column
df['Sales'].describe()
๐Ÿ‘‰ Understand distribution before detecting outliers
Step 2: Calculate Q1 and Q3
Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)
๐Ÿ‘‰ Q1 = 25th percentile
๐Ÿ‘‰ Q3 = 75th percentile
Step 3: Calculate IQR
IQR = Q3 - Q1
๐Ÿ‘‰ IQR = range of middle 50% data
Step 4: Define Bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
๐Ÿ‘‰ Values outside this range = outliers
Step 5: Detect Outliers
outliers = df[(df['Sales'] < lower_bound) | (df['Sales'] > upper_bound)]
๐Ÿ‘‰ Shows rows with extreme values
Step 6: Remove Outliers
df = df[(df['Sales'] >= lower_bound) & (df['Sales'] <= upper_bound)]
๐Ÿ‘‰ Keep only valid data range
Practice (Continue Same DataFrame)
# Calculate quartiles
Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)

# Calculate IQR
IQR = Q3 - Q1

# Define bounds
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Filter data
df_clean = df[(df['Sales'] >= lower) & (df['Sales'] <= upper)]

print(df_clean.head())
๐Ÿ‘‰ Real data cleaning workflow
Data Cleaning with Pandas

Text Cleaning in Pandas (lower, strip, replace Explained)

Real datasets often contain messy text like extra spaces, inconsistent casing, or wrong spellings. In data cleaning in python, we use string methods (.str) to clean and standardize text data.

Step 1: Convert Text to Lowercase
df['City'] = df['City'].str.lower()
๐Ÿ‘‰ “Delhi” โ†’ “delhi”
๐Ÿ‘‰ Removes case inconsistency
Step 2: Remove Extra Spaces
df['City'] = df['City'].str.strip()
๐Ÿ‘‰ “delhi ” โ†’ “delhi”
๐Ÿ‘‰ Fixes spacing issues
Step 3: Replace Incorrect Values
df['City'] = df['City'].replace({
  'delhi ': 'delhi',
  'new delhi': 'delhi'
})
๐Ÿ‘‰ Standardize similar values
๐Ÿ‘‰ Important for grouping
Step 4: Remove Special Characters
df['Name'] = df['Name'].str.replace('[^a-zA-Z ]', '', regex=True)
๐Ÿ‘‰ Removes numbers & symbols
๐Ÿ‘‰ Keeps only clean text
Step 5: Capitalize Text
df['City'] = df['City'].str.title()
๐Ÿ‘‰ “delhi” โ†’ “Delhi”
๐Ÿ‘‰ Makes data presentable
Practice (Continue Same DataFrame)
# Clean City column
df['City'] = df['City'].str.lower().str.strip()

# Standardize values
df['City'] = df['City'].replace({
  'new delhi': 'delhi'
})

# Format nicely
df['City'] = df['City'].str.title()

print(df['City'].unique())
๐Ÿ‘‰ Real data cleaning workflow
Data Analysis with Pandas

Feature Engineering in Pandas (Create New Columns & Extract Data)

Feature engineering is the process of creating new columns from existing data. It helps transform raw data into meaningful insights for analysis and machine learning.

Step 1: Create New Column
# Create Profit Ratio
df['Profit_Ratio'] = df['Profit'] / df['Sales']
๐Ÿ‘‰ New feature from existing columns
๐Ÿ‘‰ Used in business analysis
Step 2: Extract Year from Date
df['Year'] = df['Order Date'].dt.year
๐Ÿ‘‰ Extract year for trend analysis
Step 3: Extract Month
df['Month'] = df['Order Date'].dt.month
๐Ÿ‘‰ Useful for monthly reports
Step 4: Create Category (Condition)
# High / Low Sales
df['Sales_Category'] = df['Sales'].apply(
  lambda x: 'High' if x > 500 else 'Low'
)
๐Ÿ‘‰ Convert numeric โ†’ categorical
๐Ÿ‘‰ Important for dashboards
Step 5: Combine Columns
# Combine City and State
df['Location'] = df['City'] + ", " + df['State']
๐Ÿ‘‰ Create meaningful combined feature
Practice (Continue Same DataFrame)
# Create new features
df['Profit_Ratio'] = df['Profit'] / df['Sales']
df['Year'] = df['Order Date'].dt.year

# Categorize data
df['Sales_Category'] = df['Sales'].apply(
  lambda x: 'High' if x > 500 else 'Low'
)

print(df.head())
๐Ÿ‘‰ Real data analyst workflow
Data Analysis with Pandas

GroupBy in Pandas (Aggregation, Sum, Mean Explained)

GroupBy is used to analyze data by categories. In data analysis using pandas, we use groupby() to summarize data like total sales, average profit, and customer insights.

Step 1: Group by Category
df.groupby('Category')['Sales'].sum()
๐Ÿ‘‰ Total sales per category
Step 2: Calculate Mean
df.groupby('Category')['Profit'].mean()
๐Ÿ‘‰ Average profit per category
Step 3: Multiple Aggregations
df.groupby('Category').agg({
  'Sales': 'sum',
  'Profit': 'mean'
})
๐Ÿ‘‰ Perform multiple calculations together
Step 4: Group by Multiple Columns
df.groupby(['Category','Region'])['Sales'].sum()
๐Ÿ‘‰ Multi-level grouping
Step 5: Reset Index
df.groupby('Category')['Sales'].sum().reset_index()
๐Ÿ‘‰ Convert result into DataFrame
Practice (Continue Same DataFrame)
# Total sales per category
sales_by_cat = df.groupby('Category')['Sales'].sum()

# Average profit
profit_avg = df.groupby('Category')['Profit'].mean()

print(sales_by_cat)
print(profit_avg)
๐Ÿ‘‰ Real business analysis workflow
Final Project

Real Data Analytics Project (Google Colab Practice)

Now it’s time to apply everything you learned. This project covers data cleaning, feature engineering, and data analysis using a real dataset in Google Colab.

Step 1: Load Dataset Import real dataset from Google Sheets
Step 2: Data Cleaning Handle missing values, duplicates, data types
Step 3: Feature Engineering Create new columns and extract insights
Step 4: Data Analysis Use groupby to analyze sales and profit
Step 5: Final Output Export cleaned dataset
FAQs

Data Cleaning & Pandas Course โ€“ Frequently Asked Questions

Get answers to common questions about data cleaning in python and pandas.

Data cleaning ek process hai jisme raw data ko clean aur usable banaya jata hai. Isme missing values, duplicates aur incorrect data fix kiya jata hai.
Pandas ek Python library hai jo data analysis aur cleaning ke liye use hoti hai. Isse CSV, Excel aur large datasets handle karna easy ho jata hai.
Haan, ye course beginners ke liye design kiya gaya hai. Basic Python knowledge helpful hai, but step-by-step guidance di gayi hai.
Pandas, Excel, SQL, aur visualization tools (Power BI / Tableau) zaroori hote hain.
Google Colab ek free platform hai jahan aap Python code run kar sakte ho bina installation ke.
Ye course strong foundation deta hai. Job ke liye projects aur tools practice zaroori hai.
Vista Academy โ€“ 316/336, Park Rd, Laxman Chowk, Dehradun โ€“ 248001
๐Ÿ“ž +91 94117 78145 | ๐Ÿ“ง thevistaacademy@gmail.com | ๐Ÿ’ฌ WhatsApp
๐Ÿ’ฌ Chat on WhatsApp: Ask About Our Courses