Data analyst banne ke liye kya skills chahiye?

Data analyst banne ke liye Pandas, Excel, SQL aur visualization tools jaise Power BI ya Tableau aana zaroori hai.

Google Colab kya hai?

Google Colab ek free online platform hai jahan aap Python code run kar sakte ho bina installation ke.

Kya is course ke baad job mil sakti hai?

Ye course strong foundation deta hai. Job ke liye projects, practice aur tools jaise SQL aur Power BI bhi seekhne honge.

Data analytics course in Dehradun kaha milega?

Vista Academy Dehradun me best data analytics aur data science training provide karta hai, jahan practical projects aur industry-level training di jati hai.

Data Cleaning in Python: Complete Pandas & NumPy Guide (2026)

Q: Pandas ka use kyu hota hai?

Pandas Python library hai jo data analysis aur cleaning ke liye use hoti hai. Isse CSV, Excel aur large datasets easily handle kiye ja sakte hain.

Introduction to Pandas in Python

Table of Contents

Master the fundamentals of data cleaning

If you are starting your journey in data cleaning with Python, the first step is understanding what Pandas in Python really is. In this section, we explain the basics of Pandas, DataFrame (df), and why it is the most popular tool for data manipulation and data cleansing in Python.

1. What is Pandas in Python?

Answer: Pandas is an open-source Python library used for data analysis, cleaning, and manipulation. It provides flexible data structures like Series (1-D) and DataFrame (2-D) to work with structured data (tables, CSVs, Excel, SQL queries).

2. What is Pandas Python used for?

Answer: Pandas is mainly used for:

✓ Data Cleaning (handling missing values, duplicates)
✓ Data Transformation (filtering, grouping, sorting)
✓ Data Analysis (statistics, summarization)
✓ Reading/Writing files like CSV, Excel, SQL

3. What are Pandas in Python?

Answer: In simple terms, Pandas are powerful Python tools that help turn raw data into clean, structured tables so analysts can focus on insights instead of messy datasets.

4. What is df in Python?

Answer: In Pandas, df is a common variable name used for a DataFrame. A DataFrame is a 2D table of rows and columns, just like an Excel sheet or SQL table.

import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())  # displays first 5 rows

5. What is the difference between Pandas and df?

Answer: Pandas is the library, while df (DataFrame) is the data structure created using Pandas to store and manipulate tabular data.

6. Why use Pandas for data cleaning?

Answer: Pandas is the industry standard for Python data cleaning because it offers powerful built-in functions for handling missing data, removing duplicates, data type conversion, and filtering outliers—all with simple, readable syntax that saves hours of manual coding.

Ready to Master Pandas Data Cleaning?

Learn advanced techniques for handling missing values, duplicates, and outliers in the next section.

Continue Learning →

Data Cleaning with Python

Import Pandas DataFrame from Google Sheets & Practice in Google Colab

Learn data cleaning in python, pandas read csv, and df.head() in python using real datasets like Google Sheets and Superstore dataset inside Google Colab.

Step 1: Open Google Colab Start coding here:

Open Google Colab →

Step 2: Import Pandas Library

import pandas as pd

👉 This is the first step in every python pandas data cleaning project.

Step 3: Load Google Sheet Dataset

import pandas as pd

url = "https://docs.google.com/spreadsheets/d/1zZDBuiFCitgMU-D5XA2G7DVvoGmJ2G6KqXM9sOLpGz0/export?format=csv&gid=1140868527"

df = pd.read_csv(url)

df.head()

👉 Used for pandas load csv and data cleaning in pandas python.

Step 4: Superstore Dataset Practice

import pandas as pd

url = "https://docs.google.com/spreadsheets/d/1zZDBuiFCitgMU-D5XA2G7DVvoGmJ2G6KqXM9sOLpGz0/export?format=csv&gid=1140868527"

df = pd.read_csv(url)

df.head()

👉 Real dataset used in data analyst projects.

Step 5: Inspect Data (df.head)

df.head(10)

👉 Shows first rows → important for df.head() meaning.

Step 6: Start Data Cleaning

df.isnull().sum()
df.dropna()
df.describe()

👉 Core of data cleaning in pandas.

Data Cleaning with Pandas

Filtering Data in Pandas (Condition, Boolean Indexing, Query)

Learn how to filter data using pandas dataframe. Filtering is one of the most important steps in data cleaning in python to select specific rows, remove unwanted data, and analyze patterns.

Step 1: Preview Data

df.head()

👉 Always check data before filtering

Step 2: Filter Rows (Condition)

# Filter rows where Sales > 500
df[df['Sales'] > 500]

👉 Returns only rows where condition is TRUE

Step 3: Multiple Conditions

# Sales > 500 AND Profit > 50
df[(df['Sales'] > 500) & (df['Profit'] > 50)]

👉 & = AND condition
👉 | = OR condition

Step 4: Select Specific Columns

df[['Sales', 'Profit']]

👉 Select only required columns

Step 5: Using Query Method

df.query("Sales > 500 and Profit > 50")

👉 Cleaner and readable filtering syntax

Practice (Real Scenario)

# High value customers
high_sales = df[df['Sales'] > 1000]

print(high_sales.head())

👉 Real data analyst workflow

Data Cleaning with Pandas

Sorting Data in Pandas (sort_values & sort_index Explained)

Sorting is a key step in data cleaning in python. Using pandas sort_values and sort_index(), you can organize your dataset to find top values, trends, and insights.

Step 1: Preview Data

df.head()

👉 Check dataset before sorting

Step 2: Sort by Column (Ascending)

# Sort by Sales (low → high)
df.sort_values(by='Sales')

👉 Default sorting = ascending
👉 Used in sorting data in pandas

Step 3: Sort by Column (Descending)

# Sort by Sales (high → low)
df.sort_values(by='Sales', ascending=False)

👉 Find top values easily

Step 4: Sort Multiple Columns

# Sort by Category then Sales
df.sort_values(by=['Category','Sales'], ascending=[True, False])

👉 Multi-level sorting
👉 Important for analysis

Step 5: Sort by Index

df.sort_index()

👉 Sorts rows by index
👉 Useful after filtering

Practice (Top Records)

# Top 5 highest Sales
top_sales = df.sort_values(by='Sales', ascending=False).head(5)

print(top_sales)

👉 Real data analyst workflow

Data Cleaning with Pandas

Handling Missing Values in Pandas (fillna & dropna Explained)

Missing data is common in real datasets. In data cleaning in python, we use fillna() and dropna() to handle missing values effectively using Pandas DataFrame.

Step 1: Check Missing Values

df.isnull().sum()

👉 Shows number of missing values in each column

Step 2: Fill Missing Values

# Fill all missing values with 0
df.fillna(0, inplace=True)

👉 Replace missing values when data is important

Step 3: Fill with Mean

# Fill Sales column with mean
df['Sales'].fillna(df['Sales'].mean(), inplace=True)

👉 Better approach for numerical data

Step 4: Drop Missing Rows

df.dropna()

👉 Remove rows with missing values

Step 5: Drop Missing Columns

df.dropna(axis=1)

👉 axis=1 → columns
👉 axis=0 → rows

Practice (Continue Same DataFrame)

# Check missing values
print(df.isnull().sum())

# Fill missing values
df.fillna(0, inplace=True)

# Verify
print(df.isnull().sum())

👉 Real data cleaning workflow

Data Cleaning with Pandas

Identify Missing Values in Pandas (isnull, any, sum Explained)

Before cleaning data, you must detect missing values. In data cleaning in python, we use df.isnull(), sum(), and any() to find missing data in rows and columns.

Step 1: Preview Data

df.head()

👉 Always inspect data before checking missing values

Step 2: Check Missing Values

df.isnull()

👉 Returns True/False values
👉 True = missing data

Step 3: Count Missing Values

df.isnull().sum()

👉 Shows missing values per column

Step 4: Find Columns with Missing Data

df.columns[df.isnull().any()]

👉 Returns columns with missing values

Step 5: Find Rows with Missing Data

df[df.isnull().any(axis=1)]

👉 Shows rows where values are missing

Practice (Continue Same DataFrame)

# Count missing values
print(df.isnull().sum())

# Show rows with missing values
print(df[df.isnull().any(axis=1)])

👉 Real data cleaning workflow

Data Cleaning with Pandas

Descriptive Statistics in Pandas (describe() Explained)

The pandas describe method is used to summarize numerical data. It helps in data cleaning in python pandas by showing mean, standard deviation, min, max, and quartiles of your dataset.

Step 1: Load Dataset (Colab)

import pandas as pd

url = "https://raw.githubusercontent.com/plotly/datasets/master/superstore.csv"

df = pd.read_csv(url)

df.head()

👉 Load real dataset for analysis

Step 2: Apply describe()

df.describe()

👉 Generates statistical summary
👉 Important for pandas dataframe analysis

Step 3: Select Specific Columns

df[['Sales','Profit']].describe()

👉 Analyze only required columns
👉 Useful in data cleaning pandas example

Step 4: Include All Columns

df.describe(include='all')

👉 Includes categorical columns also
👉 Shows count, unique, top values

Real Practice (Colab)

import pandas as pd

url = "https://docs.google.com/spreadsheets/d/1zZDBuiFCitgMU-D5XA2G7DVvoGmJ2G6KqXM9sOLpGz0/export?format=csv&gid=1140868527"

df = pd.read_csv(url)

# Full summary
print(df.describe())

# Specific columns
print(df[['Sales','Profit']].describe())

👉 Used in real data analyst projects

Data Cleaning with Pandas

Remove Duplicate Rows in Pandas (drop_duplicates Explained)

Duplicate data is a common problem in real datasets. In data cleaning in python, we use duplicated() and drop_duplicates() to identify and remove duplicate rows from a Pandas DataFrame.

Step 1: Check Duplicate Rows

df.duplicated().sum()

👉 Counts number of duplicate rows

Step 2: View Duplicate Rows

df[df.duplicated()]

👉 Shows duplicate records

Step 3: Remove Duplicates

df.drop_duplicates(inplace=True)

👉 Removes duplicate rows permanently

Step 4: Remove Based on Column

df.drop_duplicates(subset='Customer Name', inplace=True)

👉 Remove duplicates using specific column

Step 5: Keep Last Record

df.drop_duplicates(keep='last', inplace=True)

👉 keep=’first’ (default)
👉 keep=’last’ keeps latest value

Practice (Continue Same DataFrame)

# Count duplicates
print(df.duplicated().sum())

# Remove duplicates
df.drop_duplicates(inplace=True)

# Verify
print(df.duplicated().sum())

👉 Real data cleaning workflow

Data Cleaning with Pandas

Data Type Conversion in Pandas (astype, to_datetime Explained)

In real datasets, columns often have incorrect data types. In data cleaning in python, we use astype(), to_datetime(), and to_numeric() to fix data types for proper analysis.

Step 1: Check Data Types

df.dtypes

👉 Always check types before conversion

Step 2: Convert Using astype()

# Convert column safely
df['Age'] = df['Age'].astype('Int64')

👉 Use Int64 instead of int to handle missing values

Step 3: Convert to Numeric

df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce')

👉 Converts text to numbers
👉 Invalid values → NaN

Step 4: Convert to Date

df['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce')

👉 Converts text → datetime
👉 Handles wrong date formats safely

Step 5: Convert Multiple Columns

cols = ['Sales', 'Profit']

df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')

👉 Safer method than astype for real data

Practice (Continue Same DataFrame)

# Check types
print(df.dtypes)

# Fix numeric column
df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce')

# Fix date column
df['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce')

# Verify changes
print(df.dtypes)

👉 Real data cleaning workflow

Data Cleaning with Pandas

Outlier Detection in Pandas (IQR Method Explained)

Outliers are extreme values that can distort analysis. In data cleaning in python, we use the IQR (Interquartile Range) method to detect and remove outliers.

Step 1: Select Numerical Column

df['Sales'].describe()

👉 Understand distribution before detecting outliers

Step 2: Calculate Q1 and Q3

Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)

👉 Q1 = 25th percentile
👉 Q3 = 75th percentile

Step 3: Calculate IQR

IQR = Q3 - Q1

👉 IQR = range of middle 50% data

Step 4: Define Bounds

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

👉 Values outside this range = outliers

Step 5: Detect Outliers

outliers = df[(df['Sales'] < lower_bound) | (df['Sales'] > upper_bound)]

👉 Shows rows with extreme values

Step 6: Remove Outliers

df = df[(df['Sales'] >= lower_bound) & (df['Sales'] <= upper_bound)]

👉 Keep only valid data range

Practice (Continue Same DataFrame)

# Calculate quartiles
Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)

# Calculate IQR
IQR = Q3 - Q1

# Define bounds
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Filter data
df_clean = df[(df['Sales'] >= lower) & (df['Sales'] <= upper)]

print(df_clean.head())

👉 Real data cleaning workflow

Data Cleaning with Pandas

Text Cleaning in Pandas (lower, strip, replace Explained)

Real datasets often contain messy text like extra spaces, inconsistent casing, or wrong spellings. In data cleaning in python, we use string methods (.str) to clean and standardize text data.

Step 1: Convert Text to Lowercase

df['City'] = df['City'].str.lower()

👉 “Delhi” → “delhi”
👉 Removes case inconsistency

Step 2: Remove Extra Spaces

df['City'] = df['City'].str.strip()

👉 “delhi ” → “delhi”
👉 Fixes spacing issues

Step 3: Replace Incorrect Values

df['City'] = df['City'].replace({
  'delhi ': 'delhi',
  'new delhi': 'delhi'
})

👉 Standardize similar values
👉 Important for grouping

Step 4: Remove Special Characters

df['Name'] = df['Name'].str.replace('[^a-zA-Z ]', '', regex=True)

👉 Removes numbers & symbols
👉 Keeps only clean text

Step 5: Capitalize Text

df['City'] = df['City'].str.title()

👉 “delhi” → “Delhi”
👉 Makes data presentable

Practice (Continue Same DataFrame)

# Clean City column
df['City'] = df['City'].str.lower().str.strip()

# Standardize values
df['City'] = df['City'].replace({
  'new delhi': 'delhi'
})

# Format nicely
df['City'] = df['City'].str.title()

print(df['City'].unique())

👉 Real data cleaning workflow

Data Analysis with Pandas

Feature Engineering in Pandas (Create New Columns & Extract Data)

Feature engineering is the process of creating new columns from existing data. It helps transform raw data into meaningful insights for analysis and machine learning.

Step 1: Create New Column

# Create Profit Ratio
df['Profit_Ratio'] = df['Profit'] / df['Sales']

👉 New feature from existing columns
👉 Used in business analysis

Step 2: Extract Year from Date

df['Year'] = df['Order Date'].dt.year

👉 Extract year for trend analysis

Step 3: Extract Month

df['Month'] = df['Order Date'].dt.month

👉 Useful for monthly reports

Step 4: Create Category (Condition)

# High / Low Sales
df['Sales_Category'] = df['Sales'].apply(
  lambda x: 'High' if x > 500 else 'Low'
)

👉 Convert numeric → categorical
👉 Important for dashboards

Step 5: Combine Columns

# Combine City and State
df['Location'] = df['City'] + ", " + df['State']

👉 Create meaningful combined feature

Practice (Continue Same DataFrame)

# Create new features
df['Profit_Ratio'] = df['Profit'] / df['Sales']
df['Year'] = df['Order Date'].dt.year

# Categorize data
df['Sales_Category'] = df['Sales'].apply(
  lambda x: 'High' if x > 500 else 'Low'
)

print(df.head())

👉 Real data analyst workflow

Data Analysis with Pandas

GroupBy in Pandas (Aggregation, Sum, Mean Explained)

GroupBy is used to analyze data by categories. In data analysis using pandas, we use groupby() to summarize data like total sales, average profit, and customer insights.

Step 1: Group by Category

df.groupby('Category')['Sales'].sum()

👉 Total sales per category

Step 2: Calculate Mean

df.groupby('Category')['Profit'].mean()

👉 Average profit per category

Step 3: Multiple Aggregations

df.groupby('Category').agg({
  'Sales': 'sum',
  'Profit': 'mean'
})

👉 Perform multiple calculations together

Step 4: Group by Multiple Columns

df.groupby(['Category','Region'])['Sales'].sum()

👉 Multi-level grouping

Step 5: Reset Index

df.groupby('Category')['Sales'].sum().reset_index()

👉 Convert result into DataFrame

Practice (Continue Same DataFrame)

# Total sales per category
sales_by_cat = df.groupby('Category')['Sales'].sum()

# Average profit
profit_avg = df.groupby('Category')['Profit'].mean()

print(sales_by_cat)
print(profit_avg)

👉 Real business analysis workflow

Final Project

Real Data Analytics Project (Google Colab Practice)

Now it’s time to apply everything you learned. This project covers data cleaning, feature engineering, and data analysis using a real dataset in Google Colab.

Step 1: Load Dataset Import real dataset from Google Sheets

Step 2: Data Cleaning Handle missing values, duplicates, data types

Step 3: Feature Engineering Create new columns and extract insights

Step 4: Data Analysis Use groupby to analyze sales and profit

Step 5: Final Output Export cleaned dataset

FAQs

Data Cleaning & Pandas Course – Frequently Asked Questions

Get answers to common questions about data cleaning in python and pandas.

Data cleaning ek process hai jisme raw data ko clean aur usable banaya jata hai. Isme missing values, duplicates aur incorrect data fix kiya jata hai.

Pandas ek Python library hai jo data analysis aur cleaning ke liye use hoti hai. Isse CSV, Excel aur large datasets handle karna easy ho jata hai.

Haan, ye course beginners ke liye design kiya gaya hai. Basic Python knowledge helpful hai, but step-by-step guidance di gayi hai.

Pandas, Excel, SQL, aur visualization tools (Power BI / Tableau) zaroori hote hain.

Google Colab ek free platform hai jahan aap Python code run kar sakte ho bina installation ke.

Ye course strong foundation deta hai. Job ke liye projects aur tools practice zaroori hai.

Introduction to Pandas in Python

Ready to Master Pandas Data Cleaning?

Import Pandas DataFrame from Google Sheets & Practice in Google Colab

📂 Download & Practice Dataset

Filtering Data in Pandas (Condition, Boolean Indexing, Query)

📊 Simple Explanation

Sorting Data in Pandas (sort_values & sort_index Explained)

📊 Simple Explanation

Handling Missing Values in Pandas (fillna & dropna Explained)

📊 Simple Explanation

💡 When to Use What?

Identify Missing Values in Pandas (isnull, any, sum Explained)

📊 Simple Explanation

Descriptive Statistics in Pandas (describe() Explained)

📊 What describe() Shows

📂 Practice Dataset

Remove Duplicate Rows in Pandas (drop_duplicates Explained)

📊 Simple Explanation

🎯 Interview Questions

Data Type Conversion in Pandas (astype, to_datetime Explained)

📊 Simple Explanation

⚠️ Important Tips

Outlier Detection in Pandas (IQR Method Explained)

📊 Simple Explanation

🎯 Interview Questions

Text Cleaning in Pandas (lower, strip, replace Explained)

📊 Simple Explanation

🎯 Interview Questions

Feature Engineering in Pandas (Create New Columns & Extract Data)

📊 Simple Explanation

🎯 Interview Questions

GroupBy in Pandas (Aggregation, Sum, Mean Explained)

📊 Simple Explanation

🎯 Interview Questions

Real Data Analytics Project (Google Colab Practice)

🚀 Open Project in Google Colab

📊 Business Questions Solved

🎯 What You Learned

Data Cleaning & Pandas Course – Frequently Asked Questions