🔍 Master Data Cleaning in Python and Charting with Matplotlib & Seaborn

Table of Contents

Learn data cleaning in Python using powerful libraries like Pandas and NumPy. This beginner-friendly tutorial covers how to clean datasets, handle missing values, and prepare your data for in-depth analysis.

🧹 From removing nulls to type conversion, we’ll walk you through practical data cleaning techniques step-by-step. Understand key tools and data cleaning python libraries used by professionals in real-world projects.

📊 Once your data is clean, learn how to build beautiful visualizations using Matplotlib vs Seaborn. We’ll compare both libraries and show you how to create pie charts, bar plots, line graphs, and more — including using autopct='%.1f%%' for precise labeling.

🎓 Whether you’re starting out or enhancing your skills, this complete Python data cleaning tutorial will help you visualize your cleaned dataset with ease. Let’s turn raw data into insightful stories!

🚀 Start the Python Cleaning & Visualization Tutorial

🐼 Importing Pandas for Data Cleaning in Python

After installing the Pandas library, you must import it into your Python script or Jupyter Notebook before using its powerful data cleaning features. The standard convention is to import it as pd:

import pandas as pd

This step is essential for using Pandas’ methods to clean datasets in Python — like handling missing values, fixing data types, filtering rows, and much more. If you’re looking to learn how to clean datasets using Pandas, this is the very first line you’ll need!

Click to download data set

📂 Reading Excel or CSV Files Using Pandas

One of the first steps in data cleaning using Python is to load your dataset. Pandas allows you to read data from multiple sources including CSV, Excel, SQL databases, and more.

In this example, we’ll load a messy dataset from an Excel file using pd.read_excel():

import pandas as pd

df = pd.read_excel(r'C:\Users\yogesh\Desktop\excel files for practice\messeydataset.xlsx')

# Display the DataFrame
print(df)

This command loads your dataset into a Pandas DataFrame, allowing you to begin the data preprocessing and cleaning process. For CSV files, simply use pd.read_csv('filename.csv') instead.

🏷️ How to View Column Names in a Pandas DataFrame

When working on data cleaning in Python, it’s important to understand the structure of your dataset. To retrieve all the column names of a DataFrame, use the df.columns attribute in Pandas.

print(df.columns)

This command returns a Pandas Index object containing the labels of each column in your dataset.

Index(['Name', 'Age', 'Gender', 'City', 'Salary'], dtype='object')

Use this technique to inspect, rename, or select columns during the data preprocessing or data cleaning phase in your Python project.

🚫 Handling Missing Data with `fillna()` in Pandas

During data cleaning in Python, missing values are a common issue. One of the most popular techniques to handle this is using the fillna() method provided by Pandas.

df['Name'].fillna('Unknown', inplace=True)

This line replaces all missing (NaN) values in the 'Name' column of the DataFrame df with the word ‘Unknown’. It is an efficient way to fill gaps when you want to retain rows without deleting them.

Use this method when performing data cleaning using Pandas to make your datasets ready for visualization or machine learning.

print(df.Name)

Output

0        John
1        Alex
2         Bob
3        Mary
4        Alex
5     Unknown
6         Tim
7     Unknown
8       Alice
9         Sam
10       navi
11      Sarah
12       roja
13       Mike
14     Olivia
15    Unknown
16       maya
17        Zoe
18    Unknown
Name: Name, dtype: object

Step 5: Filling age missing value

🧮 Filling Missing Values with the Mean in Pandas

One of the most common techniques in data cleaning using Pandas is replacing missing numerical values with the mean of that column. This is especially useful when you don’t want to drop rows and prefer to retain data consistency.

mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)
print(df.Age)

🔹 df['Age'].mean() calculates the average age from the available (non-null) values. 🔹 fillna(mean_age, inplace=True) fills the missing values directly in the original DataFrame.

✅ This is an efficient way to fill missing values with mean in Pandas while preserving the shape of your dataset.

📤 Output:

0     30.000000
1     25.000000
2     29.071429
3     32.000000
4     28.000000
5     29.071429
6     29.000000
7     31.000000
8     27.000000
9     29.071429
10    35.000000
11    33.000000
12    29.071429
13    26.000000
14    24.000000
15    28.000000
16    29.071429
17    30.000000
18    29.000000
Name: Age, dtype: float64

🧠 Filling Missing Categorical Data with Mode in Pandas

When cleaning categorical columns like Gender, it’s common to replace missing values with the mode — the most frequent category. Here’s how to do that using Pandas:

mode_gender = df['Gender'].mode().values[0]

# Fill missing 'Gender' values with the mode (most frequent) gender
df['Gender'].fillna(mode_gender, inplace=True)

# Print the DataFrame to see the updated values
print(df['Gender'])

📝 Step-by-step Explanation:

mode_gender = df['Gender'].mode().values[0] → Calculates the most frequent value (mode) from the ‘Gender’ column.
df['Gender'].fillna(mode_gender, inplace=True) → Replaces all missing values in the column with the mode value.
print(df['Gender']) → Displays the updated ‘Gender’ column with no null values.

📤 Output:

0       Male
1     Female
2       Male
3     Female
4     Female
5     Female
6       Male
7     Female
8     Female
9     Female
10      Male
11    Female
12    Female
13      Male
14    Female
15    Female
16      Male
17    Female
18    Female
Name: Gender, dtype: object

✅ This is one of the best practices in data cleaning with Pandas when handling missing categorical data. Using the mode keeps the data consistent and avoids dropping valuable rows.

🛠️ Replacing Incorrect Categorical Values in Pandas

During data cleaning in Python, you may encounter wrong values in categorical fields — like 'Seattle' appearing in a 'Gender' column. You can use the .replace() method in Pandas to fix this issue efficiently.

df['Gender'].replace({'Seattle': 'Female'}, inplace=True)
print(df['Gender'])

✅ Output After Replacement:

0       Male
1     Female
2       Male
3     Female
4     Female
5     Female
6       Male
7     Female
8     Female
9     Female
10      Male
11    Female
12    Female
13      Male
14    Female
15    Female
16      Male
17    Female
18    Female
Name: Gender, dtype: object

🔁 Use replace() for fast and targeted replacements in Pandas DataFrames, especially when correcting mislabeled or dirty categorical data.

🌆 Filling Missing City Names with Mode in Pandas

When working on data cleaning in Python, it’s common to encounter missing values in columns like 'City'. One effective way to handle this is by replacing nulls with the most frequent city using the mode() function in Pandas.

mode_city = df['City'].mode().values[0]

# Fill missing 'City' values with the mode (most frequent) city
df['City'].fillna(mode_city, inplace=True)

# Print the DataFrame to see the updated values
print(df['City'])

📝 Step-by-Step Explanation:

mode_city = df['City'].mode().values[0] → Finds the most frequent city in the column.
df['City'].fillna(mode_city, inplace=True) → Replaces all NaN values with that city.
print(df['City']) → Displays the cleaned column.

📤 Output:

0           LA
1     San Fran
2           LA
3      Chicago
4      Houston
5      Chicago
6           LA
7      Houston
8       Boston
9        Miami
10       Miami
11     Houston
12    new york
13     Houston
14     Houston
15     Houston
16     Houston
17      Dallas
18      Dallas
Name: City, dtype: object

✅ This technique is great for cleaning categorical columns such as city names, ensuring your dataset is ready for accurate visualization or machine learning analysis.

💰 Filling Missing Salary Values with Mean in Pandas

Missing salary values can affect calculations and insights. A common technique in data cleaning using Python is to replace missing values in a numerical column like 'Salary' with the column’s mean. Here’s how to do it with Pandas:

mean_salary = df['Salary'].mean()

# Fill missing 'Salary' values with the mean salary
df['Salary'].fillna(mean_salary, inplace=True)

# Print the DataFrame to see the updated values
print(df['Salary'])

📤 Output:

0     50000.0
1     60000.0
2     55000.0
3     50375.0
4     52000.0
5     45000.0
6     50375.0
7     48000.0
8     52000.0
9     42000.0
10    49000.0
11    51000.0
12    54000.0
13    47000.0
14    50375.0
15    52000.0
16    49000.0
17    51000.0
18    49000.0
Name: Salary, dtype: float64

✅ This method is a best practice for handling missing salary data in Pandas, especially when you want to keep the dataset intact for further analysis or modeling.

📊 Creating a Gender Distribution Bar Chart in Python

Once your data is cleaned, you can use Seaborn and Matplotlib to create beautiful visualizations. Here’s how to build a simple bar chart showing the distribution of Gender in your dataset:

# Bar chart of Gender distribution:
import matplotlib.pyplot as plt
import seaborn as sns

sns.countplot(x='Gender', data=df)
plt.title('Gender Distribution')
plt.show()

📝 Step-by-Step Explanation:

import matplotlib.pyplot as plt → Imports the Matplotlib library for creating charts.
import seaborn as sns → Imports Seaborn, a high-level API built on Matplotlib for statistical graphics.
sns.countplot(x='Gender', data=df) → Plots a bar chart of frequency counts for the ‘Gender’ column.
plt.title('Gender Distribution') → Adds a title to the chart.
plt.show() → Renders and displays the final chart.

✅ Make sure your 'Gender' column is cleaned (no nulls or invalid values) before plotting. This chart is helpful for understanding class imbalance in datasets and is often used during exploratory data analysis (EDA).

📉 Plotting Salary Distribution Using Histogram in Matplotlib

A histogram is a great tool for visualizing the distribution of numerical data. Below is a simple way to plot a salary distribution chart using Matplotlib in Python:

plt.hist(df['Salary'], bins=10, edgecolor='black')
plt.xlabel('Salary')
plt.ylabel('Count')
plt.title('Salary Distribution')
plt.show()

📝 Step-by-Step Explanation:

plt.hist(df['Salary'], bins=10, edgecolor='black') → Creates the histogram of salaries. Divides data into 10 bins with visible black edges for clarity.
plt.xlabel('Salary') → Labels the x-axis as ‘Salary’.
plt.ylabel('Count') → Labels the y-axis as ‘Count’, representing the frequency in each bin.
plt.title('Salary Distribution') → Sets the chart title.
plt.show() → Renders and displays the plot.

✅ This method is commonly used during exploratory data analysis (EDA) to understand salary ranges, detect outliers, or examine data skewness.

📊 Age Distribution Histogram Using Matplotlib

Visualizing the age distribution in your dataset is a great way to understand demographic trends. Here’s how to plot a histogram of the Age column using Matplotlib:

plt.hist(df['Age'], bins=10, edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution')
plt.show()

📝 Code Breakdown:

plt.hist(df['Age'], bins=10, edgecolor='black') → Plots the histogram for age values with 10 bins and black borders.
plt.xlabel('Age') → Labels the x-axis as “Age”.
plt.ylabel('Count') → Labels the y-axis to show how many entries fall into each age range.
plt.title('Age Distribution') → Adds the chart title.
plt.show() → Displays the plot.

✅ This histogram helps you quickly identify which age groups are most represented in your dataset — an essential step during exploratory data analysis (EDA).

🥧 Visualizing City Distribution with a Pie Chart in Python

Pie charts are a great way to visualize categorical data like the distribution of users across different cities. Below is an example of how to plot a City Distribution Pie Chart using Matplotlib and Pandas:

city_counts = df['City'].value_counts()
plt.pie(city_counts, labels=city_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('City Distribution')
plt.axis('equal')
plt.show()

📝 Step-by-Step Explanation:

df['City'].value_counts() → Counts occurrences of each city.
plt.pie(...) → Plots the pie chart using these counts.
labels=city_counts.index → Uses city names as labels for each slice.
autopct='%1.1f%%' → Shows percentage values on each slice with 1 decimal point.
startangle=90 → Rotates the start of the first slice to 12 o’clock for aesthetic balance.
plt.axis('equal') → Ensures a perfect circle pie chart.
plt.show() → Displays the final visualization.

✅ This pie chart is especially useful for quickly identifying the city with the highest user base or customer concentration in your dataset — perfect for dashboards and reports.

FAQ

🧠 Frequently Asked Questions (FAQs)

❓ What is the difference between data cleaning and data cleansing?

Both terms are often used interchangeably. However, data cleaning usually refers to removing or correcting inaccurate records, while data cleansing can include more holistic processes like standardization and enrichment of data.

❓ What are common tools for data cleaning in Python?

Popular Python libraries for data cleaning include Pandas (handling missing data, filtering), NumPy (handling numerical arrays), and OpenRefine (for data standardization).

❓ How to clean a messy dataset in Python?

You can use Pandas to identify missing values, fill them using fillna(), remove duplicates using drop_duplicates(), and standardize text with str.lower(), etc. This tutorial covers all such techniques in detail.

❓ What is the use of `autopct='%1.1f%%'` in pie charts?

This argument in Matplotlib’s plt.pie() displays percentage values on each slice. '%1.1f%%' shows percentages with one decimal place (e.g., 25.0%).

❓ What is the difference between Matplotlib and Seaborn?

Matplotlib is a low-level plotting library that gives you full control over visualizations. Seaborn is built on top of Matplotlib and provides a high-level API for more attractive and statistical plots with less code.

❓ How to visualize data using Matplotlib and Seaborn?

Use plt.plot(), plt.bar(), plt.hist() in Matplotlib. In Seaborn, use sns.countplot(), sns.boxplot(), or sns.heatmap() for powerful statistical plots.

❓ Which is better for data visualization — Matplotlib or Seaborn?

Use Seaborn for quick and beautiful plots. Use Matplotlib when you need complete control and customization. Often, they are used together in data science projects.

Vista Academy – 316/336, Park Rd, Laxman Chowk, Dehradun – 248001
📞 +91 94117 78145 | 📧 thevistaacademy@gmail.com | 💬 WhatsApp

💬 Chat on WhatsApp: Ask About Our Courses