
50 SQL Commands for Data Cleaning in Data Analytics [2025 Guide]
Table of Contents
ToggleData cleaning is the foundation of Data Analytics. Without clean, structured data, even the most advanced models fail to deliver accurate insights. In this guide, we bring you the Top 50 SQL Commands every data analyst must master to handle duplicates, missing values, formatting errors, and more. π Updated for 2025, this guide is your step-by-step SQL data cleaning handbook.
Why SQL for Data Cleaning?
SQL (Structured Query Language) is not just for querying data β it is a powerful tool for cleaning and preparing datasets. From removing duplicates to handling NULL values, fixing text, correcting dates, and restructuring tables, SQL provides a robust foundation for data analysts and business intelligence professionals.
π Accuracy
Remove duplicates, correct errors, and ensure reliable analysis.
β‘ Efficiency
Clean millions of records in seconds using optimized SQL queries.
π Standardization
Format text, dates, and numbers consistently across datasets.
π Section 1: Removing Duplicates in SQL
In Data Analytics, duplicate records are a common issue. They can distort results, inflate metrics, and mislead decision-making. SQL provides powerful commands to detect and remove duplicates effectively. Letβs clean a sample Sales table.
π Step 1: Create Table & Insert Data
CREATE TABLE Sales ( order_id INT, customer_id INT, order_date DATE ); INSERT INTO Sales VALUES (1, 101, '2025-01-10'), (2, 101, '2025-01-10'), -- Duplicate (3, 102, '2025-01-12'), (4, 103, '2025-01-14'), (5, 103, '2025-01-14'); -- Duplicate
β Original Sales table contains duplicate customer_id values.
order_id | customer_id | order_date |
---|---|---|
1 | 101 | 2025-01-10 |
2 | 101 | 2025-01-10 |
3 | 102 | 2025-01-12 |
4 | 103 | 2025-01-14 |
5 | 103 | 2025-01-14 |
1οΈβ£ SELECT DISTINCT
SELECT DISTINCT customer_id FROM Sales;
customer_id |
---|
101 |
102 |
103 |
2οΈβ£ GROUP BY + HAVING
SELECT customer_id, COUNT(*) AS duplicate_count FROM Sales GROUP BY customer_id HAVING COUNT(*) > 1;
customer_id | duplicate_count |
---|---|
101 | 2 |
103 | 2 |
3οΈβ£ DELETE Duplicate Rows
DELETE FROM Sales WHERE rowid NOT IN ( SELECT MIN(rowid) FROM Sales GROUP BY customer_id );
Remaining Rows |
---|
1 | 101 | 2025-01-10 |
3 | 102 | 2025-01-12 |
4 | 103 | 2025-01-14 |
4οΈβ£ ROW_NUMBER() Window
SELECT customer_id, ROW_NUMBER() OVER( PARTITION BY customer_id ORDER BY order_date ) AS row_num FROM Sales;
customer_id | row_num |
---|---|
101 | 1 |
101 | 2 |
103 | 1 |
103 | 2 |
5οΈβ£ CTE + ROW_NUMBER()
WITH duplicates AS ( SELECT *, ROW_NUMBER() OVER( PARTITION BY customer_id ORDER BY order_date ) AS row_num FROM Sales ) DELETE FROM duplicates WHERE row_num > 1;
Remaining Rows |
---|
1 | 101 | 2025-01-10 |
3 | 102 | 2025-01-12 |
4 | 103 | 2025-01-14 |
β¨ Section 3: Standardizing Data (Formatting)
Messy data often contains extra spaces, inconsistent capitalization, and spelling mistakes. SQL provides handy functions to standardize data into a clean and consistent format. Letβs clean our Products table step by step.
π Step 1: Create Table & Insert Data
CREATE TABLE Products ( product_id INT, product_name VARCHAR(50), category VARCHAR(50) ); INSERT INTO Products VALUES (1, ' Laptop', 'ELECTRONICS '), (2, 'MOBILE ', 'electronics'), (3, 'HeadPhones', ' Electronics'), (4, 'tablet', 'ELECTRONICS'), (5, ' CAMERA ', 'Electronics');
β The Products table has extra spaces, mixed cases, and inconsistent category names.
product_id | product_name | category |
---|---|---|
1 | Laptop | ELECTRONICS |
2 | MOBILE | electronics |
3 | HeadPhones | Electronics |
4 | tablet | ELECTRONICS |
5 | CAMERA | Electronics |
1οΈβ£ TRIM() β Remove Extra Spaces
SELECT product_id, TRIM(product_name) AS clean_name FROM Products;
Before | After |
---|---|
” Laptop” | “Laptop” |
“MOBILE “ | “MOBILE” |
2οΈβ£ UPPER() β Convert to Uppercase
SELECT product_id, UPPER(category) AS clean_category FROM Products;
Before | After |
---|---|
“electronics” | “ELECTRONICS” |
” Electronics” | “ELECTRONICS” |
3οΈβ£ LOWER() β Convert to Lowercase
SELECT product_id, LOWER(product_name) AS clean_name FROM Products;
Before | After |
---|---|
“HeadPhones” | “headphones” |
” CAMERA “ | ” camera “ |
4οΈβ£ INITCAP() β Capitalize Words
SELECT INITCAP(product_name) AS formatted_name FROM Products;
Before | After |
---|---|
“tablet” | “Tablet” |
“mobile “ | “Mobile” |
5οΈβ£ REPLACE() β Fix Wrong Text
SELECT product_id, REPLACE(category, 'Electronics', 'ELECTRONICS') AS fixed_category FROM Products;
Before | After |
---|---|
“Electronics” | “ELECTRONICS” |
” Electronics” | “ELECTRONICS” |
π Section 4: String Cleaning & Manipulation
String data often contains extra characters, inconsistent text, or wrong substrings. SQL offers powerful string functions to clean, extract, and format text values. If youβve already practiced data cleaning with Pandas, youβll notice many functions are quite similar in SQL as well. Letβs work on a sample Customers table.
π Step 1: Create Table & Insert Data
CREATE TABLE Customers ( cust_id INT, cust_name VARCHAR(50), phone VARCHAR(20), email VARCHAR(50) ); INSERT INTO Customers VALUES (1, ' AMIT SHARMA ', ' 9876543210 ', 'amit.sharma@ gmail.com'), (2, 'rohit KUMAR', '91-9998887777', 'rohit.kumar@@yahoo.com'), (3, 'PRIYA ', '8887776666', ' priyaverma@gmail.com '), (4, 'sneha', '999-111-222', 'sneha123@outlook. com');
cust_id | cust_name | phone | |
---|---|---|---|
1 | ” AMIT SHARMA “ | ” 9876543210 “ | “amit.sharma@ gmail.com” |
2 | “rohit KUMAR” | “91-9998887777” | “rohit.kumar@@yahoo.com” |
3 | “PRIYA “ | “8887776666” | ” priyaverma@gmail.com “ |
4 | “sneha” | “999-111-222” | “sneha123@outlook. com” |
1οΈβ£ TRIM() β Remove Extra Spaces
SELECT cust_id, TRIM(cust_name) AS clean_name FROM Customers;
Before | After |
---|---|
” AMIT SHARMA “ | “AMIT SHARMA” |
“PRIYA “ | “PRIYA” |
2οΈβ£ UPPER() & LOWER() β Fix Case
SELECT cust_id, UPPER(cust_name) AS upper_name, LOWER(cust_name) AS lower_name FROM Customers;
Before | UPPER() | LOWER() |
---|---|---|
“rohit KUMAR” | “ROHIT KUMAR” | “rohit kumar” |
“sneha” | “SNEHA” | “sneha” |
3οΈβ£ SUBSTRING() β Extract Text
SELECT cust_id, SUBSTRING(phone, 1, 10) AS clean_phone FROM Customers;
Before | After |
---|---|
“91-9998887777” | “91-9998887” |
“999-111-222” | “999-111-22” |
4οΈβ£ REPLACE() β Fix Wrong Characters
SELECT cust_id, REPLACE(email, ' ', '') AS clean_email FROM Customers;
Before | After |
---|---|
“amit.sharma@ gmail.com” | “amit.sharma@gmail.com” |
“sneha123@outlook. com” | “sneha123@outlook.com” |
5οΈβ£ CONCAT() β Merge Fields
SELECT cust_id, CONCAT(cust_name, ' - ', phone) AS full_contact FROM Customers;
Before | After |
---|---|
“AMIT SHARMA” + “9876543210” | “AMIT SHARMA – 9876543210” |
“PRIYA” + “8887776666” | “PRIYA – 8887776666” |
Mastering these string manipulation commands in SQL is a must for every analyst. Once you are comfortable with them, you can apply the same principles in tools like Power BI or Python to achieve seamless data preparation.
π’ Section 5: Numeric Cleaning & Transformation
Numbers are the backbone of analytics β but raw data often contains inconsistent decimal places, negative values, or rounding issues. SQL provides handy functions to round, convert, and validate numeric data. If youβve explored our guide on linear regression in machine learning, you already know how important clean numbers are for accurate predictions.
π Step 1: Create Table & Insert Data
CREATE TABLE SalesData ( sale_id INT, amount DECIMAL(10,4), discount DECIMAL(5,2) ); INSERT INTO SalesData VALUES (1, 2500.4567, 5.5), (2, -300.9876, 10.0), (3, 1999.9999, 0.0), (4, 150.75, NULL), (5, 100.1234, 2.25);
β The SalesData table contains extra decimal places, negative values, and NULL discounts.
sale_id | amount | discount |
---|---|---|
1 | 2500.4567 | 5.5 |
2 | -300.9876 | 10.0 |
3 | 1999.9999 | 0.0 |
4 | 150.75 | NULL |
5 | 100.1234 | 2.25 |
1οΈβ£ ROUND() β Fix Decimal Places
SELECT sale_id, ROUND(amount, 2) AS rounded_amount FROM SalesData;
Before | After |
---|---|
2500.4567 | 2500.46 |
1999.9999 | 2000.00 |
2οΈβ£ ABS() β Handle Negative Values
SELECT sale_id, ABS(amount) AS positive_amount FROM SalesData;
Before | After |
---|---|
-300.9876 | 300.9876 |
3οΈβ£ CEIL() & FLOOR()
SELECT sale_id, CEIL(amount) AS ceil_value, FLOOR(amount) AS floor_value FROM SalesData;
Before | CEIL() | FLOOR() |
---|---|---|
150.75 | 151 | 150 |
100.1234 | 101 | 100 |
4οΈβ£ NULLIF() β Handle Divide by Zero
SELECT sale_id, amount / NULLIF(discount, 0) AS safe_division FROM SalesData;
Before | After |
---|---|
1999.9999 Γ· 0.0 = Error | 1999.9999 Γ· NULL = NULL |
5οΈβ£ CAST() β Convert Data Type
SELECT sale_id, CAST(amount AS INT) AS int_amount FROM SalesData;
Before | After |
---|---|
2500.4567 | 2500 |
100.1234 | 100 |
With these functions, numeric columns become analysis-ready. Clean numbers are crucial not only for SQL queries but also for advanced models in neural networks and deep learning, where even small decimal inconsistencies can affect predictions.
β° Section 6: Date & Time Cleaning in SQL
Dates are one of the most tricky parts of data cleaning. Inconsistent date formats can break time-series analysis, reporting, and forecasting. SQL offers powerful functions to clean, convert, and extract date and time values. If youβve read our blog on time series analysis, youβll see how crucial it is to prepare dates properly.
π Step 1: Create Table & Insert Data
CREATE TABLE Orders ( order_id INT, order_date VARCHAR(20), delivery_date VARCHAR(20) ); INSERT INTO Orders VALUES (1, '2025-01-10', '2025/01/15'), (2, '10-02-2025', '2025-02-20'), (3, '2025/03/05', '2025-03-10'), (4, '15-04-2025', '2025/04/25'), (5, '2025-05-12', NULL);
β The Orders table has mixed date formats, missing delivery dates.
order_id | order_date | delivery_date |
---|---|---|
1 | 2025-01-10 | 2025/01/15 |
2 | 10-02-2025 | 2025-02-20 |
3 | 2025/03/05 | 2025-03-10 |
4 | 15-04-2025 | 2025/04/25 |
5 | 2025-05-12 | NULL |
1οΈβ£ CAST() / CONVERT() β Standardize Dates
SELECT order_id, CAST(order_date AS DATE) AS clean_order_date FROM Orders;
Before | After |
---|---|
“10-02-2025” | “2025-02-10” |
“2025/03/05” | “2025-03-05” |
2οΈβ£ DATEPART() β Extract Components
SELECT order_id, DATEPART(year, CAST(order_date AS DATE)) AS order_year, DATEPART(month, CAST(order_date AS DATE)) AS order_month FROM Orders;
Before | After |
---|---|
“2025-01-10” | Year=2025, Month=01 |
3οΈβ£ DATEDIFF() β Calculate Duration
SELECT order_id, DATEDIFF(day, CAST(order_date AS DATE), CAST(delivery_date AS DATE)) AS delivery_days FROM Orders;
order_date | delivery_date | Days |
---|---|---|
2025-01-10 | 2025-01-15 | 5 |
2025-03-05 | 2025-03-10 | 5 |
4οΈβ£ GETDATE() / CURRENT_DATE
SELECT order_id, DATEDIFF(day, CAST(order_date AS DATE), GETDATE()) AS days_since_order FROM Orders;
order_date | Today | Days Since |
---|---|---|
2025-01-10 | 2025-09-02 | 235 |
5οΈβ£ FORMAT() β Custom Date Format
SELECT order_id, FORMAT(CAST(order_date AS DATE), 'dd-MMM-yyyy') AS formatted_date FROM Orders;
Before | After |
---|---|
“2025-01-10” | “10-Jan-2025” |
With clean date fields, you can now build accurate time-series forecasts, dashboards, and reporting pipelines. This is especially useful in advanced projects like Walmartβs Inventory Prediction using AI.
π Section 7: Outlier Detection & Handling in SQL
Outliers are extreme values that deviate from the rest of your dataset. They can inflate averages, distort regression models, and mislead dashboards. SQL allows analysts to detect, cap, or remove outliers efficiently. If youβve read our guide on data mining & predictive analytics, you know outlier treatment is a key preprocessing step.
π Step 1: Create Table & Insert Data
CREATE TABLE SalesAmount ( sale_id INT, amount DECIMAL(10,2) ); INSERT INTO SalesAmount VALUES (1, 2500.00), (2, 2600.00), (3, 2700.00), (4, 2800.00), (5, 100000.00), -- Outlier (6, 3000.00), (7, 3100.00), (8, 50000.00), -- Outlier (9, 3200.00), (10, 3300.00);
β The SalesAmount table has two extreme outliers (100000, 50000) compared to normal values.
1οΈβ£ MIN() & MAX() β Identify Extremes
SELECT MIN(amount) AS min_val, MAX(amount) AS max_val FROM SalesAmount;
Min | Max |
---|---|
2500.00 | 100000.00 |
2οΈβ£ AVG() + STDDEV() β Z-Score Detection
SELECT sale_id, amount FROM SalesAmount WHERE ABS(amount - (SELECT AVG(amount) FROM SalesAmount)) > 2 * (SELECT STDDEV(amount) FROM SalesAmount);
Detected Outliers |
---|
50000.00 |
100000.00 |
3οΈβ£ PERCENTILE_CONT() β IQR Method
WITH percentiles AS ( SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS Q1, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS Q3 FROM SalesAmount ) SELECT Q1, Q3, (Q3-Q1) AS IQR FROM percentiles;
β Outliers are values < Q1 – 1.5*IQR or > Q3 + 1.5*IQR.
4οΈβ£ CASE WHEN β Cap Outliers
SELECT sale_id, CASE WHEN amount > 10000 THEN 10000 ELSE amount END AS capped_amount FROM SalesAmount;
Before | After |
---|---|
50000.00 | 10000.00 |
100000.00 | 10000.00 |
5οΈβ£ DELETE β Remove Outliers
DELETE FROM SalesAmount WHERE amount > 10000;
β Removes extreme outliers to retain only realistic sales values.
By detecting and handling outliers, analysts can ensure more reliable insights. Outlier handling is especially important in business analytics and hands-on data projects where accurate metrics drive decisions.
π« Section 8: Handling Missing Data (NULLs) in SQL
Missing values (NULLs) are one of the most frequent challenges in data cleaning. If left untreated, they can distort averages, break joins, and lead to incomplete reports. SQL provides powerful commands to detect, replace, or ignore NULLs. Handling them properly is essential, especially in data analysis workflows.
π Step 1: Create Table & Insert Data
CREATE TABLE CustomerOrders ( order_id INT, customer_name VARCHAR(50), amount DECIMAL(10,2) ); INSERT INTO CustomerOrders VALUES (1, 'Amit', 2500.00), (2, NULL, 3000.00), (3, 'Priya', NULL), (4, 'Sneha', 2000.00), (5, NULL, NULL);
β The CustomerOrders table has NULL names and NULL amounts.
order_id | customer_name | amount |
---|---|---|
1 | Amit | 2500.00 |
2 | NULL | 3000.00 |
3 | Priya | NULL |
4 | Sneha | 2000.00 |
5 | NULL | NULL |
1οΈβ£ IS NULL / IS NOT NULL
SELECT * FROM CustomerOrders WHERE customer_name IS NULL;
Detected Rows |
---|
order_id 2 β NULL name |
order_id 5 β NULL name |
2οΈβ£ COALESCE() β Replace with Default
SELECT order_id, COALESCE(customer_name, 'Unknown') AS clean_name FROM CustomerOrders;
Before | After |
---|---|
NULL | Unknown |
3οΈβ£ IFNULL() β MySQL Specific
SELECT order_id, IFNULL(amount, 0) AS clean_amount FROM CustomerOrders;
Before | After |
---|---|
NULL | 0 |
4οΈβ£ NULLIF() β Avoid Division Errors
SELECT order_id, amount / NULLIF(amount, 0) AS ratio FROM CustomerOrders;
β Prevents divide-by-zero errors by returning NULL instead of error.
5οΈβ£ AVG() / COUNT() Ignore NULLs
SELECT AVG(amount) AS avg_amount, COUNT(amount) AS valid_records FROM CustomerOrders;
AVG | COUNT |
---|---|
2500.00 (ignores NULLs) | 3 (only non-NULL amounts) |
Handling NULL values ensures your analysis is accurate and consistent. In real-world datasets, missing values are everywhere β especially in finance, healthcare, and surveys. Once mastered, you can apply these techniques in larger projects like data analytics workflows where clean data is the foundation of insights.
π Section 9: Data Type Conversion & Standardization in SQL
In real-world datasets, columns often store values in the wrong data type. For example: numbers stored as text, dates stored as strings, or booleans stored as integers. These inconsistencies can break calculations and cause inaccurate reports. SQL provides multiple functions to convert and standardize data types. This is especially critical in business analytics workflows.
π Step 1: Create Table & Insert Data
CREATE TABLE MixedData ( record_id INT, amount_text VARCHAR(20), date_text VARCHAR(20), status_flag INT ); INSERT INTO MixedData VALUES (1, '2500.75', '2025-01-10', 1), (2, '3000.10', '10/02/2025', 0), (3, '4500.00', '2025/03/05', 1), (4, '1000', '15-04-2025', NULL), (5, 'invalid', 'invalid_date', 1);
β The MixedData table has numbers as text, dates in multiple formats, invalid entries.
record_id | amount_text | date_text | status_flag |
---|---|---|---|
1 | “2500.75” | “2025-01-10” | 1 |
2 | “3000.10” | “10/02/2025” | 0 |
3 | “4500.00” | “2025/03/05” | 1 |
4 | “1000” | “15-04-2025” | NULL |
5 | “invalid” | “invalid_date” | 1 |
1οΈβ£ CAST() β Convert Text to Number
SELECT record_id, CAST(amount_text AS DECIMAL(10,2)) AS amount_num FROM MixedData;
Before | After |
---|---|
“2500.75” | 2500.75 |
“1000” | 1000.00 |
2οΈβ£ CONVERT() β Standardize Date Format
SELECT record_id, CONVERT(DATE, date_text, 105) AS clean_date FROM MixedData;
β
Converts dates like 15-04-2025
into 2025-04-15
.
3οΈβ£ TRY_CAST() β Safe Conversion
SELECT record_id, TRY_CAST(amount_text AS DECIMAL(10,2)) AS safe_amount FROM MixedData;
β
Invalid values (like "invalid"
) return NULL
instead of error.
4οΈβ£ PARSE() β Handle Flexible Dates
SELECT record_id, PARSE(date_text AS DATE USING 'en-GB') AS parsed_date FROM MixedData;
β
Reads multiple formats like 10/02/2025
or 2025/03/05
.
5οΈβ£ FORMAT() β Custom Output
SELECT record_id, FORMAT(CAST(amount_text AS DECIMAL(10,2)), 'C', 'en-IN') AS formatted_amount FROM MixedData;
β
Displays amount as βΉ2,500.75
(Indian currency format).
Standardizing data types ensures that calculations, joins, and models run smoothly. This step is critical in advanced workflows like data science projects, where mixed types can break machine learning pipelines.
π Section 10: Text Search & Pattern Cleaning in SQL
Text fields often contain typos, invalid formats, or hidden characters. SQL provides pattern-matching functions like LIKE, REGEXP, and PATINDEX to detect, validate, and clean text values. These techniques are also used in natural language processing (NLP) where string patterns matter a lot.
π Step 1: Create Table & Insert Data
CREATE TABLE Users ( user_id INT, email VARCHAR(50), phone VARCHAR(20) ); INSERT INTO Users VALUES (1, 'amit.sharma@gmail', '98765-43210'), (2, 'priya@@yahoo.com', '91 99988 87777'), (3, 'rohit.kumar@gmail.com', '999-111-2222'), (4, 'sneha@outlook', '8887776666'), (5, 'test_user#123.com', '99999');
β The Users table has invalid emails, irregular phone formats, and short numbers.
user_id | phone | |
---|---|---|
1 | “amit.sharma@gmail” | “98765-43210” |
2 | “priya@@yahoo.com” | “91 99988 87777” |
3 | “rohit.kumar@gmail.com” | “999-111-2222” |
4 | “sneha@outlook” | “8887776666” |
5 | “test_user#123.com” | “99999” |
1οΈβ£ LIKE β Find Invalid Emails
SELECT * FROM Users WHERE email NOT LIKE '%@%.%';
Invalid Emails |
---|
amit.sharma@gmail |
sneha@outlook |
2οΈβ£ PATINDEX() β Detect Symbols
SELECT user_id, email FROM Users WHERE PATINDEX('%[^a-zA-Z0-9@._-]%', email) > 0;
Emails with Invalid Characters |
---|
test_user#123.com |
3οΈβ£ REGEXP β Validate Phone Pattern
SELECT user_id, phone FROM Users WHERE phone NOT REGEXP '^[0-9]{10}$';
Invalid Phones |
---|
98765-43210 |
91 99988 87777 |
999-111-2222 |
99999 |
4οΈβ£ STUFF() β Mask Phone Numbers
SELECT user_id, STUFF(phone, 4, 3, 'XXX') AS masked_phone FROM Users;
Before | After |
---|---|
9876543210 | 987XXX210 |
5οΈβ£ REPLACE() β Clean Formatting
SELECT user_id, REPLACE(REPLACE(phone, '-', ''), ' ', '') AS clean_phone FROM Users;
Before | After |
---|---|
91 99988 87777 | 919998887777 |
999-111-2222 | 9991112222 |
Mastering pattern-based text cleaning is essential for data validation. These SQL commands are also used in areas like cybersecurity and ethical hacking, where detecting unusual patterns can prevent fraud and attacks.
π Section 11: Data Aggregation & Summarization in SQL
Once data is cleaned, the next step is to aggregate and summarize it.
Aggregation helps analysts turn raw transactions into insights like totals, averages, and counts.
SQL provides functions like SUM()
, COUNT()
, GROUP BY
, and HAVING
to simplify this.
If youβve read our blog on
steps of data analysis,
this is where insights really start to appear.
π Step 1: Create Table & Insert Data
CREATE TABLE Sales ( order_id INT, customer_id INT, region VARCHAR(20), amount DECIMAL(10,2) ); INSERT INTO Sales VALUES (1, 101, 'North', 2500.00), (2, 101, 'North', 3000.00), (3, 102, 'South', 1500.00), (4, 103, 'East', 2000.00), (5, 104, 'East', 2200.00), (6, 105, 'West', 1800.00), (7, 106, 'South', 1700.00), (8, 107, 'North', 5000.00);
β The Sales table contains multiple transactions per customer and region.
order_id | customer_id | region | amount |
---|---|---|---|
1 | 101 | North | 2500.00 |
2 | 101 | North | 3000.00 |
3 | 102 | South | 1500.00 |
4 | 103 | East | 2000.00 |
5 | 104 | East | 2200.00 |
6 | 105 | West | 1800.00 |
7 | 106 | South | 1700.00 |
8 | 107 | North | 5000.00 |
1οΈβ£ GROUP BY β Region-wise Totals
SELECT region, SUM(amount) AS total_sales FROM Sales GROUP BY region;
Region | Total Sales |
---|---|
North | 10500.00 |
South | 3200.00 |
East | 4200.00 |
West | 1800.00 |
2οΈβ£ HAVING β Filter Groups
SELECT region, SUM(amount) AS total_sales FROM Sales GROUP BY region HAVING SUM(amount) > 3000;
Region | Total Sales |
---|---|
North | 10500.00 |
East | 4200.00 |
3οΈβ£ COUNT(DISTINCT) β Unique Customers
SELECT region, COUNT(DISTINCT customer_id) AS unique_customers FROM Sales GROUP BY region;
Region | Unique Customers |
---|---|
North | 2 |
South | 2 |
East | 2 |
West | 1 |
4οΈβ£ AVG() β Average Sales
SELECT region, AVG(amount) AS avg_sales FROM Sales GROUP BY region;
Region | Avg Sales |
---|---|
North | 3500.00 |
South | 1600.00 |
East | 2100.00 |
West | 1800.00 |
5οΈβ£ ROLLUP β Subtotals + Grand Total
SELECT region, SUM(amount) AS total_sales FROM Sales GROUP BY ROLLUP(region);
Region | Total Sales |
---|---|
North | 10500.00 |
South | 3200.00 |
East | 4200.00 |
West | 1800.00 |
NULL (Grand Total) | 19700.00 |
Aggregation is where data becomes insight. By grouping, filtering, and summarizing, you create meaningful metrics for dashboards and reports. This process is core to data analyst careers where summarizing trends drives decision-making.
π Section 12: Data Joins & Cleaning Across Tables in SQL
Most business data lives in multiple related tables. To prepare clean analytics datasets, we use joins to merge them. Joins not only connect data but also help detect missing records, mismatches, and duplicates. If youβve practiced with our Pandas data cleaning guide, this section feels very similar β but with SQL.
π Step 1: Create Tables & Insert Data
CREATE TABLE Customers ( cust_id INT, cust_name VARCHAR(50) ); INSERT INTO Customers VALUES (101, 'Amit'), (102, 'Priya'), (103, 'Rohit'), (104, 'Sneha'); CREATE TABLE Orders ( order_id INT, cust_id INT, amount DECIMAL(10,2) ); INSERT INTO Orders VALUES (1, 101, 2500.00), (2, 101, 3000.00), (3, 102, 1500.00), (4, 105, 2000.00); -- 105 not in Customers
β The Customers table and Orders table have one mismatch (cust_id 105 not in Customers).
cust_id | cust_name |
---|---|
101 | Amit |
102 | Priya |
103 | Rohit |
104 | Sneha |
order_id | cust_id | amount |
---|---|---|
1 | 101 | 2500.00 |
2 | 101 | 3000.00 |
3 | 102 | 1500.00 |
4 | 105 | 2000.00 |
1οΈβ£ INNER JOIN β Matching Records
SELECT o.order_id, c.cust_name, o.amount FROM Orders o INNER JOIN Customers c ON o.cust_id = c.cust_id;
order_id | cust_name | amount |
---|---|---|
1 | Amit | 2500.00 |
2 | Amit | 3000.00 |
3 | Priya | 1500.00 |
2οΈβ£ LEFT JOIN β Find Orders without Customers
SELECT o.order_id, c.cust_name, o.amount FROM Orders o LEFT JOIN Customers c ON o.cust_id = c.cust_id;
order_id | cust_name | amount |
---|---|---|
4 | NULL | 2000.00 |
β Detects mismatches like cust_id 105 not in Customers.
3οΈβ£ RIGHT JOIN β Customers without Orders
SELECT c.cust_name, o.order_id FROM Customers c RIGHT JOIN Orders o ON c.cust_id = o.cust_id;
β Shows customers missing in Orders table (if any).
4οΈβ£ FULL OUTER JOIN β All Data with NULLs
SELECT c.cust_name, o.order_id, o.amount FROM Customers c FULL OUTER JOIN Orders o ON c.cust_id = o.cust_id;
β Useful for audit reports β shows everything, even mismatches.
5οΈβ£ JOIN + IS NULL β Detect Missing Foreign Keys
SELECT o.order_id, o.cust_id FROM Orders o LEFT JOIN Customers c ON o.cust_id = c.cust_id WHERE c.cust_id IS NULL;
order_id | cust_id |
---|---|
4 | 105 |
Joining and cleaning across tables ensures data consistency. Detecting mismatches helps prevent orphan records that can mislead reports. These join techniques are critical in data analytics workflows and in Python Pandas projects.
π‘οΈ Section 13: Data Integrity & Constraint-Based Cleaning in SQL
Instead of cleaning dirty data after it arrives, why not stop bad data at the source? SQL constraints enforce rules of integrity to prevent invalid, duplicate, or missing data. These techniques are crucial in data analyst workflows where quality assurance starts at the database design stage.
π Step 1: Create Products Table (with Issues)
CREATE TABLE Products ( product_id INT, product_name VARCHAR(50), price DECIMAL(10,2), category VARCHAR(20) ); INSERT INTO Products VALUES (1, 'Laptop', 60000, 'Electronics'), (1, 'Laptop', 60000, 'Electronics'), -- Duplicate ID (2, NULL, 1200, 'Stationery'), -- Missing name (3, 'Pen', -10, 'Stationery'), -- Negative price (4, 'Shirt', 1500, 'InvalidCat'); -- Wrong category
β The Products table has duplicates, NULLs, negative values, and invalid categories.
1οΈβ£ PRIMARY KEY β Prevent Duplicates
ALTER TABLE Products ADD CONSTRAINT pk_product PRIMARY KEY(product_id);
β Stops duplicate product IDs from being inserted.
2οΈβ£ UNIQUE β Enforce Uniqueness
ALTER TABLE Products ADD CONSTRAINT unq_product_name UNIQUE(product_name);
β Prevents two products with the same name.
3οΈβ£ NOT NULL β No Missing Fields
ALTER TABLE Products ALTER COLUMN product_name VARCHAR(50) NOT NULL;
β Ensures every product has a name.
4οΈβ£ CHECK β Validate Ranges
ALTER TABLE Products ADD CONSTRAINT chk_price CHECK(price > 0);
β Blocks negative prices from being entered.
5οΈβ£ FOREIGN KEY β Maintain Relationships
ALTER TABLE Products ADD CONSTRAINT fk_category FOREIGN KEY(category) REFERENCES Categories(category_name);
β Ensures only valid categories (from Categories table) are allowed.
Constraints are like data bodyguards β they prevent invalid records before they even enter your system. Using them reduces manual cleaning effort and ensures reliable datasets. This approach is vital in business analytics where decisions rely on high-quality data.
β‘ Section 14: Indexing & Performance-Aware Cleaning in SQL
When datasets grow huge, even simple cleaning operations like
DELETE
, DISTINCT
, or UPDATE
can become slow and expensive.
SQL indexing ensures faster lookups, duplicate detection, and cleaning operations.
Think of it as shortcuts for your database.
If youβve read our
predictive analytics guide,
you already know performance and efficiency directly affect insights.
π Step 1: Create Table with Duplicates
CREATE TABLE LargeSales ( sale_id INT, customer_id INT, amount DECIMAL(10,2) ); INSERT INTO LargeSales VALUES (1, 101, 2500.00), (2, 102, 3000.00), (3, 103, 1800.00), (4, 101, 2500.00), -- duplicate customer (5, 104, 2200.00);
β The LargeSales table has duplicate customer_ids and needs optimization.
1οΈβ£ CREATE INDEX β Faster Lookups
CREATE INDEX idx_customer ON LargeSales(customer_id);
β
Speeds up duplicate detection queries like SELECT DISTINCT
or GROUP BY customer_id
.
2οΈβ£ UNIQUE INDEX β Prevent Duplicates
CREATE UNIQUE INDEX unq_customer ON LargeSales(customer_id);
β Blocks insertion of duplicate customer IDs at the database level.
3οΈβ£ CLUSTERED INDEX β Organize Data
CREATE CLUSTERED INDEX cl_idx_sales ON LargeSales(sale_id);
β
Stores rows physically by sale_id
, improving query speed for sequential cleaning tasks.
4οΈβ£ NONCLUSTERED INDEX β Flexible Filtering
CREATE NONCLUSTERED INDEX ncl_idx_amount ON LargeSales(amount);
β
Helps when cleaning unusual numeric values like outliers in amount
.
5οΈβ£ DROP INDEX β Optimize Bulk Cleaning
DROP INDEX idx_customer ON LargeSales;
β
Dropping indexes before bulk DELETE
or UPDATE
speeds up cleaning, then recreate them later.
Indexes are like shortcuts for your database. They enforce uniqueness, accelerate duplicate removal, and make cleaning large datasets practical. For data analysts working with big datasets, mastering indexing ensures both accuracy and speed.
π Section 15: Data Normalization & Standardization in SQL
Raw datasets often contain inconsistent spellings, cases, and formats.
For example: India
, india
, IN
may all appear in the same column.
SQL provides functions to normalize and standardize values,
ensuring data is clean, consistent, and ready for analysis.
These techniques are widely used in
data analytics projects.
π Step 1: Create Table & Insert Data
CREATE TABLE CustomersNorm ( cust_id INT, country VARCHAR(50), gender VARCHAR(10) ); INSERT INTO CustomersNorm VALUES (1, ' India ', 'M'), (2, 'india', 'male'), (3, 'INDIA', 'F'), (4, 'USA', 'Female'), (5, 'U.S.A', 'f');
β The CustomersNorm table has inconsistent cases, spaces, abbreviations, and gender formats.
1οΈβ£ TRIM() β Remove Extra Spaces
SELECT cust_id, TRIM(country) AS clean_country FROM CustomersNorm;
Before | After |
---|---|
” India “ | “India” |
2οΈβ£ UPPER()/LOWER() β Fix Case
SELECT cust_id, UPPER(country) AS norm_country FROM CustomersNorm;
Before | After |
---|---|
“india” | “INDIA” |
3οΈβ£ REPLACE() β Standardize Abbreviations
SELECT cust_id, REPLACE(country, 'U.S.A', 'USA') AS clean_country FROM CustomersNorm;
Before | After |
---|---|
“U.S.A” | “USA” |
4οΈβ£ CASE WHEN β Map Gender Values
SELECT cust_id, CASE WHEN gender IN ('M','male') THEN 'Male' WHEN gender IN ('F','female') THEN 'Female' END AS norm_gender FROM CustomersNorm;
Before | After |
---|---|
“M” | “Male” |
“f” | “Female” |
5οΈβ£ JOIN Reference Table β Enforce Standards
-- Assuming we have a CountryReference table SELECT c.cust_id, r.country_name FROM CustomersNorm c JOIN CountryReference r ON UPPER(c.country) = UPPER(r.country_alias);
β Ensures only valid country names from a reference table are used.
Normalization ensures that different spellings and formats mean the same thing. By standardizing values, analysts prevent duplicate categories and improve data accuracy. This step is vital in data analytics workflows where consistent values lead to more reliable insights.
β° Section 16: Date & Time Cleaning in SQL
Dates in datasets often come in mixed formats, with NULLs or invalid values. Cleaning them ensures accurate time-series analysis, forecasting, and reporting. SQL provides built-in functions to standardize dates and extract useful parts. If youβve read our linear regression guide, you know how important clean dates are for trend analysis.
π Step 1: Create Table & Insert Data
CREATE TABLE Transactions ( trans_id INT, trans_date VARCHAR(20), amount DECIMAL(10,2) ); INSERT INTO Transactions VALUES (1, '2025-01-05', 2500.00), (2, '05/02/2025', 3000.00), (3, 'March 10, 2025', 1800.00), (4, NULL, 2200.00), (5, 'invalid_date', 1500.00);
β The Transactions table has mixed date formats, NULLs, and invalid values.
1οΈβ£ CAST/CONVERT β Standardize Format
SELECT trans_id, CONVERT(DATE, trans_date, 105) AS clean_date FROM Transactions;
β
Converts '05/02/2025'
into '2025-02-05'
(YYYY-MM-DD format).
2οΈβ£ TRY_CONVERT β Safe Conversion
SELECT trans_id, TRY_CONVERT(DATE, trans_date) AS safe_date FROM Transactions;
β
Invalid dates (like 'invalid_date'
) return NULL
instead of an error.
3οΈβ£ DATEPART() β Extract Components
SELECT trans_id, DATEPART(YEAR, TRY_CONVERT(DATE, trans_date)) AS year, DATEPART(MONTH, TRY_CONVERT(DATE, trans_date)) AS month FROM Transactions;
β Extracts year and month for time-based grouping.
4οΈβ£ DATEDIFF() β Calculate Durations
SELECT trans_id, DATEDIFF(DAY, TRY_CONVERT(DATE, trans_date), GETDATE()) AS days_old FROM Transactions;
β Calculates how many days old each transaction is.
5οΈβ£ COALESCE β Replace NULL Dates
SELECT trans_id, COALESCE(TRY_CONVERT(DATE, trans_date), '2025-01-01') AS final_date FROM Transactions;
β Fills missing/invalid dates with a default date.
Cleaning dates and timestamps is essential for reliable time-series insights. These SQL techniques ensure consistent formats, safe conversions, and complete values. They are widely used in forecasting, reporting, and machine learning models where accurate dates drive predictions.
π¨ Section 17: Outlier Detection & Cleaning in SQL
Outliers can distort averages, create misleading insights, and break machine learning models. SQL offers multiple ways to detect and clean outliers using statistics and conditions. If youβve read our data science guide, you know outlier handling is a must before model training.
π Step 1: Create Table with Outliers
CREATE TABLE SalesOutliers ( sale_id INT, amount DECIMAL(10,2) ); INSERT INTO SalesOutliers VALUES (1, 2500.00), (2, 2600.00), (3, 2700.00), (4, 100000.00), -- Extreme high outlier (5, 150.00), -- Extreme low outlier (6, 2800.00), (7, 2900.00);
β The SalesOutliers table has one extremely high and one extremely low value.
1οΈβ£ AVG + STDEV β Z-Score Method
SELECT sale_id, amount FROM SalesOutliers WHERE ABS(amount - (SELECT AVG(amount) FROM SalesOutliers)) < 3 * (SELECT STDEV(amount) FROM SalesOutliers);
β Removes values outside 3 standard deviations.
2οΈβ£ IQR Filtering β Boxplot Method
WITH Quartiles AS ( SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS Q1, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS Q3 FROM SalesOutliers ) SELECT s.* FROM SalesOutliers s, Quartiles q WHERE s.amount BETWEEN (Q1 - 1.5*(Q3-Q1)) AND (Q3 + 1.5*(Q3-Q1));
β Filters out values beyond the IQR range.
3οΈβ£ PERCENTILE_CONT β Winsorization
SELECT PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY amount) AS p05, PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) AS p95 FROM SalesOutliers;
β Identifies 5thβ95th percentile cutoffs for capping extreme values.
4οΈβ£ CASE WHEN β Cap Outliers
SELECT sale_id, CASE WHEN amount > 5000 THEN 5000 WHEN amount < 500 THEN 500 ELSE amount END AS capped_amount FROM SalesOutliers;
β Caps outliers to a safe range (500β5000).
5οΈβ£ DELETE β Remove Outliers Permanently
DELETE FROM SalesOutliers WHERE amount > 5000 OR amount < 500;
β Deletes extreme records completely (use with caution).
Handling outliers ensures more stable averages, cleaner visuals, and accurate models. Depending on the business case, you can remove, cap, or transform extreme values. This step is critical in data science workflows where outliers often make or break model performance.
π Section 18: Data Deduplication Across Tables in SQL
When merging datasets from different sources, duplicate records across tables are very common. SQL provides set operations and joins to detect, remove, and consolidate duplicates. This ensures analysts work with unique, high-quality records. If youβve read our 7 steps of data analysis, deduplication comes right after data collection.
π Step 1: Create Tables with Overlap
CREATE TABLE Customers_A ( cust_id INT, cust_name VARCHAR(50) ); INSERT INTO Customers_A VALUES (1, 'Amit'), (2, 'Priya'), (3, 'Rohit'); CREATE TABLE Customers_B ( cust_id INT, cust_name VARCHAR(50) ); INSERT INTO Customers_B VALUES (2, 'Priya'), (3, 'Rohit'), (4, 'Sneha');
β Both tables contain overlapping records for Priya and Rohit.
1οΈβ£ UNION β Remove Duplicates Across Tables
SELECT cust_id, cust_name FROM Customers_A UNION SELECT cust_id, cust_name FROM Customers_B;
β Combines both tables but keeps unique records only.
2οΈβ£ UNION ALL + GROUP BY β Find True Duplicates
SELECT cust_id, cust_name, COUNT(*) AS duplicate_count FROM ( SELECT cust_id, cust_name FROM Customers_A UNION ALL SELECT cust_id, cust_name FROM Customers_B ) t GROUP BY cust_id, cust_name HAVING COUNT(*) > 1;
β Identifies which records appear in both tables.
3οΈβ£ EXCEPT/MINUS β Find Non-Matching Records
SELECT cust_id, cust_name FROM Customers_A EXCEPT SELECT cust_id, cust_name FROM Customers_B;
β Shows records that exist in Customers_A only.
4οΈβ£ LEFT JOIN + IS NULL β Detect Missing Matches
SELECT a.cust_id, a.cust_name FROM Customers_A a LEFT JOIN Customers_B b ON a.cust_id = b.cust_id WHERE b.cust_id IS NULL;
β Finds records in A but not in B (non-overlaps).
5οΈβ£ INSERT DISTINCT β Merge Clean Data
INSERT INTO MasterCustomers (cust_id, cust_name) SELECT DISTINCT cust_id, cust_name FROM ( SELECT * FROM Customers_A UNION SELECT * FROM Customers_B ) t;
β Inserts only unique records into the master table.
Deduplication ensures one record = one entity, even if data comes from multiple sources. It prevents double-counting and improves the accuracy of reports and dashboards. This step is a must in data analysis pipelines before joining data with other business tables.
π€ Section 19: Data Type Conversion & Cleaning in SQL
Datasets often have values stored in the wrong data type: numbers saved as text, or dates written as strings. This causes errors in calculations and reporting. SQL provides conversion functions to clean and standardize data. If youβve read our neural network blog, youβll know that even advanced models require correct data types to function properly.
π Step 1: Create Table with Mixed Types
CREATE TABLE MixedData ( record_id INT, amount_text VARCHAR(20), date_text VARCHAR(20) ); INSERT INTO MixedData VALUES (1, '2500.50', '2025-01-05'), (2, '3000', '05/02/2025'), (3, 'invalid_num', 'March 10, 2025'), (4, NULL, 'invalid_date');
β The MixedData table has numbers stored as text and inconsistent date formats.
1οΈβ£ CAST β Basic Conversion
SELECT record_id, CAST(amount_text AS DECIMAL(10,2)) AS clean_amount FROM MixedData;
β
Converts '2500.50'
(text) into 2500.50
(decimal).
2οΈβ£ CONVERT β Date Conversion
SELECT record_id, CONVERT(DATE, date_text, 105) AS clean_date FROM MixedData;
β
Converts '05/02/2025'
into '2025-02-05'
(standard format).
3οΈβ£ TRY_CAST β Safe Conversion
SELECT record_id, TRY_CAST(amount_text AS DECIMAL(10,2)) AS safe_amount FROM MixedData;
β
Invalid numbers ('invalid_num'
) become NULL
instead of causing errors.
4οΈβ£ TRY_CONVERT β Safe Date Conversion
SELECT record_id, TRY_CONVERT(DATE, date_text) AS safe_date FROM MixedData;
β
Invalid dates ('invalid_date'
) safely return NULL
.
5οΈβ£ PARSE β Complex Conversions
SELECT record_id, PARSE(date_text AS DATE USING 'en-US') AS parsed_date FROM MixedData;
β
Converts 'March 10, 2025'
into a proper DATE
field.
With correct data types, analysts can perform calculations, joins, and models reliably. Type conversion is a critical step before analytics or machine learning, where incorrect formats can cause errors or wrong predictions.
β Section 20: Data Validation Queries in SQL
Data validation ensures that records follow business rules before they are used for analytics. Using SQL, analysts can easily detect invalid, missing, or inconsistent values. This step is critical in business analysis, where data quality directly impacts decision-making.
π Step 1: Create Table with Invalid Records
CREATE TABLE SalesValidation ( sale_id INT, product_name VARCHAR(50), amount DECIMAL(10,2), category VARCHAR(20), sale_date DATE ); INSERT INTO SalesValidation VALUES (1, 'Laptop', 50000, 'Electronics', '2025-01-15'), (2, NULL, 1500, 'Stationery', '2025-02-05'), -- Missing product name (3, 'Pen', -10, 'Stationery', '2025-02-10'), -- Negative amount (4, 'Shirt', 2000, 'InvalidCat', '2025-03-01'),-- Wrong category (5, 'Book', 800, 'Stationery', '2030-01-01'); -- Future date
β The SalesValidation table has NULLs, negatives, wrong categories, and future dates.
1οΈβ£ IS NULL β Detect Missing Fields
SELECT * FROM SalesValidation WHERE product_name IS NULL;
β Finds records where product_name is missing.
2οΈβ£ CASE WHEN β Check for Negative Values
SELECT sale_id, amount, CASE WHEN amount < 0 THEN 'Invalid' ELSE 'Valid' END AS status FROM SalesValidation;
β Flags sales with negative amounts.
3οΈβ£ NOT IN β Validate Allowed Categories
SELECT * FROM SalesValidation WHERE category NOT IN ('Electronics', 'Stationery', 'Clothing');
β Finds products with invalid categories.
4οΈβ£ BETWEEN β Detect Out-of-Range Values
SELECT * FROM SalesValidation WHERE amount NOT BETWEEN 1 AND 100000;
β Ensures amount is within expected range.
5οΈβ£ Date Validation β No Future Dates
SELECT * FROM SalesValidation WHERE sale_date > GETDATE();
β Detects future transaction dates.
Validation queries act like data quality checkpoints. They ensure values match business expectations before being used in reports. For business analysts, mastering these checks is key to delivering trustworthy insights.
π Section 21: Data Cleaning with Subqueries in SQL
Subqueries allow analysts to run a query inside another query. They are widely used in data cleaning for detecting duplicates, filtering invalid records, and keeping only the latest or valid entries. Similar to loops in Python, subqueries repeat logic efficiently within SQL itself.
π Step 1: Create Table with Dirty Records
CREATE TABLE CustomerOrders ( order_id INT, cust_id INT, order_date DATE, amount DECIMAL(10,2) ); INSERT INTO CustomerOrders VALUES (1, 101, '2025-01-01', 2500.00), (2, 102, '2025-01-05', NULL), -- Missing amount (3, 103, '2025-01-10', 1800.00), (4, 103, '2025-01-12', 2000.00), -- Duplicate customer with later date (5, 104, '2024-12-25', -500.00); -- Negative amount
β The CustomerOrders table has NULLs, duplicates, and negative values.
1οΈβ£ IN β Keep Only Valid Customers
SELECT * FROM CustomerOrders WHERE cust_id IN ( SELECT DISTINCT cust_id FROM CustomerOrders WHERE amount > 0 );
β Keeps only customers with valid positive amounts.
2οΈβ£ NOT IN β Remove Customers with NULLs
SELECT * FROM CustomerOrders WHERE cust_id NOT IN ( SELECT cust_id FROM CustomerOrders WHERE amount IS NULL );
β Excludes customers with NULL amounts.
3οΈβ£ EXISTS β Detect Duplicate Customers
SELECT * FROM CustomerOrders c1 WHERE EXISTS ( SELECT 1 FROM CustomerOrders c2 WHERE c1.cust_id = c2.cust_id AND c1.order_id <> c2.order_id );
β Flags customers that appear in multiple orders.
4οΈβ£ MAX() with Subquery β Keep Latest Order
SELECT * FROM CustomerOrders c WHERE order_date = ( SELECT MAX(order_date) FROM CustomerOrders c2 WHERE c.cust_id = c2.cust_id );
β Keeps only the latest order per customer.
5οΈβ£ DELETE with Subquery β Remove Negatives
DELETE FROM CustomerOrders WHERE amount < 0 AND cust_id IN ( SELECT cust_id FROM CustomerOrders );
β Deletes negative transactions using a subquery filter.
Subqueries are powerful tools for smart filtering and cleaning. They help analysts remove invalid values, keep latest records, and detect duplicates. Just like Python loops, subqueries repeat logic without writing separate queries each time.
π§© Section 22: CTEs for Stepwise Data Cleaning in SQL
Common Table Expressions (CTEs) allow you to write stepwise SQL queries. They make cleaning easier to read and manage, especially when handling duplicates, NULLs, and row numbering. Think of them as temporary views within a query. If youβve explored our Pandas cleaning guide, youβll see how CTEs give SQL similar stepwise clarity.
π Step 1: Create Sample Orders Table
CREATE TABLE OrdersCTE ( order_id INT, cust_id INT, amount DECIMAL(10,2), order_date DATE ); INSERT INTO OrdersCTE VALUES (1, 101, 2500.00, '2025-01-01'), (2, 102, NULL, '2025-01-05'), -- NULL value (3, 103, 1800.00, '2025-01-10'), (4, 103, 1800.00, '2025-01-10'), -- Duplicate (5, 104, -500.00, '2025-01-15'); -- Negative value
β The OrdersCTE table has NULLs, duplicates, and negatives.
1οΈβ£ Remove Duplicates with ROW_NUMBER()
WITH Dedup AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY cust_id, amount, order_date ORDER BY order_id ) AS rn FROM OrdersCTE ) SELECT * FROM Dedup WHERE rn = 1;
β Assigns row numbers and keeps only the first occurrence.
2οΈβ£ Fill NULLs Stepwise
WITH FillNulls AS ( SELECT order_id, cust_id, COALESCE(amount, 0) AS clean_amount, order_date FROM OrdersCTE ) SELECT * FROM FillNulls;
β Replaces NULL amounts with 0.
3οΈβ£ Detect Negative Values
WITH Negatives AS ( SELECT * FROM OrdersCTE WHERE amount < 0 ) SELECT * FROM Negatives;
β Finds negative transactions.
4οΈβ£ Multi-Step Cleaning
WITH Step1 AS ( SELECT *, COALESCE(amount,0) AS clean_amt FROM OrdersCTE ), Step2 AS ( SELECT *, ABS(clean_amt) AS abs_amt FROM Step1 ) SELECT * FROM Step2;
β Chains multiple cleaning steps: fill NULLs β fix negatives.
5οΈβ£ DELETE Duplicates with CTE
WITH Dedup AS ( SELECT *, ROW_NUMBER() OVER( PARTITION BY cust_id, amount, order_date ORDER BY order_id ) AS rn FROM OrdersCTE ) DELETE FROM Dedup WHERE rn > 1;
β Deletes duplicate rows while keeping the first one.
CTEs make complex cleaning step-by-step and readable. They are especially useful for handling duplicates, NULLs, and negatives systematically. Analysts can combine multiple cleaning tasks in a single structured query, similar to how we chain cleaning steps in Pandas.
π Section 23: Views for Consistent Data Cleaning in SQL
A VIEW in SQL acts like a saved query β a virtual table that ensures consistent cleaning logic across teams. Instead of rewriting cleaning steps for every report, analysts can build a view once and reuse it. This is crucial in enterprise analytics and data analyst workflows.
π Step 1: Create Employees Table with Issues
CREATE TABLE RawEmployees ( emp_id INT, emp_name VARCHAR(50), salary VARCHAR(20), dept VARCHAR(20) ); INSERT INTO RawEmployees VALUES (1, ' Amit ', '50000', 'HR'), (2, 'Priya', 'invalid', 'Finance'), (3, 'Rohit', '60000', 'Sales'), (4, 'Sneha', NULL, 'HR'), (5, 'Amit', '50000', 'HR'); -- duplicate
β The RawEmployees table has extra spaces, duplicates, invalid salaries, and NULLs.
1οΈβ£ CREATE VIEW β Clean Data Once
CREATE VIEW CleanEmployees AS SELECT DISTINCT emp_id, TRIM(emp_name) AS clean_name, TRY_CAST(salary AS DECIMAL(10,2)) AS clean_salary, dept FROM RawEmployees WHERE salary IS NOT NULL;
β Standardizes salaries, trims names, and removes NULLs.
2οΈβ£ UPDATE Through a VIEW
UPDATE CleanEmployees SET clean_salary = 55000 WHERE emp_name = 'Rohit';
β Fixes invalid/incorrect values directly via the view.
3οΈβ£ JOIN in a VIEW β Add Context
CREATE VIEW EmployeeWithDept AS SELECT e.emp_id, e.clean_name, e.clean_salary, d.manager FROM CleanEmployees e JOIN Departments d ON e.dept = d.dept_name;
β Enriches cleaned data with manager info from a Departments table.
4οΈβ£ Indexed View β Faster Cleaning
CREATE UNIQUE CLUSTERED INDEX idx_clean_emp ON CleanEmployees(emp_id);
β Adds performance to repetitive cleaning queries.
5οΈβ£ DROP VIEW β Reset Cleaning Logic
DROP VIEW CleanEmployees;
β Deletes the view when cleaning rules change.
Views ensure consistent cleaning rules across all analysts and dashboards. Instead of rewriting cleaning queries, you reuse one virtual table. This is essential in enterprise analytics where multiple teams depend on the same clean data.
βοΈ Section 24: Stored Procedures for Automated Cleaning in SQL
Stored procedures bundle cleaning logic into a reusable block of SQL code. Instead of writing multiple queries, analysts just call the procedure. This makes cleaning faster, automated, and secure. In enterprise analytics, automation is a key step after data preparation.
π Step 1: Create Dirty Products Table
CREATE TABLE DirtyProducts ( prod_id INT, prod_name VARCHAR(50), price VARCHAR(20), stock INT ); INSERT INTO DirtyProducts VALUES (1, ' Laptop ', '50000.99', 10), (2, 'Mobile', 'invalid', 5), (3, 'Tablet', '15000', NULL), (4, 'Mobile', '20000', 5); -- duplicate
β The DirtyProducts table has extra spaces, invalid prices, NULL stock, and duplicates.
1οΈβ£ CREATE PROCEDURE β Define Cleaning Logic
CREATE PROCEDURE CleanProducts AS BEGIN UPDATE DirtyProducts SET prod_name = TRIM(prod_name), price = TRY_CAST(price AS DECIMAL(10,2)), stock = COALESCE(stock, 0); END;
β Creates a procedure to clean names, fix prices, and fill NULL stock.
2οΈβ£ EXEC β Run the Procedure
EXEC CleanProducts;
β Executes the cleaning procedure in one step.
3οΈβ£ Procedure with Parameters
CREATE PROCEDURE UpdatePrice (@prod_id INT, @new_price DECIMAL(10,2)) AS BEGIN UPDATE DirtyProducts SET price = @new_price WHERE prod_id = @prod_id; END; EXEC UpdatePrice 2, 18000;
β Updates price for a specific product dynamically.
4οΈβ£ Procedure with Transaction
CREATE PROCEDURE SafeDeleteDuplicates AS BEGIN BEGIN TRANSACTION; DELETE FROM DirtyProducts WHERE prod_id NOT IN ( SELECT MIN(prod_id) FROM DirtyProducts GROUP BY prod_name, price, stock ); COMMIT TRANSACTION; END;
β Safely deletes duplicates using a transaction.
5οΈβ£ DROP PROCEDURE β Remove Cleaning Logic
DROP PROCEDURE CleanProducts;
β Deletes the procedure when no longer needed.
Stored procedures make cleaning repeatable and automated. Analysts can run multiple cleaning steps in one call, improving efficiency and consistency. This is a key skill in data analytics projects where automation reduces manual errors.
π Section 25: Temporary Tables for Cleaning Sessions in SQL
Temporary tables let you store intermediate cleaning results without modifying the main dataset. They are useful in stepwise cleaning, staging data, and testing fixes. Just like staging tables in Power BI or ML projects, SQL temp tables provide a safe workspace.
π Step 1: Create Raw Orders Table
CREATE TABLE RawOrders ( order_id INT, cust_name VARCHAR(50), amount VARCHAR(20), status VARCHAR(20) ); INSERT INTO RawOrders VALUES (1, ' Amit ', '2500.50', 'Completed'), (2, 'Priya', 'invalid', 'Pending'), (3, 'Rohit', '3000', NULL), (4, 'Sneha', '-500', 'Completed');
β The RawOrders table has spaces, invalid numbers, NULLs, and negatives.
1οΈβ£ CREATE TEMP TABLE β Scratchpad
CREATE TABLE #CleanOrders ( order_id INT, cust_name VARCHAR(50), clean_amount DECIMAL(10,2), status VARCHAR(20) );
β
Creates a temporary table #CleanOrders
for cleaned data.
2οΈβ£ INSERT Cleaned Data
INSERT INTO #CleanOrders SELECT order_id, TRIM(cust_name), TRY_CAST(amount AS DECIMAL(10,2)) AS clean_amount, COALESCE(status, 'Pending') AS clean_status FROM RawOrders;
β Loads cleaned records into the temp table.
3οΈβ£ JOIN Temp with Original
SELECT r.order_id, r.cust_name, c.clean_amount, c.status FROM RawOrders r JOIN #CleanOrders c ON r.order_id = c.order_id;
β Compares raw vs clean records side by side.
4οΈβ£ UPDATE Temp Table
UPDATE #CleanOrders SET clean_amount = ABS(clean_amount) WHERE clean_amount < 0;
β Fixes negative amounts in the temp table.
5οΈβ£ DROP TEMP TABLE β End Session
DROP TABLE #CleanOrders;
β Removes the temp table after cleaning is complete.
Temporary tables provide a safe, session-based workspace for cleaning. Analysts can test fixes without harming original data. This is widely used in Power BI and ML pipelines where clean staging is crucial before loading into reports.
π Section 26: Transactions & Rollback for Safe Cleaning in SQL
Transactions ensure cleaning steps happen safely. If an error occurs, you can ROLLBACK to undo changes. This prevents accidental data loss and ensures reliable analytics. Just like staging in predictive analytics, safe transactions protect your data before itβs modeled.
π Step 1: Create Payments Table
CREATE TABLE Payments ( pay_id INT, cust_id INT, amount DECIMAL(10,2), status VARCHAR(20) ); INSERT INTO Payments VALUES (1, 101, 2500.00, 'Completed'), (2, 102, -500.00, 'Completed'), -- invalid negative (3, 103, NULL, 'Pending'), -- NULL amount (4, 104, 3000.00, 'Completed');
β The Payments table has negatives and NULL amounts.
1οΈβ£ BEGIN TRANSACTION β Start Cleaning
BEGIN TRANSACTION; UPDATE Payments SET amount = ABS(amount) WHERE amount < 0;
β Starts a transaction and fixes negative values.
2οΈβ£ COMMIT β Save Changes
COMMIT;
β Confirms cleaning changes permanently.
3οΈβ£ ROLLBACK β Undo Mistakes
ROLLBACK;
β Cancels the transaction and restores original data.
4οΈβ£ SAVEPOINT β Partial Rollback
BEGIN TRANSACTION; UPDATE Payments SET amount = 0 WHERE amount IS NULL; SAVEPOINT FixNulls; UPDATE Payments SET status = 'Checked' WHERE status = 'Pending'; ROLLBACK TO FixNulls; COMMIT;
β Rolls back only part of the transaction to FixNulls.
5οΈβ£ TRYβ¦CATCH β Handle Errors Safely
BEGIN TRY BEGIN TRANSACTION; UPDATE Payments SET amount = amount / 0; -- error COMMIT; END TRY BEGIN CATCH ROLLBACK; PRINT 'Error occurred. Transaction rolled back.'; END CATCH;
β Ensures rollback on errors during cleaning.
Transactions make data cleaning safe and reversible. Analysts can fix negatives, NULLs, or duplicates with confidence, knowing mistakes can be undone. This is critical in predictive analytics where wrong cleaning steps can corrupt models.
π’ Master SQL & Data Analytics with Vista Academy
Youβve just explored 100+ SQL commands for data cleaning. Ready to take your skills further? Join Vista Academy, Dehradun for hands-on Data Analytics & SQL Training with real-world projects. Build a job-ready portfolio and become an in-demand analyst in 2025. π
π Enroll Now β Start Your Data Analytics Journeyπ +91 94117 78145 | π§ thevistaacademy@gmail.com | π¬ WhatsApp