data Cleaning in SQL

50 SQL Commands for Data Cleaning in Data Analytics [2025 Guide]

Table of Contents

Data cleaning is the foundation of Data Analytics. Without clean, structured data, even the most advanced models fail to deliver accurate insights. In this guide, we bring you the Top 50 SQL Commands every data analyst must master to handle duplicates, missing values, formatting errors, and more. πŸš€ Updated for 2025, this guide is your step-by-step SQL data cleaning handbook.

Why SQL for Data Cleaning?

SQL (Structured Query Language) is not just for querying data β€” it is a powerful tool for cleaning and preparing datasets. From removing duplicates to handling NULL values, fixing text, correcting dates, and restructuring tables, SQL provides a robust foundation for data analysts and business intelligence professionals.

πŸ” Accuracy

Remove duplicates, correct errors, and ensure reliable analysis.

⚑ Efficiency

Clean millions of records in seconds using optimized SQL queries.

πŸ“Š Standardization

Format text, dates, and numbers consistently across datasets.

πŸ† Section 1: Removing Duplicates in SQL

In Data Analytics, duplicate records are a common issue. They can distort results, inflate metrics, and mislead decision-making. SQL provides powerful commands to detect and remove duplicates effectively. Let’s clean a sample Sales table.

πŸ“Š Step 1: Create Table & Insert Data

CREATE TABLE Sales (
  order_id INT,
  customer_id INT,
  order_date DATE
);

INSERT INTO Sales VALUES
(1, 101, '2025-01-10'),
(2, 101, '2025-01-10'), -- Duplicate
(3, 102, '2025-01-12'),
(4, 103, '2025-01-14'),
(5, 103, '2025-01-14'); -- Duplicate
      

βœ… Original Sales table contains duplicate customer_id values.

order_id customer_id order_date
11012025-01-10
21012025-01-10
31022025-01-12
41032025-01-14
51032025-01-14

1️⃣ SELECT DISTINCT

SELECT DISTINCT customer_id 
FROM Sales;
        
customer_id
101
102
103

2️⃣ GROUP BY + HAVING

SELECT customer_id, COUNT(*) AS duplicate_count
FROM Sales
GROUP BY customer_id
HAVING COUNT(*) > 1;
        
customer_idduplicate_count
1012
1032

3️⃣ DELETE Duplicate Rows

DELETE FROM Sales
WHERE rowid NOT IN (
  SELECT MIN(rowid) 
  FROM Sales 
  GROUP BY customer_id
);
        
Remaining Rows
1 | 101 | 2025-01-10
3 | 102 | 2025-01-12
4 | 103 | 2025-01-14

4️⃣ ROW_NUMBER() Window

SELECT customer_id, 
ROW_NUMBER() OVER(
  PARTITION BY customer_id 
  ORDER BY order_date
) AS row_num
FROM Sales;
        
customer_idrow_num
1011
1012
1031
1032

5️⃣ CTE + ROW_NUMBER()

WITH duplicates AS (
  SELECT *, 
  ROW_NUMBER() OVER(
    PARTITION BY customer_id 
    ORDER BY order_date
  ) AS row_num
  FROM Sales
)
DELETE FROM duplicates WHERE row_num > 1;
        
Remaining Rows
1 | 101 | 2025-01-10
3 | 102 | 2025-01-12
4 | 103 | 2025-01-14

✨ Section 3: Standardizing Data (Formatting)

Messy data often contains extra spaces, inconsistent capitalization, and spelling mistakes. SQL provides handy functions to standardize data into a clean and consistent format. Let’s clean our Products table step by step.

πŸ“Š Step 1: Create Table & Insert Data

CREATE TABLE Products (
  product_id INT,
  product_name VARCHAR(50),
  category VARCHAR(50)
);

INSERT INTO Products VALUES
(1, '  Laptop', 'ELECTRONICS '),
(2, 'MOBILE  ', 'electronics'),
(3, 'HeadPhones', ' Electronics'),
(4, 'tablet', 'ELECTRONICS'),
(5, '  CAMERA ', 'Electronics');
      

βœ… The Products table has extra spaces, mixed cases, and inconsistent category names.

product_id product_name category
1 LaptopELECTRONICS
2MOBILE electronics
3HeadPhones Electronics
4tabletELECTRONICS
5 CAMERA Electronics

1️⃣ TRIM() – Remove Extra Spaces

SELECT product_id, 
       TRIM(product_name) AS clean_name
FROM Products;
        
Before After
” Laptop”“Laptop”
“MOBILE ““MOBILE”

2️⃣ UPPER() – Convert to Uppercase

SELECT product_id, 
       UPPER(category) AS clean_category
FROM Products;
        
Before After
“electronics”“ELECTRONICS”
” Electronics”“ELECTRONICS”

3️⃣ LOWER() – Convert to Lowercase

SELECT product_id, 
       LOWER(product_name) AS clean_name
FROM Products;
        
Before After
“HeadPhones”“headphones”
” CAMERA “” camera “

4️⃣ INITCAP() – Capitalize Words

SELECT INITCAP(product_name) AS formatted_name
FROM Products;
        
Before After
“tablet”“Tablet”
“mobile ““Mobile”

5️⃣ REPLACE() – Fix Wrong Text

SELECT product_id, 
       REPLACE(category, 'Electronics', 'ELECTRONICS') AS fixed_category
FROM Products;
        
Before After
“Electronics”“ELECTRONICS”
” Electronics”“ELECTRONICS”

πŸ“ Section 4: String Cleaning & Manipulation

String data often contains extra characters, inconsistent text, or wrong substrings. SQL offers powerful string functions to clean, extract, and format text values. If you’ve already practiced data cleaning with Pandas, you’ll notice many functions are quite similar in SQL as well. Let’s work on a sample Customers table.

πŸ“Š Step 1: Create Table & Insert Data

CREATE TABLE Customers (
  cust_id INT,
  cust_name VARCHAR(50),
  phone VARCHAR(20),
  email VARCHAR(50)
);

INSERT INTO Customers VALUES
(1, '  AMIT SHARMA  ', ' 9876543210 ', 'amit.sharma@ gmail.com'),
(2, 'rohit KUMAR', '91-9998887777', 'rohit.kumar@@yahoo.com'),
(3, 'PRIYA  ', '8887776666', ' priyaverma@gmail.com '),
(4, 'sneha', '999-111-222', 'sneha123@outlook. com');
      
cust_idcust_namephoneemail
1” AMIT SHARMA “” 9876543210 ““amit.sharma@ gmail.com”
2“rohit KUMAR”“91-9998887777”“rohit.kumar@@yahoo.com”
3“PRIYA ““8887776666”” priyaverma@gmail.com “
4“sneha”“999-111-222”“sneha123@outlook. com”

1️⃣ TRIM() – Remove Extra Spaces

SELECT cust_id, TRIM(cust_name) AS clean_name
FROM Customers;
        
BeforeAfter
” AMIT SHARMA ““AMIT SHARMA”
“PRIYA ““PRIYA”

2️⃣ UPPER() & LOWER() – Fix Case

SELECT cust_id, UPPER(cust_name) AS upper_name,
       LOWER(cust_name) AS lower_name
FROM Customers;
        
BeforeUPPER()LOWER()
“rohit KUMAR”“ROHIT KUMAR”“rohit kumar”
“sneha”“SNEHA”“sneha”

3️⃣ SUBSTRING() – Extract Text

SELECT cust_id, SUBSTRING(phone, 1, 10) AS clean_phone
FROM Customers;
        
BeforeAfter
“91-9998887777”“91-9998887”
“999-111-222”“999-111-22”

4️⃣ REPLACE() – Fix Wrong Characters

SELECT cust_id, REPLACE(email, ' ', '') AS clean_email
FROM Customers;
        
BeforeAfter
“amit.sharma@ gmail.com”“amit.sharma@gmail.com”
“sneha123@outlook. com”“sneha123@outlook.com”

5️⃣ CONCAT() – Merge Fields

SELECT cust_id, CONCAT(cust_name, ' - ', phone) AS full_contact
FROM Customers;
        
BeforeAfter
“AMIT SHARMA” + “9876543210”“AMIT SHARMA – 9876543210”
“PRIYA” + “8887776666”“PRIYA – 8887776666”

Mastering these string manipulation commands in SQL is a must for every analyst. Once you are comfortable with them, you can apply the same principles in tools like Power BI or Python to achieve seamless data preparation.

πŸ”’ Section 5: Numeric Cleaning & Transformation

Numbers are the backbone of analytics β€” but raw data often contains inconsistent decimal places, negative values, or rounding issues. SQL provides handy functions to round, convert, and validate numeric data. If you’ve explored our guide on linear regression in machine learning, you already know how important clean numbers are for accurate predictions.

πŸ“Š Step 1: Create Table & Insert Data

CREATE TABLE SalesData (
  sale_id INT,
  amount DECIMAL(10,4),
  discount DECIMAL(5,2)
);

INSERT INTO SalesData VALUES
(1, 2500.4567, 5.5),
(2, -300.9876, 10.0),
(3, 1999.9999, 0.0),
(4, 150.75, NULL),
(5, 100.1234, 2.25);
      

βœ… The SalesData table contains extra decimal places, negative values, and NULL discounts.

sale_id amount discount
12500.45675.5
2-300.987610.0
31999.99990.0
4150.75NULL
5100.12342.25

1️⃣ ROUND() – Fix Decimal Places

SELECT sale_id, 
       ROUND(amount, 2) AS rounded_amount
FROM SalesData;
        
BeforeAfter
2500.45672500.46
1999.99992000.00

2️⃣ ABS() – Handle Negative Values

SELECT sale_id, 
       ABS(amount) AS positive_amount
FROM SalesData;
        
BeforeAfter
-300.9876300.9876

3️⃣ CEIL() & FLOOR()

SELECT sale_id, 
       CEIL(amount) AS ceil_value,
       FLOOR(amount) AS floor_value
FROM SalesData;
        
BeforeCEIL()FLOOR()
150.75151150
100.1234101100

4️⃣ NULLIF() – Handle Divide by Zero

SELECT sale_id, 
       amount / NULLIF(discount, 0) AS safe_division
FROM SalesData;
        
BeforeAfter
1999.9999 Γ· 0.0 = Error1999.9999 Γ· NULL = NULL

5️⃣ CAST() – Convert Data Type

SELECT sale_id, 
       CAST(amount AS INT) AS int_amount
FROM SalesData;
        
BeforeAfter
2500.45672500
100.1234100

With these functions, numeric columns become analysis-ready. Clean numbers are crucial not only for SQL queries but also for advanced models in neural networks and deep learning, where even small decimal inconsistencies can affect predictions.

⏰ Section 6: Date & Time Cleaning in SQL

Dates are one of the most tricky parts of data cleaning. Inconsistent date formats can break time-series analysis, reporting, and forecasting. SQL offers powerful functions to clean, convert, and extract date and time values. If you’ve read our blog on time series analysis, you’ll see how crucial it is to prepare dates properly.

πŸ“Š Step 1: Create Table & Insert Data

CREATE TABLE Orders (
  order_id INT,
  order_date VARCHAR(20),
  delivery_date VARCHAR(20)
);

INSERT INTO Orders VALUES
(1, '2025-01-10', '2025/01/15'),
(2, '10-02-2025', '2025-02-20'),
(3, '2025/03/05', '2025-03-10'),
(4, '15-04-2025', '2025/04/25'),
(5, '2025-05-12', NULL);
      

βœ… The Orders table has mixed date formats, missing delivery dates.

order_idorder_datedelivery_date
12025-01-102025/01/15
210-02-20252025-02-20
32025/03/052025-03-10
415-04-20252025/04/25
52025-05-12NULL

1️⃣ CAST() / CONVERT() – Standardize Dates

SELECT order_id,
       CAST(order_date AS DATE) AS clean_order_date
FROM Orders;
        
BeforeAfter
“10-02-2025”“2025-02-10”
“2025/03/05”“2025-03-05”

2️⃣ DATEPART() – Extract Components

SELECT order_id,
       DATEPART(year, CAST(order_date AS DATE)) AS order_year,
       DATEPART(month, CAST(order_date AS DATE)) AS order_month
FROM Orders;
        
BeforeAfter
“2025-01-10”Year=2025, Month=01

3️⃣ DATEDIFF() – Calculate Duration

SELECT order_id,
       DATEDIFF(day, CAST(order_date AS DATE), CAST(delivery_date AS DATE)) AS delivery_days
FROM Orders;
        
order_datedelivery_dateDays
2025-01-102025-01-155
2025-03-052025-03-105

4️⃣ GETDATE() / CURRENT_DATE

SELECT order_id,
       DATEDIFF(day, CAST(order_date AS DATE), GETDATE()) AS days_since_order
FROM Orders;
        
order_dateTodayDays Since
2025-01-102025-09-02235

5️⃣ FORMAT() – Custom Date Format

SELECT order_id,
       FORMAT(CAST(order_date AS DATE), 'dd-MMM-yyyy') AS formatted_date
FROM Orders;
        
BeforeAfter
“2025-01-10”“10-Jan-2025”

With clean date fields, you can now build accurate time-series forecasts, dashboards, and reporting pipelines. This is especially useful in advanced projects like Walmart’s Inventory Prediction using AI.

πŸ“ˆ Section 7: Outlier Detection & Handling in SQL

Outliers are extreme values that deviate from the rest of your dataset. They can inflate averages, distort regression models, and mislead dashboards. SQL allows analysts to detect, cap, or remove outliers efficiently. If you’ve read our guide on data mining & predictive analytics, you know outlier treatment is a key preprocessing step.

πŸ“Š Step 1: Create Table & Insert Data

CREATE TABLE SalesAmount (
  sale_id INT,
  amount DECIMAL(10,2)
);

INSERT INTO SalesAmount VALUES
(1, 2500.00),
(2, 2600.00),
(3, 2700.00),
(4, 2800.00),
(5, 100000.00), -- Outlier
(6, 3000.00),
(7, 3100.00),
(8, 50000.00),  -- Outlier
(9, 3200.00),
(10, 3300.00);
      

βœ… The SalesAmount table has two extreme outliers (100000, 50000) compared to normal values.

1️⃣ MIN() & MAX() – Identify Extremes

SELECT MIN(amount) AS min_val,
       MAX(amount) AS max_val
FROM SalesAmount;
        
MinMax
2500.00100000.00

2️⃣ AVG() + STDDEV() – Z-Score Detection

SELECT sale_id, amount
FROM SalesAmount
WHERE ABS(amount - (SELECT AVG(amount) FROM SalesAmount)) > 
      2 * (SELECT STDDEV(amount) FROM SalesAmount);
        
Detected Outliers
50000.00
100000.00

3️⃣ PERCENTILE_CONT() – IQR Method

WITH percentiles AS (
  SELECT 
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS Q1,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS Q3
  FROM SalesAmount
)
SELECT Q1, Q3, (Q3-Q1) AS IQR
FROM percentiles;
        

βœ… Outliers are values < Q1 – 1.5*IQR or > Q3 + 1.5*IQR.

4️⃣ CASE WHEN – Cap Outliers

SELECT sale_id,
       CASE WHEN amount > 10000 THEN 10000
            ELSE amount END AS capped_amount
FROM SalesAmount;
        
BeforeAfter
50000.0010000.00
100000.0010000.00

5️⃣ DELETE – Remove Outliers

DELETE FROM SalesAmount
WHERE amount > 10000;
        

βœ… Removes extreme outliers to retain only realistic sales values.

By detecting and handling outliers, analysts can ensure more reliable insights. Outlier handling is especially important in business analytics and hands-on data projects where accurate metrics drive decisions.

🚫 Section 8: Handling Missing Data (NULLs) in SQL

Missing values (NULLs) are one of the most frequent challenges in data cleaning. If left untreated, they can distort averages, break joins, and lead to incomplete reports. SQL provides powerful commands to detect, replace, or ignore NULLs. Handling them properly is essential, especially in data analysis workflows.

πŸ“Š Step 1: Create Table & Insert Data

CREATE TABLE CustomerOrders (
  order_id INT,
  customer_name VARCHAR(50),
  amount DECIMAL(10,2)
);

INSERT INTO CustomerOrders VALUES
(1, 'Amit', 2500.00),
(2, NULL, 3000.00),
(3, 'Priya', NULL),
(4, 'Sneha', 2000.00),
(5, NULL, NULL);
      

βœ… The CustomerOrders table has NULL names and NULL amounts.

order_idcustomer_nameamount
1Amit2500.00
2NULL3000.00
3PriyaNULL
4Sneha2000.00
5NULLNULL

1️⃣ IS NULL / IS NOT NULL

SELECT * FROM CustomerOrders
WHERE customer_name IS NULL;
        
Detected Rows
order_id 2 β†’ NULL name
order_id 5 β†’ NULL name

2️⃣ COALESCE() – Replace with Default

SELECT order_id, 
       COALESCE(customer_name, 'Unknown') AS clean_name
FROM CustomerOrders;
        
BeforeAfter
NULLUnknown

3️⃣ IFNULL() – MySQL Specific

SELECT order_id, 
       IFNULL(amount, 0) AS clean_amount
FROM CustomerOrders;
        
BeforeAfter
NULL0

4️⃣ NULLIF() – Avoid Division Errors

SELECT order_id,
       amount / NULLIF(amount, 0) AS ratio
FROM CustomerOrders;
        

βœ… Prevents divide-by-zero errors by returning NULL instead of error.

5️⃣ AVG() / COUNT() Ignore NULLs

SELECT AVG(amount) AS avg_amount,
       COUNT(amount) AS valid_records
FROM CustomerOrders;
        
AVGCOUNT
2500.00 (ignores NULLs)3 (only non-NULL amounts)

Handling NULL values ensures your analysis is accurate and consistent. In real-world datasets, missing values are everywhere β€” especially in finance, healthcare, and surveys. Once mastered, you can apply these techniques in larger projects like data analytics workflows where clean data is the foundation of insights.

πŸ”„ Section 9: Data Type Conversion & Standardization in SQL

In real-world datasets, columns often store values in the wrong data type. For example: numbers stored as text, dates stored as strings, or booleans stored as integers. These inconsistencies can break calculations and cause inaccurate reports. SQL provides multiple functions to convert and standardize data types. This is especially critical in business analytics workflows.

πŸ“Š Step 1: Create Table & Insert Data

CREATE TABLE MixedData (
  record_id INT,
  amount_text VARCHAR(20),
  date_text VARCHAR(20),
  status_flag INT
);

INSERT INTO MixedData VALUES
(1, '2500.75', '2025-01-10', 1),
(2, '3000.10', '10/02/2025', 0),
(3, '4500.00', '2025/03/05', 1),
(4, '1000', '15-04-2025', NULL),
(5, 'invalid', 'invalid_date', 1);
      

βœ… The MixedData table has numbers as text, dates in multiple formats, invalid entries.

record_idamount_textdate_textstatus_flag
1“2500.75”“2025-01-10”1
2“3000.10”“10/02/2025”0
3“4500.00”“2025/03/05”1
4“1000”“15-04-2025”NULL
5“invalid”“invalid_date”1

1️⃣ CAST() – Convert Text to Number

SELECT record_id, CAST(amount_text AS DECIMAL(10,2)) AS amount_num
FROM MixedData;
        
BeforeAfter
“2500.75”2500.75
“1000”1000.00

2️⃣ CONVERT() – Standardize Date Format

SELECT record_id, 
       CONVERT(DATE, date_text, 105) AS clean_date
FROM MixedData;
        

βœ… Converts dates like 15-04-2025 into 2025-04-15.

3️⃣ TRY_CAST() – Safe Conversion

SELECT record_id,
       TRY_CAST(amount_text AS DECIMAL(10,2)) AS safe_amount
FROM MixedData;
        

βœ… Invalid values (like "invalid") return NULL instead of error.

4️⃣ PARSE() – Handle Flexible Dates

SELECT record_id,
       PARSE(date_text AS DATE USING 'en-GB') AS parsed_date
FROM MixedData;
        

βœ… Reads multiple formats like 10/02/2025 or 2025/03/05.

5️⃣ FORMAT() – Custom Output

SELECT record_id,
       FORMAT(CAST(amount_text AS DECIMAL(10,2)), 'C', 'en-IN') AS formatted_amount
FROM MixedData;
        

βœ… Displays amount as β‚Ή2,500.75 (Indian currency format).

Standardizing data types ensures that calculations, joins, and models run smoothly. This step is critical in advanced workflows like data science projects, where mixed types can break machine learning pipelines.

πŸ” Section 10: Text Search & Pattern Cleaning in SQL

Text fields often contain typos, invalid formats, or hidden characters. SQL provides pattern-matching functions like LIKE, REGEXP, and PATINDEX to detect, validate, and clean text values. These techniques are also used in natural language processing (NLP) where string patterns matter a lot.

πŸ“Š Step 1: Create Table & Insert Data

CREATE TABLE Users (
  user_id INT,
  email VARCHAR(50),
  phone VARCHAR(20)
);

INSERT INTO Users VALUES
(1, 'amit.sharma@gmail', '98765-43210'),
(2, 'priya@@yahoo.com', '91 99988 87777'),
(3, 'rohit.kumar@gmail.com', '999-111-2222'),
(4, 'sneha@outlook', '8887776666'),
(5, 'test_user#123.com', '99999');
      

βœ… The Users table has invalid emails, irregular phone formats, and short numbers.

user_idemailphone
1“amit.sharma@gmail”“98765-43210”
2“priya@@yahoo.com”“91 99988 87777”
3“rohit.kumar@gmail.com”“999-111-2222”
4“sneha@outlook”“8887776666”
5“test_user#123.com”“99999”

1️⃣ LIKE – Find Invalid Emails

SELECT * FROM Users
WHERE email NOT LIKE '%@%.%';
        
Invalid Emails
amit.sharma@gmail
sneha@outlook

2️⃣ PATINDEX() – Detect Symbols

SELECT user_id, email
FROM Users
WHERE PATINDEX('%[^a-zA-Z0-9@._-]%', email) > 0;
        
Emails with Invalid Characters
test_user#123.com

3️⃣ REGEXP – Validate Phone Pattern

SELECT user_id, phone
FROM Users
WHERE phone NOT REGEXP '^[0-9]{10}$';
        
Invalid Phones
98765-43210
91 99988 87777
999-111-2222
99999

4️⃣ STUFF() – Mask Phone Numbers

SELECT user_id,
       STUFF(phone, 4, 3, 'XXX') AS masked_phone
FROM Users;
        
BeforeAfter
9876543210987XXX210

5️⃣ REPLACE() – Clean Formatting

SELECT user_id,
       REPLACE(REPLACE(phone, '-', ''), ' ', '') AS clean_phone
FROM Users;
        
BeforeAfter
91 99988 87777919998887777
999-111-22229991112222

Mastering pattern-based text cleaning is essential for data validation. These SQL commands are also used in areas like cybersecurity and ethical hacking, where detecting unusual patterns can prevent fraud and attacks.

πŸ“Š Section 11: Data Aggregation & Summarization in SQL

Once data is cleaned, the next step is to aggregate and summarize it. Aggregation helps analysts turn raw transactions into insights like totals, averages, and counts. SQL provides functions like SUM(), COUNT(), GROUP BY, and HAVING to simplify this. If you’ve read our blog on steps of data analysis, this is where insights really start to appear.

πŸ“Š Step 1: Create Table & Insert Data

CREATE TABLE Sales (
  order_id INT,
  customer_id INT,
  region VARCHAR(20),
  amount DECIMAL(10,2)
);

INSERT INTO Sales VALUES
(1, 101, 'North', 2500.00),
(2, 101, 'North', 3000.00),
(3, 102, 'South', 1500.00),
(4, 103, 'East', 2000.00),
(5, 104, 'East', 2200.00),
(6, 105, 'West', 1800.00),
(7, 106, 'South', 1700.00),
(8, 107, 'North', 5000.00);
      

βœ… The Sales table contains multiple transactions per customer and region.

order_idcustomer_idregionamount
1101North2500.00
2101North3000.00
3102South1500.00
4103East2000.00
5104East2200.00
6105West1800.00
7106South1700.00
8107North5000.00

1️⃣ GROUP BY – Region-wise Totals

SELECT region, SUM(amount) AS total_sales
FROM Sales
GROUP BY region;
        
RegionTotal Sales
North10500.00
South3200.00
East4200.00
West1800.00

2️⃣ HAVING – Filter Groups

SELECT region, SUM(amount) AS total_sales
FROM Sales
GROUP BY region
HAVING SUM(amount) > 3000;
        
RegionTotal Sales
North10500.00
East4200.00

3️⃣ COUNT(DISTINCT) – Unique Customers

SELECT region, COUNT(DISTINCT customer_id) AS unique_customers
FROM Sales
GROUP BY region;
        
RegionUnique Customers
North2
South2
East2
West1

4️⃣ AVG() – Average Sales

SELECT region, AVG(amount) AS avg_sales
FROM Sales
GROUP BY region;
        
RegionAvg Sales
North3500.00
South1600.00
East2100.00
West1800.00

5️⃣ ROLLUP – Subtotals + Grand Total

SELECT region, SUM(amount) AS total_sales
FROM Sales
GROUP BY ROLLUP(region);
        
RegionTotal Sales
North10500.00
South3200.00
East4200.00
West1800.00
NULL (Grand Total)19700.00

Aggregation is where data becomes insight. By grouping, filtering, and summarizing, you create meaningful metrics for dashboards and reports. This process is core to data analyst careers where summarizing trends drives decision-making.

πŸ”— Section 12: Data Joins & Cleaning Across Tables in SQL

Most business data lives in multiple related tables. To prepare clean analytics datasets, we use joins to merge them. Joins not only connect data but also help detect missing records, mismatches, and duplicates. If you’ve practiced with our Pandas data cleaning guide, this section feels very similar β€” but with SQL.

πŸ“Š Step 1: Create Tables & Insert Data

CREATE TABLE Customers (
  cust_id INT,
  cust_name VARCHAR(50)
);

INSERT INTO Customers VALUES
(101, 'Amit'),
(102, 'Priya'),
(103, 'Rohit'),
(104, 'Sneha');

CREATE TABLE Orders (
  order_id INT,
  cust_id INT,
  amount DECIMAL(10,2)
);

INSERT INTO Orders VALUES
(1, 101, 2500.00),
(2, 101, 3000.00),
(3, 102, 1500.00),
(4, 105, 2000.00); -- 105 not in Customers
      

βœ… The Customers table and Orders table have one mismatch (cust_id 105 not in Customers).

cust_idcust_name
101Amit
102Priya
103Rohit
104Sneha
order_idcust_idamount
11012500.00
21013000.00
31021500.00
41052000.00

1️⃣ INNER JOIN – Matching Records

SELECT o.order_id, c.cust_name, o.amount
FROM Orders o
INNER JOIN Customers c ON o.cust_id = c.cust_id;
        
order_idcust_nameamount
1Amit2500.00
2Amit3000.00
3Priya1500.00

2️⃣ LEFT JOIN – Find Orders without Customers

SELECT o.order_id, c.cust_name, o.amount
FROM Orders o
LEFT JOIN Customers c ON o.cust_id = c.cust_id;
        
order_idcust_nameamount
4NULL2000.00

βœ… Detects mismatches like cust_id 105 not in Customers.

3️⃣ RIGHT JOIN – Customers without Orders

SELECT c.cust_name, o.order_id
FROM Customers c
RIGHT JOIN Orders o ON c.cust_id = o.cust_id;
        

βœ… Shows customers missing in Orders table (if any).

4️⃣ FULL OUTER JOIN – All Data with NULLs

SELECT c.cust_name, o.order_id, o.amount
FROM Customers c
FULL OUTER JOIN Orders o ON c.cust_id = o.cust_id;
        

βœ… Useful for audit reports – shows everything, even mismatches.

5️⃣ JOIN + IS NULL – Detect Missing Foreign Keys

SELECT o.order_id, o.cust_id
FROM Orders o
LEFT JOIN Customers c ON o.cust_id = c.cust_id
WHERE c.cust_id IS NULL;
        
order_idcust_id
4105

Joining and cleaning across tables ensures data consistency. Detecting mismatches helps prevent orphan records that can mislead reports. These join techniques are critical in data analytics workflows and in Python Pandas projects.

πŸ›‘οΈ Section 13: Data Integrity & Constraint-Based Cleaning in SQL

Instead of cleaning dirty data after it arrives, why not stop bad data at the source? SQL constraints enforce rules of integrity to prevent invalid, duplicate, or missing data. These techniques are crucial in data analyst workflows where quality assurance starts at the database design stage.

πŸ“Š Step 1: Create Products Table (with Issues)

CREATE TABLE Products (
  product_id INT,
  product_name VARCHAR(50),
  price DECIMAL(10,2),
  category VARCHAR(20)
);

INSERT INTO Products VALUES
(1, 'Laptop', 60000, 'Electronics'),
(1, 'Laptop', 60000, 'Electronics'), -- Duplicate ID
(2, NULL, 1200, 'Stationery'),       -- Missing name
(3, 'Pen', -10, 'Stationery'),       -- Negative price
(4, 'Shirt', 1500, 'InvalidCat');    -- Wrong category
      

βœ… The Products table has duplicates, NULLs, negative values, and invalid categories.

1️⃣ PRIMARY KEY – Prevent Duplicates

ALTER TABLE Products
ADD CONSTRAINT pk_product PRIMARY KEY(product_id);
        

❌ Stops duplicate product IDs from being inserted.

2️⃣ UNIQUE – Enforce Uniqueness

ALTER TABLE Products
ADD CONSTRAINT unq_product_name UNIQUE(product_name);
        

❌ Prevents two products with the same name.

3️⃣ NOT NULL – No Missing Fields

ALTER TABLE Products
ALTER COLUMN product_name VARCHAR(50) NOT NULL;
        

❌ Ensures every product has a name.

4️⃣ CHECK – Validate Ranges

ALTER TABLE Products
ADD CONSTRAINT chk_price CHECK(price > 0);
        

❌ Blocks negative prices from being entered.

5️⃣ FOREIGN KEY – Maintain Relationships

ALTER TABLE Products
ADD CONSTRAINT fk_category FOREIGN KEY(category)
REFERENCES Categories(category_name);
        

❌ Ensures only valid categories (from Categories table) are allowed.

Constraints are like data bodyguards β€” they prevent invalid records before they even enter your system. Using them reduces manual cleaning effort and ensures reliable datasets. This approach is vital in business analytics where decisions rely on high-quality data.

⚑ Section 14: Indexing & Performance-Aware Cleaning in SQL

When datasets grow huge, even simple cleaning operations like DELETE, DISTINCT, or UPDATE can become slow and expensive. SQL indexing ensures faster lookups, duplicate detection, and cleaning operations. Think of it as shortcuts for your database. If you’ve read our predictive analytics guide, you already know performance and efficiency directly affect insights.

πŸ“Š Step 1: Create Table with Duplicates

CREATE TABLE LargeSales (
  sale_id INT,
  customer_id INT,
  amount DECIMAL(10,2)
);

INSERT INTO LargeSales VALUES
(1, 101, 2500.00),
(2, 102, 3000.00),
(3, 103, 1800.00),
(4, 101, 2500.00), -- duplicate customer
(5, 104, 2200.00);
      

βœ… The LargeSales table has duplicate customer_ids and needs optimization.

1️⃣ CREATE INDEX – Faster Lookups

CREATE INDEX idx_customer
ON LargeSales(customer_id);
        

βœ… Speeds up duplicate detection queries like SELECT DISTINCT or GROUP BY customer_id.

2️⃣ UNIQUE INDEX – Prevent Duplicates

CREATE UNIQUE INDEX unq_customer
ON LargeSales(customer_id);
        

❌ Blocks insertion of duplicate customer IDs at the database level.

3️⃣ CLUSTERED INDEX – Organize Data

CREATE CLUSTERED INDEX cl_idx_sales
ON LargeSales(sale_id);
        

βœ… Stores rows physically by sale_id, improving query speed for sequential cleaning tasks.

4️⃣ NONCLUSTERED INDEX – Flexible Filtering

CREATE NONCLUSTERED INDEX ncl_idx_amount
ON LargeSales(amount);
        

βœ… Helps when cleaning unusual numeric values like outliers in amount.

5️⃣ DROP INDEX – Optimize Bulk Cleaning

DROP INDEX idx_customer ON LargeSales;
        

βœ… Dropping indexes before bulk DELETE or UPDATE speeds up cleaning, then recreate them later.

Indexes are like shortcuts for your database. They enforce uniqueness, accelerate duplicate removal, and make cleaning large datasets practical. For data analysts working with big datasets, mastering indexing ensures both accuracy and speed.

πŸ”„ Section 15: Data Normalization & Standardization in SQL

Raw datasets often contain inconsistent spellings, cases, and formats. For example: India, india, IN may all appear in the same column. SQL provides functions to normalize and standardize values, ensuring data is clean, consistent, and ready for analysis. These techniques are widely used in data analytics projects.

πŸ“Š Step 1: Create Table & Insert Data

CREATE TABLE CustomersNorm (
  cust_id INT,
  country VARCHAR(50),
  gender VARCHAR(10)
);

INSERT INTO CustomersNorm VALUES
(1, ' India ', 'M'),
(2, 'india', 'male'),
(3, 'INDIA', 'F'),
(4, 'USA', 'Female'),
(5, 'U.S.A', 'f');
      

βœ… The CustomersNorm table has inconsistent cases, spaces, abbreviations, and gender formats.

1️⃣ TRIM() – Remove Extra Spaces

SELECT cust_id, TRIM(country) AS clean_country
FROM CustomersNorm;
        
BeforeAfter
” India ““India”

2️⃣ UPPER()/LOWER() – Fix Case

SELECT cust_id, UPPER(country) AS norm_country
FROM CustomersNorm;
        
BeforeAfter
“india”“INDIA”

3️⃣ REPLACE() – Standardize Abbreviations

SELECT cust_id,
       REPLACE(country, 'U.S.A', 'USA') AS clean_country
FROM CustomersNorm;
        
BeforeAfter
“U.S.A”“USA”

4️⃣ CASE WHEN – Map Gender Values

SELECT cust_id,
       CASE 
         WHEN gender IN ('M','male') THEN 'Male'
         WHEN gender IN ('F','female') THEN 'Female'
       END AS norm_gender
FROM CustomersNorm;
        
BeforeAfter
“M”“Male”
“f”“Female”

5️⃣ JOIN Reference Table – Enforce Standards

-- Assuming we have a CountryReference table
SELECT c.cust_id, r.country_name
FROM CustomersNorm c
JOIN CountryReference r
ON UPPER(c.country) = UPPER(r.country_alias);
        

βœ… Ensures only valid country names from a reference table are used.

Normalization ensures that different spellings and formats mean the same thing. By standardizing values, analysts prevent duplicate categories and improve data accuracy. This step is vital in data analytics workflows where consistent values lead to more reliable insights.

⏰ Section 16: Date & Time Cleaning in SQL

Dates in datasets often come in mixed formats, with NULLs or invalid values. Cleaning them ensures accurate time-series analysis, forecasting, and reporting. SQL provides built-in functions to standardize dates and extract useful parts. If you’ve read our linear regression guide, you know how important clean dates are for trend analysis.

πŸ“Š Step 1: Create Table & Insert Data

CREATE TABLE Transactions (
  trans_id INT,
  trans_date VARCHAR(20),
  amount DECIMAL(10,2)
);

INSERT INTO Transactions VALUES
(1, '2025-01-05', 2500.00),
(2, '05/02/2025', 3000.00),
(3, 'March 10, 2025', 1800.00),
(4, NULL, 2200.00),
(5, 'invalid_date', 1500.00);
      

βœ… The Transactions table has mixed date formats, NULLs, and invalid values.

1️⃣ CAST/CONVERT – Standardize Format

SELECT trans_id,
       CONVERT(DATE, trans_date, 105) AS clean_date
FROM Transactions;
        

βœ… Converts '05/02/2025' into '2025-02-05' (YYYY-MM-DD format).

2️⃣ TRY_CONVERT – Safe Conversion

SELECT trans_id,
       TRY_CONVERT(DATE, trans_date) AS safe_date
FROM Transactions;
        

βœ… Invalid dates (like 'invalid_date') return NULL instead of an error.

3️⃣ DATEPART() – Extract Components

SELECT trans_id,
       DATEPART(YEAR, TRY_CONVERT(DATE, trans_date)) AS year,
       DATEPART(MONTH, TRY_CONVERT(DATE, trans_date)) AS month
FROM Transactions;
        

βœ… Extracts year and month for time-based grouping.

4️⃣ DATEDIFF() – Calculate Durations

SELECT trans_id,
       DATEDIFF(DAY, TRY_CONVERT(DATE, trans_date), GETDATE()) AS days_old
FROM Transactions;
        

βœ… Calculates how many days old each transaction is.

5️⃣ COALESCE – Replace NULL Dates

SELECT trans_id,
       COALESCE(TRY_CONVERT(DATE, trans_date), '2025-01-01') AS final_date
FROM Transactions;
        

βœ… Fills missing/invalid dates with a default date.

Cleaning dates and timestamps is essential for reliable time-series insights. These SQL techniques ensure consistent formats, safe conversions, and complete values. They are widely used in forecasting, reporting, and machine learning models where accurate dates drive predictions.

🚨 Section 17: Outlier Detection & Cleaning in SQL

Outliers can distort averages, create misleading insights, and break machine learning models. SQL offers multiple ways to detect and clean outliers using statistics and conditions. If you’ve read our data science guide, you know outlier handling is a must before model training.

πŸ“Š Step 1: Create Table with Outliers

CREATE TABLE SalesOutliers (
  sale_id INT,
  amount DECIMAL(10,2)
);

INSERT INTO SalesOutliers VALUES
(1, 2500.00),
(2, 2600.00),
(3, 2700.00),
(4, 100000.00), -- Extreme high outlier
(5, 150.00),    -- Extreme low outlier
(6, 2800.00),
(7, 2900.00);
      

βœ… The SalesOutliers table has one extremely high and one extremely low value.

1️⃣ AVG + STDEV – Z-Score Method

SELECT sale_id, amount
FROM SalesOutliers
WHERE ABS(amount - (SELECT AVG(amount) FROM SalesOutliers)) 
      < 3 * (SELECT STDEV(amount) FROM SalesOutliers);
        

βœ… Removes values outside 3 standard deviations.

2️⃣ IQR Filtering – Boxplot Method

WITH Quartiles AS (
  SELECT 
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY amount) AS Q1,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY amount) AS Q3
  FROM SalesOutliers
)
SELECT s.*
FROM SalesOutliers s, Quartiles q
WHERE s.amount BETWEEN (Q1 - 1.5*(Q3-Q1)) AND (Q3 + 1.5*(Q3-Q1));
        

βœ… Filters out values beyond the IQR range.

3️⃣ PERCENTILE_CONT – Winsorization

SELECT PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY amount) AS p05,
       PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) AS p95
FROM SalesOutliers;
        

βœ… Identifies 5th–95th percentile cutoffs for capping extreme values.

4️⃣ CASE WHEN – Cap Outliers

SELECT sale_id,
       CASE 
         WHEN amount > 5000 THEN 5000
         WHEN amount < 500 THEN 500
         ELSE amount
       END AS capped_amount
FROM SalesOutliers;
        

βœ… Caps outliers to a safe range (500–5000).

5️⃣ DELETE – Remove Outliers Permanently

DELETE FROM SalesOutliers
WHERE amount > 5000 OR amount < 500;
        

βœ… Deletes extreme records completely (use with caution).

Handling outliers ensures more stable averages, cleaner visuals, and accurate models. Depending on the business case, you can remove, cap, or transform extreme values. This step is critical in data science workflows where outliers often make or break model performance.

πŸ” Section 18: Data Deduplication Across Tables in SQL

When merging datasets from different sources, duplicate records across tables are very common. SQL provides set operations and joins to detect, remove, and consolidate duplicates. This ensures analysts work with unique, high-quality records. If you’ve read our 7 steps of data analysis, deduplication comes right after data collection.

πŸ“Š Step 1: Create Tables with Overlap

CREATE TABLE Customers_A (
  cust_id INT,
  cust_name VARCHAR(50)
);

INSERT INTO Customers_A VALUES
(1, 'Amit'),
(2, 'Priya'),
(3, 'Rohit');

CREATE TABLE Customers_B (
  cust_id INT,
  cust_name VARCHAR(50)
);

INSERT INTO Customers_B VALUES
(2, 'Priya'),
(3, 'Rohit'),
(4, 'Sneha');
      

βœ… Both tables contain overlapping records for Priya and Rohit.

1️⃣ UNION – Remove Duplicates Across Tables

SELECT cust_id, cust_name FROM Customers_A
UNION
SELECT cust_id, cust_name FROM Customers_B;
        

βœ… Combines both tables but keeps unique records only.

2️⃣ UNION ALL + GROUP BY – Find True Duplicates

SELECT cust_id, cust_name, COUNT(*) AS duplicate_count
FROM (
  SELECT cust_id, cust_name FROM Customers_A
  UNION ALL
  SELECT cust_id, cust_name FROM Customers_B
) t
GROUP BY cust_id, cust_name
HAVING COUNT(*) > 1;
        

βœ… Identifies which records appear in both tables.

3️⃣ EXCEPT/MINUS – Find Non-Matching Records

SELECT cust_id, cust_name FROM Customers_A
EXCEPT
SELECT cust_id, cust_name FROM Customers_B;
        

βœ… Shows records that exist in Customers_A only.

4️⃣ LEFT JOIN + IS NULL – Detect Missing Matches

SELECT a.cust_id, a.cust_name
FROM Customers_A a
LEFT JOIN Customers_B b
ON a.cust_id = b.cust_id
WHERE b.cust_id IS NULL;
        

βœ… Finds records in A but not in B (non-overlaps).

5️⃣ INSERT DISTINCT – Merge Clean Data

INSERT INTO MasterCustomers (cust_id, cust_name)
SELECT DISTINCT cust_id, cust_name
FROM (
  SELECT * FROM Customers_A
  UNION
  SELECT * FROM Customers_B
) t;
        

βœ… Inserts only unique records into the master table.

Deduplication ensures one record = one entity, even if data comes from multiple sources. It prevents double-counting and improves the accuracy of reports and dashboards. This step is a must in data analysis pipelines before joining data with other business tables.

πŸ”€ Section 19: Data Type Conversion & Cleaning in SQL

Datasets often have values stored in the wrong data type: numbers saved as text, or dates written as strings. This causes errors in calculations and reporting. SQL provides conversion functions to clean and standardize data. If you’ve read our neural network blog, you’ll know that even advanced models require correct data types to function properly.

πŸ“Š Step 1: Create Table with Mixed Types

CREATE TABLE MixedData (
  record_id INT,
  amount_text VARCHAR(20),
  date_text VARCHAR(20)
);

INSERT INTO MixedData VALUES
(1, '2500.50', '2025-01-05'),
(2, '3000', '05/02/2025'),
(3, 'invalid_num', 'March 10, 2025'),
(4, NULL, 'invalid_date');
      

βœ… The MixedData table has numbers stored as text and inconsistent date formats.

1️⃣ CAST – Basic Conversion

SELECT record_id,
       CAST(amount_text AS DECIMAL(10,2)) AS clean_amount
FROM MixedData;
        

βœ… Converts '2500.50' (text) into 2500.50 (decimal).

2️⃣ CONVERT – Date Conversion

SELECT record_id,
       CONVERT(DATE, date_text, 105) AS clean_date
FROM MixedData;
        

βœ… Converts '05/02/2025' into '2025-02-05' (standard format).

3️⃣ TRY_CAST – Safe Conversion

SELECT record_id,
       TRY_CAST(amount_text AS DECIMAL(10,2)) AS safe_amount
FROM MixedData;
        

βœ… Invalid numbers ('invalid_num') become NULL instead of causing errors.

4️⃣ TRY_CONVERT – Safe Date Conversion

SELECT record_id,
       TRY_CONVERT(DATE, date_text) AS safe_date
FROM MixedData;
        

βœ… Invalid dates ('invalid_date') safely return NULL.

5️⃣ PARSE – Complex Conversions

SELECT record_id,
       PARSE(date_text AS DATE USING 'en-US') AS parsed_date
FROM MixedData;
        

βœ… Converts 'March 10, 2025' into a proper DATE field.

With correct data types, analysts can perform calculations, joins, and models reliably. Type conversion is a critical step before analytics or machine learning, where incorrect formats can cause errors or wrong predictions.

βœ… Section 20: Data Validation Queries in SQL

Data validation ensures that records follow business rules before they are used for analytics. Using SQL, analysts can easily detect invalid, missing, or inconsistent values. This step is critical in business analysis, where data quality directly impacts decision-making.

πŸ“Š Step 1: Create Table with Invalid Records

CREATE TABLE SalesValidation (
  sale_id INT,
  product_name VARCHAR(50),
  amount DECIMAL(10,2),
  category VARCHAR(20),
  sale_date DATE
);

INSERT INTO SalesValidation VALUES
(1, 'Laptop', 50000, 'Electronics', '2025-01-15'),
(2, NULL, 1500, 'Stationery', '2025-02-05'),   -- Missing product name
(3, 'Pen', -10, 'Stationery', '2025-02-10'),   -- Negative amount
(4, 'Shirt', 2000, 'InvalidCat', '2025-03-01'),-- Wrong category
(5, 'Book', 800, 'Stationery', '2030-01-01');  -- Future date
      

βœ… The SalesValidation table has NULLs, negatives, wrong categories, and future dates.

1️⃣ IS NULL – Detect Missing Fields

SELECT * FROM SalesValidation
WHERE product_name IS NULL;
        

βœ… Finds records where product_name is missing.

2️⃣ CASE WHEN – Check for Negative Values

SELECT sale_id, amount,
       CASE WHEN amount < 0 THEN 'Invalid' ELSE 'Valid' END AS status
FROM SalesValidation;
        

βœ… Flags sales with negative amounts.

3️⃣ NOT IN – Validate Allowed Categories

SELECT * FROM SalesValidation
WHERE category NOT IN ('Electronics', 'Stationery', 'Clothing');
        

βœ… Finds products with invalid categories.

4️⃣ BETWEEN – Detect Out-of-Range Values

SELECT * FROM SalesValidation
WHERE amount NOT BETWEEN 1 AND 100000;
        

βœ… Ensures amount is within expected range.

5️⃣ Date Validation – No Future Dates

SELECT * FROM SalesValidation
WHERE sale_date > GETDATE();
        

βœ… Detects future transaction dates.

Validation queries act like data quality checkpoints. They ensure values match business expectations before being used in reports. For business analysts, mastering these checks is key to delivering trustworthy insights.

πŸ” Section 21: Data Cleaning with Subqueries in SQL

Subqueries allow analysts to run a query inside another query. They are widely used in data cleaning for detecting duplicates, filtering invalid records, and keeping only the latest or valid entries. Similar to loops in Python, subqueries repeat logic efficiently within SQL itself.

πŸ“Š Step 1: Create Table with Dirty Records

CREATE TABLE CustomerOrders (
  order_id INT,
  cust_id INT,
  order_date DATE,
  amount DECIMAL(10,2)
);

INSERT INTO CustomerOrders VALUES
(1, 101, '2025-01-01', 2500.00),
(2, 102, '2025-01-05', NULL),       -- Missing amount
(3, 103, '2025-01-10', 1800.00),
(4, 103, '2025-01-12', 2000.00),    -- Duplicate customer with later date
(5, 104, '2024-12-25', -500.00);    -- Negative amount
      

βœ… The CustomerOrders table has NULLs, duplicates, and negative values.

1️⃣ IN – Keep Only Valid Customers

SELECT * FROM CustomerOrders
WHERE cust_id IN (
  SELECT DISTINCT cust_id FROM CustomerOrders WHERE amount > 0
);
        

βœ… Keeps only customers with valid positive amounts.

2️⃣ NOT IN – Remove Customers with NULLs

SELECT * FROM CustomerOrders
WHERE cust_id NOT IN (
  SELECT cust_id FROM CustomerOrders WHERE amount IS NULL
);
        

βœ… Excludes customers with NULL amounts.

3️⃣ EXISTS – Detect Duplicate Customers

SELECT * FROM CustomerOrders c1
WHERE EXISTS (
  SELECT 1 FROM CustomerOrders c2
  WHERE c1.cust_id = c2.cust_id
  AND c1.order_id <> c2.order_id
);
        

βœ… Flags customers that appear in multiple orders.

4️⃣ MAX() with Subquery – Keep Latest Order

SELECT * FROM CustomerOrders c
WHERE order_date = (
  SELECT MAX(order_date) FROM CustomerOrders c2
  WHERE c.cust_id = c2.cust_id
);
        

βœ… Keeps only the latest order per customer.

5️⃣ DELETE with Subquery – Remove Negatives

DELETE FROM CustomerOrders
WHERE amount < 0
AND cust_id IN (
  SELECT cust_id FROM CustomerOrders
);
        

βœ… Deletes negative transactions using a subquery filter.

Subqueries are powerful tools for smart filtering and cleaning. They help analysts remove invalid values, keep latest records, and detect duplicates. Just like Python loops, subqueries repeat logic without writing separate queries each time.

🧩 Section 22: CTEs for Stepwise Data Cleaning in SQL

Common Table Expressions (CTEs) allow you to write stepwise SQL queries. They make cleaning easier to read and manage, especially when handling duplicates, NULLs, and row numbering. Think of them as temporary views within a query. If you’ve explored our Pandas cleaning guide, you’ll see how CTEs give SQL similar stepwise clarity.

πŸ“Š Step 1: Create Sample Orders Table

CREATE TABLE OrdersCTE (
  order_id INT,
  cust_id INT,
  amount DECIMAL(10,2),
  order_date DATE
);

INSERT INTO OrdersCTE VALUES
(1, 101, 2500.00, '2025-01-01'),
(2, 102, NULL, '2025-01-05'),     -- NULL value
(3, 103, 1800.00, '2025-01-10'),
(4, 103, 1800.00, '2025-01-10'), -- Duplicate
(5, 104, -500.00, '2025-01-15'); -- Negative value
      

βœ… The OrdersCTE table has NULLs, duplicates, and negatives.

1️⃣ Remove Duplicates with ROW_NUMBER()

WITH Dedup AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY cust_id, amount, order_date
           ORDER BY order_id
         ) AS rn
  FROM OrdersCTE
)
SELECT * FROM Dedup WHERE rn = 1;
        

βœ… Assigns row numbers and keeps only the first occurrence.

2️⃣ Fill NULLs Stepwise

WITH FillNulls AS (
  SELECT order_id, cust_id,
         COALESCE(amount, 0) AS clean_amount,
         order_date
  FROM OrdersCTE
)
SELECT * FROM FillNulls;
        

βœ… Replaces NULL amounts with 0.

3️⃣ Detect Negative Values

WITH Negatives AS (
  SELECT * FROM OrdersCTE WHERE amount < 0
)
SELECT * FROM Negatives;
        

βœ… Finds negative transactions.

4️⃣ Multi-Step Cleaning

WITH Step1 AS (
  SELECT *, COALESCE(amount,0) AS clean_amt FROM OrdersCTE
),
Step2 AS (
  SELECT *, ABS(clean_amt) AS abs_amt FROM Step1
)
SELECT * FROM Step2;
        

βœ… Chains multiple cleaning steps: fill NULLs β†’ fix negatives.

5️⃣ DELETE Duplicates with CTE

WITH Dedup AS (
  SELECT *,
         ROW_NUMBER() OVER(
           PARTITION BY cust_id, amount, order_date
           ORDER BY order_id
         ) AS rn
  FROM OrdersCTE
)
DELETE FROM Dedup WHERE rn > 1;
        

βœ… Deletes duplicate rows while keeping the first one.

CTEs make complex cleaning step-by-step and readable. They are especially useful for handling duplicates, NULLs, and negatives systematically. Analysts can combine multiple cleaning tasks in a single structured query, similar to how we chain cleaning steps in Pandas.

πŸ‘“ Section 23: Views for Consistent Data Cleaning in SQL

A VIEW in SQL acts like a saved query β€” a virtual table that ensures consistent cleaning logic across teams. Instead of rewriting cleaning steps for every report, analysts can build a view once and reuse it. This is crucial in enterprise analytics and data analyst workflows.

πŸ“Š Step 1: Create Employees Table with Issues

CREATE TABLE RawEmployees (
  emp_id INT,
  emp_name VARCHAR(50),
  salary VARCHAR(20),
  dept VARCHAR(20)
);

INSERT INTO RawEmployees VALUES
(1, '  Amit  ', '50000', 'HR'),
(2, 'Priya', 'invalid', 'Finance'),
(3, 'Rohit', '60000', 'Sales'),
(4, 'Sneha', NULL, 'HR'),
(5, 'Amit', '50000', 'HR'); -- duplicate
      

βœ… The RawEmployees table has extra spaces, duplicates, invalid salaries, and NULLs.

1️⃣ CREATE VIEW – Clean Data Once

CREATE VIEW CleanEmployees AS
SELECT DISTINCT 
       emp_id,
       TRIM(emp_name) AS clean_name,
       TRY_CAST(salary AS DECIMAL(10,2)) AS clean_salary,
       dept
FROM RawEmployees
WHERE salary IS NOT NULL;
        

βœ… Standardizes salaries, trims names, and removes NULLs.

2️⃣ UPDATE Through a VIEW

UPDATE CleanEmployees
SET clean_salary = 55000
WHERE emp_name = 'Rohit';
        

βœ… Fixes invalid/incorrect values directly via the view.

3️⃣ JOIN in a VIEW – Add Context

CREATE VIEW EmployeeWithDept AS
SELECT e.emp_id, e.clean_name, e.clean_salary, d.manager
FROM CleanEmployees e
JOIN Departments d ON e.dept = d.dept_name;
        

βœ… Enriches cleaned data with manager info from a Departments table.

4️⃣ Indexed View – Faster Cleaning

CREATE UNIQUE CLUSTERED INDEX idx_clean_emp
ON CleanEmployees(emp_id);
        

βœ… Adds performance to repetitive cleaning queries.

5️⃣ DROP VIEW – Reset Cleaning Logic

DROP VIEW CleanEmployees;
        

βœ… Deletes the view when cleaning rules change.

Views ensure consistent cleaning rules across all analysts and dashboards. Instead of rewriting cleaning queries, you reuse one virtual table. This is essential in enterprise analytics where multiple teams depend on the same clean data.

βš™οΈ Section 24: Stored Procedures for Automated Cleaning in SQL

Stored procedures bundle cleaning logic into a reusable block of SQL code. Instead of writing multiple queries, analysts just call the procedure. This makes cleaning faster, automated, and secure. In enterprise analytics, automation is a key step after data preparation.

πŸ“Š Step 1: Create Dirty Products Table

CREATE TABLE DirtyProducts (
  prod_id INT,
  prod_name VARCHAR(50),
  price VARCHAR(20),
  stock INT
);

INSERT INTO DirtyProducts VALUES
(1, ' Laptop ', '50000.99', 10),
(2, 'Mobile', 'invalid', 5),
(3, 'Tablet', '15000', NULL),
(4, 'Mobile', '20000', 5); -- duplicate
      

βœ… The DirtyProducts table has extra spaces, invalid prices, NULL stock, and duplicates.

1️⃣ CREATE PROCEDURE – Define Cleaning Logic

CREATE PROCEDURE CleanProducts AS
BEGIN
  UPDATE DirtyProducts
  SET prod_name = TRIM(prod_name),
      price = TRY_CAST(price AS DECIMAL(10,2)),
      stock = COALESCE(stock, 0);
END;
        

βœ… Creates a procedure to clean names, fix prices, and fill NULL stock.

2️⃣ EXEC – Run the Procedure

EXEC CleanProducts;
        

βœ… Executes the cleaning procedure in one step.

3️⃣ Procedure with Parameters

CREATE PROCEDURE UpdatePrice (@prod_id INT, @new_price DECIMAL(10,2)) AS
BEGIN
  UPDATE DirtyProducts
  SET price = @new_price
  WHERE prod_id = @prod_id;
END;

EXEC UpdatePrice 2, 18000;
        

βœ… Updates price for a specific product dynamically.

4️⃣ Procedure with Transaction

CREATE PROCEDURE SafeDeleteDuplicates AS
BEGIN
  BEGIN TRANSACTION;
  DELETE FROM DirtyProducts
  WHERE prod_id NOT IN (
    SELECT MIN(prod_id)
    FROM DirtyProducts
    GROUP BY prod_name, price, stock
  );
  COMMIT TRANSACTION;
END;
        

βœ… Safely deletes duplicates using a transaction.

5️⃣ DROP PROCEDURE – Remove Cleaning Logic

DROP PROCEDURE CleanProducts;
        

βœ… Deletes the procedure when no longer needed.

Stored procedures make cleaning repeatable and automated. Analysts can run multiple cleaning steps in one call, improving efficiency and consistency. This is a key skill in data analytics projects where automation reduces manual errors.

πŸ“ Section 25: Temporary Tables for Cleaning Sessions in SQL

Temporary tables let you store intermediate cleaning results without modifying the main dataset. They are useful in stepwise cleaning, staging data, and testing fixes. Just like staging tables in Power BI or ML projects, SQL temp tables provide a safe workspace.

πŸ“Š Step 1: Create Raw Orders Table

CREATE TABLE RawOrders (
  order_id INT,
  cust_name VARCHAR(50),
  amount VARCHAR(20),
  status VARCHAR(20)
);

INSERT INTO RawOrders VALUES
(1, ' Amit ', '2500.50', 'Completed'),
(2, 'Priya', 'invalid', 'Pending'),
(3, 'Rohit', '3000', NULL),
(4, 'Sneha', '-500', 'Completed');
      

βœ… The RawOrders table has spaces, invalid numbers, NULLs, and negatives.

1️⃣ CREATE TEMP TABLE – Scratchpad

CREATE TABLE #CleanOrders (
  order_id INT,
  cust_name VARCHAR(50),
  clean_amount DECIMAL(10,2),
  status VARCHAR(20)
);
        

βœ… Creates a temporary table #CleanOrders for cleaned data.

2️⃣ INSERT Cleaned Data

INSERT INTO #CleanOrders
SELECT order_id,
       TRIM(cust_name),
       TRY_CAST(amount AS DECIMAL(10,2)) AS clean_amount,
       COALESCE(status, 'Pending') AS clean_status
FROM RawOrders;
        

βœ… Loads cleaned records into the temp table.

3️⃣ JOIN Temp with Original

SELECT r.order_id, r.cust_name, c.clean_amount, c.status
FROM RawOrders r
JOIN #CleanOrders c ON r.order_id = c.order_id;
        

βœ… Compares raw vs clean records side by side.

4️⃣ UPDATE Temp Table

UPDATE #CleanOrders
SET clean_amount = ABS(clean_amount)
WHERE clean_amount < 0;
        

βœ… Fixes negative amounts in the temp table.

5️⃣ DROP TEMP TABLE – End Session

DROP TABLE #CleanOrders;
        

βœ… Removes the temp table after cleaning is complete.

Temporary tables provide a safe, session-based workspace for cleaning. Analysts can test fixes without harming original data. This is widely used in Power BI and ML pipelines where clean staging is crucial before loading into reports.

πŸ”„ Section 26: Transactions & Rollback for Safe Cleaning in SQL

Transactions ensure cleaning steps happen safely. If an error occurs, you can ROLLBACK to undo changes. This prevents accidental data loss and ensures reliable analytics. Just like staging in predictive analytics, safe transactions protect your data before it’s modeled.

πŸ“Š Step 1: Create Payments Table

CREATE TABLE Payments (
  pay_id INT,
  cust_id INT,
  amount DECIMAL(10,2),
  status VARCHAR(20)
);

INSERT INTO Payments VALUES
(1, 101, 2500.00, 'Completed'),
(2, 102, -500.00, 'Completed'), -- invalid negative
(3, 103, NULL, 'Pending'),     -- NULL amount
(4, 104, 3000.00, 'Completed');
      

βœ… The Payments table has negatives and NULL amounts.

1️⃣ BEGIN TRANSACTION – Start Cleaning

BEGIN TRANSACTION;
UPDATE Payments
SET amount = ABS(amount)
WHERE amount < 0;
        

βœ… Starts a transaction and fixes negative values.

2️⃣ COMMIT – Save Changes

COMMIT;
        

βœ… Confirms cleaning changes permanently.

3️⃣ ROLLBACK – Undo Mistakes

ROLLBACK;
        

βœ… Cancels the transaction and restores original data.

4️⃣ SAVEPOINT – Partial Rollback

BEGIN TRANSACTION;
UPDATE Payments SET amount = 0 WHERE amount IS NULL;
SAVEPOINT FixNulls;
UPDATE Payments SET status = 'Checked' WHERE status = 'Pending';
ROLLBACK TO FixNulls;
COMMIT;
        

βœ… Rolls back only part of the transaction to FixNulls.

5️⃣ TRY…CATCH – Handle Errors Safely

BEGIN TRY
  BEGIN TRANSACTION;
  UPDATE Payments
  SET amount = amount / 0; -- error
  COMMIT;
END TRY
BEGIN CATCH
  ROLLBACK;
  PRINT 'Error occurred. Transaction rolled back.';
END CATCH;
        

βœ… Ensures rollback on errors during cleaning.

Transactions make data cleaning safe and reversible. Analysts can fix negatives, NULLs, or duplicates with confidence, knowing mistakes can be undone. This is critical in predictive analytics where wrong cleaning steps can corrupt models.

πŸ“’ Master SQL & Data Analytics with Vista Academy

You’ve just explored 100+ SQL commands for data cleaning. Ready to take your skills further? Join Vista Academy, Dehradun for hands-on Data Analytics & SQL Training with real-world projects. Build a job-ready portfolio and become an in-demand analyst in 2025. πŸš€

πŸ‘‰ Enroll Now – Start Your Data Analytics Journey
Vista Academy – 316/336, Park Rd, Laxman Chowk, Dehradun – 248001
πŸ“ž +91 94117 78145 | πŸ“§ thevistaacademy@gmail.com | πŸ’¬ WhatsApp
πŸ’¬ Chat on WhatsApp: Ask About Our Courses