🧹 10 Essential Data Cleansing Techniques Every Data Professional Should Know

Ahmed ElSangary

Enterprise Architect | Digital Transformation Expert | Microsoft Azure & Data Solutions Expert | ERP Specialist | Odoo ERP Partner | MS Certified Trainer MCT | Helping Enterprises Build Scalable, Secure Platforms.

Published Jun 9, 2025

As a Microsoft Fabric Data Engineer, I know that clean data is the foundation of reliable analytics. Bad data leads to poor decisions, wasted time, and flawed AI models.

Here are 10 must-know data cleansing techniques, with simple examples you can apply in Fabric Lakehouse, Spark, and Power BI. As Data Engineer, clean data is your most critical deliverable—it powers accurate analytics, reliable AI models, and trustworthy business decisions.

Without proper data cleansing, even the most advanced pipelines and warehouses become "garbage in, garbage out" systems that erode organizational trust.

1. Standardization

Ensuring data follows a consistent format.

Problem: Inconsistent formats ("USA" vs. "U.S.A" vs. "United States").
Example: Converting dates to YYYY-MM-DD format across all records.

# Before standardization: various date formats

# "2023-05-15", "05/15/2023", "15-May-2023", "20230515"

// Standardize country names

Example (Power Query in Fabric):

= Table.ReplaceValue(

Source, each [Country], each if Text.Contains([Country], "USA") then "US" else [Country],

Replacer.ReplaceText, {"Country"})

2. Normalization

Transforming data to a common scale without distorting differences in values. Often used in numerical data for machine learning models.

Problem: Features with different scales (e.g., Age 0-100 vs. Income 0-1,000,000).

- Scaling numerical data to a standard range (e.g., 0 to 1, -1 to 1, or z-scores).

## Why Normalization Matters

✅ Improves Machine Learning Performance (algorithms like SVM, KNN, and neural networks work better with normalized data).

✅ Enables Fair Comparisons (when features are on different scales, e.g., income vs. age).

✅ Reduces Bias (prevents features with large ranges from dominating models).

✅ Speeds Up Training (optimization algorithms converge faster).

## Common Normalization Techniques

### 1. Min-Max Scaling

- Use case: Image pixel values (0-255 → 0-1).

### 2. Z-Score Standardization

- Transforms data to have mean = 0 and standard deviation = 1.

- Use case: When data follows a Gaussian distribution.

### 3. Decimal Scaling

- Moves the decimal point to scale values (e.g., 1500 → 1.5 if max is 2000).

### 4. Log Transformation

- Compresses large ranges (e.g., for skewed data like income).

### 5. Unit Vector Scaling (L2 Normalization)

- Scales data to have a magnitude of 1 (useful for text/data with varying lengths).

## Example in Python (Pandas & Scikit-Learn)

import pandas as pd

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample data

data = {'Age': [25, 47, 32, 61], 'Income': [40000, 80000, 55000, 120000]}

df = pd.DataFrame(data)

# Min-Max Scaling (0 to 1)

scaler = MinMaxScaler()

df_minmax = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Z-Score Standardization (mean=0, std=1)

standardizer = StandardScaler()

df_zscore = pd.DataFrame(standardizer.fit_transform(df), columns=df.columns)

print("Min-Max Normalized:\n", df_minmax)

print("\nZ-Score Standardized:\n", df_zscore)

Example (Fabric Notebook - Scikit-Learn):

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])

3. Deduplication (De-duplication)

Removing duplicate records from a dataset.

Problem: Duplicate records skew aggregations and ML models.
Example: Two customer entries with the same email or ID.
Example (Fabric SQL Endpoint):

-- Keep only the first occurrence

WITH Deduped AS (

SELECT *,

ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS RowNum

FROM Sales

)

SELECT * FROM Deduped WHERE RowNum = 1;

4. Cross-Field Validation

Problem: Logical inconsistencies (e.g., "Age=10" but "EmploymentStatus=Employed").

Example (Fabric SQL Rule):

-- Flag invalid records

SELECT *, CASE WHEN Age < 18 AND IsEmployed = 1 THEN 'Invalid: Underage Employment'

ELSE 'Valid' END AS ValidationStatus

FROM Users;

5. Validation / Text Cleaning (Noise Removal)

Checking that data meets certain rules or constraints.

Example: Ensuring emails have a proper format or ages are within logical bounds.
Problem: Extra spaces, special chars, or HTML tags in text.

Example (Fabric Notebook - Regex):

import re

df['CleanText'] = df['RawText'].str.replace(r'[^\w\s]', '') # Remove punctuation

6. Imputation / Filling Missing Values

Replacing missing data using a strategy (mean, median, default value, etc.).

Problem: Nulls, empty strings, or placeholders like "N/A" break calculations.

Solutions:

Drop rows/columns (if missing data is insignificant)
Fill with mean/median/mode (for numerical data)
Forward/backward fill (time-series data)

Example: Filling missing temperatures with the average of the day.

Example (PySpark in Fabric):

from pyspark.sql.functions import mean, col

# Fill missing 'Salary' with the mean

mean_salary = df.select(mean(col('Salary'))).collect()[0][0]

df_clean = df.na.fill({'Salary': mean_salary})

7. Parsing & Splitting Columns

Breaking data into parts or converting it into a more usable structure.

Problem: Combined fields (e.g., "First Last" in one column).
Example: Splitting full address fields into street, city, and postal code.

Example (Fabric Data Pipeline - Derived Column):

-- Split 'FullName' into 'FirstName' and 'LastName'

SELECT

SUBSTRING(FullName, 1, CHARINDEX(' ', FullName) - 1) AS FirstName,

SUBSTRING(FullName, CHARINDEX(' ', FullName) + 1, LEN(FullName)) AS LastName

FROM Customers;

8. Filtering / Outlier Removal

Identifying and removing data that falls outside of expected ranges.

Problem: Extreme values distort statistics.

Solutions:

Z-Score filtering (remove where |Z| > 3)
IQR method (remove points outside Q1-1.5IQR / Q3+1.5IQR)

Example: Removing negative ages or salaries over $10M in a retail dataset.

Example (Spark SQL):

-- Remove outliers using IQR

WITH Stats AS (

SELECT

PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY Revenue) AS Q1,

PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY Revenue) AS Q3,

(Q3 - Q1) * 1.5 AS IQR

FROM Sales

)

SELECT * FROM Sales

WHERE Revenue BETWEEN (Q1 - IQR) AND (Q3 + IQR);

9. Data Transformation / Fixing Data Types

Converting data into a new structure or format.

Problem: Numbers stored as strings, dates in wrong formats.
Example: Changing “Yes/No” to Boolean True/False.

Example (Power Query - Type Conversion):

= Table.TransformColumnTypes(

Source, {{"OrderID", Int64.Type}, {"OrderDate", type date}} )

10. Handling Inconsistent Categorical Data/ Data Mapping

Aligning data from one format or schema to another.

Problem: Typos or multiple categories for the same thing ("M" / "Male" / "m").
Example: Mapping "M" and "Male" both to "Male" for consistency.

Example (Fabric Spark):

from pyspark.sql.functions import when

df = df.withColumn(

"Gender", when(df.Gender.isin(["M", "Male"]), "Male")

.when(df.Gender.isin(["F", "Female"]), "Female") .otherwise("Other")

)

🧹 10 Essential Data Cleansing Techniques Every Data Professional Should Know

Ahmed ElSangary

Enterprise Architect | Digital Transformation Expert | Microsoft Azure & Data Solutions Expert | ERP Specialist | Odoo ERP Partner | MS Certified Trainer MCT | Helping Enterprises Build Scalable, Secure Platforms.

As a Microsoft Fabric Data Engineer, I know that clean data is the foundation of reliable analytics. Bad data leads to poor decisions, wasted time, and flawed AI models.

2. Normalization

3. Deduplication (De-duplication)

4. Cross-Field Validation

5. Validation / Text Cleaning (Noise Removal)

6. Imputation / Filling Missing Values

7. Parsing & Splitting Columns

8. Filtering / Outlier Removal

9. Data Transformation / Fixing Data Types

10. Handling Inconsistent Categorical Data/ Data Mapping

More articles by this author

Others also viewed

Should I Choose Machine Learning or Big Data?

🤖 Will AI Replace Data Analysts?

Dimension Reduction Linear Discriminant Analysis

The Data Analytic Skills You Need to Move Forward

Why Data Analysts Must Master AI

From Data to Decisions – How Machine Learning Enables Better Business Decisions

Data Strategies Start With Defining What Problem You Want to Solve

Computer Vision Classification: Cleaning Noisy and Mislabeled Data

The Visual Cortex of Data

Isolation Forest: Unmasking Anomalies in Your Data

Explore topics

As a Microsoft Fabric Data Engineer, I know that clean data is the foundation of reliable analytics. Bad data leads to poor decisions, wasted time, and flawed AI models.

2. Normalization

3. Deduplication (De-duplication)

4. Cross-Field Validation

5. Validation / Text Cleaning (Noise Removal)

6. Imputation / Filling Missing Values

7. Parsing & Splitting Columns

8. Filtering / Outlier Removal

9. Data Transformation / Fixing Data Types

10. Handling Inconsistent Categorical Data/ Data Mapping

The Data Analyst’s Dilemma: Strong Technical Skills but Weak Business Knowledge

Aug 2, 2025

Max count of Tables in a Single JOIN

Jun 17, 2025

ETL & C: Evolving the Data Pipeline for Value-Driven Analytics.

Jun 13, 2025

🧹 Data Cleansing for Predictive Modeling in Microsoft Fabric: The Hidden Hero of Your Analytics Journey

Jun 4, 2025

# Why Use Spark in Microsoft Fabric? The Data Engineer's Power Combo

Jun 1, 2025

Securing SQL Server on Azure: Best Practices for PaaS, IaaS, and SaaS Deployments

Apr 17, 2025

Data Cleansing Skills Required to Optimize Data Quality for Business Success

Apr 9, 2025

Training Evaluation

Jun 19, 2015

MS SQL Server 2016 !! Be ready

May 16, 2015

Backup is Easy, Restoring is the Challenge

Apr 28, 2015

Others also viewed

Should I Choose Machine Learning or Big Data?

🤖 Will AI Replace Data Analysts?

Dimension Reduction Linear Discriminant Analysis

The Data Analytic Skills You Need to Move Forward

Why Data Analysts Must Master AI

From Data to Decisions – How Machine Learning Enables Better Business Decisions

Data Strategies Start With Defining What Problem You Want to Solve

Computer Vision Classification: Cleaning Noisy and Mislabeled Data

The Visual Cortex of Data

Isolation Forest: Unmasking Anomalies in Your Data

Explore topics