🧹 10 Essential Data Cleansing Techniques Every Data Professional Should Know
As a Microsoft Fabric Data Engineer, I know that clean data is the foundation of reliable analytics. Bad data leads to poor decisions, wasted time, and flawed AI models.
Here are 10 must-know data cleansing techniques, with simple examples you can apply in Fabric Lakehouse, Spark, and Power BI. As Data Engineer, clean data is your most critical deliverable—it powers accurate analytics, reliable AI models, and trustworthy business decisions.
Without proper data cleansing, even the most advanced pipelines and warehouses become "garbage in, garbage out" systems that erode organizational trust.
1. Standardization
Ensuring data follows a consistent format.
Problem: Inconsistent formats ("USA" vs. "U.S.A" vs. "United States").
Example: Converting dates to YYYY-MM-DD format across all records.
# Before standardization: various date formats
# "2023-05-15", "05/15/2023", "15-May-2023", "20230515"
// Standardize country names
Example (Power Query in Fabric):
= Table.ReplaceValue(
Source, each [Country], each if Text.Contains([Country], "USA") then "US" else [Country],
Replacer.ReplaceText, {"Country"})
2. Normalization
Transforming data to a common scale without distorting differences in values. Often used in numerical data for machine learning models.
Problem: Features with different scales (e.g., Age 0-100 vs. Income 0-1,000,000).
- Scaling numerical data to a standard range (e.g., 0 to 1, -1 to 1, or z-scores).
## Why Normalization Matters
✅ Improves Machine Learning Performance (algorithms like SVM, KNN, and neural networks work better with normalized data).
✅ Enables Fair Comparisons (when features are on different scales, e.g., income vs. age).
✅ Reduces Bias (prevents features with large ranges from dominating models).
✅ Speeds Up Training (optimization algorithms converge faster).
## Common Normalization Techniques
### 1. Min-Max Scaling
- Use case: Image pixel values (0-255 → 0-1).
### 2. Z-Score Standardization
- Transforms data to have mean = 0 and standard deviation = 1.
- Use case: When data follows a Gaussian distribution.
### 3. Decimal Scaling
- Moves the decimal point to scale values (e.g., 1500 → 1.5 if max is 2000).
### 4. Log Transformation
- Compresses large ranges (e.g., for skewed data like income).
### 5. Unit Vector Scaling (L2 Normalization)
- Scales data to have a magnitude of 1 (useful for text/data with varying lengths).
## Example in Python (Pandas & Scikit-Learn)
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Sample data
data = {'Age': [25, 47, 32, 61], 'Income': [40000, 80000, 55000, 120000]}
df = pd.DataFrame(data)
# Min-Max Scaling (0 to 1)
scaler = MinMaxScaler()
df_minmax = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
# Z-Score Standardization (mean=0, std=1)
standardizer = StandardScaler()
df_zscore = pd.DataFrame(standardizer.fit_transform(df), columns=df.columns)
print("Min-Max Normalized:\n", df_minmax)
print("\nZ-Score Standardized:\n", df_zscore)
Example (Fabric Notebook - Scikit-Learn):
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])
3. Deduplication (De-duplication)
Removing duplicate records from a dataset.
Problem: Duplicate records skew aggregations and ML models.
Example: Two customer entries with the same email or ID.
Example (Fabric SQL Endpoint):
-- Keep only the first occurrence
WITH Deduped AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS RowNum
FROM Sales
)
SELECT * FROM Deduped WHERE RowNum = 1;
4. Cross-Field Validation
Problem: Logical inconsistencies (e.g., "Age=10" but "EmploymentStatus=Employed").
Example (Fabric SQL Rule):
-- Flag invalid records
SELECT *, CASE WHEN Age < 18 AND IsEmployed = 1 THEN 'Invalid: Underage Employment'
ELSE 'Valid' END AS ValidationStatus
FROM Users;
5. Validation / Text Cleaning (Noise Removal)
Checking that data meets certain rules or constraints.
Example: Ensuring emails have a proper format or ages are within logical bounds.
Problem: Extra spaces, special chars, or HTML tags in text.
Example (Fabric Notebook - Regex):
import re
df['CleanText'] = df['RawText'].str.replace(r'[^\w\s]', '') # Remove punctuation
6. Imputation / Filling Missing Values
Replacing missing data using a strategy (mean, median, default value, etc.).
Problem: Nulls, empty strings, or placeholders like "N/A" break calculations.
Solutions:
Drop rows/columns (if missing data is insignificant)
Fill with mean/median/mode (for numerical data)
Forward/backward fill (time-series data)
Example: Filling missing temperatures with the average of the day.
Example (PySpark in Fabric):
from pyspark.sql.functions import mean, col
# Fill missing 'Salary' with the mean
mean_salary = df.select(mean(col('Salary'))).collect()[0][0]
df_clean = df.na.fill({'Salary': mean_salary})
7. Parsing & Splitting Columns
Breaking data into parts or converting it into a more usable structure.
Problem: Combined fields (e.g., "First Last" in one column).
Example: Splitting full address fields into street, city, and postal code.
Example (Fabric Data Pipeline - Derived Column):
-- Split 'FullName' into 'FirstName' and 'LastName'
SELECT
SUBSTRING(FullName, 1, CHARINDEX(' ', FullName) - 1) AS FirstName,
SUBSTRING(FullName, CHARINDEX(' ', FullName) + 1, LEN(FullName)) AS LastName
FROM Customers;
8. Filtering / Outlier Removal
Identifying and removing data that falls outside of expected ranges.
Problem: Extreme values distort statistics.
Solutions:
Z-Score filtering (remove where |Z| > 3)
IQR method (remove points outside Q1-1.5IQR / Q3+1.5IQR)
Example: Removing negative ages or salaries over $10M in a retail dataset.
Example (Spark SQL):
-- Remove outliers using IQR
WITH Stats AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY Revenue) AS Q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY Revenue) AS Q3,
(Q3 - Q1) * 1.5 AS IQR
FROM Sales
)
SELECT * FROM Sales
WHERE Revenue BETWEEN (Q1 - IQR) AND (Q3 + IQR);
9. Data Transformation / Fixing Data Types
Converting data into a new structure or format.
Problem: Numbers stored as strings, dates in wrong formats.
Example: Changing “Yes/No” to Boolean True/False.
Example (Power Query - Type Conversion):
= Table.TransformColumnTypes(
Source, {{"OrderID", Int64.Type}, {"OrderDate", type date}} )
10. Handling Inconsistent Categorical Data/ Data Mapping
Aligning data from one format or schema to another.
Problem: Typos or multiple categories for the same thing ("M" / "Male" / "m").
Example: Mapping "M" and "Male" both to "Male" for consistency.
Example (Fabric Spark):
from pyspark.sql.functions import when
df = df.withColumn(
"Gender", when(df.Gender.isin(["M", "Male"]), "Male")
.when(df.Gender.isin(["F", "Female"]), "Female") .otherwise("Other")
)