🧹 10 Essential Data Cleansing Techniques Every Data Professional Should Know

🧹 10 Essential Data Cleansing Techniques Every Data Professional Should Know

As a Microsoft Fabric Data Engineer, I know that clean data is the foundation of reliable analytics. Bad data leads to poor decisions, wasted time, and flawed AI models.

Here are 10 must-know data cleansing techniques, with simple examples you can apply in Fabric Lakehouse, Spark, and Power BI. As Data Engineer, clean data is your most critical deliverable—it powers accurate analytics, reliable AI models, and trustworthy business decisions.

Without proper data cleansing, even the most advanced pipelines and warehouses become "garbage in, garbage out" systems that erode organizational trust.

1. Standardization

Ensuring data follows a consistent format.

  • Problem: Inconsistent formats ("USA" vs. "U.S.A" vs. "United States").

  • Example: Converting dates to YYYY-MM-DD format across all records.

# Before standardization: various date formats

# "2023-05-15", "05/15/2023", "15-May-2023", "20230515"

// Standardize country names

Example (Power Query in Fabric):

= Table.ReplaceValue(

Source, each [Country], each if Text.Contains([Country], "USA") then "US" else [Country],

Replacer.ReplaceText, {"Country"})

2. Normalization

Transforming data to a common scale without distorting differences in values. Often used in numerical data for machine learning models.

  • Problem: Features with different scales (e.g., Age 0-100 vs. Income 0-1,000,000).

- Scaling numerical data to a standard range (e.g., 0 to 1, -1 to 1, or z-scores).

## Why Normalization Matters

Improves Machine Learning Performance (algorithms like SVM, KNN, and neural networks work better with normalized data).

Enables Fair Comparisons (when features are on different scales, e.g., income vs. age).

Reduces Bias (prevents features with large ranges from dominating models).

Speeds Up Training (optimization algorithms converge faster).

## Common Normalization Techniques

### 1. Min-Max Scaling

- Use case: Image pixel values (0-255 → 0-1).

### 2. Z-Score Standardization

- Transforms data to have mean = 0 and standard deviation = 1.

- Use case: When data follows a Gaussian distribution.

### 3. Decimal Scaling

- Moves the decimal point to scale values (e.g., 1500 → 1.5 if max is 2000).

### 4. Log Transformation

- Compresses large ranges (e.g., for skewed data like income).

### 5. Unit Vector Scaling (L2 Normalization)

- Scales data to have a magnitude of 1 (useful for text/data with varying lengths).

## Example in Python (Pandas & Scikit-Learn)

import pandas as pd

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample data

data = {​'Age': [25, 47, 32, 61], 'Income': [40000, 80000, 55000, 120000]}

df = pd.DataFrame(data)

# Min-Max Scaling (0 to 1)

scaler = MinMaxScaler()

df_minmax = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Z-Score Standardization (mean=0, std=1)

standardizer = StandardScaler()

df_zscore = pd.DataFrame(standardizer.fit_transform(df), columns=df.columns)

print("Min-Max Normalized:\n", df_minmax)

print("\nZ-Score Standardized:\n", df_zscore)

Example (Fabric Notebook - Scikit-Learn):

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])

3. Deduplication (De-duplication)

Removing duplicate records from a dataset.

  • Problem: Duplicate records skew aggregations and ML models.

  • Example: Two customer entries with the same email or ID.

  • Example (Fabric SQL Endpoint):

-- Keep only the first occurrence

WITH Deduped AS (

SELECT *,

ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY OrderDate) AS RowNum

FROM Sales

)

SELECT * FROM Deduped WHERE RowNum = 1;

4. Cross-Field Validation

Problem: Logical inconsistencies (e.g., "Age=10" but "EmploymentStatus=Employed").

Example (Fabric SQL Rule):

-- Flag invalid records

SELECT *, CASE WHEN Age < 18 AND IsEmployed = 1 THEN 'Invalid: Underage Employment'

ELSE 'Valid' END AS ValidationStatus

FROM Users;

5. Validation / Text Cleaning (Noise Removal)

Checking that data meets certain rules or constraints.

  • Example: Ensuring emails have a proper format or ages are within logical bounds.

  • Problem: Extra spaces, special chars, or HTML tags in text.

Example (Fabric Notebook - Regex):

import re

df['CleanText'] = df['RawText'].str.replace(r'[^\w\s]', '') # Remove punctuation

6. Imputation / Filling Missing Values

Replacing missing data using a strategy (mean, median, default value, etc.).

Problem: Nulls, empty strings, or placeholders like "N/A" break calculations.

Solutions:

  • Drop rows/columns (if missing data is insignificant)

  • Fill with mean/median/mode (for numerical data)

  • Forward/backward fill (time-series data)

  • Example: Filling missing temperatures with the average of the day.

Example (PySpark in Fabric):

from pyspark.sql.functions import mean, col

# Fill missing 'Salary' with the mean

mean_salary = df.select(mean(col('Salary'))).collect()[0][0]

df_clean = df.na.fill({​'Salary': mean_salary})

7. Parsing & Splitting Columns

Breaking data into parts or converting it into a more usable structure.

  • Problem: Combined fields (e.g., "First Last" in one column).

  • Example: Splitting full address fields into street, city, and postal code.

Example (Fabric Data Pipeline - Derived Column):

-- Split 'FullName' into 'FirstName' and 'LastName'

SELECT

SUBSTRING(FullName, 1, CHARINDEX(' ', FullName) - 1) AS FirstName,

SUBSTRING(FullName, CHARINDEX(' ', FullName) + 1, LEN(FullName)) AS LastName

FROM Customers;

8. Filtering / Outlier Removal

Identifying and removing data that falls outside of expected ranges.

  • Problem: Extreme values distort statistics.

Solutions:

  • Z-Score filtering (remove where |Z| > 3)

  • IQR method (remove points outside Q1-1.5IQR / Q3+1.5IQR)

Example: Removing negative ages or salaries over $10M in a retail dataset.

Example (Spark SQL):

-- Remove outliers using IQR

WITH Stats AS (

SELECT

PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY Revenue) AS Q1,

PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY Revenue) AS Q3,

(Q3 - Q1) * 1.5 AS IQR

FROM Sales

)

SELECT * FROM Sales

WHERE Revenue BETWEEN (Q1 - IQR) AND (Q3 + IQR);

9. Data Transformation / Fixing Data Types

Converting data into a new structure or format.

  • Problem: Numbers stored as strings, dates in wrong formats.

  • Example: Changing “Yes/No” to Boolean True/False.

Example (Power Query - Type Conversion):

= Table.TransformColumnTypes(

Source, {​{"OrderID", Int64.Type}, {"OrderDate", type date}} )

10. Handling Inconsistent Categorical Data/ Data Mapping

Aligning data from one format or schema to another.

  • Problem: Typos or multiple categories for the same thing ("M" / "Male" / "m").

  • Example: Mapping "M" and "Male" both to "Male" for consistency.

Example (Fabric Spark):

from pyspark.sql.functions import when

df = df.withColumn(

"Gender", when(df.Gender.isin(["M", "Male"]), "Male")

.when(df.Gender.isin(["F", "Female"]), "Female") .otherwise("Other")

)

To view or add a comment, sign in

Others also viewed

Explore topics