SlideShare a Scribd company logo
Data Preprocessing
Jun Du
The University of Western Ontario
jdu43@uwo.ca
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
1
What is Data?
• Collection of data objects
and their attributes
• Data objects  rows
• Attributes  columns
2
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
Data Objects
• A data object represents an entity.
• Examples:
– Sales database: customers, store items, sales
– Medical database: patients, treatments
– University database: students, professors, courses
• Also called examples, instances, records, cases,
samples, data points, objects, etc.
• Data objects are described by attributes.
3
Attributes
• An attribute is a data field, representing a
characteristic or feature of a data object.
• Example:
– Customer Data: customer _ID, name, gender, age, address,
phone number, etc.
– Product data: product_ID, price, quantity, manufacturer,
etc.
• Also called features, variables, fields, dimensions, etc.
4
Attribute Types (1)
• Nominal (Discrete) Attribute
– Has only a finite set of values (such as, categories, states,
etc.)
– E.g., Hair_color = {black, blond, brown, grey, red, white, …}
– E.g., marital status, zip codes
• Numeric (Continuous) Attribute
– Has real numbers as attribute values
– E.g., temperature, height, or weight.
• Question: what about student id, SIN, year of birth?
5
Attribute Types (2)
• Binary
– A special case of nominal attribute: with only 2 states (0
and 1)
– Gender = {male, female};
– Medical test = {positive, negative}
• Ordinal
– Usually a special case of nominal attribute: values have a
meaningful order (ranking)
– Size = {small, medium, large}
– Army rankings
6
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
7
Data Preprocessing
• Why preprocess the data?
– Data quality is poor in real world.
– No quality data, no quality mining results!
• Measures for data quality
– Accuracy: noise, outliers, …
– Completeness: missing values, …
– Redundancy: duplicated data, irrelevant data, …
– Consistency: some modified but some not, …
– ……
8
Typical Tasks in Data Preprocessing
• Data Cleaning
– Handle missing values, noisy / outlier data, resolve
inconsistencies, …
• Data Transformation
– Aggregation
– Type Conversion
– Normalization
• Data Reduction
– Data Sampling
– Dimensionality Reduction
• ……
9
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
10
Data Cleaning
• Missing value: lacking attribute values
– E.g., Occupation = “ ”
• Noise (Error): modification of original values
– E.g., Salary = “−10”
• Outlier: considerably different from most of the
other data (not necessarily error)
– E.g., Salary = “2,100,000”
• Inconsistency: discrepancies in codes or names
– E.g., Age=“42”, Birthday=“03/07/2010”
– Was rating “1, 2, 3”, now rating “A, B, C”
• ……
11
Missing Values
• Reasons for missing values
– Information is not collected
• E.g., people decline to give their age and weight
– Attributes may not be applicable to all cases
• E.g., annual income is not applicable to children
– Human / Hardware / Software problems
• E.g., Birthdate information is accidentally deleted for all
people born in 1988.
– ……
12
How to Handle Missing Value?
• Eliminate  ignore missing value
• Eliminate  ignore the examples
• Eliminate  ignore the features
• Simple; not applicable when data is scarce
• Estimate missing value
– Global constant : e.g., “unknown”,
– Attribute mean (median, mode)
– Predict the value based on features (data imputation)
• Estimate gender based on first name (name gender)
• Estimate age based on first name (name popularity)
• Build a predictive model based on other features
– Missing value estimation depends on the missing reason!
13
Demonstration
• ReplaceMissingValues
– WekaVote
– Replacing missing values for nominal and numeric
attributes
• More functions in Rapidminer
14
Noisy (Outlier) Data
• Noise: refers to modification of original values
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
15
How to Handle Noisy (Outlier) Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human
16
Binning
Sort data in ascending order: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into equal-frequency (equal-depth) bins:
– Bin 1: 4, 8, 9, 15
– Bin 2: 21, 21, 24, 25
– Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
– Bin 1: 9, 9, 9, 9
– Bin 2: 23, 23, 23, 23
– Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
– Bin 1: 4, 4, 4, 15
– Bin 2: 21, 21, 25, 25
– Bin 3: 26, 26, 26, 34
17
Regression
18
x
y
y = x + 1
X1
Y1
Y1’
Cluster Analysis
19
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
20
Data Transformation
• Aggregation:
– Attribute / example summarization
• Feature type conversion:
– Nominal  Numeric, …
• Normalization:
– Scaled to fall within a small, specified range
• Attribute/feature construction:
– New attributes constructed from the given ones
21
Aggregation
• Combining two or more attributes (examples) into a single
attribute (example)
• Combining two or more attribute values into a single attribute
value
• Purpose
– Change of scale
• Cities aggregated into regions, states, countries, etc
– More “stable” data
• Aggregated data tends to have less variability
– More “predictive” data
• Aggregated data might have high Predictability
22
Demonstration
• MergeTwoValues
– Wekacontact-lenses
– Merge class values “soft” and “hard”
• Effective aggregation in real-world application
23
Feature Type Conversion
• Some algorithms can only handle numeric features; some can
only handle nominal features. Only few can handle both.
• Features have to be converted to satisfy the requirement of
learning algorithms.
– Numeric  Nominal (Discretization)
• E.g., Age Discretization: Young 18-29; Career 30-40; Mid-Life 41-55;
Empty-Nester 56-69; Senior 70+
– Nominal  Numeric
• Introduce multiple numeric features for one nominal feature
• Nominal  Binary (Numeric)
• E.g., size={L, M, S}  size_L: 0, 1; size_M: 0, 1; size_S: 0, 1
24
Demonstration
• Discretize
– Wekadiabetes
– Discretize “age” (equal bins vs equal frequency)
• NumericToNominal
– Wekadiabetes
– Discretize “age” (vs “Discretize” method)
• NominalToBinary
– UCIautos
– Convert “num-of-doors”
– Convert “drive-wheels”
25
Normalization
716.00)00.1(
000,12000,98
000,12600,73



26
Scale the attribute values to a small specified range
• Min-max normalization: to [new_minA, new_maxA]
– E.g., Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):
• ……
AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 



Demonstration
• Normalize
– Wekadiabetes
– Normalize “age”
• Standardize
– Wekadiabetes
– Standardize “age” (vs “Normalize” method)
27
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
28
Sampling
• Big data era: too expensive (or even infeasible) to
process the entire data set
• Sampling: obtaining a small sample to represent the
entire data set ( ---- undersampling)
• Oversampling is also required in some scenarios,
such as class imbalance problem
– E.g., 100 HIV test results: 5 positive, 995 negative
29
Sampling Principle
Key principle for effective sampling:
• Using a sample will work almost as well as using the
entire data sets, if the sample is representative
• A sample is representative if it has approximately the
same property (of interest) as the original set of data
30
Types of Sampling (1)
• Random sampling without replacement
– As each example is selected, it is removed from the population
• Random sampling with replacement
– Examples are not removed from the population after being selected
• The same example can be picked up more than once
31
Raw Data
Types of Sampling (2)
• Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition
32
Raw Data Stratified Sampling
Demonstration
• Resample
– UCIwaveform-5000
– Undersampling (with or without replacement)
33
Dimensionality Reduction
• Purpose:
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise
• Techniques
– Feature Selection
– Feature Extraction
34
Feature Selection
• Redundant features
– Duplicated information contained in different features
– E.g., “Age”, “Year of Birth”; “Purchase price”, “Sales tax”
• Irrelevant features
– Containing no information that is useful for the task
– E.g., students' ID is irrelevant to predicting GPA
• Goal:
– A minimum set of features containing all (most)
information
35
Heuristic Search in Feature Selection
• Given d features, there are 2d possible feature
combinations
– Exhaust search won’t work
– Heuristics has to be applied
• Typical heuristic feature selection methods:
– Feature ranking
– Forward feature selection
– Backward feature elimination
– Bidirectional search (selection + elimination)
– Search based on evolution algorithm
– ……
36
Feature Ranking
• Steps:
1) Rank all the individual features according to certain criteria
(e.g., information gain, gain ratio, χ2)
2) Select / keep top N features
• Properties:
– Usually independent of the learning algorithm to be used
– Efficient (no search process)
– Hard to determine the threshold
– Unable to consider correlation between features
37
Forward Feature Selection
• Steps:
1) First select the best single-feature (according to the learning
algorithm)
2) Repeat (until some stop criterion is met):
Select the next best feature, given the already picked features
• Properties:
– Usually learning algorithm dependent
– Feature correlation is considered
– More reliable
– Inefficient
38
Backward Feature Elimination
• Steps:
1) First build a model based on all the features
2) Repeat (until some criterion is met):
Eliminate the feature that makes the least contribution.
• Properties:
– Usually learning algorithm dependent
– Feature correlation is considered
– More reliable
– Inefficient
39
Filter vs Wrapper Model
• Filter model
– Separating feature selection from learning
– Relying on general characteristics of data (information, etc.)
– No bias toward any learning algorithm, fast
– Feature ranking usually falls into here
• Wrapper model
– Relying on a predetermined learning algorithm
– Using predictive accuracy as goodness measure
– High accuracy, computationally expensive
– FFS, BFE usually fall into here
40
Demonstration
• Feature ranking
– Wekaweather
– ChiSquared, InfoGain, GainRatio
• FFS & BFE
– WekaDiabetes
– ClassifierSubsetEval + GreedyStepwise
41
Feature Extraction
• Map original high-dimensional data onto a lower-
dimensional space
– Generate a (smaller) set of new features
– Preserve all (most) information from the original data
• Techniques
– Principal Component Analysis (PCA)
– Canonical Correlation Analysis (CCA)
– Linear Discriminant Analysis (LDA)
– Independent Component Analysis (ICA)
– Manifold Learning
– ……
42
Principal Component Analysis (PCA)
• Find a projection that captures the largest amount of variation
in data
• The original data are projected onto a much smaller space,
resulting in dimensionality reduction.
43
x2
x1
e
Principal Component Analysis (Steps)
• Given data from n-dimensions (n features), find k ≤ n new
features (principal components) that can best represent data
– Normalize input data: each feature falls within the same range
– Compute k principal components (details omitted)
– Each input data is projected in the new k-dimensional space
– The new features (principal components ) are sorted in order of
decreasing “significance” or strength
– Eliminate weak components / features to reduce dimensionality.
• Works for numeric data only
44
PCA Demonstration
• UCIbreast-w
– Accuracy with all features
– PrincipalComponents (data transformation)
– Visualize/save transformed data (first two features, last
two features)
– Accuracy with all transformed features
– Accuracy with top 1 or 2 feature(s)
45
Outline
• Data
• Data Preprocessing: An Overview
• Data Cleaning
• Data Transformation and Data Discretization
• Data Reduction
• Summary
46
Summary
• Data (features and instances)
• Data Cleaning: missing values, noise / outliers
• Data Transformation: aggregation, type conversion,
normalization
• Data Reduction
– Sampling: random sampling with replacement, random
sampling without replacement, stratified sampling
– Dimensionality reduction:
• Feature Selection: Feature ranking, FFS, BFE
• Feature Extraction: PCA
47
Notes
• In real world applications, data preprocessing usually
occupies about 70% workload in a data mining task.
• Domain knowledge is usually required to do good
data preprocessing.
• To improve a predictive performance of a model
– Improve learning algorithms (different algorithms,
different parameters)
• Most data mining research focuses on here
– Improve data quality ---- data preprocessing
• Deserve more attention!
48

More Related Content

PPTX
Lect5 principal component analysis
PPTX
Outlier detection handling
PPT
Introduction to Stata
PPTX
Data reduction
PPTX
Principal component analysis
PDF
Decision trees in Machine Learning
PDF
Data preprocessing using Machine Learning
PPT
Data preprocessing
Lect5 principal component analysis
Outlier detection handling
Introduction to Stata
Data reduction
Principal component analysis
Decision trees in Machine Learning
Data preprocessing using Machine Learning
Data preprocessing

What's hot (20)

PPTX
Classification Algorithm.
PPTX
Data Exploration.pptx
PPTX
05 Clustering in Data Mining
PDF
Principal component analysis and lda
PPTX
DATA PREPROCESSING AND DATA CLEANSING
PPT
Data preparation
PPTX
Computer Science-Data Structures :Abstract DataType (ADT)
PPT
3.7 outlier analysis
PDF
Feature Engineering in Machine Learning
PPT
PPTX
Data analysis with R
PDF
Multiple Regression and Logistic Regression
PPTX
OLAP operations
PDF
Hierarchical Clustering
PDF
Outlier Detection
PPTX
Data mining , Knowledge Discovery Process, Classification
PPT
5.1 mining data streams
PPTX
Data Preparation.pptx
PPTX
PPT
Chapter 12. Outlier Detection.ppt
Classification Algorithm.
Data Exploration.pptx
05 Clustering in Data Mining
Principal component analysis and lda
DATA PREPROCESSING AND DATA CLEANSING
Data preparation
Computer Science-Data Structures :Abstract DataType (ADT)
3.7 outlier analysis
Feature Engineering in Machine Learning
Data analysis with R
Multiple Regression and Logistic Regression
OLAP operations
Hierarchical Clustering
Outlier Detection
Data mining , Knowledge Discovery Process, Classification
5.1 mining data streams
Data Preparation.pptx
Chapter 12. Outlier Detection.ppt
Ad

Similar to Data preprocessing in Data Mining (20)

PPT
Preprocessing.ppt
PDF
Copy of Data preprocessing.pdf give notes regarding mining concpts
PPT
ML-ChapterTwo-Data Preprocessing.ppt
PPT
Data_Preparation_Modeling_Evaluation.ppt
PPT
1.6.data preprocessing
PDF
02Data updated.pdf
PPTX
Data pre processing
PPTX
Data preprocessing PPT
PPT
Preprocessing.ppt
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing
PPT
Data preprocessing in precision agriculture
PPT
Data preprocessing ng
PPT
Data preprocessing ng
PPT
Pre_processing_the_data_using_advance_technique
Preprocessing.ppt
Copy of Data preprocessing.pdf give notes regarding mining concpts
ML-ChapterTwo-Data Preprocessing.ppt
Data_Preparation_Modeling_Evaluation.ppt
1.6.data preprocessing
02Data updated.pdf
Data pre processing
Data preprocessing PPT
Preprocessing.ppt
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing
Data preprocessing in precision agriculture
Data preprocessing ng
Data preprocessing ng
Pre_processing_the_data_using_advance_technique
Ad

Recently uploaded (20)

PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
PPT on Performance Review to get promotions
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
web development for engineering and engineering
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Well-logging-methods_new................
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPT on Performance Review to get promotions
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
web development for engineering and engineering
CYBER-CRIMES AND SECURITY A guide to understanding
additive manufacturing of ss316l using mig welding
Foundation to blockchain - A guide to Blockchain Tech
Lecture Notes Electrical Wiring System Components
Well-logging-methods_new................
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
CH1 Production IntroductoryConcepts.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
bas. eng. economics group 4 presentation 1.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Operating System & Kernel Study Guide-1 - converted.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...

Data preprocessing in Data Mining

  • 1. Data Preprocessing Jun Du The University of Western Ontario jdu43@uwo.ca
  • 2. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 1
  • 3. What is Data? • Collection of data objects and their attributes • Data objects  rows • Attributes  columns 2 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects
  • 4. Data Objects • A data object represents an entity. • Examples: – Sales database: customers, store items, sales – Medical database: patients, treatments – University database: students, professors, courses • Also called examples, instances, records, cases, samples, data points, objects, etc. • Data objects are described by attributes. 3
  • 5. Attributes • An attribute is a data field, representing a characteristic or feature of a data object. • Example: – Customer Data: customer _ID, name, gender, age, address, phone number, etc. – Product data: product_ID, price, quantity, manufacturer, etc. • Also called features, variables, fields, dimensions, etc. 4
  • 6. Attribute Types (1) • Nominal (Discrete) Attribute – Has only a finite set of values (such as, categories, states, etc.) – E.g., Hair_color = {black, blond, brown, grey, red, white, …} – E.g., marital status, zip codes • Numeric (Continuous) Attribute – Has real numbers as attribute values – E.g., temperature, height, or weight. • Question: what about student id, SIN, year of birth? 5
  • 7. Attribute Types (2) • Binary – A special case of nominal attribute: with only 2 states (0 and 1) – Gender = {male, female}; – Medical test = {positive, negative} • Ordinal – Usually a special case of nominal attribute: values have a meaningful order (ranking) – Size = {small, medium, large} – Army rankings 6
  • 8. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 7
  • 9. Data Preprocessing • Why preprocess the data? – Data quality is poor in real world. – No quality data, no quality mining results! • Measures for data quality – Accuracy: noise, outliers, … – Completeness: missing values, … – Redundancy: duplicated data, irrelevant data, … – Consistency: some modified but some not, … – …… 8
  • 10. Typical Tasks in Data Preprocessing • Data Cleaning – Handle missing values, noisy / outlier data, resolve inconsistencies, … • Data Transformation – Aggregation – Type Conversion – Normalization • Data Reduction – Data Sampling – Dimensionality Reduction • …… 9
  • 11. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 10
  • 12. Data Cleaning • Missing value: lacking attribute values – E.g., Occupation = “ ” • Noise (Error): modification of original values – E.g., Salary = “−10” • Outlier: considerably different from most of the other data (not necessarily error) – E.g., Salary = “2,100,000” • Inconsistency: discrepancies in codes or names – E.g., Age=“42”, Birthday=“03/07/2010” – Was rating “1, 2, 3”, now rating “A, B, C” • …… 11
  • 13. Missing Values • Reasons for missing values – Information is not collected • E.g., people decline to give their age and weight – Attributes may not be applicable to all cases • E.g., annual income is not applicable to children – Human / Hardware / Software problems • E.g., Birthdate information is accidentally deleted for all people born in 1988. – …… 12
  • 14. How to Handle Missing Value? • Eliminate ignore missing value • Eliminate ignore the examples • Eliminate ignore the features • Simple; not applicable when data is scarce • Estimate missing value – Global constant : e.g., “unknown”, – Attribute mean (median, mode) – Predict the value based on features (data imputation) • Estimate gender based on first name (name gender) • Estimate age based on first name (name popularity) • Build a predictive model based on other features – Missing value estimation depends on the missing reason! 13
  • 15. Demonstration • ReplaceMissingValues – WekaVote – Replacing missing values for nominal and numeric attributes • More functions in Rapidminer 14
  • 16. Noisy (Outlier) Data • Noise: refers to modification of original values • Incorrect attribute values may be due to – faulty data collection instruments – data entry problems – data transmission problems – technology limitation – inconsistency in naming convention 15
  • 17. How to Handle Noisy (Outlier) Data? • Binning – first sort data and partition into (equal-frequency) bins – then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Regression – smooth by fitting the data into regression functions • Clustering – detect and remove outliers • Combined computer and human inspection – detect suspicious values and check by human 16
  • 18. Binning Sort data in ascending order: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 • Partition into equal-frequency (equal-depth) bins: – Bin 1: 4, 8, 9, 15 – Bin 2: 21, 21, 24, 25 – Bin 3: 26, 28, 29, 34 • Smoothing by bin means: – Bin 1: 9, 9, 9, 9 – Bin 2: 23, 23, 23, 23 – Bin 3: 29, 29, 29, 29 • Smoothing by bin boundaries: – Bin 1: 4, 4, 4, 15 – Bin 2: 21, 21, 25, 25 – Bin 3: 26, 26, 26, 34 17
  • 19. Regression 18 x y y = x + 1 X1 Y1 Y1’
  • 21. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 20
  • 22. Data Transformation • Aggregation: – Attribute / example summarization • Feature type conversion: – Nominal  Numeric, … • Normalization: – Scaled to fall within a small, specified range • Attribute/feature construction: – New attributes constructed from the given ones 21
  • 23. Aggregation • Combining two or more attributes (examples) into a single attribute (example) • Combining two or more attribute values into a single attribute value • Purpose – Change of scale • Cities aggregated into regions, states, countries, etc – More “stable” data • Aggregated data tends to have less variability – More “predictive” data • Aggregated data might have high Predictability 22
  • 24. Demonstration • MergeTwoValues – Wekacontact-lenses – Merge class values “soft” and “hard” • Effective aggregation in real-world application 23
  • 25. Feature Type Conversion • Some algorithms can only handle numeric features; some can only handle nominal features. Only few can handle both. • Features have to be converted to satisfy the requirement of learning algorithms. – Numeric  Nominal (Discretization) • E.g., Age Discretization: Young 18-29; Career 30-40; Mid-Life 41-55; Empty-Nester 56-69; Senior 70+ – Nominal  Numeric • Introduce multiple numeric features for one nominal feature • Nominal  Binary (Numeric) • E.g., size={L, M, S}  size_L: 0, 1; size_M: 0, 1; size_S: 0, 1 24
  • 26. Demonstration • Discretize – Wekadiabetes – Discretize “age” (equal bins vs equal frequency) • NumericToNominal – Wekadiabetes – Discretize “age” (vs “Discretize” method) • NominalToBinary – UCIautos – Convert “num-of-doors” – Convert “drive-wheels” 25
  • 27. Normalization 716.00)00.1( 000,12000,98 000,12600,73    26 Scale the attribute values to a small specified range • Min-max normalization: to [new_minA, new_maxA] – E.g., Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to • Z-score normalization (μ: mean, σ: standard deviation): • …… AAA AA A minnewminnewmaxnew minmax minv v _)__('    
  • 28. Demonstration • Normalize – Wekadiabetes – Normalize “age” • Standardize – Wekadiabetes – Standardize “age” (vs “Normalize” method) 27
  • 29. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 28
  • 30. Sampling • Big data era: too expensive (or even infeasible) to process the entire data set • Sampling: obtaining a small sample to represent the entire data set ( ---- undersampling) • Oversampling is also required in some scenarios, such as class imbalance problem – E.g., 100 HIV test results: 5 positive, 995 negative 29
  • 31. Sampling Principle Key principle for effective sampling: • Using a sample will work almost as well as using the entire data sets, if the sample is representative • A sample is representative if it has approximately the same property (of interest) as the original set of data 30
  • 32. Types of Sampling (1) • Random sampling without replacement – As each example is selected, it is removed from the population • Random sampling with replacement – Examples are not removed from the population after being selected • The same example can be picked up more than once 31 Raw Data
  • 33. Types of Sampling (2) • Stratified sampling – Split the data into several partitions; then draw random samples from each partition 32 Raw Data Stratified Sampling
  • 34. Demonstration • Resample – UCIwaveform-5000 – Undersampling (with or without replacement) 33
  • 35. Dimensionality Reduction • Purpose: – Reduce amount of time and memory required by data mining algorithms – Allow data to be more easily visualized – May help to eliminate irrelevant features or reduce noise • Techniques – Feature Selection – Feature Extraction 34
  • 36. Feature Selection • Redundant features – Duplicated information contained in different features – E.g., “Age”, “Year of Birth”; “Purchase price”, “Sales tax” • Irrelevant features – Containing no information that is useful for the task – E.g., students' ID is irrelevant to predicting GPA • Goal: – A minimum set of features containing all (most) information 35
  • 37. Heuristic Search in Feature Selection • Given d features, there are 2d possible feature combinations – Exhaust search won’t work – Heuristics has to be applied • Typical heuristic feature selection methods: – Feature ranking – Forward feature selection – Backward feature elimination – Bidirectional search (selection + elimination) – Search based on evolution algorithm – …… 36
  • 38. Feature Ranking • Steps: 1) Rank all the individual features according to certain criteria (e.g., information gain, gain ratio, χ2) 2) Select / keep top N features • Properties: – Usually independent of the learning algorithm to be used – Efficient (no search process) – Hard to determine the threshold – Unable to consider correlation between features 37
  • 39. Forward Feature Selection • Steps: 1) First select the best single-feature (according to the learning algorithm) 2) Repeat (until some stop criterion is met): Select the next best feature, given the already picked features • Properties: – Usually learning algorithm dependent – Feature correlation is considered – More reliable – Inefficient 38
  • 40. Backward Feature Elimination • Steps: 1) First build a model based on all the features 2) Repeat (until some criterion is met): Eliminate the feature that makes the least contribution. • Properties: – Usually learning algorithm dependent – Feature correlation is considered – More reliable – Inefficient 39
  • 41. Filter vs Wrapper Model • Filter model – Separating feature selection from learning – Relying on general characteristics of data (information, etc.) – No bias toward any learning algorithm, fast – Feature ranking usually falls into here • Wrapper model – Relying on a predetermined learning algorithm – Using predictive accuracy as goodness measure – High accuracy, computationally expensive – FFS, BFE usually fall into here 40
  • 42. Demonstration • Feature ranking – Wekaweather – ChiSquared, InfoGain, GainRatio • FFS & BFE – WekaDiabetes – ClassifierSubsetEval + GreedyStepwise 41
  • 43. Feature Extraction • Map original high-dimensional data onto a lower- dimensional space – Generate a (smaller) set of new features – Preserve all (most) information from the original data • Techniques – Principal Component Analysis (PCA) – Canonical Correlation Analysis (CCA) – Linear Discriminant Analysis (LDA) – Independent Component Analysis (ICA) – Manifold Learning – …… 42
  • 44. Principal Component Analysis (PCA) • Find a projection that captures the largest amount of variation in data • The original data are projected onto a much smaller space, resulting in dimensionality reduction. 43 x2 x1 e
  • 45. Principal Component Analysis (Steps) • Given data from n-dimensions (n features), find k ≤ n new features (principal components) that can best represent data – Normalize input data: each feature falls within the same range – Compute k principal components (details omitted) – Each input data is projected in the new k-dimensional space – The new features (principal components ) are sorted in order of decreasing “significance” or strength – Eliminate weak components / features to reduce dimensionality. • Works for numeric data only 44
  • 46. PCA Demonstration • UCIbreast-w – Accuracy with all features – PrincipalComponents (data transformation) – Visualize/save transformed data (first two features, last two features) – Accuracy with all transformed features – Accuracy with top 1 or 2 feature(s) 45
  • 47. Outline • Data • Data Preprocessing: An Overview • Data Cleaning • Data Transformation and Data Discretization • Data Reduction • Summary 46
  • 48. Summary • Data (features and instances) • Data Cleaning: missing values, noise / outliers • Data Transformation: aggregation, type conversion, normalization • Data Reduction – Sampling: random sampling with replacement, random sampling without replacement, stratified sampling – Dimensionality reduction: • Feature Selection: Feature ranking, FFS, BFE • Feature Extraction: PCA 47
  • 49. Notes • In real world applications, data preprocessing usually occupies about 70% workload in a data mining task. • Domain knowledge is usually required to do good data preprocessing. • To improve a predictive performance of a model – Improve learning algorithms (different algorithms, different parameters) • Most data mining research focuses on here – Improve data quality ---- data preprocessing • Deserve more attention! 48