SlideShare a Scribd company logo
1
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
2
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
3
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?
4
Major Tasks in Data Preprocessing
 Data cleaning

Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction

Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization

Normalization

Concept hierarchy generation
5
5
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
6
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error

incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data

e.g., Occupation=“ ” (missing data)

noisy: containing noise, errors, or outliers

e.g., Salary=“−10” (an error)

inconsistent: containing discrepancies in codes or names, e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records
 Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data
 Data is not always available

E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the
time of entry

not register history or changes of the data
 Missing data may need to be inferred
8
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with

a global constant : e.g., “unknown”, a new class?!

the attribute mean
 the attribute mean for all samples belonging to the same
class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
9
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data
10
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
11
How to Handle Noisy Data?
 Regression
 Data smoothing can also be done by regression, a
technique that conforms data values to a function.
Linear regression involves finding the “best” line to
 fit two attributes (or variables) so that one attribute
can be used to predict the other.
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)
12
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
13
13
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
14
14
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 How can equivalent real world entities from multiple data source
be matched up ?
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Metadata of attributes (name, meaning, data types etc)
15
15
Data Integration
 Data integration:
 Attribute functional Dependencies and referential constraints
 discount may be applied to the order, whereas in another system it
is applied to each individual line item within the order.
 For the same real world entity, attribute values from different
sources are different Possible reasons: different representations,
different scales, e.g., metric vs. British units
16
16
Handling Redundancy in Data Integration
 Redundant data occur often when integration of multiple
databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis.
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality.
17
Correlation Analysis (Nominal Data)
 Χ2
(chi-square) test
 The larger the Χ2
value, the more likely the variables are
related
 The cells that contribute the most to the Χ2
value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population



Expected
Expected
Observed 2
2 )
(

18
Chi-Square Calculation: An Example
 Χ2
(chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
 It shows that like_science_fiction and play_chess are
correlated in the group
93
.
507
840
)
840
1000
(
360
)
360
200
(
210
)
210
50
(
90
)
90
250
( 2
2
2
2
2










Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
19
Correlation Analysis (Numeric Data)
 Correlation coefficient (also called Pearson’s product moment
coefficient)
where n is the number of tuples, and are the respective means of A
and B, σA and σB are the respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated.
 higher value may indicate that A (or B) may be
removed as a redundancy.
B
A
n
i i
i
B
A
n
i i
i
B
A
n
B
A
n
b
a
n
B
b
A
a
r



 )
(
)
(
)
(
)
)(
( 1
1
,

 






A B
 Correlation measures the linear relationship
between objects.
 To compute correlation, we standardize data
objects, A and B, and then take their dot product
20
Correlation (viewed as linear relationship)
)
(
/
))
(
(
' A
std
A
mean
a
a k
k 

)
(
/
))
(
(
' B
std
B
mean
b
b k
k 

'
'
)
,
( B
A
B
A
n
correlatio 

21
Covariance (Numeric Data)
 Covariance is similar to correlation
where n is the number of tuples, and are the respective mean or
expected values of A and B, σA and σB are the respective standard deviation
of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value.

Independence: CovA,B = 0 but the converse is not true:

Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
A B
Correlation coefficient:
Co-Variance: An Example
 It can be simplified in computation as
 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.
23
23
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
24
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)
25
Parametric Data Reduction: Regression
and Log-Linear Models
 Linear regression
 Data modeled to fit a straight line
 Often uses the least-square method to fit the line
 Multiple regression
 Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
26
Regression Analysis
 Regression analysis: A collective name for
techniques for the modeling and analysis
of numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or
more independent variables (aka.
explanatory variables or predictors)
 The parameters are estimated so as to
give a "best fit" of the data
 Most commonly the best fit is evaluated by
using the least squares method, but other
criteria have also been used
 Used for prediction
(including forecasting of
time-series data), inference,
hypothesis testing, and
modeling of causal
relationships
y
x
y = x + 1
X1
Y1
Y1’
27
 Linear regression:
 Linear regression is defined as an algorithm that provides a linear
relationship between an independent variable and a dependent
variable to predict the outcome of future events.
 Y = w X + b
 Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
 Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
Regress Analysis and Log-Linear Models

More Related Content

PPTX
03Preprocessing_20160222datamining5678.pptx
PPT
03Preprocesmlmlmljhjninibvbnjhyuftrdtyfyujsing.ppt
PPT
Data Mining and Warehousing Concept and Techniques
PPT
03 Data Mining Concepts and TechniquesPreprocessing.ppt
PPT
preprocessing so that u can you these thing in your daily lifeppt
PPT
chapter 3 - Preprocessing data mining ppt
PPT
DATA PREPROCESSING NOTES ABOUT DATA MINING AND DATA
PPT
03tahapanpengolahanPreprocessingdata.ppt
03Preprocessing_20160222datamining5678.pptx
03Preprocesmlmlmljhjninibvbnjhyuftrdtyfyujsing.ppt
Data Mining and Warehousing Concept and Techniques
03 Data Mining Concepts and TechniquesPreprocessing.ppt
preprocessing so that u can you these thing in your daily lifeppt
chapter 3 - Preprocessing data mining ppt
DATA PREPROCESSING NOTES ABOUT DATA MINING AND DATA
03tahapanpengolahanPreprocessingdata.ppt

Similar to 03Preprocessing.ppt Processing in Computer Science (20)

PPT
Preprocessing steps in Data mining steps
PPT
03Preprocessing_DataMining_Conce ddd.ppt
PPT
03PreprocessindARA AJAJAJJAJAJAJJJAg.ppt
PPT
Data Preprocessing in research methodology
PPT
Preprocessing of data mining process.ppt
PPT
03Preprocessing.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
PPT
03Preprocessing for student computer sciecne.ppt
PPT
Preprocessing concepts and techniques.ppt
PPT
Preprocessing.ppt
PPT
03Predddddddddddddddddddddddprocessling.ppt
PPTX
03Preprocessing_plp.pptx
PPT
03Preprocessing.ppt
PPTX
03Preprocessing_plp.pptx
PPT
data mining preprocessing notes and pptt
PPT
03Preprocessing.ppt
PPT
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
PPT
03 preprocessing
PPT
Upstate CSCI 525 Data Mining Chapter 3
PPT
Chapter 3. Data Preprocessing.ppt
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Preprocessing steps in Data mining steps
03Preprocessing_DataMining_Conce ddd.ppt
03PreprocessindARA AJAJAJJAJAJAJJJAg.ppt
Data Preprocessing in research methodology
Preprocessing of data mining process.ppt
03Preprocessing.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
03Preprocessing for student computer sciecne.ppt
Preprocessing concepts and techniques.ppt
Preprocessing.ppt
03Predddddddddddddddddddddddprocessling.ppt
03Preprocessing_plp.pptx
03Preprocessing.ppt
03Preprocessing_plp.pptx
data mining preprocessing notes and pptt
03Preprocessing.ppt
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
03 preprocessing
Upstate CSCI 525 Data Mining Chapter 3
Chapter 3. Data Preprocessing.ppt
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Ad

Recently uploaded (20)

PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Cell Structure & Organelles in detailed.
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Institutional Correction lecture only . . .
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
01-Introduction-to-Information-Management.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Classroom Observation Tools for Teachers
PPTX
master seminar digital applications in india
Microbial disease of the cardiovascular and lymphatic systems
Abdominal Access Techniques with Prof. Dr. R K Mishra
Cell Structure & Organelles in detailed.
A systematic review of self-coping strategies used by university students to ...
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
human mycosis Human fungal infections are called human mycosis..pptx
Institutional Correction lecture only . . .
Anesthesia in Laparoscopic Surgery in India
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Microbial diseases, their pathogenesis and prophylaxis
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Complications of Minimal Access Surgery at WLH
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
01-Introduction-to-Information-Management.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
FourierSeries-QuestionsWithAnswers(Part-A).pdf
O5-L3 Freight Transport Ops (International) V1.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
Classroom Observation Tools for Teachers
master seminar digital applications in india
Ad

03Preprocessing.ppt Processing in Computer Science

  • 1. 1 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 3 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved.
  • 2. 2 2 Chapter 3: Data Preprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 3. 3 Data Quality: Why Preprocess the Data?  Measures for data quality: A multidimensional view  Accuracy: correct or wrong, accurate or not  Completeness: not recorded, unavailable, …  Consistency: some modified but some not, dangling, …  Timeliness: timely update?  Believability: how trustable the data are correct?  Interpretability: how easily the data can be understood?
  • 4. 4 Major Tasks in Data Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data reduction  Dimensionality reduction  Numerosity reduction  Data compression  Data transformation and data discretization  Normalization  Concept hierarchy generation
  • 5. 5 5 Chapter 3: Data Preprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 6. 6 Data Cleaning  Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., Occupation=“ ” (missing data)  noisy: containing noise, errors, or outliers  e.g., Salary=“−10” (an error)  inconsistent: containing discrepancies in codes or names, e.g.,  Age=“42”, Birthday=“03/07/2010”  Was rating “1, 2, 3”, now rating “A, B, C”  discrepancy between duplicate records  Intentional (e.g., disguised missing data)  Jan. 1 as everyone’s birthday?
  • 7. 7 Incomplete (Missing) Data  Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data  Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding  certain data may not be considered important at the time of entry  not register history or changes of the data  Missing data may need to be inferred
  • 8. 8 How to Handle Missing Data?  Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably  Fill in the missing value manually: tedious + infeasible?  Fill in it automatically with  a global constant : e.g., “unknown”, a new class?!  the attribute mean  the attribute mean for all samples belonging to the same class: smarter  the most probable value: inference-based such as Bayesian formula or decision tree
  • 9. 9 Noisy Data  Noise: random error or variance in a measured variable  Incorrect attribute values may be due to  faulty data collection instruments  data entry problems  data transmission problems  technology limitation  inconsistency in naming convention  Other data problems which require data cleaning  duplicate records  incomplete data  inconsistent data
  • 10. 10 How to Handle Noisy Data?  Binning  first sort data and partition into (equal-frequency) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
  • 11. 11 How to Handle Noisy Data?  Regression  Data smoothing can also be done by regression, a technique that conforms data values to a function. Linear regression involves finding the “best” line to  fit two attributes (or variables) so that one attribute can be used to predict the other.  Clustering  detect and remove outliers  Combined computer and human inspection  detect suspicious values and check by human (e.g., deal with possible outliers)
  • 12. 12 Data Cleaning as a Process  Data discrepancy detection  Use metadata (e.g., domain, range, dependency, distribution)  Check field overloading  Check uniqueness rule, consecutive rule and null rule  Use commercial tools  Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections  Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)  Data migration and integration  Data migration tools: allow transformations to be specified  ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface
  • 13. 13 13 Chapter 3: Data Preprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 14. 14 14 Data Integration  Data integration:  Combines data from multiple sources into a coherent store  Schema integration: e.g., A.cust-id  B.cust-#  Integrate metadata from different sources  How can equivalent real world entities from multiple data source be matched up ?  Entity identification problem:  Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton  Metadata of attributes (name, meaning, data types etc)
  • 15. 15 15 Data Integration  Data integration:  Attribute functional Dependencies and referential constraints  discount may be applied to the order, whereas in another system it is applied to each individual line item within the order.  For the same real world entity, attribute values from different sources are different Possible reasons: different representations, different scales, e.g., metric vs. British units
  • 16. 16 16 Handling Redundancy in Data Integration  Redundant data occur often when integration of multiple databases  Object identification: The same attribute or object may have different names in different databases  Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant attributes may be able to be detected by correlation analysis and covariance analysis.  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality.
  • 17. 17 Correlation Analysis (Nominal Data)  Χ2 (chi-square) test  The larger the Χ2 value, the more likely the variables are related  The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count  Correlation does not imply causality  # of hospitals and # of car-theft in a city are correlated  Both are causally linked to the third variable: population    Expected Expected Observed 2 2 ) ( 
  • 18. 18 Chi-Square Calculation: An Example  Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)  It shows that like_science_fiction and play_chess are correlated in the group 93 . 507 840 ) 840 1000 ( 360 ) 360 200 ( 210 ) 210 50 ( 90 ) 90 250 ( 2 2 2 2 2           Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500
  • 19. 19 Correlation Analysis (Numeric Data)  Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.  If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation.  rA,B = 0: independent; rAB < 0: negatively correlated.  higher value may indicate that A (or B) may be removed as a redundancy. B A n i i i B A n i i i B A n B A n b a n B b A a r     ) ( ) ( ) ( ) )( ( 1 1 ,          A B
  • 20.  Correlation measures the linear relationship between objects.  To compute correlation, we standardize data objects, A and B, and then take their dot product 20 Correlation (viewed as linear relationship) ) ( / )) ( ( ' A std A mean a a k k   ) ( / )) ( ( ' B std B mean b b k k   ' ' ) , ( B A B A n correlatio  
  • 21. 21 Covariance (Numeric Data)  Covariance is similar to correlation where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B.  Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values.  Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value.  Independence: CovA,B = 0 but the converse is not true:  Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence A B Correlation coefficient:
  • 22. Co-Variance: An Example  It can be simplified in computation as  Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).  Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?  E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4  E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6  Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4  Thus, A and B rise together since Cov(A, B) > 0.
  • 23. 23 23 Chapter 3: Data Preprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 24. 24 Data Reduction 1: Dimensionality Reduction  Curse of dimensionality  When dimensionality increases, data becomes increasingly sparse  Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful  The possible combinations of subspaces will grow exponentially  Dimensionality reduction  Avoid the curse of dimensionality  Help eliminate irrelevant features and reduce noise  Reduce time and space required in data mining  Allow easier visualization  Dimensionality reduction techniques  Wavelet transforms  Principal Component Analysis  Supervised and nonlinear techniques (e.g., feature selection)
  • 25. 25 Parametric Data Reduction: Regression and Log-Linear Models  Linear regression  Data modeled to fit a straight line  Often uses the least-square method to fit the line  Multiple regression  Allows a response variable Y to be modeled as a linear function of multidimensional feature vector
  • 26. 26 Regression Analysis  Regression analysis: A collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variables (aka. explanatory variables or predictors)  The parameters are estimated so as to give a "best fit" of the data  Most commonly the best fit is evaluated by using the least squares method, but other criteria have also been used  Used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships y x y = x + 1 X1 Y1 Y1’
  • 27. 27  Linear regression:  Linear regression is defined as an algorithm that provides a linear relationship between an independent variable and a dependent variable to predict the outcome of future events.  Y = w X + b  Two regression coefficients, w and b, specify the line and are to be estimated by using the data at hand  Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….  Multiple regression: Y = b0 + b1 X1 + b2 X2  Many nonlinear functions can be transformed into the above Regress Analysis and Log-Linear Models

Editor's Notes

  • #12: Field overloading is another error source that typically results when developers squeeze new attribute definitions into unused (bit) portions of already defined attributes (e.g., an unused bit of an attribute that has a value range that uses only, say, 31 out of 32 bits).
  • #20: linear relationship (or linear association) is a statistical term used to describe a straight-line relationship between two variables
  • #21: covariance are two similar measures for assessing how much two attributes change together.