03Preprocessing.ppt Processing in Computer Science

1
Data Mining:
Concepts and Techniques
(3rd
ed.)
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.

2
2
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary

3
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?

4
Major Tasks in Data Preprocessing
 Data cleaning

Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction

Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization

Normalization

Concept hierarchy generation

5
5
 Data Quality
 Data Cleaning
 Data Reduction
 Summary

6
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error

incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data

e.g., Occupation=“ ” (missing data)

noisy: containing noise, errors, or outliers

e.g., Salary=“−10” (an error)

inconsistent: containing discrepancies in codes or names, e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records
 Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?

7
Incomplete (Missing) Data
 Data is not always available

E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the
time of entry

not register history or changes of the data
 Missing data may need to be inferred

8
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with

a global constant : e.g., “unknown”, a new class?!

the attribute mean
 the attribute mean for all samples belonging to the same
class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree

9
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data

10
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.

11
How to Handle Noisy Data?
 Regression
 Data smoothing can also be done by regression, a
technique that conforms data values to a function.
Linear regression involves finding the “best” line to
 fit two attributes (or variables) so that one attribute
can be used to predict the other.
 Clustering
 detect and remove outliers
 Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)

12
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface

13
13
 Data Quality
 Data Cleaning
 Data Reduction
 Summary

14
14
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 How can equivalent real world entities from multiple data source
be matched up ?
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Metadata of attributes (name, meaning, data types etc)

15
15
Data Integration
 Data integration:
 Attribute functional Dependencies and referential constraints
 discount may be applied to the order, whereas in another system it
is applied to each individual line item within the order.
 For the same real world entity, attribute values from different
sources are different Possible reasons: different representations,
different scales, e.g., metric vs. British units

16
16
Handling Redundancy in Data Integration
 Redundant data occur often when integration of multiple
databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis.
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality.

17
Correlation Analysis (Nominal Data)
 Χ2
(chi-square) test
 The larger the Χ2
value, the more likely the variables are
related
 The cells that contribute the most to the Χ2
value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population



Expected
Expected
Observed 2
2 )
(


18
Chi-Square Calculation: An Example
 Χ2
(chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
 It shows that like_science_fiction and play_chess are
correlated in the group
93
.
507
840
)
840
1000
(
360
)
360
200
(
210
)
210
50
(
90
)
90
250
( 2
2
2
2
2










Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500

19
Correlation Analysis (Numeric Data)
 Correlation coefficient (also called Pearson’s product moment
coefficient)
where n is the number of tuples, and are the respective means of A
and B, σA and σB are the respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated.
 higher value may indicate that A (or B) may be
removed as a redundancy.
B
A
n
i i
i
B
A
n
i i
i
B
A
n
B
A
n
b
a
n
B
b
A
a
r



 )
(
)
(
)
(
)
)(
( 1
1
,

 






A B

 Correlation measures the linear relationship
between objects.
 To compute correlation, we standardize data
objects, A and B, and then take their dot product
20
Correlation (viewed as linear relationship)
)
(
/
))
(
(
' A
std
A
mean
a
a k
k 

)
(
/
))
(
(
' B
std
B
mean
b
b k
k 

'
'
)
,
( B
A
B
A
n
correlatio 


21
Covariance (Numeric Data)
 Covariance is similar to correlation
where n is the number of tuples, and are the respective mean or
expected values of A and B, σA and σB are the respective standard deviation
of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value.

Independence: CovA,B = 0 but the converse is not true:

Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
A B
Correlation coefficient:

Co-Variance: An Example
 It can be simplified in computation as
 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.

23
23
 Data Quality
 Data Cleaning
 Data Reduction
 Summary

24
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

25
Parametric Data Reduction: Regression
and Log-Linear Models
 Linear regression
 Data modeled to fit a straight line
 Often uses the least-square method to fit the line
 Multiple regression
 Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector

26
Regression Analysis
 Regression analysis: A collective name for
techniques for the modeling and analysis
of numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or
more independent variables (aka.
explanatory variables or predictors)
 The parameters are estimated so as to
give a "best fit" of the data
 Most commonly the best fit is evaluated by
using the least squares method, but other
criteria have also been used
 Used for prediction
(including forecasting of
time-series data), inference,
hypothesis testing, and
modeling of causal
relationships
y
x
y = x + 1
X1
Y1
Y1’

27
 Linear regression:
 Linear regression is defined as an algorithm that provides a linear
relationship between an independent variable and a dependent
variable to predict the outcome of future events.
 Y = w X + b
 Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
 Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
 Multiple regression: Y = b0 + b1 X1 + b2 X2
 Many nonlinear functions can be transformed into the above
Regress Analysis and Log-Linear Models

03Preprocessing.ppt Processing in Computer Science

More Related Content

Similar to 03Preprocessing.ppt Processing in Computer Science (20)

Recently uploaded (20)

03Preprocessing.ppt Processing in Computer Science

Editor's Notes