SlideShare a Scribd company logo
Datamining & Warehousing
Dr.VIDHYA B
ASSISTANT PROFESSOR & HEAD
Department of Computer Technology
Sri Ramakrishna College of Arts and Science
Coimbatore - 641 006
Tamil Nadu, India
1
Unit 2 - Preprocessing
2
2
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
3
3
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Integration helps to reduce and avoid redundancies and
inconsistencies
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
4
4
Data Integration
Entity identification problem:
Data analysis task will involve data integration, which combines data
from multiple sources into a coherent data store, as in data warehousing.
These sources may include multiple databases, data cubes, or flat files
Issues in data integration: Schema integration and object matching
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different
 Possible reasons: different representations, different scales, e.g.,
metric vs. British units
5
5
Handling Redundancy in Data Integration
 Redundant data occur often when integration of multiple
databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
6
Correlation Analysis (Nominal Data)
 Χ2
(chi-square) test
 The larger the Χ2
value, the more likely the variables are
related
 The cells that contribute the most to the Χ2
value are
those whose actual count is very different from the
expected count
 Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population



Expected
Expected
Observed 2
2 )
(

7
Chi-Square Calculation: An Example
 Χ2
(chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
 It shows that like_science_fiction and play_chess are
correlated in the group
93
.
507
840
)
840
1000
(
360
)
360
200
(
210
)
210
50
(
90
)
90
250
( 2
2
2
2
2










Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
8
Correlation Analysis (Numeric Data)
 Correlation coefficient (also called Pearson’s product
moment coefficient)
where n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated
B
A
n
i i
i
B
A
n
i i
i
B
A
n
B
A
n
b
a
n
B
b
A
a
r



 )
1
(
)
(
)
1
(
)
)(
( 1
1
,








 

A B
9
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
10
Correlation (viewed as linear relationship)
 Correlation measures the linear relationship
between objects
 To compute correlation, we standardize data
objects, A and B, and then take their dot product
)
(
/
))
(
(
' A
std
A
mean
a
a k
k 

)
(
/
))
(
(
' B
std
B
mean
b
b k
k 

'
'
)
,
( B
A
B
A
n
correlatio 

11
Covariance (Numeric Data)
 Covariance is similar to correlation used for assessing the change in two attributes.
where n is the number of tuples, and are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
 Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values.
 Negative covariance: If CovA,B < 0 then if A is larger than its expected value,
B is likely to be smaller than its expected value.

Independence: CovA,B = 0 but the converse is not true:

Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
A B
Correlation coefficient:
Co-Variance: An Example
 It can be simplified in computation as
 Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
 E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.
13
Tuple Duplication
 Apart from detecting redundancies between
attributes, detecting duplicates at tuple level
is equally important.
 Denormalized tables is another source of
data redundancy.
 Inconsistencies occur due to inaccurate data
entry
14
Data Value Conflict Detection and
Resolution
 Data integration involves detection and
resolution of data value conflicts.
 For instance, a weight attribute may be stored
in metric units in one system and British
imperial units in another.
 For a hotel chain, the price of rooms in different
cities may involve not only different currencies
but also different services (e.g., free breakfast)
and taxes.
 When exchanging information between schools,
for example, each school may have its own
curriculum and grading scheme.
15
Data Value Conflict Detection and
Resolution
 Attributes may also differ on the abstraction
level, where an attribute in one system is
recorded at, say, a lower abstraction level
than the “same” attribute in another.

More Related Content

PPT
Data Preprocessing
PPT
Chapter 3. Data Preprocessing.ppt
PPT
03Preprocessing.ppt Processing in Computer Science
PDF
03 preprocessing
PPT
03Preprocessing for student computer sciecne.ppt
PPT
Preprocessing concepts and techniques.ppt
PPT
Preprocessing.ppt
PPT
03Predddddddddddddddddddddddprocessling.ppt
Data Preprocessing
Chapter 3. Data Preprocessing.ppt
03Preprocessing.ppt Processing in Computer Science
03 preprocessing
03Preprocessing for student computer sciecne.ppt
Preprocessing concepts and techniques.ppt
Preprocessing.ppt
03Predddddddddddddddddddddddprocessling.ppt

Similar to Preprocessing - Data Integration Tuple Duplication (20)

PPTX
03Preprocessing_plp.pptx
PPT
03Preprocessing.ppt
PPTX
03Preprocessing_plp.pptx
PPT
data mining preprocessing notes and pptt
PPT
03Preprocessing.ppt
PPT
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
PPT
Data ware housing and data mining for educational purpose
PPT
03Preprocesmlmlmljhjninibvbnjhyuftrdtyfyujsing.ppt
PPT
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
PPTX
03Preprocessing_20160222datamining5678.pptx
PPT
Upstate CSCI 525 Data Mining Chapter 3
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
PDF
Preprocessing Step in Data Cleaning - Data Mining
PPT
03 preprocessing
PDF
12.Data processing and concepts.pdf
PPT
Data Mining and Warehousing Concept and Techniques
PPT
03 Data Mining Concepts and TechniquesPreprocessing.ppt
PPT
preprocessing so that u can you these thing in your daily lifeppt
PPT
chapter 3 - Preprocessing data mining ppt
PPT
DATA PREPROCESSING NOTES ABOUT DATA MINING AND DATA
03Preprocessing_plp.pptx
03Preprocessing.ppt
03Preprocessing_plp.pptx
data mining preprocessing notes and pptt
03Preprocessing.ppt
Data Preprocessing and Visualizsdjvnovrnververdfvdfation
Data ware housing and data mining for educational purpose
03Preprocesmlmlmljhjninibvbnjhyuftrdtyfyujsing.ppt
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
03Preprocessing_20160222datamining5678.pptx
Upstate CSCI 525 Data Mining Chapter 3
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Preprocessing Step in Data Cleaning - Data Mining
03 preprocessing
12.Data processing and concepts.pdf
Data Mining and Warehousing Concept and Techniques
03 Data Mining Concepts and TechniquesPreprocessing.ppt
preprocessing so that u can you these thing in your daily lifeppt
chapter 3 - Preprocessing data mining ppt
DATA PREPROCESSING NOTES ABOUT DATA MINING AND DATA
Ad

More from VidhyaB10 (16)

PPTX
ANN – NETWORK ARCHITECTURE in Natural Language Processing
PPTX
Exploring and Processing Text data using NLP
PPTX
NLP Introduction - Natural Language Processing and Artificial Intelligence Ov...
PPTX
Applications & Text Representations.pptx
PPT
Major Tasks in Data Preprocessing - Data cleaning
PPT
Applications ,Issues & Technology in Data mining -
PPTX
Python Visualization API Primersubplots
PPTX
Python _dataStructures_ List, Tuples, its functions
PPTX
Python_Functions_Modules_ User define Functions-
PPT
Datamining - Introduction - Knowledge Discovery in Databases
PPTX
INSTRUCTION PROCESSOR DESIGN Computer system architecture
PPTX
Disk Scheduling in OS computer deals with multiple processes over a period of...
PPTX
Unit 2 digital fundamentals boolean func.pptx
PPTX
Digital Fundamental - Binary Codes-Logic Gates
PPTX
unit 5-files.pptx
PPTX
Python_Unit1_Introduction.pptx
ANN – NETWORK ARCHITECTURE in Natural Language Processing
Exploring and Processing Text data using NLP
NLP Introduction - Natural Language Processing and Artificial Intelligence Ov...
Applications & Text Representations.pptx
Major Tasks in Data Preprocessing - Data cleaning
Applications ,Issues & Technology in Data mining -
Python Visualization API Primersubplots
Python _dataStructures_ List, Tuples, its functions
Python_Functions_Modules_ User define Functions-
Datamining - Introduction - Knowledge Discovery in Databases
INSTRUCTION PROCESSOR DESIGN Computer system architecture
Disk Scheduling in OS computer deals with multiple processes over a period of...
Unit 2 digital fundamentals boolean func.pptx
Digital Fundamental - Binary Codes-Logic Gates
unit 5-files.pptx
Python_Unit1_Introduction.pptx
Ad

Recently uploaded (20)

PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Computing-Curriculum for Schools in Ghana
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
RMMM.pdf make it easy to upload and study
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
Microbial diseases, their pathogenesis and prophylaxis
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Pharma ospi slides which help in ospi learning
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Anesthesia in Laparoscopic Surgery in India
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Abdominal Access Techniques with Prof. Dr. R K Mishra
Computing-Curriculum for Schools in Ghana
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
102 student loan defaulters named and shamed – Is someone you know on the list?
2.FourierTransform-ShortQuestionswithAnswers.pdf
Microbial disease of the cardiovascular and lymphatic systems
STATICS OF THE RIGID BODIES Hibbelers.pdf
Institutional Correction lecture only . . .
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
RMMM.pdf make it easy to upload and study

Preprocessing - Data Integration Tuple Duplication

  • 1. Datamining & Warehousing Dr.VIDHYA B ASSISTANT PROFESSOR & HEAD Department of Computer Technology Sri Ramakrishna College of Arts and Science Coimbatore - 641 006 Tamil Nadu, India 1 Unit 2 - Preprocessing
  • 2. 2 2 Chapter 3: Data Preprocessing  Data Preprocessing: An Overview  Data Quality  Major Tasks in Data Preprocessing  Data Cleaning  Data Integration  Data Reduction  Data Transformation and Data Discretization  Summary
  • 3. 3 3 Data Integration  Data integration:  Combines data from multiple sources into a coherent store  Integration helps to reduce and avoid redundancies and inconsistencies  Schema integration: e.g., A.cust-id  B.cust-#  Integrate metadata from different sources
  • 4. 4 4 Data Integration Entity identification problem: Data analysis task will involve data integration, which combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files Issues in data integration: Schema integration and object matching  Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton Detecting and resolving data value conflicts  For the same real world entity, attribute values from different sources are different  Possible reasons: different representations, different scales, e.g., metric vs. British units
  • 5. 5 5 Handling Redundancy in Data Integration  Redundant data occur often when integration of multiple databases  Object identification: The same attribute or object may have different names in different databases  Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue  Redundant attributes may be able to be detected by correlation analysis and covariance analysis  Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 6. 6 Correlation Analysis (Nominal Data)  Χ2 (chi-square) test  The larger the Χ2 value, the more likely the variables are related  The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count  Correlation does not imply causality  # of hospitals and # of car-theft in a city are correlated  Both are causally linked to the third variable: population    Expected Expected Observed 2 2 ) ( 
  • 7. 7 Chi-Square Calculation: An Example  Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories)  It shows that like_science_fiction and play_chess are correlated in the group 93 . 507 840 ) 840 1000 ( 360 ) 360 200 ( 210 ) 210 50 ( 90 ) 90 250 ( 2 2 2 2 2           Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500
  • 8. 8 Correlation Analysis (Numeric Data)  Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.  If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation.  rA,B = 0: independent; rAB < 0: negatively correlated B A n i i i B A n i i i B A n B A n b a n B b A a r     ) 1 ( ) ( ) 1 ( ) )( ( 1 1 ,            A B
  • 9. 9 Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1.
  • 10. 10 Correlation (viewed as linear relationship)  Correlation measures the linear relationship between objects  To compute correlation, we standardize data objects, A and B, and then take their dot product ) ( / )) ( ( ' A std A mean a a k k   ) ( / )) ( ( ' B std B mean b b k k   ' ' ) , ( B A B A n correlatio  
  • 11. 11 Covariance (Numeric Data)  Covariance is similar to correlation used for assessing the change in two attributes. where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B.  Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values.  Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value.  Independence: CovA,B = 0 but the converse is not true:  Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence A B Correlation coefficient:
  • 12. Co-Variance: An Example  It can be simplified in computation as  Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).  Question: If the stocks are affected by the same industry trends, will their prices rise or fall together?  E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4  E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6  Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4  Thus, A and B rise together since Cov(A, B) > 0.
  • 13. 13 Tuple Duplication  Apart from detecting redundancies between attributes, detecting duplicates at tuple level is equally important.  Denormalized tables is another source of data redundancy.  Inconsistencies occur due to inaccurate data entry
  • 14. 14 Data Value Conflict Detection and Resolution  Data integration involves detection and resolution of data value conflicts.  For instance, a weight attribute may be stored in metric units in one system and British imperial units in another.  For a hotel chain, the price of rooms in different cities may involve not only different currencies but also different services (e.g., free breakfast) and taxes.  When exchanging information between schools, for example, each school may have its own curriculum and grading scheme.
  • 15. 15 Data Value Conflict Detection and Resolution  Attributes may also differ on the abstraction level, where an attribute in one system is recorded at, say, a lower abstraction level than the “same” attribute in another.