SlideShare a Scribd company logo
Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 2
Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization
Introduction
 Noise is a random error in measured variable.
 Noisy data is meaningless data.
 Any data that has been received, stored or changed
in such a manner that it cannot be read or used by the
program that originally created it can be described as
noisy.
Noisy Data
 Source of Noisy data:
1. Data entry problem.
2. Faulty data collection instruments.
3. Data transmission.
Noisy Data
 Binning method
 Clustering
 Combined computer and human inspections
 Regression
How to handle noisy data ?
How to handle noisy data ?
 Binning method:
1. Sort data
2. Partition into equal-frequency groups.
3. One can smooth by group means, smooth by
group median, smooth by group boundaries, etc.
How to handle noisy data ?
Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equal-frequency) groups:
-G1: 4, 8, 9, 15
-G2: 21, 21, 24, 25
-G3: 26, 28, 29, 34
Smoothing by bin means:
-G1: 9, 9, 9, 9
-G2: 23, 23, 23, 23
-G3: 29, 29, 29, 29
Smoothing by bin boundaries:
-G1: 4, 4, 4, 15
-G2: 21, 21, 25, 25
-G3: 26, 26, 26, 34
How to handle noisy data ?
Clustering: Outliers may be detected by clustering,
where similar values are organized into groups, values
that fall outside the set of clusters may be considered
outliers.
How to handle noisy data ?
 Combined computer and human inspections: Outliers
may be identified by detect suspicious values and
check by human.
How to handle noisy data ?
 Regression: Data can be smoothed by fitting the
data to a function.
Inconsistent Data
 Data which is inconsistent with our models, should
be dealt with.
 Common sense can also be used to detect such kind
of inconsistency:
The same name occurring differently in an application.
Different names can appear to be the same (Dennis Vs
Denis)
Inappropriate values (Males being pregnant, or having an
negative age) Was rating “1,2,3”, now rating “A, B, C”
Difference between duplicate records
Inconsistent Data
 We want to transform all dates to the same format internally
 Some systems accept dates in many formats
 e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc
 dates are transformed internally to a standard value
 Frequently, just the year (YYYY) is sufficient
 For more details, we may need the month, the day, the hour,
etc
 Representing date as YYYYMM or YYYYMMDD can be OK.
Data Integration
Goal identification
& Data
Understanding
Data Cleaning Data Integration
Data
Transformation
Data
Reduction
Data Integration
 Combines data from multiple sources into a coherent
store.
 Increasingly data a mining projects require data
from more than one data source.
 Such as multiple databases, data warehouse, flat
files and historical data.
Data Integration
 Data is stored in many systems across enterprise
and outside the enterprise
The source of data fall into two categories:
 Internal sources that are generated through enterprise
activities such as databases, historical data, Web sites
and warehouses.
 External sources such as credit bureaus, phone
companies and demographical information.
Data Integration
 Data Warehouse: is a structure that links information
from two or more databases.
 Data warehouse brings data from different data
sources into a central repository.
 It performs some data integration, clean-up, and
summarization, and distribute the information data
marts.
Data Integration
Next:
Data Cleaning: Noisy Data

More Related Content

PPTX
4 Data preparation and processing
PPTX
7 data transformation
PPTX
1 Introduction to-data-mining lecture
PPTX
2 Data-mining process
PPTX
Data preparation and processing chapter 2
PPTX
3 classification
PPTX
3 Data Mining Tasks
PPTX
Data mining an introduction
4 Data preparation and processing
7 data transformation
1 Introduction to-data-mining lecture
2 Data-mining process
Data preparation and processing chapter 2
3 classification
3 Data Mining Tasks
Data mining an introduction

What's hot (20)

PPTX
Data mining introduction
PPTX
Data mining
PPT
Database
PDF
Introduction to Data Mining
PPTX
Data mining
PPTX
Data Mining: Classification and analysis
PPTX
Data mining - Process, Techniques and Research Topics
PPT
Lecture1
PPT
Part1
PDF
Ghhh
PPTX
Research trends in data warehousing and data mining
PPT
Datamining
PPTX
Data Mining: Applying data mining
PPTX
Data Mining
PDF
Data preprocessing
PPT
Cssu dw dm
PPTX
Data Mining: Key definitions
PPTX
Data mining concepts and work
PPTX
Data mining
Data mining introduction
Data mining
Database
Introduction to Data Mining
Data mining
Data Mining: Classification and analysis
Data mining - Process, Techniques and Research Topics
Lecture1
Part1
Ghhh
Research trends in data warehousing and data mining
Datamining
Data Mining: Applying data mining
Data Mining
Data preprocessing
Cssu dw dm
Data Mining: Key definitions
Data mining concepts and work
Data mining
Ad

Similar to 5 data preparation and processing2 (20)

PPT
Data preprocessing in precision agriculture
PPTX
Data mining
PPTX
Foundation of information system
PDF
Data Mining
PPT
Preprocessing data mining hhxdzsdsasaasa
PPT
Chapter 2 Cond (1).ppt
PPTX
Explorartory Data Analytics and Knowledge Discovery techniques.pptx
PPTX
Security issues in big data
PPTX
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
PPT
ML-ChapterTwo-Data Preprocessing.ppt
PPTX
Managing Data For Efficiency.pptx and in
PDF
machinelearning-191005133446.pdf
PPTX
Machine Learning: A Fast Review
DOCX
Data Warehose and Data Mining Unit II.docx
PPTX
Machine learning topics machine learning algorithm into three main parts.
PPTX
Data mining , Knowledge Discovery Process, Classification
DOCX
1. What are the business costs or risks of poor data quality Sup.docx
PPTX
Bigdata
PDF
Using Randomized Response Techniques for Privacy-Preserving Data Mining
PPT
DM Lecture 3
Data preprocessing in precision agriculture
Data mining
Foundation of information system
Data Mining
Preprocessing data mining hhxdzsdsasaasa
Chapter 2 Cond (1).ppt
Explorartory Data Analytics and Knowledge Discovery techniques.pptx
Security issues in big data
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
ML-ChapterTwo-Data Preprocessing.ppt
Managing Data For Efficiency.pptx and in
machinelearning-191005133446.pdf
Machine Learning: A Fast Review
Data Warehose and Data Mining Unit II.docx
Machine learning topics machine learning algorithm into three main parts.
Data mining , Knowledge Discovery Process, Classification
1. What are the business costs or risks of poor data quality Sup.docx
Bigdata
Using Randomized Response Techniques for Privacy-Preserving Data Mining
DM Lecture 3
Ad

More from Mahmoud Alfarra (20)

PPT
Computer Programming, Loops using Java - part 2
PPT
Computer Programming, Loops using Java
PPT
Chapter 10: hashing data structure
PPT
Chapter9 graph data structure
PPT
Chapter 8: tree data structure
PPT
Chapter 7: Queue data structure
PPT
Chapter 6: stack data structure
PPT
Chapter 5: linked list data structure
PPT
Chapter 4: basic search algorithms data structure
PPT
Chapter 3: basic sorting algorithms data structure
PPT
Chapter 2: array and array list data structure
PPT
Chapter1 intro toprincipleofc#_datastructure_b_cs
PPT
Chapter 0: introduction to data structure
PPT
8 programming-using-java decision-making practices 20102011
PPT
7 programming-using-java decision-making220102011
PPT
6 programming-using-java decision-making20102011-
PPT
5 programming-using-java intro-tooop20102011
PPT
4 programming-using-java intro-tojava20102011
PPT
3 programming-using-java introduction-to computer
PPT
2 programming-using-java how to built application
Computer Programming, Loops using Java - part 2
Computer Programming, Loops using Java
Chapter 10: hashing data structure
Chapter9 graph data structure
Chapter 8: tree data structure
Chapter 7: Queue data structure
Chapter 6: stack data structure
Chapter 5: linked list data structure
Chapter 4: basic search algorithms data structure
Chapter 3: basic sorting algorithms data structure
Chapter 2: array and array list data structure
Chapter1 intro toprincipleofc#_datastructure_b_cs
Chapter 0: introduction to data structure
8 programming-using-java decision-making practices 20102011
7 programming-using-java decision-making220102011
6 programming-using-java decision-making20102011-
5 programming-using-java intro-tooop20102011
4 programming-using-java intro-tojava20102011
3 programming-using-java introduction-to computer
2 programming-using-java how to built application

Recently uploaded (20)

PPTX
Pharma ospi slides which help in ospi learning
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Classroom Observation Tools for Teachers
PPTX
master seminar digital applications in india
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Computing-Curriculum for Schools in Ghana
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Institutional Correction lecture only . . .
PDF
Complications of Minimal Access Surgery at WLH
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Presentation on HIE in infants and its manifestations
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Pharma ospi slides which help in ospi learning
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Classroom Observation Tools for Teachers
master seminar digital applications in india
STATICS OF THE RIGID BODIES Hibbelers.pdf
Computing-Curriculum for Schools in Ghana
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Institutional Correction lecture only . . .
Complications of Minimal Access Surgery at WLH
Abdominal Access Techniques with Prof. Dr. R K Mishra
Supply Chain Operations Speaking Notes -ICLT Program
human mycosis Human fungal infections are called human mycosis..pptx
Presentation on HIE in infants and its manifestations
VCE English Exam - Section C Student Revision Booklet
GDM (1) (1).pptx small presentation for students
Microbial diseases, their pathogenesis and prophylaxis
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf

5 data preparation and processing2

  • 1. Data preparation and processing Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 2 – Lecture 2
  • 2. Outline  Introduction  Domain Expert  Goal identification and Data Understanding  Data Cleaning  Missing values  Noisy Data  Inconsistent Data  Data Integration  Data Transformation  Data Reduction  Feature Selection  Sampling  Discretization
  • 4.  Noise is a random error in measured variable.  Noisy data is meaningless data.  Any data that has been received, stored or changed in such a manner that it cannot be read or used by the program that originally created it can be described as noisy. Noisy Data
  • 5.  Source of Noisy data: 1. Data entry problem. 2. Faulty data collection instruments. 3. Data transmission. Noisy Data
  • 6.  Binning method  Clustering  Combined computer and human inspections  Regression How to handle noisy data ?
  • 7. How to handle noisy data ?  Binning method: 1. Sort data 2. Partition into equal-frequency groups. 3. One can smooth by group means, smooth by group median, smooth by group boundaries, etc.
  • 8. How to handle noisy data ? Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equal-frequency) groups: -G1: 4, 8, 9, 15 -G2: 21, 21, 24, 25 -G3: 26, 28, 29, 34 Smoothing by bin means: -G1: 9, 9, 9, 9 -G2: 23, 23, 23, 23 -G3: 29, 29, 29, 29 Smoothing by bin boundaries: -G1: 4, 4, 4, 15 -G2: 21, 21, 25, 25 -G3: 26, 26, 26, 34
  • 9. How to handle noisy data ? Clustering: Outliers may be detected by clustering, where similar values are organized into groups, values that fall outside the set of clusters may be considered outliers.
  • 10. How to handle noisy data ?  Combined computer and human inspections: Outliers may be identified by detect suspicious values and check by human.
  • 11. How to handle noisy data ?  Regression: Data can be smoothed by fitting the data to a function.
  • 12. Inconsistent Data  Data which is inconsistent with our models, should be dealt with.  Common sense can also be used to detect such kind of inconsistency: The same name occurring differently in an application. Different names can appear to be the same (Dennis Vs Denis) Inappropriate values (Males being pregnant, or having an negative age) Was rating “1,2,3”, now rating “A, B, C” Difference between duplicate records
  • 13. Inconsistent Data  We want to transform all dates to the same format internally  Some systems accept dates in many formats  e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc  dates are transformed internally to a standard value  Frequently, just the year (YYYY) is sufficient  For more details, we may need the month, the day, the hour, etc  Representing date as YYYYMM or YYYYMMDD can be OK.
  • 14. Data Integration Goal identification & Data Understanding Data Cleaning Data Integration Data Transformation Data Reduction
  • 15. Data Integration  Combines data from multiple sources into a coherent store.  Increasingly data a mining projects require data from more than one data source.  Such as multiple databases, data warehouse, flat files and historical data.
  • 16. Data Integration  Data is stored in many systems across enterprise and outside the enterprise The source of data fall into two categories:  Internal sources that are generated through enterprise activities such as databases, historical data, Web sites and warehouses.  External sources such as credit bureaus, phone companies and demographical information.
  • 17. Data Integration  Data Warehouse: is a structure that links information from two or more databases.  Data warehouse brings data from different data sources into a central repository.  It performs some data integration, clean-up, and summarization, and distribute the information data marts.