SlideShare a Scribd company logo
Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 1
Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization
Introduction
 The real –world database typically used in data
mining may have millions of records and thousands of
variables. It is noisy and has missing and inconsistent
values.
Data quality is a key issue with data mining so data
preparation is a necessary step for serious, effective,
real-world data mining.
Introduction
To increase the accuracy of the mining, has to
perform data preprocessing.
Otherwise, garbage in => garbage out
Data Preparation estimated to take 70-80% of the
time and effort.
Introduction
Domain Expertise
 Data quality expert: “We found these strange records
in your database after running sophisticated
algorithms!”
 Domain Experts: “Oh, those apples - we put them
in the same baskets as oranges because there are too
few apples to bother. Not a big deal. We knew that
already.”
Domain Expertise
Domain Expertise is important for understanding the
data, the problem and interpreting the results.
“The counter resets to 0 if the number of calls exceeds N”.
“The missing values are represented by 0, but the default billed
amount is 0 too.”
Insufficient Domain Expertise is a primary cause of
poor Data Quality– data are unusable.
Goal Identification
 To obtain the highest benefit from data mining, there
must be a clear statement of the business objectives.
 The first and most important step in any targeting-
model project is to establish a clear goal and develop a
process to achieve that goal.
Goal Identification
 Example of Goal for business company are:
 You want to attract new customers
 You want to avoid high -risk customers
 You want to understand the characteristics of your current customers?
 You want to make your unprofitable customers more profitable?
 You want to retain your profitable customers?
 You want to win back your lost customers?
 You want to improve customer satisfaction?
 You want to increase sales?
 You want to reduce expenses
Data Understanding
 Starts with an initial data collection and proceeds with
activities in order to get familiar with the data, to
identify data quality problems, to discover first closes
into the data.
Data Understanding
Data Understanding: Relevance:
 What data is available for the task?
 Is this data relevant?
 Is additional relevant data available?
 How much historical data is available?
 Who is the data expert ?
Data Understanding
Data Understanding: Quantity
 Number of instances (records)
 Rule of thumb: 5,000 or more desired
 if less, results are less reliable;
 Number of attributes (fields)
 Rule of thumb: for each field, 10 or more instances
 If more fields, use feature reduction and selection
 Number of targets
 Rule of thumb: >100 for each class
 if very unbalanced, use stratified sampling
Data Cleaning
Goal identification
& Data
Understanding
Data Cleaning Data Integration
Data
Transformation
Data
Reduction
Data Cleaning
Tid Refund
Marital
Status
Taxable
Income
Cheat
1 Yes 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced -95k Yes
6 No Married 60K No
7 Yes 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
Data Cleaning
 Real-world data tends to be incomplete, noisy and
inconsistent.
 Data Cleaning Steps
 Missing values
 Noisy Data
 Inconsistent Data
Missing values
 A missing value (Mv) is an empty cell in the table
that represents a dataset.
?Instances
Attributes
Dealing with missing values
1. Ignore records with missing values:
 This is usually done when the class label is missing.
 This method is not effective, unless the record contains
several attributes with missing values.
Dealing with missing values
2. Fill in the missing value manually:
In general, this approach is time-consuming and may be not
feeble given a large data set with many missing values.
3. Fill in the missing value manually:
Replace all missing values by same constant such as
“unknown”. Although this method is simple but it is not
recommended because results with “unknown values are not
“interesting”.
Dealing with missing values
4. Use the attribute mean to fill missing values:
For example in attribute income if the mean income is 28000,
use this value to replace the missing values.
5. Use the attribute mean for all samples belonging to the
same class
For example, if classifying customers according to credit risk,
replace the missing value with the mean income value for
customers in the same credit risk category as that of the given
record.
Dealing with missing values
6. Use advanced method
such as K-nearest neighbors formalism or decision
tree to predict the missing value using other values.
Dealing with missing values
k nearest neighbors Approach
Compute the k nearest neighbors and assign a value
from them.
Dealing with missing values
k nearest neighbors Approach
 For nominal values, use the most common value
among all neighbors.
 For numerical values use the average value.
 Indeed, we need to define a proximity measure
between instances, such as euclidian distance.
Next:
Data Cleaning: Noisy Data

More Related Content

PPTX
5 data preparation and processing2
PPTX
1 Introduction to-data-mining lecture
PPTX
7 data transformation
PPTX
Data preparation and processing chapter 2
PPTX
2 Data-mining process
PPTX
3 classification
PPTX
3 Data Mining Tasks
PPTX
Research trends in data warehousing and data mining
5 data preparation and processing2
1 Introduction to-data-mining lecture
7 data transformation
Data preparation and processing chapter 2
2 Data-mining process
3 classification
3 Data Mining Tasks
Research trends in data warehousing and data mining

What's hot (20)

PPT
Database
PPT
Lecture1
PDF
Introduction to Data Mining
PPTX
Data mining - Process, Techniques and Research Topics
PDF
Ghhh
PPTX
Data mining
PDF
Data preprocessing
PPT
1.2 steps and functionalities
PPT
Chapter 13 data warehousing
PPTX
Data Mining
PPTX
Data mining
PDF
data mining
PDF
Data mining and data warehouse lab manual updated
PPT
Data Mining
PPTX
Data Mining: Classification and analysis
PPTX
Data mining: Classification and prediction
PPTX
Data mining tasks
PPTX
The 8 Step Data Mining Process
ODP
Data mining
PPTX
01 Introduction to Data Mining
Database
Lecture1
Introduction to Data Mining
Data mining - Process, Techniques and Research Topics
Ghhh
Data mining
Data preprocessing
1.2 steps and functionalities
Chapter 13 data warehousing
Data Mining
Data mining
data mining
Data mining and data warehouse lab manual updated
Data Mining
Data Mining: Classification and analysis
Data mining: Classification and prediction
Data mining tasks
The 8 Step Data Mining Process
Data mining
01 Introduction to Data Mining
Ad

Similar to 4 Data preparation and processing (20)

PDF
Chapter 3.pdf
PPT
ML-ChapterTwo-Data Preprocessing.ppt
PPTX
Data Quality Analytics: Understanding what is in your data, before using it
PPT
preproccessing level 3 for students.ppt
PPTX
BAS 250 Lecture 2
DOCX
Machine Learning Approaches and its Challenges
PDF
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
PDF
Data quality testing – a quick checklist to measure and improve data quality
PDF
Exploratory Data Analysis - Satyajit.pdf
PDF
Overview of Data Cleaning.pdf
PDF
Barga Galvanize Sept 2015
PDF
Data Quality: principles, approaches, and best practices
PPT
Data processing
PDF
Sergio Juarez, Elemica – “From Big Data to Value: The Power of Master Data Ma...
PDF
Top 30 Data Analyst Interview Questions.pdf
PDF
PDF
Data Cleansing What, Why, How, and Trends .pdf
PPT
Cssu dw dm
PDF
From Thought to Code, Write Your Own Data Destiny.pdf
PDF
Data Cleaning and Preprocessing: Ensuring Data Quality
Chapter 3.pdf
ML-ChapterTwo-Data Preprocessing.ppt
Data Quality Analytics: Understanding what is in your data, before using it
preproccessing level 3 for students.ppt
BAS 250 Lecture 2
Machine Learning Approaches and its Challenges
Challenges Of A Junior Data Scientist_ Best Tips To Help You Along The Way.pdf
Data quality testing – a quick checklist to measure and improve data quality
Exploratory Data Analysis - Satyajit.pdf
Overview of Data Cleaning.pdf
Barga Galvanize Sept 2015
Data Quality: principles, approaches, and best practices
Data processing
Sergio Juarez, Elemica – “From Big Data to Value: The Power of Master Data Ma...
Top 30 Data Analyst Interview Questions.pdf
Data Cleansing What, Why, How, and Trends .pdf
Cssu dw dm
From Thought to Code, Write Your Own Data Destiny.pdf
Data Cleaning and Preprocessing: Ensuring Data Quality
Ad

More from Mahmoud Alfarra (20)

PPT
Computer Programming, Loops using Java - part 2
PPT
Computer Programming, Loops using Java
PPT
Chapter 10: hashing data structure
PPT
Chapter9 graph data structure
PPT
Chapter 8: tree data structure
PPT
Chapter 7: Queue data structure
PPT
Chapter 6: stack data structure
PPT
Chapter 5: linked list data structure
PPT
Chapter 4: basic search algorithms data structure
PPT
Chapter 3: basic sorting algorithms data structure
PPT
Chapter 2: array and array list data structure
PPT
Chapter1 intro toprincipleofc#_datastructure_b_cs
PPT
Chapter 0: introduction to data structure
PPT
8 programming-using-java decision-making practices 20102011
PPT
7 programming-using-java decision-making220102011
PPT
6 programming-using-java decision-making20102011-
PPT
5 programming-using-java intro-tooop20102011
PPT
4 programming-using-java intro-tojava20102011
PPT
3 programming-using-java introduction-to computer
PPT
2 programming-using-java how to built application
Computer Programming, Loops using Java - part 2
Computer Programming, Loops using Java
Chapter 10: hashing data structure
Chapter9 graph data structure
Chapter 8: tree data structure
Chapter 7: Queue data structure
Chapter 6: stack data structure
Chapter 5: linked list data structure
Chapter 4: basic search algorithms data structure
Chapter 3: basic sorting algorithms data structure
Chapter 2: array and array list data structure
Chapter1 intro toprincipleofc#_datastructure_b_cs
Chapter 0: introduction to data structure
8 programming-using-java decision-making practices 20102011
7 programming-using-java decision-making220102011
6 programming-using-java decision-making20102011-
5 programming-using-java intro-tooop20102011
4 programming-using-java intro-tojava20102011
3 programming-using-java introduction-to computer
2 programming-using-java how to built application

Recently uploaded (20)

PDF
Anesthesia in Laparoscopic Surgery in India
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
Cell Types and Its function , kingdom of life
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Institutional Correction lecture only . . .
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Anesthesia in Laparoscopic Surgery in India
2.FourierTransform-ShortQuestionswithAnswers.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
Renaissance Architecture: A Journey from Faith to Humanism
Week 4 Term 3 Study Techniques revisited.pptx
VCE English Exam - Section C Student Revision Booklet
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Cell Types and Its function , kingdom of life
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Supply Chain Operations Speaking Notes -ICLT Program
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
102 student loan defaulters named and shamed – Is someone you know on the list?
O7-L3 Supply Chain Operations - ICLT Program
Institutional Correction lecture only . . .
Complications of Minimal Access Surgery at WLH
Final Presentation General Medicine 03-08-2024.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf

4 Data preparation and processing

  • 1. Data preparation and processing Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 2 – Lecture 1
  • 2. Outline  Introduction  Domain Expert  Goal identification and Data Understanding  Data Cleaning  Missing values  Noisy Data  Inconsistent Data  Data Integration  Data Transformation  Data Reduction  Feature Selection  Sampling  Discretization
  • 4.  The real –world database typically used in data mining may have millions of records and thousands of variables. It is noisy and has missing and inconsistent values. Data quality is a key issue with data mining so data preparation is a necessary step for serious, effective, real-world data mining. Introduction
  • 5. To increase the accuracy of the mining, has to perform data preprocessing. Otherwise, garbage in => garbage out Data Preparation estimated to take 70-80% of the time and effort. Introduction
  • 6. Domain Expertise  Data quality expert: “We found these strange records in your database after running sophisticated algorithms!”  Domain Experts: “Oh, those apples - we put them in the same baskets as oranges because there are too few apples to bother. Not a big deal. We knew that already.”
  • 7. Domain Expertise Domain Expertise is important for understanding the data, the problem and interpreting the results. “The counter resets to 0 if the number of calls exceeds N”. “The missing values are represented by 0, but the default billed amount is 0 too.” Insufficient Domain Expertise is a primary cause of poor Data Quality– data are unusable.
  • 8. Goal Identification  To obtain the highest benefit from data mining, there must be a clear statement of the business objectives.  The first and most important step in any targeting- model project is to establish a clear goal and develop a process to achieve that goal.
  • 9. Goal Identification  Example of Goal for business company are:  You want to attract new customers  You want to avoid high -risk customers  You want to understand the characteristics of your current customers?  You want to make your unprofitable customers more profitable?  You want to retain your profitable customers?  You want to win back your lost customers?  You want to improve customer satisfaction?  You want to increase sales?  You want to reduce expenses
  • 10. Data Understanding  Starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first closes into the data.
  • 11. Data Understanding Data Understanding: Relevance:  What data is available for the task?  Is this data relevant?  Is additional relevant data available?  How much historical data is available?  Who is the data expert ?
  • 12. Data Understanding Data Understanding: Quantity  Number of instances (records)  Rule of thumb: 5,000 or more desired  if less, results are less reliable;  Number of attributes (fields)  Rule of thumb: for each field, 10 or more instances  If more fields, use feature reduction and selection  Number of targets  Rule of thumb: >100 for each class  if very unbalanced, use stratified sampling
  • 13. Data Cleaning Goal identification & Data Understanding Data Cleaning Data Integration Data Transformation Data Reduction
  • 14. Data Cleaning Tid Refund Marital Status Taxable Income Cheat 1 Yes 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced -95k Yes 6 No Married 60K No 7 Yes 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects
  • 15. Data Cleaning  Real-world data tends to be incomplete, noisy and inconsistent.  Data Cleaning Steps  Missing values  Noisy Data  Inconsistent Data
  • 16. Missing values  A missing value (Mv) is an empty cell in the table that represents a dataset. ?Instances Attributes
  • 17. Dealing with missing values 1. Ignore records with missing values:  This is usually done when the class label is missing.  This method is not effective, unless the record contains several attributes with missing values.
  • 18. Dealing with missing values 2. Fill in the missing value manually: In general, this approach is time-consuming and may be not feeble given a large data set with many missing values. 3. Fill in the missing value manually: Replace all missing values by same constant such as “unknown”. Although this method is simple but it is not recommended because results with “unknown values are not “interesting”.
  • 19. Dealing with missing values 4. Use the attribute mean to fill missing values: For example in attribute income if the mean income is 28000, use this value to replace the missing values. 5. Use the attribute mean for all samples belonging to the same class For example, if classifying customers according to credit risk, replace the missing value with the mean income value for customers in the same credit risk category as that of the given record.
  • 20. Dealing with missing values 6. Use advanced method such as K-nearest neighbors formalism or decision tree to predict the missing value using other values.
  • 21. Dealing with missing values k nearest neighbors Approach Compute the k nearest neighbors and assign a value from them.
  • 22. Dealing with missing values k nearest neighbors Approach  For nominal values, use the most common value among all neighbors.  For numerical values use the average value.  Indeed, we need to define a proximity measure between instances, such as euclidian distance.