4 Data preparation and processing

Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 1

Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization

 The real –world database typically used in data
mining may have millions of records and thousands of
variables. It is noisy and has missing and inconsistent
values.
Data quality is a key issue with data mining so data
preparation is a necessary step for serious, effective,
real-world data mining.
Introduction

To increase the accuracy of the mining, has to
perform data preprocessing.
Otherwise, garbage in => garbage out
Data Preparation estimated to take 70-80% of the
time and effort.
Introduction

Domain Expertise
 Data quality expert: “We found these strange records
in your database after running sophisticated
algorithms!”
 Domain Experts: “Oh, those apples - we put them
in the same baskets as oranges because there are too
few apples to bother. Not a big deal. We knew that
already.”

Domain Expertise
Domain Expertise is important for understanding the
data, the problem and interpreting the results.
“The counter resets to 0 if the number of calls exceeds N”.
“The missing values are represented by 0, but the default billed
amount is 0 too.”
Insufficient Domain Expertise is a primary cause of
poor Data Quality– data are unusable.

Goal Identification
 To obtain the highest benefit from data mining, there
must be a clear statement of the business objectives.
 The first and most important step in any targeting-
model project is to establish a clear goal and develop a
process to achieve that goal.

Goal Identification
 Example of Goal for business company are:
 You want to attract new customers
 You want to avoid high -risk customers
 You want to understand the characteristics of your current customers?
 You want to make your unprofitable customers more profitable?
 You want to retain your profitable customers?
 You want to win back your lost customers?
 You want to improve customer satisfaction?
 You want to increase sales?
 You want to reduce expenses

Data Understanding
 Starts with an initial data collection and proceeds with
activities in order to get familiar with the data, to
identify data quality problems, to discover first closes
into the data.

Data Understanding
Data Understanding: Relevance:
 What data is available for the task?
 Is this data relevant?
 Is additional relevant data available?
 How much historical data is available?
 Who is the data expert ?

Data Understanding
Data Understanding: Quantity
 Number of instances (records)
 Rule of thumb: 5,000 or more desired
 if less, results are less reliable;
 Number of attributes (fields)
 Rule of thumb: for each field, 10 or more instances
 If more fields, use feature reduction and selection
 Number of targets
 Rule of thumb: >100 for each class
 if very unbalanced, use stratified sampling

Data Cleaning
Goal identification
& Data
Understanding
Data Cleaning Data Integration
Data
Transformation
Data
Reduction

Data Cleaning
Tid Refund
Marital
Status
Taxable
Income
Cheat
1 Yes 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced -95k Yes
6 No Married 60K No
7 Yes 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects

Data Cleaning
 Real-world data tends to be incomplete, noisy and
inconsistent.
 Data Cleaning Steps
 Missing values
 Noisy Data
 Inconsistent Data

Missing values
 A missing value (Mv) is an empty cell in the table
that represents a dataset.
?Instances
Attributes

Dealing with missing values
1. Ignore records with missing values:
 This is usually done when the class label is missing.
 This method is not effective, unless the record contains
several attributes with missing values.

2. Fill in the missing value manually:
In general, this approach is time-consuming and may be not
feeble given a large data set with many missing values.
3. Fill in the missing value manually:
Replace all missing values by same constant such as
“unknown”. Although this method is simple but it is not
recommended because results with “unknown values are not
“interesting”.

4. Use the attribute mean to fill missing values:
For example in attribute income if the mean income is 28000,
use this value to replace the missing values.
5. Use the attribute mean for all samples belonging to the
same class
For example, if classifying customers according to credit risk,
replace the missing value with the mean income value for
customers in the same credit risk category as that of the given
record.

6. Use advanced method
such as K-nearest neighbors formalism or decision
tree to predict the missing value using other values.

k nearest neighbors Approach
Compute the k nearest neighbors and assign a value
from them.

k nearest neighbors Approach
 For nominal values, use the most common value
among all neighbors.
 For numerical values use the average value.
 Indeed, we need to define a proximity measure
between instances, such as euclidian distance.

Next:
Data Cleaning: Noisy Data

4 Data preparation and processing

More Related Content

What's hot (20)

Similar to 4 Data preparation and processing (20)

More from Mahmoud Alfarra (20)

Recently uploaded (20)

4 Data preparation and processing