SlideShare a Scribd company logo
Data Preprocessing
Likit Preeyanon, Ph.D.
Center of Data Mining and Biomedical Informatics
Faculty of Medical Technology
Mahidol University
Real World Data
• Noisy, incomplete, inconsistent
• Unstructered (Web pages, Tweets, Emails etc.)
• Big
“Garbage in, garbage out”
Your analysis is as good as your data.
Definition
• Data pre-processing is a process that improves
quality of data and makes data more suitable for the
analysis
• Data pre-processing is the quality control
Data Analysis Workflow
• Data Preprocessing 80%
• Data Analysis 5%
• Report/Presentation 5%
What To Do
• Make data suitable (reformat) the analysis
• Remove irrelevant data, discrepancies, errors or outliers (data
cleaning)
• Remove, replace or impute missing data
• Reduce data size (data reduction)
• Normalize, transform, convert or discretize data
• Integrate data
• Anonymize data
Data Preprocessing is Hard
• Data are different (text data, sequencing data, image data, signal
data, geographical data, time series data and etc.)
• Data are massive
• Data are from different sources
• Some data are of low quality
• Some data are not well documented (no metadata)
• Different tools require data in different formats
• Some decisions have to be made
Examples
stuid score1 score2 age gender
1 3 76 16 M
2 33 56 15 F
3 45 88 75 M
4 90 32 - -
5 45 37 17 M
5 45 37 17 M
6 55 - 17 female
7 45 99 16 male
Examples
stuid math biology age gender
2 33 56 15 F
3 45 88 16 M
4 90 32 16 -
5 45 37 17 M
6 55 - 17 F
7 45 99 16 M
Examples
stuid math biology age gender
2 F P 15 F
3 F P 16 M
4 P F 16 -
5 F F 17 M
6 P F 17 F
7 F P 16 M
Examples
stuid math biology age gender
2 F P 15 F
3 F P 16 M
4 P F 16 -
5 F F 17 M
6 P F 17 F
7 F P 16 M
stuid city state country
2 London Kentucky US
3 Seattle WA US
4 L.A. California US
5 Denver Colorado US
6 Beijing - China
7 - - India
stuid income
2 100000
3 13400
4 20000
5 430000
6 23000
7 5400
Take home message
• Data preprocessing is hard and not taught in class
• Use Google and web forum to learn more about available
tools, guidelines and tricks to preprocess data
• Trial and error is necessary (it’s research!)
• You should consult collaborators or experts, if needed
• Data preprocessing must also be documented! Both raw
and preprocessed data must be available.
• Well-designed data collection method is helpful
Resources
• stackoverflow.com
• rseek.org
• google.com
• Lots of R books

More Related Content

PPTX
Data preprocessing in Machine learning
PPT
5.1 mining data streams
PDF
Missing data
PPT
Data preprocessing
PDF
Data preprocessing using Machine Learning
PPTX
Data Preparation.pptx
PDF
Big data Analytics
PPT
Software packages for statistical analysis - SPSS
Data preprocessing in Machine learning
5.1 mining data streams
Missing data
Data preprocessing
Data preprocessing using Machine Learning
Data Preparation.pptx
Big data Analytics
Software packages for statistical analysis - SPSS

What's hot (20)

PDF
Anomaly detection Workshop slides
PPT
Clustering
PDF
Introduction to Statistical Machine Learning
PPTX
Data reduction
PPTX
04 performance metrics v2
PPT
Dma unit 1
PDF
Tda presentation
PPTX
Dm from databases perspective u 1
PPTX
Exploratory data analysis
PPTX
Machine Learning Algorithms
PPTX
Data preprocessing
PDF
Lecture1 introduction to machine learning
PPTX
PPTX
Classification in data mining
PDF
Module 4: Model Selection and Evaluation
PDF
Machine Learning Model Evaluation Methods
PDF
Three Big Data Case Studies
PDF
Dimensionality Reduction
PPTX
Feature selection
Anomaly detection Workshop slides
Clustering
Introduction to Statistical Machine Learning
Data reduction
04 performance metrics v2
Dma unit 1
Tda presentation
Dm from databases perspective u 1
Exploratory data analysis
Machine Learning Algorithms
Data preprocessing
Lecture1 introduction to machine learning
Classification in data mining
Module 4: Model Selection and Evaluation
Machine Learning Model Evaluation Methods
Three Big Data Case Studies
Dimensionality Reduction
Feature selection
Ad

Viewers also liked (6)

PDF
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...
PDF
Maeshori missing
PDF
Experiences with big data by Srinivasan Seshadri
PDF
Data Cleaning Process
PPTX
Data cleansing
ODP
Exploratory factor analysis
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...
Maeshori missing
Experiences with big data by Srinivasan Seshadri
Data Cleaning Process
Data cleansing
Exploratory factor analysis
Ad

Similar to Data preprocessing (20)

PDF
Handson Data Preprocessing In Python Learn How To Effectively Prepare Data Fo...
PDF
How to start your journey as a data scientist
PPT
COM 578 Empirical Methods in Machine Learning and Data Mining
PPTX
Data science
PPTX
DATA preprocessing.pptx
PPTX
Data Science_Unit-1.2 part - 2 of intro.pptx
PPTX
PPTX
data_preprocessingknnnaiveandothera.pptx
PPTX
Advance Data_Preprocessing_and_Wrangling
PPT
data Preprocessing different techniques summarized
PPTX
Data science engineering Preprocessing.pptx
PPTX
Lec2(Types of ML) & Preprocessing of data.pptx
PPT
Major Tasks in Data Preprocessing - Data cleaning
PDF
4 preprocess
PDF
Preprocessing Step in Data Cleaning - Data Mining
PPTX
Bayesian reasoning
PPTX
Data pre-processing. A step by step practical approach.pptx
PDF
DMDW Unit 1.pdf
PDF
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
PPT
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt
Handson Data Preprocessing In Python Learn How To Effectively Prepare Data Fo...
How to start your journey as a data scientist
COM 578 Empirical Methods in Machine Learning and Data Mining
Data science
DATA preprocessing.pptx
Data Science_Unit-1.2 part - 2 of intro.pptx
data_preprocessingknnnaiveandothera.pptx
Advance Data_Preprocessing_and_Wrangling
data Preprocessing different techniques summarized
Data science engineering Preprocessing.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Major Tasks in Data Preprocessing - Data cleaning
4 preprocess
Preprocessing Step in Data Cleaning - Data Mining
Bayesian reasoning
Data pre-processing. A step by step practical approach.pptx
DMDW Unit 1.pdf
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Preprocessing_17924109858fc09abd41bc880e540c13.ppt

Recently uploaded (20)

PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Foundation of Data Science unit number two notes
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to machine learning and Linear Models
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Computer network topology notes for revision
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Global journeys: estimating international migration
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Miokarditis (Inflamasi pada Otot Jantung)
Foundation of Data Science unit number two notes
.pdf is not working space design for the following data for the following dat...
Introduction to machine learning and Linear Models
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Computer network topology notes for revision
Database Infoormation System (DBIS).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Clinical guidelines as a resource for EBP(1).pdf
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Supervised vs unsupervised machine learning algorithms
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Global journeys: estimating international migration
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

Data preprocessing

  • 1. Data Preprocessing Likit Preeyanon, Ph.D. Center of Data Mining and Biomedical Informatics Faculty of Medical Technology Mahidol University
  • 2. Real World Data • Noisy, incomplete, inconsistent • Unstructered (Web pages, Tweets, Emails etc.) • Big
  • 3. “Garbage in, garbage out” Your analysis is as good as your data.
  • 4. Definition • Data pre-processing is a process that improves quality of data and makes data more suitable for the analysis • Data pre-processing is the quality control
  • 5. Data Analysis Workflow • Data Preprocessing 80% • Data Analysis 5% • Report/Presentation 5%
  • 6. What To Do • Make data suitable (reformat) the analysis • Remove irrelevant data, discrepancies, errors or outliers (data cleaning) • Remove, replace or impute missing data • Reduce data size (data reduction) • Normalize, transform, convert or discretize data • Integrate data • Anonymize data
  • 7. Data Preprocessing is Hard • Data are different (text data, sequencing data, image data, signal data, geographical data, time series data and etc.) • Data are massive • Data are from different sources • Some data are of low quality • Some data are not well documented (no metadata) • Different tools require data in different formats • Some decisions have to be made
  • 8. Examples stuid score1 score2 age gender 1 3 76 16 M 2 33 56 15 F 3 45 88 75 M 4 90 32 - - 5 45 37 17 M 5 45 37 17 M 6 55 - 17 female 7 45 99 16 male
  • 9. Examples stuid math biology age gender 2 33 56 15 F 3 45 88 16 M 4 90 32 16 - 5 45 37 17 M 6 55 - 17 F 7 45 99 16 M
  • 10. Examples stuid math biology age gender 2 F P 15 F 3 F P 16 M 4 P F 16 - 5 F F 17 M 6 P F 17 F 7 F P 16 M
  • 11. Examples stuid math biology age gender 2 F P 15 F 3 F P 16 M 4 P F 16 - 5 F F 17 M 6 P F 17 F 7 F P 16 M stuid city state country 2 London Kentucky US 3 Seattle WA US 4 L.A. California US 5 Denver Colorado US 6 Beijing - China 7 - - India stuid income 2 100000 3 13400 4 20000 5 430000 6 23000 7 5400
  • 12. Take home message • Data preprocessing is hard and not taught in class • Use Google and web forum to learn more about available tools, guidelines and tricks to preprocess data • Trial and error is necessary (it’s research!) • You should consult collaborators or experts, if needed • Data preprocessing must also be documented! Both raw and preprocessed data must be available. • Well-designed data collection method is helpful
  • 13. Resources • stackoverflow.com • rseek.org • google.com • Lots of R books