SlideShare a Scribd company logo
BAS 250
Lesson 2: Data Preparation
 Explain concepts and purpose of Data Preparation
 Understand solutions for handling missing and
inconsistent data
 Utilize data and attribute reduction techniques
 Effectively work in RapidMiner to prepare your data.
This Week’s Learning Objectives
The Data Mining Process: CRISP-DM
o Join data sets that are needed for your analysis
o Reduce data sets to only include pertinent
variables
o Scrub data to remove anomalies- outliers or
missing data
o Reformat for consistency and effective use
3. Data Preparation
 Ensure robustness of data
o Combine more 2 or more data sets to create a “mini – database”
with all variables needed for analysis in one place.
o Merge by a unique identifier common to both data sets
 “Key Identifier”, “Common ID”, “ID Number”, etc.
 Example: Social Security Number (links Medical and Insurance)
Data Preparation
Data Preparation
Example: Sources of Data
Customer Purchases - “Point of Sale data” – CSV file format
Cost of Products Sold – “Accounting department” – Excel file format
Inventory of Products - “ IT Data Warehouse” - XML file format
Merge By Product ID or SKU
 Data Reduction…two part
o Observations (rows, instances, etc)
o Attributes (variables, records, columns, etc)
Data Preparation
 Attribute reduction to filter out irrelevant or
uninteresting data without completely removing them
from the original set.
 Even if a variable isn’t interesting for answering some
questions, it may still be useful in others.
It is recommended to import all attributes first, then filter as necessary
Data Preparation
 Observation Reduction…
 Observation reduction is to reduce the # of observations to create a
smaller data set.
 Some reasons to do so:
o Create a sample set for:
 Training data, proof of concept analysis, testing theories, sharing data
o Improve analysis speed or process time
o Data scrubbing for outliers, missing values, etc.
Data Preparation
 Ensure consistency of data
o Missing information
o Spelling errors, typos
o Multiple responses for an attribute
o Characters in numeric fields and vice-versa
Data Preparation
 Ensure consistency of data
Data Preparation
KEY: Missing data is data that does not exist in a data set
• Not the same as zero or some other value
• In a dataset, it is blank and the value is unknown
• Sometimes referred to as null values
• Depending on your objective and the circumstance, you may
choose to leave missing data as they are or replace with some
other value
 Ensure consistency of data
Data Preparation
KEY: Inconsistent data is different from missing data
• Occurs when a value does exist but its value is not valid
or meaningful.
• Common = “.” or “zero”
 Ensure consistency of data
Data Preparation
Replace or remove missing or inconsistent data
• For numeric data…
• Can be replaced using Measures of Central Tendency
• Mean, Median, and Mode
• Mean - Average value
• Median - Middle value
• Mode - Most frequent or common value
 Ensure consistency of data
Data Preparation
Replace or remove missing or inconsistent data
• For character data…
• Can be replaced using Best Estimated Value
• “Like Others”
• Ex. All males in data like bass fishing. If attribute “Fish Type” is blank and
attribute “Gender” equals male, then “Bass”
• “Clustering Techniques”
• “Best Guess”
 Ensure consistency of data
Data Preparation
• Replacing missing or inconsistent values found in
data should be done:
• With intention, not haphazardly
• Use common sense
• Be transparent
It is recommended to always document your
missing or consistent data processes.
 This course is a practical application course in Data Mining. Learning to
use RapidMiner is required.
 If you have not done so yet, please plan to walk through the tutorial
examples in RapidMiner.
 To assist you in understanding RapidMiner, I will take screenshots of what
I am doing to get the results we are looking for.
 RapidMiner is pretty intuitive. You will get it quickly.
Basics of RapidMiner
 Types of files that can be imported into RapidMiner:
o CSV File
o Excel File
o XML File
o Access Database Table
o … and much more
 We use mainly CSV files which contain Comma Separated Values- be mindful if
your dataset contains commas
o Alternative delimiters can be selected in this case:
 Tab
 Semicolon
 Pipe ( l ), etc.
Basics of RapidMiner
 Three main areas that contain useful tools in
RapidMiner:
o Operators – Every possible task you can think of
o Repositories – Where you store your data
o Parameters – Task set up details
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
 Explain concepts and purpose of Data Preparation
 Understand solutions for handling missing and inconsistent
data
 Utilize data and attribute reduction techniques
 Effectively work in RapidMiner to prepare your data.
Summary
“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s
Employment and Training Administration. The solution was created by the grantee and does not
necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor
makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such
information, including any information on linked sites and including, but not limited to, accuracy of the
information or its completeness, timeliness, usefulness, adequacy, continued availability, or ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building Capacity in
Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative
Commons Attribution 4.0 International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/4.0/
Copyright Information

More Related Content

PPTX
BAS 150 Lesson 1 Lecture
PPTX
BAS 150 Lesson 2 Lecture
PPTX
Data analytics
PPTX
Analytics 2
PPTX
Data analytics
PPTX
Data analytics
PPT
Data analytics & its Trends
PPTX
Classification of data
BAS 150 Lesson 1 Lecture
BAS 150 Lesson 2 Lecture
Data analytics
Analytics 2
Data analytics
Data analytics
Data analytics & its Trends
Classification of data

What's hot (20)

PPTX
Introduction to Data Analytics
PPTX
PPTX
Data Analytics
PPTX
Introduction to data analytics
PPTX
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
PDF
SAS/MIT/Sloan Data Analytics
PPT
Introducing SPSS customer overview
PPTX
Data analytics
PPTX
Data Mining Technique - SEMMA
PPTX
Introduction to data science
PDF
Data Science Project Lifecycle
PDF
Challenges in business analytics
PPTX
Big data and data science overview
PPT
Analysis of ‘Unstructured’ Data
PPT
Reports vs analysis
PDF
PPTX
Data Science Project Lifecycle and Skill Set
PPTX
Leveraging Data Science in the Automotive Industry
PDF
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
Introduction to Data Analytics
Data Analytics
Introduction to data analytics
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
SAS/MIT/Sloan Data Analytics
Introducing SPSS customer overview
Data analytics
Data Mining Technique - SEMMA
Introduction to data science
Data Science Project Lifecycle
Challenges in business analytics
Big data and data science overview
Analysis of ‘Unstructured’ Data
Reports vs analysis
Data Science Project Lifecycle and Skill Set
Leveraging Data Science in the Automotive Industry
How to Become a Data Analyst? | Data Analyst Skills | Data Analyst Training |...
Ad

Viewers also liked (20)

PDF
Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
PPT
SAS BASICS
PPTX
SAS basics Step by step learning
PPT
Basics Of SAS Programming Language
PDF
SAS Training session - By Pratima
PPTX
BAS 250 Lecture 1
PDF
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20
PPSX
SAS TRAINING
PPTX
BAS 250 Lecture 8
PPT
Where Vs If Statement
PDF
Base 9.1 preparation guide
PDF
Analytics with SAS
PPTX
PDF
Base SAS Full Sample Paper
PPTX
Statistical analytical programming for social media analysis .
PDF
Base SAS Exam Questions
PDF
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
PDF
Deep learning - Conceptual understanding and applications
PDF
The Second Little Book of Leadership
PPTX
Best Presentation About Infosys
Learning SAS by Example -A Programmer’s Guide by Ron CodySolution
SAS BASICS
SAS basics Step by step learning
Basics Of SAS Programming Language
SAS Training session - By Pratima
BAS 250 Lecture 1
SAS Ron Cody Solutions for even Number problems from Chapter 16 to 20
SAS TRAINING
BAS 250 Lecture 8
Where Vs If Statement
Base 9.1 preparation guide
Analytics with SAS
Base SAS Full Sample Paper
Statistical analytical programming for social media analysis .
Base SAS Exam Questions
Big Data Career Path | Big Data Learning Path | Hadoop Tutorial | Edureka
Deep learning - Conceptual understanding and applications
The Second Little Book of Leadership
Best Presentation About Infosys
Ad

Similar to BAS 250 Lecture 2 (20)

PPTX
data wrangling (1).pptx kjhiukjhknjbnkjh
PDF
Metopen 6
PPTX
Data Processing & Explain each term in details.pptx
PPT
ML-ChapterTwo-Data Preprocessing.ppt
PDF
Chapter 3.pdf
PPTX
Data preparation and processing chapter 2
DOC
Business analyst
PDF
Top 30 Data Analyst Interview Questions.pdf
PPTX
Data Science in Python.pptx
PPTX
Pandas Data Cleaning and Preprocessing PPT.pptx
PDF
Data Quality: principles, approaches, and best practices
PPT
Data processing
PDF
Barga Data Science lecture 2
PPTX
4 Data preparation and processing
PDF
Knowledge discovery claudiad amato
PPTX
Data Preparation.pptx
PDF
KNOLX_Data_preprocessing
PPT
Mba ii rm unit-4.1 data analysis & presentation a
PPTX
BDA TAE 2 (BMEB 83).pptx
PDF
3-DataPreprocessing a complete guide.pdf
data wrangling (1).pptx kjhiukjhknjbnkjh
Metopen 6
Data Processing & Explain each term in details.pptx
ML-ChapterTwo-Data Preprocessing.ppt
Chapter 3.pdf
Data preparation and processing chapter 2
Business analyst
Top 30 Data Analyst Interview Questions.pdf
Data Science in Python.pptx
Pandas Data Cleaning and Preprocessing PPT.pptx
Data Quality: principles, approaches, and best practices
Data processing
Barga Data Science lecture 2
4 Data preparation and processing
Knowledge discovery claudiad amato
Data Preparation.pptx
KNOLX_Data_preprocessing
Mba ii rm unit-4.1 data analysis & presentation a
BDA TAE 2 (BMEB 83).pptx
3-DataPreprocessing a complete guide.pdf

More from Wake Tech BAS (9)

PPTX
BAS 250 Lecture 5
PPTX
BAS 250 Lecture 4
PPTX
BAS 250 Lecture 3
PPTX
BAS 150 Lesson 8 Lecture
PPTX
BAS 150 Lesson 7 Lecture
PPTX
BAS 150 Lesson 6 Lecture
PPTX
BAS 150 Lesson 5 Lecture
PPTX
BAS 150 Lesson 4 Lecture
PPTX
BAS 150 Lesson 3 Lecture
BAS 250 Lecture 5
BAS 250 Lecture 4
BAS 250 Lecture 3
BAS 150 Lesson 8 Lecture
BAS 150 Lesson 7 Lecture
BAS 150 Lesson 6 Lecture
BAS 150 Lesson 5 Lecture
BAS 150 Lesson 4 Lecture
BAS 150 Lesson 3 Lecture

Recently uploaded (20)

PDF
Pre independence Education in Inndia.pdf
PPTX
Cell Structure & Organelles in detailed.
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Basic Mud Logging Guide for educational purpose
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Complications of Minimal Access Surgery at WLH
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Classroom Observation Tools for Teachers
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
master seminar digital applications in india
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Pre independence Education in Inndia.pdf
Cell Structure & Organelles in detailed.
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
01-Introduction-to-Information-Management.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
O7-L3 Supply Chain Operations - ICLT Program
102 student loan defaulters named and shamed – Is someone you know on the list?
O5-L3 Freight Transport Ops (International) V1.pdf
Basic Mud Logging Guide for educational purpose
PPH.pptx obstetrics and gynecology in nursing
Complications of Minimal Access Surgery at WLH
Sports Quiz easy sports quiz sports quiz
human mycosis Human fungal infections are called human mycosis..pptx
Classroom Observation Tools for Teachers
Anesthesia in Laparoscopic Surgery in India
master seminar digital applications in india
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
FourierSeries-QuestionsWithAnswers(Part-A).pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...

BAS 250 Lecture 2

  • 1. BAS 250 Lesson 2: Data Preparation
  • 2.  Explain concepts and purpose of Data Preparation  Understand solutions for handling missing and inconsistent data  Utilize data and attribute reduction techniques  Effectively work in RapidMiner to prepare your data. This Week’s Learning Objectives
  • 3. The Data Mining Process: CRISP-DM
  • 4. o Join data sets that are needed for your analysis o Reduce data sets to only include pertinent variables o Scrub data to remove anomalies- outliers or missing data o Reformat for consistency and effective use 3. Data Preparation
  • 5.  Ensure robustness of data o Combine more 2 or more data sets to create a “mini – database” with all variables needed for analysis in one place. o Merge by a unique identifier common to both data sets  “Key Identifier”, “Common ID”, “ID Number”, etc.  Example: Social Security Number (links Medical and Insurance) Data Preparation
  • 6. Data Preparation Example: Sources of Data Customer Purchases - “Point of Sale data” – CSV file format Cost of Products Sold – “Accounting department” – Excel file format Inventory of Products - “ IT Data Warehouse” - XML file format Merge By Product ID or SKU
  • 7.  Data Reduction…two part o Observations (rows, instances, etc) o Attributes (variables, records, columns, etc) Data Preparation
  • 8.  Attribute reduction to filter out irrelevant or uninteresting data without completely removing them from the original set.  Even if a variable isn’t interesting for answering some questions, it may still be useful in others. It is recommended to import all attributes first, then filter as necessary Data Preparation
  • 9.  Observation Reduction…  Observation reduction is to reduce the # of observations to create a smaller data set.  Some reasons to do so: o Create a sample set for:  Training data, proof of concept analysis, testing theories, sharing data o Improve analysis speed or process time o Data scrubbing for outliers, missing values, etc. Data Preparation
  • 10.  Ensure consistency of data o Missing information o Spelling errors, typos o Multiple responses for an attribute o Characters in numeric fields and vice-versa Data Preparation
  • 11.  Ensure consistency of data Data Preparation KEY: Missing data is data that does not exist in a data set • Not the same as zero or some other value • In a dataset, it is blank and the value is unknown • Sometimes referred to as null values • Depending on your objective and the circumstance, you may choose to leave missing data as they are or replace with some other value
  • 12.  Ensure consistency of data Data Preparation KEY: Inconsistent data is different from missing data • Occurs when a value does exist but its value is not valid or meaningful. • Common = “.” or “zero”
  • 13.  Ensure consistency of data Data Preparation Replace or remove missing or inconsistent data • For numeric data… • Can be replaced using Measures of Central Tendency • Mean, Median, and Mode • Mean - Average value • Median - Middle value • Mode - Most frequent or common value
  • 14.  Ensure consistency of data Data Preparation Replace or remove missing or inconsistent data • For character data… • Can be replaced using Best Estimated Value • “Like Others” • Ex. All males in data like bass fishing. If attribute “Fish Type” is blank and attribute “Gender” equals male, then “Bass” • “Clustering Techniques” • “Best Guess”
  • 15.  Ensure consistency of data Data Preparation • Replacing missing or inconsistent values found in data should be done: • With intention, not haphazardly • Use common sense • Be transparent It is recommended to always document your missing or consistent data processes.
  • 16.  This course is a practical application course in Data Mining. Learning to use RapidMiner is required.  If you have not done so yet, please plan to walk through the tutorial examples in RapidMiner.  To assist you in understanding RapidMiner, I will take screenshots of what I am doing to get the results we are looking for.  RapidMiner is pretty intuitive. You will get it quickly. Basics of RapidMiner
  • 17.  Types of files that can be imported into RapidMiner: o CSV File o Excel File o XML File o Access Database Table o … and much more  We use mainly CSV files which contain Comma Separated Values- be mindful if your dataset contains commas o Alternative delimiters can be selected in this case:  Tab  Semicolon  Pipe ( l ), etc. Basics of RapidMiner
  • 18.  Three main areas that contain useful tools in RapidMiner: o Operators – Every possible task you can think of o Repositories – Where you store your data o Parameters – Task set up details Basics of RapidMiner
  • 26.  Explain concepts and purpose of Data Preparation  Understand solutions for handling missing and inconsistent data  Utilize data and attribute reduction techniques  Effectively work in RapidMiner to prepare your data. Summary
  • 27. “This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s Employment and Training Administration. The solution was created by the grantee and does not necessarily reflect the official position of the U.S. Department of Labor. The Department of Labor makes no guarantees, warranties, or assurances of any kind, express or implied, with respect to such information, including any information on linked sites and including, but not limited to, accuracy of the information or its completeness, timeliness, usefulness, adequacy, continued availability, or ownership.” Except where otherwise stated, this work by Wake Technical Community College Building Capacity in Business Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ Copyright Information