SlideShare a Scribd company logo
Machine Learning
Prepared by
Sana Iftikhar
BSCS 7th
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Data Preprocessing
Types of Data
• Numerical data
a) Discrete - Date, No. of students in a class
b) Continuous – Cost of House (in decimal)
• Categorical data
a) Nominal – Gender
b) Ordinal – grades of students(splitting it into groups)
c) Dichotomous – Cancerous, Non-cancerous
• Time series data
Sequence of numbers collected at regular intervals over some period
of time
• Text
Preprocessing of Data
• Good Data Preparation is key to producing
valid and reliable models
Where can I get dataset?
• https://www.superdatascience.com/machine-
learning
• https://www.kaggle.com/dataset
• http://registry.opendata.aws/
• https://archive.ics.uci.edu/
28
Data Quality: Why Preprocess the Data?
• Measures for data quality: A multidimensional view
– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be
understood?
Lec2(Types of ML) & Preprocessing of data.pptx
30
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer
error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
31
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the
time of entry
– not register history or changes of the data
• Missing data may need to be inferred
Missing values
33
How to Handle Missing Data?
• Ignore the Column: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– Imputation: use the attribute mean to fill in the missing
value
– A mean, median, mode value for the column
– the most probable value: inference-based such as Bayesian
formula or decision tree
Categorical Data
• Categorical variables are essentially string values
and at times can be used to group information.
• For example, in a dataset of employees, gender,
hometown, religion, etc. can be categorical
variables. Since, machine learning models are
mathematical models, it can only accept numeric
values. Hence, it becomes important to handle
categorical variables and convert them into a
form that can be fed into the model.
How to handle categorical data
• Find and replace
• Label encoding
• Binary coding
• One hot encoding
• Ordinal encoder
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df["Gender"] = encoder.fit_transform(df["Gender"])
Training and testing data
• “Training data teaches a machine
learning model how to behave while
testing data evaluates how well the
model has learned.”
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Lec2(Types of ML) & Preprocessing of data.pptx
Thank You

More Related Content

PPTX
Data_Preparation.pptx
PDF
Data science using python, Data Preprocessing
PDF
Data Analytics ,Data Preprocessing What is Data Preprocessing?
PPTX
DATA preprocessing.pptx
PPTX
data_preprocessingknnnaiveandothera.pptx
PDF
4 preprocess
PDF
The model interacts with the environment seeking ways to maximize the reward....
PPTX
Exploratory Data Analysis Unit 1 ppt presentation.pptx
Data_Preparation.pptx
Data science using python, Data Preprocessing
Data Analytics ,Data Preprocessing What is Data Preprocessing?
DATA preprocessing.pptx
data_preprocessingknnnaiveandothera.pptx
4 preprocess
The model interacts with the environment seeking ways to maximize the reward....
Exploratory Data Analysis Unit 1 ppt presentation.pptx

Similar to Lec2(Types of ML) & Preprocessing of data.pptx (20)

PDF
3 module 2
PPT
ML-ChapterTwo-Data Preprocessing.ppt
PPT
Major Tasks in Data Preprocessing - Data cleaning
PDF
Data Preprocessing Concepts in Data Engineering
PPTX
Data preprocessing in Machine learning
PDF
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
PPT
Preprocessing.ppt
PDF
13_Data Preprocessing in Python.pptx (1).pdf
PDF
Data preprocessing in Data Mining
PDF
Explore ML day 1
PDF
3-DataPreprocessing a complete guide.pdf
PDF
CM NCCU Class1
PDF
data science with python_UNIT 2_full notes.pdf
PPTX
Machine learning module 2
PDF
Module 1.2 data preparation
PDF
Preparing Data
PPTX
Data preparation and processing chapter 2
PDF
Dirty data science machine learning on non-curated data
PPTX
Data science engineering Preprocessing.pptx
PPTX
Machine Learning: A Fast Review
3 module 2
ML-ChapterTwo-Data Preprocessing.ppt
Major Tasks in Data Preprocessing - Data cleaning
Data Preprocessing Concepts in Data Engineering
Data preprocessing in Machine learning
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
Preprocessing.ppt
13_Data Preprocessing in Python.pptx (1).pdf
Data preprocessing in Data Mining
Explore ML day 1
3-DataPreprocessing a complete guide.pdf
CM NCCU Class1
data science with python_UNIT 2_full notes.pdf
Machine learning module 2
Module 1.2 data preparation
Preparing Data
Data preparation and processing chapter 2
Dirty data science machine learning on non-curated data
Data science engineering Preprocessing.pptx
Machine Learning: A Fast Review
Ad

Recently uploaded (20)

PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Yogi Goddess Pres Conference Studio Updates
PPTX
Cell Structure & Organelles in detailed.
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
01-Introduction-to-Information-Management.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
RMMM.pdf make it easy to upload and study
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Microbial disease of the cardiovascular and lymphatic systems
FourierSeries-QuestionsWithAnswers(Part-A).pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
O7-L3 Supply Chain Operations - ICLT Program
Yogi Goddess Pres Conference Studio Updates
Cell Structure & Organelles in detailed.
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
01-Introduction-to-Information-Management.pdf
VCE English Exam - Section C Student Revision Booklet
2.FourierTransform-ShortQuestionswithAnswers.pdf
Supply Chain Operations Speaking Notes -ICLT Program
RMMM.pdf make it easy to upload and study
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Microbial diseases, their pathogenesis and prophylaxis
human mycosis Human fungal infections are called human mycosis..pptx
O5-L3 Freight Transport Ops (International) V1.pdf
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Ad

Lec2(Types of ML) & Preprocessing of data.pptx

  • 25. Types of Data • Numerical data a) Discrete - Date, No. of students in a class b) Continuous – Cost of House (in decimal) • Categorical data a) Nominal – Gender b) Ordinal – grades of students(splitting it into groups) c) Dichotomous – Cancerous, Non-cancerous • Time series data Sequence of numbers collected at regular intervals over some period of time • Text
  • 26. Preprocessing of Data • Good Data Preparation is key to producing valid and reliable models
  • 27. Where can I get dataset? • https://www.superdatascience.com/machine- learning • https://www.kaggle.com/dataset • http://registry.opendata.aws/ • https://archive.ics.uci.edu/
  • 28. 28 Data Quality: Why Preprocess the Data? • Measures for data quality: A multidimensional view – Accuracy: correct or wrong, accurate or not – Completeness: not recorded, unavailable, … – Consistency: some modified but some not, dangling, … – Timeliness: timely update? – Believability: how trustable the data are correct? – Interpretability: how easily the data can be understood?
  • 30. 30 Data Cleaning • Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • e.g., Occupation=“ ” (missing data) – noisy: containing noise, errors, or outliers • e.g., Salary=“−10” (an error) – inconsistent: containing discrepancies in codes or names, e.g., • Age=“42”, Birthday=“03/07/2010” • Was rating “1, 2, 3”, now rating “A, B, C” • discrepancy between duplicate records – Intentional (e.g., disguised missing data) • Jan. 1 as everyone’s birthday?
  • 31. 31 Incomplete (Missing) Data • Data is not always available – E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to – equipment malfunction – inconsistent with other recorded data and thus deleted – data not entered due to misunderstanding – certain data may not be considered important at the time of entry – not register history or changes of the data • Missing data may need to be inferred
  • 33. 33 How to Handle Missing Data? • Ignore the Column: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably • Fill in the missing value manually: tedious + infeasible? • Fill in it automatically with – a global constant : e.g., “unknown”, a new class?! – the attribute mean – Imputation: use the attribute mean to fill in the missing value – A mean, median, mode value for the column – the most probable value: inference-based such as Bayesian formula or decision tree
  • 34. Categorical Data • Categorical variables are essentially string values and at times can be used to group information. • For example, in a dataset of employees, gender, hometown, religion, etc. can be categorical variables. Since, machine learning models are mathematical models, it can only accept numeric values. Hence, it becomes important to handle categorical variables and convert them into a form that can be fed into the model.
  • 35. How to handle categorical data • Find and replace • Label encoding • Binary coding • One hot encoding • Ordinal encoder from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() df["Gender"] = encoder.fit_transform(df["Gender"])
  • 36. Training and testing data • “Training data teaches a machine learning model how to behave while testing data evaluates how well the model has learned.”

Editor's Notes

  • #25: Time series data example: rise and fall of temperature over a day
  • #26: Missing data Categorical data Train and test data