Lec2(Types of ML) & Preprocessing of data.pptx

Machine Learning
Prepared by
Sana Iftikhar
BSCS 7th

Types of Data
• Numerical data
a) Discrete - Date, No. of students in a class
b) Continuous – Cost of House (in decimal)
• Categorical data
a) Nominal – Gender
b) Ordinal – grades of students(splitting it into groups)
c) Dichotomous – Cancerous, Non-cancerous
• Time series data
Sequence of numbers collected at regular intervals over some period
of time
• Text

Preprocessing of Data
• Good Data Preparation is key to producing
valid and reliable models

Where can I get dataset?
• https://www.superdatascience.com/machine-
learning
• https://www.kaggle.com/dataset
• http://registry.opendata.aws/
• https://archive.ics.uci.edu/

28
Data Quality: Why Preprocess the Data?
• Measures for data quality: A multidimensional view
– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be
understood?

30
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer
error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

31
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the
time of entry
– not register history or changes of the data
• Missing data may need to be inferred

33
How to Handle Missing Data?
• Ignore the Column: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– Imputation: use the attribute mean to fill in the missing
value
– A mean, median, mode value for the column
– the most probable value: inference-based such as Bayesian
formula or decision tree

Categorical Data
• Categorical variables are essentially string values
and at times can be used to group information.
• For example, in a dataset of employees, gender,
hometown, religion, etc. can be categorical
variables. Since, machine learning models are
mathematical models, it can only accept numeric
values. Hence, it becomes important to handle
categorical variables and convert them into a
form that can be fed into the model.

How to handle categorical data
• Find and replace
• Label encoding
• Binary coding
• One hot encoding
• Ordinal encoder
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df["Gender"] = encoder.fit_transform(df["Gender"])

Training and testing data
• “Training data teaches a machine
learning model how to behave while
testing data evaluates how well the
model has learned.”

Lec2(Types of ML) & Preprocessing of data.pptx

More Related Content

Similar to Lec2(Types of ML) & Preprocessing of data.pptx (20)

Recently uploaded (20)

Lec2(Types of ML) & Preprocessing of data.pptx

Editor's Notes