Types of Data in Machine Learning, Number aand Categorical

Data Exploratory, Feature
Engineering and Visualization
Dr.M.Shanthi,
ADS
ODD SEM-2024-2025

Unit-I
EDA Fundamentals-Understanding Data Science
– Significance of EDA – Making Sense of Data-
Comparing EDA with Classical and Bayesian
Analysis- Software Tools available for EDA.

Understanding Data Science
• Data Science: Scientific Study of Data.
• Data science involves cross-disciplinary knowledge from computer science,
statistics and mathematics.
• Data Analysis Phases:
1. Data Requirements
2. Data Collection
3. Data Processing
4. Data cleaning
5. EDA  Transformation, Descriptive Statistics.
6. Modeling and Algorithm
7. Data Product
8. Communication – Data Visualization

The Significance of EDA
- Different fields (science, economics, engineering, and
marketing) accumulate and store the data in electronic
formats.
- Appropriate and well established decision should be made
using the data collected.
- Impossible to take decisions from datasets without the help
computer programs.
- Data mining  data insights & make further decisions.
- Exploratory Data Analysis is a first Exercise in data mining.
- Visualize the data  to understand , create hypotheses for
further analysis.

The Significance of EDA
• EDA reveals ground truth of the content 
without making any underlying assumptions.
• Scientists uses (EDA)  what type of
modeling and hypothesis can be created.
• EDA
- summarizing data,[pandas]
- statistical data,[scipy]
- visualization of the data. [matplotlib]

Steps in EDA
• Problem Definition
- Define the business problem
- Data analysis plan execution
- Main deliverables
- Obtaining the current status of the data
- Performing cost/benefit Analysis
• Data Preparation
- Sources of the data
-Define data schemas and tables
-Main characteristics of the data
-Clean the dataset
- Delete the non-relevant datasets
- Transform the data
- Divide the data into required chunks for analysis
• Data Analysis
- Summarizing the data
- finding the hidden correlation and relationships among the data
• Development and representation of the results.
- Graphs
- Summary Tables
- Plots

Making sense of Data
Type of data analysis?
1. Numerical data[Quantitative data]
 Discrete data (fixed and distinct values)
Ex: Country code variable
Rank for students
 Continuous data
Infinite number of numerical values within a
specific range

Making Sense of Data
2. Categorical data[Qualitative data]
Categorical data represents the characteristics of an object.
Example:
Gender
Marital status
Movie Genres
Blood Type
Types of drugs
Types:
Binary categorical variable can take exactly two values anyone will be
selected. Dichotomous variable.
Polytomous variable can take more than two possible values. (Marital status)

Measurement Scales
• Most of the categorical dataset follows either
nominal or ordinal measurement scales.
- Nominal
- Ordinal
- Interval
- ratio

Measurement Scales
Nominal
• labeling variables without any quantitative value.
• The scales are generally referred to as labels.
• Scales are mutually exclusive / do not carry numerical values.
Example:
• What is the gender?
Male, Female, Third gender/Non-binary
I prefer not to answer, Other
• The languages that are spoken in a particular country
Tamil, Telugu, Malayalam etc.
• Biological Species
• Parts Speech in grammar
Important Note: Someone uses numbers for labels in the nominal measurement sense, they
have no concrete numerical value or meaning.
 No form of arithmetic calculation can be made on nominal measures.

Measurement Scales
In case of a Nominal dataset, you can certainly know the following:
Frequency rate at which a label occurs over a period of time
within the dataset.
Proportion Dividing the frequency by the total number of events
Percentage  compute the percentage of each proportion
Visualize  Pie chart or Bar Chart
Nominal scale: Pie chart or Bar Chart
Important note:
Type of data  Computation Type of model  Type of
visualization

Measurement Scales
Ordinal
- Difference between Ordinal and Nominal scale is the
order.
- Order of the values is significant factor.
- Represented by Likert scale.
Diagram need to be attached:
- Ordinal scales as an order of ranking.
- Median  measure of central tendency.
- Average is not permitted.

Measurement Scales
Interval
• The order and exact differences between the
values are significant.
• Used in statistics
• Measure of central tendencies i.e.
mean,median,mode and standard deviations.
• Example : Temperature.

Measurement Scales
Ratio:
The order, exact values and absolute zero
Possible to apply descriptive and inferential statistics.
 Central tendencies, Measure of dispersion(scattering the data/distribution)
 coefficient variation(ratio of measure of dispersion around the mean).
Examples:
- Dose amount
- Reaction rate
- Flow rate
- Concentration
- Pulse
- Weight
- Length

Comparing EDA with Classical and Bayesian
Analysis

Software tools available for EDA
• Python
• R programming Language
• Weka
• KNIME

Visual Aids for EDA
• Line Chart
• Bar Chart
• Scatter Plot
• Area Plot and stacked plot
• Pie Chart
• Table chart
• Polar Chart
• Histogram
• Lollipop Chart
• Choosing the best Chart
• Other Libraries to explore

Line Chart
• Line chart is used to illustrate the relationship
between two or more continuous variable.
• Matplotlib library
• Example:
- Date vs Stock_price

Lollipop chart
• A Lollipop chart can be used to display ranking
in the data.
• It is similar to an ordered bar chart.
• The line and the circle on the top gives nice
illustration of different types of cars and their
associated miles.

Bar Chart
• Bar charts are frequently used.
• To distinguish objects between distinct
collections in order to track variations over
time.
• Bars can be drawn horizontally or vertically to
represent the categorical variables.
• Example: Pharmacy in Norway keeps track of
the amount of Zoloft sold every month.

Table Chart
• A table chart combines a bar chart and a table.
• Example: Consider the standard LED bulbs that
come in different wattages.
• Based on two categorical variables: The year
and wattage. The number of units sold in a
particular year.

Histogram
• Histogram plots are used to depict the
distribution of any continuous variable.
• These types of plots are very popular in statistical
analysis.
• To find out the distribution we can go with
histogram plot.
• Example: Frequency vs years of experience with
python programming.

Scatter Plot
• Scatter plots are also called scatter graphs,
scatter charts.
• Cartesian co-ordinates x,y.

Data Transformation
• Merging database-style dataframes
• Transformation techniques
• Benefits of data transformation

Data transformation
• Concat
• Concat with an axis
• Merge
inner join
outer join
left join
right join
index
• Reshaping and pivoting
stacking
unstacking

Transformation Techniques
• Data Duplication
• Replacing values
• Handling missing data

• Dropping the Missing Values
– Row-wise
– Column-wise
– Based on threshold

• Filling the Missing Values
- Fill by zero value
- Fill by Forward/Backward Filling
- Fill by interpolating method

Descriptive Statistics
• Simple summaries of the entire dataset.
Central Tendencies
Mean
Median
Mode

Descriptive Statistics
Mean/Average might not be the best representation of the dataset ?
Measure of Dispersion
1. Standard Deviation
2. Variance
3. Skewness ( Measure of Symmetry and Asymmetry Variable)
Positive Skewness
Symmetrical
Negative Skewness
4. Kurtosis (Heaviness of the tail of the distribution)
( 0 ) Mesokurtic
(+3) Leptokurtic
(-1) Platykurtic
5. Percentile ( Measure the percentage of values in any dataset that lie below a
certain value)
25%
50%
75%
100%
6. Quartiles
- Visualization of Quartiles

Skewness
• Asymmetry of the variable in the dataset
about its mean.
• Positive
• Negative
• Symmetrical

Kurtosis
Function= df.kurt()
• Kurtosis is a statistical measure that illustrates
how heavily the tails of distribution differ
from those of a normal distribution.
• Identify whether a given distribution contains
extreme values.
• Measure of outlier presence in a given
distribution.
• High kurtosis  high Outliers.

Kurtosis
• There are three types of Kurtosis:
Mesokurtic  0
Leptokurtic  (K>3) High Flat  High
Outliers
Platykurtic (K<0) Low Outliers

Percentile
Function = np.percentile(attribute,50)
• Measure the percentage of values in any
dataset that lie below a certain value.

Quartiles
• Quartiles are values that split the given
dataset into quarters.

Grouping Datasets
• Groupby Mechanisms
- Grouping by features, hierarchically
- Aggregating a dataset by groups
- Applying custom aggregation functions to
groups
- Transforming a dataset groupwise

Grouping the Datasets
• Selecting a subset of columns
• Max and Min
• Mean

Types of Data in Machine Learning, Number aand Categorical

More Related Content

Similar to Types of Data in Machine Learning, Number aand Categorical (20)

Recently uploaded (20)

Types of Data in Machine Learning, Number aand Categorical