SlideShare a Scribd company logo
Data Exploratory, Feature
Engineering and Visualization
Dr.M.Shanthi,
ADS
ODD SEM-2024-2025
Unit-I
EDA Fundamentals-Understanding Data Science
– Significance of EDA – Making Sense of Data-
Comparing EDA with Classical and Bayesian
Analysis- Software Tools available for EDA.
Understanding Data Science
• Data Science: Scientific Study of Data.
• Data science involves cross-disciplinary knowledge from computer science,
statistics and mathematics.
• Data Analysis Phases:
1. Data Requirements
2. Data Collection
3. Data Processing
4. Data cleaning
5. EDA  Transformation, Descriptive Statistics.
6. Modeling and Algorithm
7. Data Product
8. Communication – Data Visualization
The Significance of EDA
- Different fields (science, economics, engineering, and
marketing) accumulate and store the data in electronic
formats.
- Appropriate and well established decision should be made
using the data collected.
- Impossible to take decisions from datasets without the help
computer programs.
- Data mining  data insights & make further decisions.
- Exploratory Data Analysis is a first Exercise in data mining.
- Visualize the data  to understand , create hypotheses for
further analysis.
The Significance of EDA
• EDA reveals ground truth of the content 
without making any underlying assumptions.
• Scientists uses (EDA)  what type of
modeling and hypothesis can be created.
• EDA
- summarizing data,[pandas]
- statistical data,[scipy]
- visualization of the data. [matplotlib]
Steps in EDA
• Problem Definition
- Define the business problem
- Data analysis plan execution
- Main deliverables
- Obtaining the current status of the data
- Performing cost/benefit Analysis
• Data Preparation
- Sources of the data
-Define data schemas and tables
-Main characteristics of the data
-Clean the dataset
- Delete the non-relevant datasets
- Transform the data
- Divide the data into required chunks for analysis
• Data Analysis
- Summarizing the data
- finding the hidden correlation and relationships among the data
• Development and representation of the results.
- Graphs
- Summary Tables
- Plots
Making sense of Data
Type of data analysis?
1. Numerical data[Quantitative data]
 Discrete data (fixed and distinct values)
Ex: Country code variable
Rank for students
 Continuous data
Infinite number of numerical values within a
specific range
Making Sense of Data
2. Categorical data[Qualitative data]
Categorical data represents the characteristics of an object.
Example:
Gender
Marital status
Movie Genres
Blood Type
Types of drugs
Types:
Binary categorical variable can take exactly two values anyone will be
selected. Dichotomous variable.
Polytomous variable can take more than two possible values. (Marital status)
Measurement Scales
• Most of the categorical dataset follows either
nominal or ordinal measurement scales.
- Nominal
- Ordinal
- Interval
- ratio
Measurement Scales
Nominal
• labeling variables without any quantitative value.
• The scales are generally referred to as labels.
• Scales are mutually exclusive / do not carry numerical values.
Example:
• What is the gender?
Male, Female, Third gender/Non-binary
I prefer not to answer, Other
• The languages that are spoken in a particular country
Tamil, Telugu, Malayalam etc.
• Biological Species
• Parts Speech in grammar
Important Note: Someone uses numbers for labels in the nominal measurement sense, they
have no concrete numerical value or meaning.
 No form of arithmetic calculation can be made on nominal measures.
Measurement Scales
In case of a Nominal dataset, you can certainly know the following:
Frequency rate at which a label occurs over a period of time
within the dataset.
Proportion Dividing the frequency by the total number of events
Percentage  compute the percentage of each proportion
Visualize  Pie chart or Bar Chart
Nominal scale: Pie chart or Bar Chart
Important note:
Type of data  Computation Type of model  Type of
visualization
Measurement Scales
Ordinal
- Difference between Ordinal and Nominal scale is the
order.
- Order of the values is significant factor.
- Represented by Likert scale.
Diagram need to be attached:
- Ordinal scales as an order of ranking.
- Median  measure of central tendency.
- Average is not permitted.
Measurement Scales
Interval
• The order and exact differences between the
values are significant.
• Used in statistics
• Measure of central tendencies i.e.
mean,median,mode and standard deviations.
• Example : Temperature.
Measurement Scales
Ratio:
The order, exact values and absolute zero
Possible to apply descriptive and inferential statistics.
 Central tendencies, Measure of dispersion(scattering the data/distribution)
 coefficient variation(ratio of measure of dispersion around the mean).
Examples:
- Dose amount
- Reaction rate
- Flow rate
- Concentration
- Pulse
- Weight
- Length
Measurement Scales
Comparing EDA with Classical and Bayesian
Analysis
Software tools available for EDA
• Python
• R programming Language
• Weka
• KNIME
Visual Aids for EDA
• Line Chart
• Bar Chart
• Scatter Plot
• Area Plot and stacked plot
• Pie Chart
• Table chart
• Polar Chart
• Histogram
• Lollipop Chart
• Choosing the best Chart
• Other Libraries to explore
Line Chart
• Line chart is used to illustrate the relationship
between two or more continuous variable.
• Matplotlib library
• Example:
- Date vs Stock_price
Lollipop chart
• A Lollipop chart can be used to display ranking
in the data.
• It is similar to an ordered bar chart.
• The line and the circle on the top gives nice
illustration of different types of cars and their
associated miles.
Bar Chart
• Bar charts are frequently used.
• To distinguish objects between distinct
collections in order to track variations over
time.
• Bars can be drawn horizontally or vertically to
represent the categorical variables.
• Example: Pharmacy in Norway keeps track of
the amount of Zoloft sold every month.
Table Chart
• A table chart combines a bar chart and a table.
• Example: Consider the standard LED bulbs that
come in different wattages.
• Based on two categorical variables: The year
and wattage. The number of units sold in a
particular year.
Histogram
• Histogram plots are used to depict the
distribution of any continuous variable.
• These types of plots are very popular in statistical
analysis.
• To find out the distribution we can go with
histogram plot.
• Example: Frequency vs years of experience with
python programming.
Scatter Plot
• Scatter plots are also called scatter graphs,
scatter charts.
• Cartesian co-ordinates x,y.
Cartesian Co-Ordinates
Polar Co-ordinates
Polar Chart
Data Transformation
• Merging database-style dataframes
• Transformation techniques
• Benefits of data transformation
Data transformation
• Concat
• Concat with an axis
• Merge
inner join
outer join
left join
right join
index
• Reshaping and pivoting
stacking
unstacking
Transformation Techniques
• Data Duplication
• Replacing values
• Handling missing data
Transformation Techniques
• Dropping the Missing Values
– Row-wise
– Column-wise
– Based on threshold
Transformation Techniques
• Filling the Missing Values
- Fill by zero value
- Fill by Forward/Backward Filling
- Fill by interpolating method
Descriptive Statistics
• Simple summaries of the entire dataset.
Central Tendencies
Mean
Median
Mode
Descriptive Statistics
Mean/Average might not be the best representation of the dataset ?
Measure of Dispersion
1. Standard Deviation
2. Variance
3. Skewness ( Measure of Symmetry and Asymmetry Variable)
Positive Skewness
Symmetrical
Negative Skewness
4. Kurtosis (Heaviness of the tail of the distribution)
( 0 ) Mesokurtic
(+3) Leptokurtic
(-1) Platykurtic
5. Percentile ( Measure the percentage of values in any dataset that lie below a
certain value)
25%
50%
75%
100%
6. Quartiles
- Visualization of Quartiles
Skewness
• Asymmetry of the variable in the dataset
about its mean.
• Positive
• Negative
• Symmetrical
Skewness
function: df.skew()
Kurtosis
Function= df.kurt()
• Kurtosis is a statistical measure that illustrates
how heavily the tails of distribution differ
from those of a normal distribution.
• Identify whether a given distribution contains
extreme values.
• Measure of outlier presence in a given
distribution.
• High kurtosis  high Outliers.
Kurtosis
Kurtosis
• There are three types of Kurtosis:
Mesokurtic  0
Leptokurtic  (K>3) High Flat  High
Outliers
Platykurtic (K<0) Low Outliers
Percentile
Function = np.percentile(attribute,50)
• Measure the percentage of values in any
dataset that lie below a certain value.
Quartiles
• Quartiles are values that split the given
dataset into quarters.
Grouping Datasets
• Groupby Mechanisms
- Grouping by features, hierarchically
- Aggregating a dataset by groups
- Applying custom aggregation functions to
groups
- Transforming a dataset groupwise
Grouping the Datasets
• Selecting a subset of columns
• Max and Min
• Mean

More Related Content

PPTX
Educational Statistics with Software Application.pptx
PPTX
RM UNIT 6.pptx
PPTX
RM UNIT 6.pptx
PDF
Data_Analytics_for_IoT_Solutions.pptx.pdf
PPTX
Data mining techniques unit 2
PPTX
Chapter 11 Data Analysis Classification and Tabulation
PPTX
Organizational Data Analysis by Mr Mumba.pptx
PPTX
Statistics for BSN 3 something that can guide them in their analysis.pptx
Educational Statistics with Software Application.pptx
RM UNIT 6.pptx
RM UNIT 6.pptx
Data_Analytics_for_IoT_Solutions.pptx.pdf
Data mining techniques unit 2
Chapter 11 Data Analysis Classification and Tabulation
Organizational Data Analysis by Mr Mumba.pptx
Statistics for BSN 3 something that can guide them in their analysis.pptx

Similar to Types of Data in Machine Learning, Number aand Categorical (20)

PPTX
Basics of statistics
PDF
Data presentation by nndd data presentation.pdf
PPTX
Module 4 data analysis
PPTX
Statistics 000000000000000000000000.pptx
PDF
Data-Interpretation-Workshop-Presentation.pdf
PPTX
Descriptive statistics
PPTX
statistics.pptxghfhsahkjhsghkjhahkjhgfjkjkg
PPTX
Exploratory Data Analysis (EDA) .pptx
PDF
76a15ed521b7679e372aab35412ab78ab583436a-1602816156135.pdf
PPTX
1. chapter i(pasw)
PPTX
Descriptive Statistics
PPTX
Presentation1.pptx
PPTX
Lect1.pptxdglsgldjtzjgd csjfsjtskysngfkgfhxvxfhhdhz
PPTX
Introduction of data science
PPTX
Introduction to Descriptive Statistics
PPTX
fundamentals of data science and analytics on descriptive analysis.pptx
PPTX
Introduction to statistics.pptx
PPTX
RVO-STATISTICS_Statistics_Introduction To Statistics IBBI.pptx
PPTX
Lec 3.pptx
PDF
Pelatihan Data Analitik
Basics of statistics
Data presentation by nndd data presentation.pdf
Module 4 data analysis
Statistics 000000000000000000000000.pptx
Data-Interpretation-Workshop-Presentation.pdf
Descriptive statistics
statistics.pptxghfhsahkjhsghkjhahkjhgfjkjkg
Exploratory Data Analysis (EDA) .pptx
76a15ed521b7679e372aab35412ab78ab583436a-1602816156135.pdf
1. chapter i(pasw)
Descriptive Statistics
Presentation1.pptx
Lect1.pptxdglsgldjtzjgd csjfsjtskysngfkgfhxvxfhhdhz
Introduction of data science
Introduction to Descriptive Statistics
fundamentals of data science and analytics on descriptive analysis.pptx
Introduction to statistics.pptx
RVO-STATISTICS_Statistics_Introduction To Statistics IBBI.pptx
Lec 3.pptx
Pelatihan Data Analitik
Ad

Recently uploaded (20)

PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Computer network topology notes for revision
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Quality review (1)_presentation of this 21
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Business Analytics and business intelligence.pdf
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Qualitative Qantitative and Mixed Methods.pptx
Reliability_Chapter_ presentation 1221.5784
Computer network topology notes for revision
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Fluorescence-microscope_Botany_detailed content
Quality review (1)_presentation of this 21
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Miokarditis (Inflamasi pada Otot Jantung)
.pdf is not working space design for the following data for the following dat...
Business Acumen Training GuidePresentation.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
climate analysis of Dhaka ,Banglades.pptx
ISS -ESG Data flows What is ESG and HowHow
Business Analytics and business intelligence.pdf
Clinical guidelines as a resource for EBP(1).pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Knowledge Engineering Part 1
Qualitative Qantitative and Mixed Methods.pptx
Ad

Types of Data in Machine Learning, Number aand Categorical

  • 1. Data Exploratory, Feature Engineering and Visualization Dr.M.Shanthi, ADS ODD SEM-2024-2025
  • 2. Unit-I EDA Fundamentals-Understanding Data Science – Significance of EDA – Making Sense of Data- Comparing EDA with Classical and Bayesian Analysis- Software Tools available for EDA.
  • 3. Understanding Data Science • Data Science: Scientific Study of Data. • Data science involves cross-disciplinary knowledge from computer science, statistics and mathematics. • Data Analysis Phases: 1. Data Requirements 2. Data Collection 3. Data Processing 4. Data cleaning 5. EDA  Transformation, Descriptive Statistics. 6. Modeling and Algorithm 7. Data Product 8. Communication – Data Visualization
  • 4. The Significance of EDA - Different fields (science, economics, engineering, and marketing) accumulate and store the data in electronic formats. - Appropriate and well established decision should be made using the data collected. - Impossible to take decisions from datasets without the help computer programs. - Data mining  data insights & make further decisions. - Exploratory Data Analysis is a first Exercise in data mining. - Visualize the data  to understand , create hypotheses for further analysis.
  • 5. The Significance of EDA • EDA reveals ground truth of the content  without making any underlying assumptions. • Scientists uses (EDA)  what type of modeling and hypothesis can be created. • EDA - summarizing data,[pandas] - statistical data,[scipy] - visualization of the data. [matplotlib]
  • 6. Steps in EDA • Problem Definition - Define the business problem - Data analysis plan execution - Main deliverables - Obtaining the current status of the data - Performing cost/benefit Analysis • Data Preparation - Sources of the data -Define data schemas and tables -Main characteristics of the data -Clean the dataset - Delete the non-relevant datasets - Transform the data - Divide the data into required chunks for analysis • Data Analysis - Summarizing the data - finding the hidden correlation and relationships among the data • Development and representation of the results. - Graphs - Summary Tables - Plots
  • 7. Making sense of Data Type of data analysis? 1. Numerical data[Quantitative data]  Discrete data (fixed and distinct values) Ex: Country code variable Rank for students  Continuous data Infinite number of numerical values within a specific range
  • 8. Making Sense of Data 2. Categorical data[Qualitative data] Categorical data represents the characteristics of an object. Example: Gender Marital status Movie Genres Blood Type Types of drugs Types: Binary categorical variable can take exactly two values anyone will be selected. Dichotomous variable. Polytomous variable can take more than two possible values. (Marital status)
  • 9. Measurement Scales • Most of the categorical dataset follows either nominal or ordinal measurement scales. - Nominal - Ordinal - Interval - ratio
  • 10. Measurement Scales Nominal • labeling variables without any quantitative value. • The scales are generally referred to as labels. • Scales are mutually exclusive / do not carry numerical values. Example: • What is the gender? Male, Female, Third gender/Non-binary I prefer not to answer, Other • The languages that are spoken in a particular country Tamil, Telugu, Malayalam etc. • Biological Species • Parts Speech in grammar Important Note: Someone uses numbers for labels in the nominal measurement sense, they have no concrete numerical value or meaning.  No form of arithmetic calculation can be made on nominal measures.
  • 11. Measurement Scales In case of a Nominal dataset, you can certainly know the following: Frequency rate at which a label occurs over a period of time within the dataset. Proportion Dividing the frequency by the total number of events Percentage  compute the percentage of each proportion Visualize  Pie chart or Bar Chart Nominal scale: Pie chart or Bar Chart Important note: Type of data  Computation Type of model  Type of visualization
  • 12. Measurement Scales Ordinal - Difference between Ordinal and Nominal scale is the order. - Order of the values is significant factor. - Represented by Likert scale. Diagram need to be attached: - Ordinal scales as an order of ranking. - Median  measure of central tendency. - Average is not permitted.
  • 13. Measurement Scales Interval • The order and exact differences between the values are significant. • Used in statistics • Measure of central tendencies i.e. mean,median,mode and standard deviations. • Example : Temperature.
  • 14. Measurement Scales Ratio: The order, exact values and absolute zero Possible to apply descriptive and inferential statistics.  Central tendencies, Measure of dispersion(scattering the data/distribution)  coefficient variation(ratio of measure of dispersion around the mean). Examples: - Dose amount - Reaction rate - Flow rate - Concentration - Pulse - Weight - Length
  • 16. Comparing EDA with Classical and Bayesian Analysis
  • 17. Software tools available for EDA • Python • R programming Language • Weka • KNIME
  • 18. Visual Aids for EDA • Line Chart • Bar Chart • Scatter Plot • Area Plot and stacked plot • Pie Chart • Table chart • Polar Chart • Histogram • Lollipop Chart • Choosing the best Chart • Other Libraries to explore
  • 19. Line Chart • Line chart is used to illustrate the relationship between two or more continuous variable. • Matplotlib library • Example: - Date vs Stock_price
  • 20. Lollipop chart • A Lollipop chart can be used to display ranking in the data. • It is similar to an ordered bar chart. • The line and the circle on the top gives nice illustration of different types of cars and their associated miles.
  • 21. Bar Chart • Bar charts are frequently used. • To distinguish objects between distinct collections in order to track variations over time. • Bars can be drawn horizontally or vertically to represent the categorical variables. • Example: Pharmacy in Norway keeps track of the amount of Zoloft sold every month.
  • 22. Table Chart • A table chart combines a bar chart and a table. • Example: Consider the standard LED bulbs that come in different wattages. • Based on two categorical variables: The year and wattage. The number of units sold in a particular year.
  • 23. Histogram • Histogram plots are used to depict the distribution of any continuous variable. • These types of plots are very popular in statistical analysis. • To find out the distribution we can go with histogram plot. • Example: Frequency vs years of experience with python programming.
  • 24. Scatter Plot • Scatter plots are also called scatter graphs, scatter charts. • Cartesian co-ordinates x,y.
  • 28. Data Transformation • Merging database-style dataframes • Transformation techniques • Benefits of data transformation
  • 29. Data transformation • Concat • Concat with an axis • Merge inner join outer join left join right join index • Reshaping and pivoting stacking unstacking
  • 30. Transformation Techniques • Data Duplication • Replacing values • Handling missing data
  • 31. Transformation Techniques • Dropping the Missing Values – Row-wise – Column-wise – Based on threshold
  • 32. Transformation Techniques • Filling the Missing Values - Fill by zero value - Fill by Forward/Backward Filling - Fill by interpolating method
  • 33. Descriptive Statistics • Simple summaries of the entire dataset. Central Tendencies Mean Median Mode
  • 34. Descriptive Statistics Mean/Average might not be the best representation of the dataset ? Measure of Dispersion 1. Standard Deviation 2. Variance 3. Skewness ( Measure of Symmetry and Asymmetry Variable) Positive Skewness Symmetrical Negative Skewness 4. Kurtosis (Heaviness of the tail of the distribution) ( 0 ) Mesokurtic (+3) Leptokurtic (-1) Platykurtic 5. Percentile ( Measure the percentage of values in any dataset that lie below a certain value) 25% 50% 75% 100% 6. Quartiles - Visualization of Quartiles
  • 35. Skewness • Asymmetry of the variable in the dataset about its mean. • Positive • Negative • Symmetrical
  • 37. Kurtosis Function= df.kurt() • Kurtosis is a statistical measure that illustrates how heavily the tails of distribution differ from those of a normal distribution. • Identify whether a given distribution contains extreme values. • Measure of outlier presence in a given distribution. • High kurtosis  high Outliers.
  • 39. Kurtosis • There are three types of Kurtosis: Mesokurtic  0 Leptokurtic  (K>3) High Flat  High Outliers Platykurtic (K<0) Low Outliers
  • 40. Percentile Function = np.percentile(attribute,50) • Measure the percentage of values in any dataset that lie below a certain value.
  • 41. Quartiles • Quartiles are values that split the given dataset into quarters.
  • 42. Grouping Datasets • Groupby Mechanisms - Grouping by features, hierarchically - Aggregating a dataset by groups - Applying custom aggregation functions to groups - Transforming a dataset groupwise
  • 43. Grouping the Datasets • Selecting a subset of columns • Max and Min • Mean