SlideShare a Scribd company logo
EDA Visualization
Orozco Hsu
2023-10-31
1
About me
• Education
• NCU (MIS)、NCCU (CS)
• Work Experience
• Telecom big data Innovation
• AI projects
• Retail marketing technology
• User Group
• TW Spark User Group
• TW Hadoop User Group
• Taiwan Data Engineer Association Director
• Research
• Big Data/ ML/ AIOT/ AI Columnist
2
Tutorial
Content
3
Data Science Process
Exploration Data Analysis and Visualization
Home work
Code
• Download materials:
• https://drive.google.com/drive/folders/1KPC3K19_vJgRb5Op5M9bqQYJEh2zv
rnr?usp=sharing
4
Get ready to your Orange 3
• Version: 3.36.1
5
6
The Pyramid of Data Needs (and why it matters for your career) | by Hugh Williams | Medium
7
The Pyramid of Data Needs (and why it matters for your career) | by Hugh Williams | Medium
Static chart
• There are generally THREE STEPS in drawing a chart:
• Observing the data, determine the relationship, and select the chart.
• What type of data it is, and what content you want to express.
• Category
• Numeric
• Text
• Datetime
• After clarifying the content to be expressed, you can choose which chart to
use to express it.
8
Pie chart
• You must have some kind of whole
amount that is divided into a number
of distinct parts.
• Your primary objective in a pie chart
should be to compare each group’s
contribution to the whole.
9
Line chart
• Line charts provide the clearest
graphical representation of time-
related variables and are the
preferred mode for representing
trends or variables over time.
10
Histogram chart
• It is used to summarize discrete
or continuous data that are
measured on an interval scale.
• It is often used to illustrate the
major features of the distribution
of the data in a convenient form.
11
Bar chart
• It provides a way of showing
data values represented as
the comparison of multiple
data sets side by side.
12
Differences between histogram and bar chart
Comparison terms Bar chart Histogram
Usage
To compare different categories of
data.
To display the distribution of a variable.
Type of variable Categorical variables Numeric variables
Rendering
Each data point is rendered as a
separate bar.
The data points are grouped and
rendered based on the bin value.
The entire range of data values is
divided into a series of non-
overlapping intervals.
Space between bars Can have space. No space.
Reordering bars Can be reordered. Cannot be reordered.
13
Scatter Plot
• It uses dots to
represent values for
two different numeric
variables and observe
relationships between
variables.
14
Box plot
• Q1: The first quartile (25%) position.
• Q3: The third quartile (75%) position.
• Interquartile range (IQR)
• Lower and upper 1.5*IQR whiskers:
These represent the limits and
boundaries for the outliers.
• Outliers: Defined as observations that
fall below Q1 − 1.5 IQR or above Q3 +
1.5 IQR.
15
Dataset description
• Using this dataset to
predict whether passengers
will survive the Titanic
accident
16
Data Summary
• Load titanic.csv
• Data description
• Names, Types, Role, Values
• Change the Columns
17
Data Summary
• Missing values
• Using Features Statistics
Widget
• How about those missing
ratios?
18
Preprocess (Remove or Impute columns)
• Remove columns
19
Preprocess (Remove or Impute columns)
• Impute columns
• For Default Method
• For each column
20
Pie chart
• Orange 3 has deprecated
Pie chat widget
• Use python script instead
• Find the output file
21
Line chart
• Typically, trend analysis
charts are presented
together with time-based
data
22
Distribution chart
• Used to present by sorting
frequency
• In Orange 3, both of numeric
or category data can be
presented here
• Bar chart widget is not used
much compared to others
23
Scatter plot
• It used to observe the degree
of correlation between
features
• positive correlation
• negative correlation
• noncorrelation
24
Box plot
• Comparing multiple
features with each other
25
Pivot Table
• It summarizes the data
of a more extensive
table into a table of
statistics.
• The statistics can include
sums, averages, counts,
etc.
26
1. Show me top 10 data rows
• Hint: Use Data Sampler widget
27
2. Show me dataset info
• How many Rows?
• How many Features?
• All information like this!
28
3. Get a count of the number of survivors
29
4. Survival Conclusion
• For features, SEX, PCLASS, SIBSP,
PARCH, EMBARKED
• Women had a higher chance of survival
than men.
• First-class passengers had a higher
chance of survival.
• Passengers with siblings, spouses had a
higher chance of survival.
• Passengers with children and parents
had a higher chance of survival.
• Departing from the S terminal may
lead to lower cabin class and lower
chances of survival.
30
5. Show me sex survival rate
31
6. Look at survival rate by SEX and PCLASS
• Women in first class had a survival rate as high as 96.8%. In contrast,
men in economy class only had a 13.54% chance of survival
32
7. Look at survival rate by SEX, AGE and
PCLASS
• In the event of a disaster, women in
first class or business class have a 90%
chance of survival regardless of age.
• On the other hand, if a man is in
economy class and older than 18, the
chance of survival is only 13.36%.
• To summarize, in a disaster scenario,
girls and women have a higher chance
of survival compared to boys and men.
• Additionally, the higher the class (such
as first class), the higher the chances
of survival.
33
8. The price paid of each class
• Try to plot Pclass and Fare chart
to visualize data
• Every seat had someone board
for free, while others spent over
500 pounds for a first-class
ticket. It's quite an interesting
observation!
34

More Related Content

PDF
資料視覺化_透過Orange3進行_無須寫程式直接使用_碩士學程_202403.pdf
 
PPTX
Types of Data in Machine Learning, Number aand Categorical
PPTX
11-11_EDA Samia.pptx 11-11_EDA Samia.pptx
PPTX
Model Evaluation & Visualisation part of a series of intro modules for data ...
PPTX
Exploratory Data Analysis week 4
PPT
02Data mining 243657786756868766758(1).ppt
PDF
Data Visualization in Excel
 
PPTX
Getting to Know Data presentation basics
資料視覺化_透過Orange3進行_無須寫程式直接使用_碩士學程_202403.pdf
 
Types of Data in Machine Learning, Number aand Categorical
11-11_EDA Samia.pptx 11-11_EDA Samia.pptx
Model Evaluation & Visualisation part of a series of intro modules for data ...
Exploratory Data Analysis week 4
02Data mining 243657786756868766758(1).ppt
Data Visualization in Excel
 
Getting to Know Data presentation basics

Similar to 202312 Exploration of Data Analysis Visualization (20)

PDF
Data Mining - Exploring Data
PPTX
CLO4 - Week13 data analysiss python.pptx
PPT
Data Mining: Concepts and Techniques — Chapter 2 —
PPT
Data mining :Concepts and Techniques Chapter 2, data
PPT
Data mining techniques in data mining with examples
PPT
02Dataccccccccccccccccccccccccccccccccccccccc.ppt
PPT
Data mining data characteristics
PPT
Data Mining and Warehousing Concept and Techniques
PPT
Getting to Know Your Data Some sources from where you can access datasets for...
PPTX
Lec 3.pptx
PPT
Upstate CSCI 525 Data Mining Chapter 2
PDF
Data_Analytics_for_IoT_Solutions.pptx.pdf
PPT
02Data.ppt data mining introduction topic
PPT
02Data.ppt 02Data.ppt data mining introduction topic1
PPT
02Data.ppt
PPT
02Data.ppt
PPTX
Chapter-1-section 2.1 Exploring data-Edition-5.pptx
PPTX
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
PPT
DATA MINING: CONCEPTS AND TECHNIQUES OF DATA MINING
PDF
Data Mining - Exploring Data
CLO4 - Week13 data analysiss python.pptx
Data Mining: Concepts and Techniques — Chapter 2 —
Data mining :Concepts and Techniques Chapter 2, data
Data mining techniques in data mining with examples
02Dataccccccccccccccccccccccccccccccccccccccc.ppt
Data mining data characteristics
Data Mining and Warehousing Concept and Techniques
Getting to Know Your Data Some sources from where you can access datasets for...
Lec 3.pptx
Upstate CSCI 525 Data Mining Chapter 2
Data_Analytics_for_IoT_Solutions.pptx.pdf
02Data.ppt data mining introduction topic
02Data.ppt 02Data.ppt data mining introduction topic1
02Data.ppt
02Data.ppt
Chapter-1-section 2.1 Exploring data-Edition-5.pptx
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
DATA MINING: CONCEPTS AND TECHNIQUES OF DATA MINING
Ad

More from FEG (20)

PDF
Supervised learning in decision tree algorithm
 
PDF
Unsupervised learning in data clustering
 
PDF
CNN_Image Classification for deep learning.pdf
 
PDF
Sequence Model with practicing hands on coding.pdf
 
PDF
Seq2seq Model introduction with practicing hands on coding.pdf
 
PDF
AIGEN introduction with practicing hands on coding.pdf
 
PDF
資料視覺化_Exploation_Data_Analysis_20241015.pdf
 
PDF
Operation_research_Linear_programming_20241015.pdf
 
PDF
Operation_research_Linear_programming_20241112.pdf
 
PDF
非監督是學習_Kmeans_process_visualization20241110.pdf
 
PDF
Sequence Model pytorch at colab with gpu.pdf
 
PDF
學院碩士班_非監督式學習_使用Orange3直接使用_分群_20240417.pdf
 
PDF
Pytorch cnn netowork introduction 20240318
 
PDF
2023 Decision Tree analysis in business practices
 
PDF
2023 Clustering analysis using Python from scratch
 
PDF
2023 Data visualization using Python from scratch
 
PDF
2023 Supervised Learning for Orange3 from scratch
 
PDF
2023 Supervised_Learning_Association_Rules
 
PDF
202312 Exploration Data Analysis Visualization (English version)
 
PDF
Transfer Learning (20230516)
 
Supervised learning in decision tree algorithm
 
Unsupervised learning in data clustering
 
CNN_Image Classification for deep learning.pdf
 
Sequence Model with practicing hands on coding.pdf
 
Seq2seq Model introduction with practicing hands on coding.pdf
 
AIGEN introduction with practicing hands on coding.pdf
 
資料視覺化_Exploation_Data_Analysis_20241015.pdf
 
Operation_research_Linear_programming_20241015.pdf
 
Operation_research_Linear_programming_20241112.pdf
 
非監督是學習_Kmeans_process_visualization20241110.pdf
 
Sequence Model pytorch at colab with gpu.pdf
 
學院碩士班_非監督式學習_使用Orange3直接使用_分群_20240417.pdf
 
Pytorch cnn netowork introduction 20240318
 
2023 Decision Tree analysis in business practices
 
2023 Clustering analysis using Python from scratch
 
2023 Data visualization using Python from scratch
 
2023 Supervised Learning for Orange3 from scratch
 
2023 Supervised_Learning_Association_Rules
 
202312 Exploration Data Analysis Visualization (English version)
 
Transfer Learning (20230516)
 
Ad

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
annual-report-2024-2025 original latest.
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Business Analytics and business intelligence.pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Computer network topology notes for revision
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
1_Introduction to advance data techniques.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction-to-Cloud-ComputingFinal.pptx
annual-report-2024-2025 original latest.
Reliability_Chapter_ presentation 1221.5784
oil_refinery_comprehensive_20250804084928 (1).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Quality review (1)_presentation of this 21
Supervised vs unsupervised machine learning algorithms
Business Analytics and business intelligence.pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Computer network topology notes for revision
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
1_Introduction to advance data techniques.pptx
Clinical guidelines as a resource for EBP(1).pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
IB Computer Science - Internal Assessment.pptx

202312 Exploration of Data Analysis Visualization

  • 2. About me • Education • NCU (MIS)、NCCU (CS) • Work Experience • Telecom big data Innovation • AI projects • Retail marketing technology • User Group • TW Spark User Group • TW Hadoop User Group • Taiwan Data Engineer Association Director • Research • Big Data/ ML/ AIOT/ AI Columnist 2
  • 3. Tutorial Content 3 Data Science Process Exploration Data Analysis and Visualization Home work
  • 4. Code • Download materials: • https://drive.google.com/drive/folders/1KPC3K19_vJgRb5Op5M9bqQYJEh2zv rnr?usp=sharing 4
  • 5. Get ready to your Orange 3 • Version: 3.36.1 5
  • 6. 6 The Pyramid of Data Needs (and why it matters for your career) | by Hugh Williams | Medium
  • 7. 7 The Pyramid of Data Needs (and why it matters for your career) | by Hugh Williams | Medium
  • 8. Static chart • There are generally THREE STEPS in drawing a chart: • Observing the data, determine the relationship, and select the chart. • What type of data it is, and what content you want to express. • Category • Numeric • Text • Datetime • After clarifying the content to be expressed, you can choose which chart to use to express it. 8
  • 9. Pie chart • You must have some kind of whole amount that is divided into a number of distinct parts. • Your primary objective in a pie chart should be to compare each group’s contribution to the whole. 9
  • 10. Line chart • Line charts provide the clearest graphical representation of time- related variables and are the preferred mode for representing trends or variables over time. 10
  • 11. Histogram chart • It is used to summarize discrete or continuous data that are measured on an interval scale. • It is often used to illustrate the major features of the distribution of the data in a convenient form. 11
  • 12. Bar chart • It provides a way of showing data values represented as the comparison of multiple data sets side by side. 12
  • 13. Differences between histogram and bar chart Comparison terms Bar chart Histogram Usage To compare different categories of data. To display the distribution of a variable. Type of variable Categorical variables Numeric variables Rendering Each data point is rendered as a separate bar. The data points are grouped and rendered based on the bin value. The entire range of data values is divided into a series of non- overlapping intervals. Space between bars Can have space. No space. Reordering bars Can be reordered. Cannot be reordered. 13
  • 14. Scatter Plot • It uses dots to represent values for two different numeric variables and observe relationships between variables. 14
  • 15. Box plot • Q1: The first quartile (25%) position. • Q3: The third quartile (75%) position. • Interquartile range (IQR) • Lower and upper 1.5*IQR whiskers: These represent the limits and boundaries for the outliers. • Outliers: Defined as observations that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR. 15
  • 16. Dataset description • Using this dataset to predict whether passengers will survive the Titanic accident 16
  • 17. Data Summary • Load titanic.csv • Data description • Names, Types, Role, Values • Change the Columns 17
  • 18. Data Summary • Missing values • Using Features Statistics Widget • How about those missing ratios? 18
  • 19. Preprocess (Remove or Impute columns) • Remove columns 19
  • 20. Preprocess (Remove or Impute columns) • Impute columns • For Default Method • For each column 20
  • 21. Pie chart • Orange 3 has deprecated Pie chat widget • Use python script instead • Find the output file 21
  • 22. Line chart • Typically, trend analysis charts are presented together with time-based data 22
  • 23. Distribution chart • Used to present by sorting frequency • In Orange 3, both of numeric or category data can be presented here • Bar chart widget is not used much compared to others 23
  • 24. Scatter plot • It used to observe the degree of correlation between features • positive correlation • negative correlation • noncorrelation 24
  • 25. Box plot • Comparing multiple features with each other 25
  • 26. Pivot Table • It summarizes the data of a more extensive table into a table of statistics. • The statistics can include sums, averages, counts, etc. 26
  • 27. 1. Show me top 10 data rows • Hint: Use Data Sampler widget 27
  • 28. 2. Show me dataset info • How many Rows? • How many Features? • All information like this! 28
  • 29. 3. Get a count of the number of survivors 29
  • 30. 4. Survival Conclusion • For features, SEX, PCLASS, SIBSP, PARCH, EMBARKED • Women had a higher chance of survival than men. • First-class passengers had a higher chance of survival. • Passengers with siblings, spouses had a higher chance of survival. • Passengers with children and parents had a higher chance of survival. • Departing from the S terminal may lead to lower cabin class and lower chances of survival. 30
  • 31. 5. Show me sex survival rate 31
  • 32. 6. Look at survival rate by SEX and PCLASS • Women in first class had a survival rate as high as 96.8%. In contrast, men in economy class only had a 13.54% chance of survival 32
  • 33. 7. Look at survival rate by SEX, AGE and PCLASS • In the event of a disaster, women in first class or business class have a 90% chance of survival regardless of age. • On the other hand, if a man is in economy class and older than 18, the chance of survival is only 13.36%. • To summarize, in a disaster scenario, girls and women have a higher chance of survival compared to boys and men. • Additionally, the higher the class (such as first class), the higher the chances of survival. 33
  • 34. 8. The price paid of each class • Try to plot Pclass and Fare chart to visualize data • Every seat had someone board for free, while others spent over 500 pounds for a first-class ticket. It's quite an interesting observation! 34