SlideShare a Scribd company logo
國立臺北護理健康大學 NTUHS
Visualization
Orozco Hsu
2021-12-13
1
About me
• Education
• NCU (MIS)、NCCU (CS)
• Work Experience
• Telecom big data Innovation
• AI projects
• Retail marketing technology
• User Group
• TW Spark User Group
• TW Hadoop User Group
• Taiwan Data Engineer Association Director
• Research
• Big Data/ ML/ AIOT/ AI Columnist
2
Tutorial
Content
3
Using tools for EDA
Informative Visualization
Homework
Exploratory Data Analysis (EDA)
Code
4
• Download code
• https://github.com/orozcohsu/ntunhs_2021.git
• Folder
• 20211213_inter_master
EDA
• EDA refers to the critical process of performing initial investigations
on data so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of summary
statistics and graphical representations.
5
EDA
• The useful python package for
EDA:
• matplotlib
• pandas
• seaborn
• The useful python interactive
visualization tool:
• dash
6
參考: https://dash.plotly.com/basic-callbacks
Using pandas
• Firstly, load csv file into data-frame
• Check basic information of data-frame, those are useful methods:
• head()
• tail()
• shape
• info()
• describe(include='all')
7
Using pandas
• Visualize from data-frame, those are useful methods:
• corr
• hist
• scatter
• line
• bar
• pie
• boxplot
8
pandas.ipynb
Using seaborn
• Seaborn supports rich chart visualization based on matplotlib tool
and is compatible with numpy or pandas data types.
• heatmap
• kdeplot/displot
• cut, cumulative
• jointplot
• pairplot
• lmplot
• barplot
• countplot
• catplot
9
seaborn.ipynb
Boxplot
10
Ref: https://help.ezbiocloud.net/box-plot/
Boxplot
11
Ref: https://help.ezbiocloud.net/box-plot/
Boxplot
12
Ref: https://zh.wikipedia.org/wiki/File:Boxplot_vs_PDF.svg
Boxplot
• Given 20 sample points as
• 5,17,17,18,18,19,19,19,20,20,20,21,22,22,22,22,23,23,23
• Q1= (25/100)x20=5, Q1=(X5+X6)/2, = (18+19)/2 = 18.5
• Q3= (75/100)x20=22, Q3=(X15+X16)/2 = (22+22)/2 = 22
• Q2= (50/100)X20=20, Q2=(X10+X11)/2 = (20+20)/2=20
• IRQ= Q3-Q1 = 22-18.5 =3.5
• Fence:
• Q1-1.5xIRQ = 18.5-5.25=12.75
• Q3+1.5xIRQ = 22+5.25=27.25
13
Regression hypothesis
• Each predicted values is assumed to come from a normal distribution
14
How to test for a normal distribution
• The following variables are close to normally distributed variables:
• Height of a population
• Blood pressure of adult human
• Position of a particle that experiences diffusion
• Measurement errors
• Residuals in regression
• Shoe size of a population
• Amount of time it takes for employees to reach home
• A large number of educational measures
15
How to test for a normal distribution
• A normal distribution is a distribution
that is solely dependent on two
parameters of the data set: mean and
the standard deviation of the sample.
• Mean — This is the average value of all the
points in the sample that is computed by
summing the values and then dividing by
the total number of the values in a sample.
• Standard Deviation — This indicates how
much the data set deviates from the mean
of the sample.
16
Ref: https://www.varsitytutors.com/hotmath/hotmath_help/topics/normal-distribution-of-data
test_for_a_Normal_Distribution.ipynb
Homework
• Visualizing from winequality-red.csv with following charts. And point
out your investigation.
17

More Related Content

PDF
overview of_data_processing
 
PDF
analytic hierarchy_process
 
PDF
6 data envelopment_analysis
 
PDF
A Semantic Web Platform for Automating the Interpretation of Finite Element ...
PPTX
A Semantic Web Platform for Improving the Automation and Reproducibility of F...
PDF
GLM & GBM in H2O
PDF
A Firefly based improved clustering algorithm
PDF
Introduction to Data Mining - A Beginner's Guide
overview of_data_processing
 
analytic hierarchy_process
 
6 data envelopment_analysis
 
A Semantic Web Platform for Automating the Interpretation of Finite Element ...
A Semantic Web Platform for Improving the Automation and Reproducibility of F...
GLM & GBM in H2O
A Firefly based improved clustering algorithm
Introduction to Data Mining - A Beginner's Guide

What's hot (9)

PDF
H2O World - Ensembles with Erin LeDell
PPT
PDF
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
PPTX
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
PPTX
Database Performance Analysis with Time Series
PDF
Region-Based Search in Large Medical Image Repositories
PDF
Azure Machine Learning and ML on Premises
PDF
Vector spaces for information extraction - Random Projection Example
PDF
Building Data Products
H2O World - Ensembles with Erin LeDell
( Big ) Data Management - Data Mining and Machine Learning - Global concepts ...
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
Database Performance Analysis with Time Series
Region-Based Search in Large Medical Image Repositories
Azure Machine Learning and ML on Premises
Vector spaces for information extraction - Random Projection Example
Building Data Products
Ad

Similar to 4 visualization inter (20)

PPTX
Data_Preparation.pptx
PDF
Keynote SBST 2014 - Search-Based Testing
PPTX
Advanced Analytics in Banking, CITI
PPTX
From ensembles to computer networks
PPT
Understanding the Impact of the Cyber Act
PPTX
LEARN Final Conference: Tutorial Group | Costing RDM
PPTX
LEARN Conference - How to cost
PDF
From Raw Data to Deployed Product. Fast & Agile with CRISP-DM
PDF
Emerging Properties in Self-Supervised Vision Transformers
PDF
Ontologies mining using association rules
PDF
MOA for the IoT at ACML 2016
PDF
SQLBits Module 2 RStats Introduction to R and Statistics
PDF
Lecture_2_Stats.pdf
PDF
Testing and Verification of Electronics Circuits : Introduction
PPTX
2019 DSA 105 Introduction to Data Science Week 3
PDF
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
PDF
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
PDF
The systemic challenges in data science initiatives (and some solutions)
PPTX
Future se oct15
PDF
The Paris-Saclay Center for Data Science
Data_Preparation.pptx
Keynote SBST 2014 - Search-Based Testing
Advanced Analytics in Banking, CITI
From ensembles to computer networks
Understanding the Impact of the Cyber Act
LEARN Final Conference: Tutorial Group | Costing RDM
LEARN Conference - How to cost
From Raw Data to Deployed Product. Fast & Agile with CRISP-DM
Emerging Properties in Self-Supervised Vision Transformers
Ontologies mining using association rules
MOA for the IoT at ACML 2016
SQLBits Module 2 RStats Introduction to R and Statistics
Lecture_2_Stats.pdf
Testing and Verification of Electronics Circuits : Introduction
2019 DSA 105 Introduction to Data Science Week 3
AI4SE: Challenges and opportunities in the integration of Systems Engineering...
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
The systemic challenges in data science initiatives (and some solutions)
Future se oct15
The Paris-Saclay Center for Data Science
Ad

More from FEG (20)

PDF
Supervised learning in decision tree algorithm
 
PDF
Unsupervised learning in data clustering
 
PDF
CNN_Image Classification for deep learning.pdf
 
PDF
Sequence Model with practicing hands on coding.pdf
 
PDF
Seq2seq Model introduction with practicing hands on coding.pdf
 
PDF
AIGEN introduction with practicing hands on coding.pdf
 
PDF
資料視覺化_Exploation_Data_Analysis_20241015.pdf
 
PDF
Operation_research_Linear_programming_20241015.pdf
 
PDF
Operation_research_Linear_programming_20241112.pdf
 
PDF
非監督是學習_Kmeans_process_visualization20241110.pdf
 
PDF
Sequence Model pytorch at colab with gpu.pdf
 
PDF
學院碩士班_非監督式學習_使用Orange3直接使用_分群_20240417.pdf
 
PDF
資料視覺化_透過Orange3進行_無須寫程式直接使用_碩士學程_202403.pdf
 
PDF
Pytorch cnn netowork introduction 20240318
 
PDF
2023 Decision Tree analysis in business practices
 
PDF
2023 Clustering analysis using Python from scratch
 
PDF
2023 Data visualization using Python from scratch
 
PDF
2023 Supervised Learning for Orange3 from scratch
 
PDF
2023 Supervised_Learning_Association_Rules
 
PDF
202312 Exploration Data Analysis Visualization (English version)
 
Supervised learning in decision tree algorithm
 
Unsupervised learning in data clustering
 
CNN_Image Classification for deep learning.pdf
 
Sequence Model with practicing hands on coding.pdf
 
Seq2seq Model introduction with practicing hands on coding.pdf
 
AIGEN introduction with practicing hands on coding.pdf
 
資料視覺化_Exploation_Data_Analysis_20241015.pdf
 
Operation_research_Linear_programming_20241015.pdf
 
Operation_research_Linear_programming_20241112.pdf
 
非監督是學習_Kmeans_process_visualization20241110.pdf
 
Sequence Model pytorch at colab with gpu.pdf
 
學院碩士班_非監督式學習_使用Orange3直接使用_分群_20240417.pdf
 
資料視覺化_透過Orange3進行_無須寫程式直接使用_碩士學程_202403.pdf
 
Pytorch cnn netowork introduction 20240318
 
2023 Decision Tree analysis in business practices
 
2023 Clustering analysis using Python from scratch
 
2023 Data visualization using Python from scratch
 
2023 Supervised Learning for Orange3 from scratch
 
2023 Supervised_Learning_Association_Rules
 
202312 Exploration Data Analysis Visualization (English version)
 

Recently uploaded (20)

PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Institutional Correction lecture only . . .
PDF
Complications of Minimal Access Surgery at WLH
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Presentation on HIE in infants and its manifestations
PDF
Computing-Curriculum for Schools in Ghana
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Cell Types and Its function , kingdom of life
PPTX
Cell Structure & Organelles in detailed.
PPTX
Lesson notes of climatology university.
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
GDM (1) (1).pptx small presentation for students
2.FourierTransform-ShortQuestionswithAnswers.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Institutional Correction lecture only . . .
Complications of Minimal Access Surgery at WLH
VCE English Exam - Section C Student Revision Booklet
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Presentation on HIE in infants and its manifestations
Computing-Curriculum for Schools in Ghana
Anesthesia in Laparoscopic Surgery in India
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Microbial disease of the cardiovascular and lymphatic systems
Final Presentation General Medicine 03-08-2024.pptx
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Cell Types and Its function , kingdom of life
Cell Structure & Organelles in detailed.
Lesson notes of climatology university.
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
GDM (1) (1).pptx small presentation for students

4 visualization inter

  • 2. About me • Education • NCU (MIS)、NCCU (CS) • Work Experience • Telecom big data Innovation • AI projects • Retail marketing technology • User Group • TW Spark User Group • TW Hadoop User Group • Taiwan Data Engineer Association Director • Research • Big Data/ ML/ AIOT/ AI Columnist 2
  • 3. Tutorial Content 3 Using tools for EDA Informative Visualization Homework Exploratory Data Analysis (EDA)
  • 4. Code 4 • Download code • https://github.com/orozcohsu/ntunhs_2021.git • Folder • 20211213_inter_master
  • 5. EDA • EDA refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations. 5
  • 6. EDA • The useful python package for EDA: • matplotlib • pandas • seaborn • The useful python interactive visualization tool: • dash 6 參考: https://dash.plotly.com/basic-callbacks
  • 7. Using pandas • Firstly, load csv file into data-frame • Check basic information of data-frame, those are useful methods: • head() • tail() • shape • info() • describe(include='all') 7
  • 8. Using pandas • Visualize from data-frame, those are useful methods: • corr • hist • scatter • line • bar • pie • boxplot 8 pandas.ipynb
  • 9. Using seaborn • Seaborn supports rich chart visualization based on matplotlib tool and is compatible with numpy or pandas data types. • heatmap • kdeplot/displot • cut, cumulative • jointplot • pairplot • lmplot • barplot • countplot • catplot 9 seaborn.ipynb
  • 13. Boxplot • Given 20 sample points as • 5,17,17,18,18,19,19,19,20,20,20,21,22,22,22,22,23,23,23 • Q1= (25/100)x20=5, Q1=(X5+X6)/2, = (18+19)/2 = 18.5 • Q3= (75/100)x20=22, Q3=(X15+X16)/2 = (22+22)/2 = 22 • Q2= (50/100)X20=20, Q2=(X10+X11)/2 = (20+20)/2=20 • IRQ= Q3-Q1 = 22-18.5 =3.5 • Fence: • Q1-1.5xIRQ = 18.5-5.25=12.75 • Q3+1.5xIRQ = 22+5.25=27.25 13
  • 14. Regression hypothesis • Each predicted values is assumed to come from a normal distribution 14
  • 15. How to test for a normal distribution • The following variables are close to normally distributed variables: • Height of a population • Blood pressure of adult human • Position of a particle that experiences diffusion • Measurement errors • Residuals in regression • Shoe size of a population • Amount of time it takes for employees to reach home • A large number of educational measures 15
  • 16. How to test for a normal distribution • A normal distribution is a distribution that is solely dependent on two parameters of the data set: mean and the standard deviation of the sample. • Mean — This is the average value of all the points in the sample that is computed by summing the values and then dividing by the total number of the values in a sample. • Standard Deviation — This indicates how much the data set deviates from the mean of the sample. 16 Ref: https://www.varsitytutors.com/hotmath/hotmath_help/topics/normal-distribution-of-data test_for_a_Normal_Distribution.ipynb
  • 17. Homework • Visualizing from winequality-red.csv with following charts. And point out your investigation. 17