Leland Wilkinson
Chief Scientist
H2O.ai
Automatic
Visualization
Big Data
set cover (core sets)
Big Data
-10
0
10
20
30
40
50
60
70
80
90
-10 0 10 20 30 40 50 60 70 80 90
Chart Title
-10
0
10
20
30
40
50
60
70
80
90
-10 0 10 20 30 40 50 60 70 80 90
Chart Title
Outliers
Outliers
Outliers
Outliers
Outliers
• An anomaly is an observation inconsistent with a set of beliefs.
• The anomaly depends on these beliefs
• An outlier is an observation inconsistent with a set of points.
• The points are presumed generated by a probabilistic process in a vector space.
• All outliers are anomalies but not all anomalies are outliers
• Some anomalies are logical or mathematical
• Outliers are probabilistic
• Outlier detection has more than a 200 year history.
• The goal was to reduce bias in models
• The goal today is to learn interesting stuff from examining outliers
• Statisticians no longer delete outliers. They use robust methods.
Outliers
• Barnett & Lewis (1994), Outliers in Statistical Data.
• Rousseeuw & Leroy (1987). Robust Regression & Outlier Detection.
• Hartigan (1975) Clustering Algorithms.
Beauty is truth, truth beauty,—that is all
Ye know on earth, and all ye need to know.
Outliers
• Univariate outliers
• Distance from Center Rule
• Gaps Rule
Outliers
• Multivariate outliers
• Distance from Center Rule
• Gaps Rule
Outliers
1. Map categorical variables to continuous values (SVD).
2. If p large, use random projections to reduce dimensionality.
3. Normalize columns on [0, 1]
4. If n large, aggregate
• If p = 2, you could use gridding or hex binning
• But general solution is based on Hartigan’s Leader algorithm
5. Compute nearest neighbor distances between points.
6. Fit exponential distribution to largest distances.
7. Reject points in upper tail of this distribution.
Outliers
• Low-dimensional projections are not reliable ways to discover high-
dimensional outliers.
Outliers
• Parallel coordinates, SPLOMs, and other multivariate visualizations are not
reliable ways to discover high-dimensional outliers.
A
-4 -2 0 2 4
1 2
3
4
5
6
12
3
4
5
6
-4 -2 0 2 4
1 2
3
4
5
6
12
3
4
5
6
-4 -2 0 2 4
-4-2024
1 2
3
4
5
6
-4-2024
1
2
3
4
5
6
B 1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
345
6
1
2
345
6
C
1
2
34 5
6
1
2
34 5
6
-4-2024
1
2
34 5
6
-4-2024
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
D
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
56
1
2
3
4
5 6
1
2
3
4
5 6
1
2
3
4
5 6
E
-4-2024
1
2
3
4
56
-4 -2 0 2 4
-4-2024
1
2 3
4
5
6
1
23
4
5
6
-4 -2 0 2 4
1
2 3
4
5
6
1
2 3
4
5
6
-4 -2 0 2 4
1
2 3
4
5
6
F
66
6
6
6 6
666
6
6
6
6
6 6 6
6
6
6
6
6 6 6
6
6
6
6
66
6
Outliers
• Popular ML algorithms are not reliable ways to identify outliers.
Scagnostics
• We characterize a scatterplot (2D point set) with nine measures.
• We base our measures on three geometric graphs.
• Convex Hull
• Alpha Shape
• Minimum Spanning Tree
Scagnostics
• Each geometric graph is a subset of the Delaunay triangulation
Scagnostics
X
Shape
13
Shape
2) Convex: ratio of area of alpha shape to the area of convex hull.
3) Skinny: ratio of perimeter to area of the alpha shape.
4) Stringy: ratio of diameter of MST to length of MST. Similar to skinny.
The diameter of a graph is the longest shortest path between a pair of its vertices.
Convex: area of alpha shape divided by area of convex hull
Skinny: ratio of perimeter to area of the alpha shape
Stringy: ratio of 2-degree vertices in MST to number of vertices > 1-degree
Scagnostics
X
Density
Skewed: ratio of (Q90 - Q50) / (Q90 - Q10),
where quantiles are on MST edge lengths
15
Density
7) Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where the quantiles are taken from the
MST edge lengths.
8) Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the
length of runt cutting edge (red).
The Hartigan RUNT statistic for a node of a hierarchical clustering tree is the
smaller of the number of leaves owned by each of its two children. We derive this
for each vertex in the MST using an edge-cutting algorithm.
largest runt
longest edge
in runt
Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the
length of runt-cutting edge (red)
15
Density
7) Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where the quantiles are taken from the
MST edge lengths.
8) Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the
length of runt cutting edge (red).
The Hartigan RUNT statistic for a node of a hierarchical clustering tree is the
smaller of the number of leaves owned by each of its two children. We derive this
for each vertex in the MST using an edge-cutting algorithm.
largest runt
longest edge
in runt
Outlying: proportion of total MST length due to edges adjacent to outliers
Scagnostics
X
Density
Sparse: 90th percentile of distribution of edge lengths in MST
Striated: proportion of all vertices in the MST that are degree-2 and have a
cosine between adjacent edges less than -.75
Scagnostics
Scagnostics
Scagnostics
AutoVis
Graham Wills and Leland Wilkinson. 2010. AutoVis: automatic visualization.
Information Visualization 9, 1 (March 2010), 47-69.
H2O AutoViz
Future Plans
1. Add brushing to graphics
2. Create case-weight vector for DAI (0 = exclude)
3. Suggest additional features to pass to DAI
4. Animate visualizations
5. Add natural language explanations to graphics.
Thank You!
Visualizing Big Data
• Complexity: Many functions are polynomial or exponential
• Curse of Dimensionality: distances tend toward constant as
• Chokepoint: Cannot send big data over the wire
• Real Estate: Cannot plot big data on the client
• Cheesy solutions in 2D
• Pixelate (too complex for higher dimensions)
• Project (usually violates triangle inequality for )
• Image maps (OK for popups and simple links, not for EDA)
• Viable solutions
• Aggregate (big n) to a few thousand rows
• Project (big p) to a few dozen columns

More Related Content

PPTX
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
PPT
PPTX
ACM 2013-02-25
PPTX
Simplicial closure and higher-order link prediction (SIAMNS18)
PPT
250380111-Measures-of-Dispersion-ppt.ppt
PPT
Fuzzy c means clustering protocol for wireless sensor networks
PPTX
Quantum-Min-Cut/Max-Flow-Based Vertex Importance Ranking
PPTX
Ability Study of Proximity Measure for Big Data Mining Context on Clustering
Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai
ACM 2013-02-25
Simplicial closure and higher-order link prediction (SIAMNS18)
250380111-Measures-of-Dispersion-ppt.ppt
Fuzzy c means clustering protocol for wireless sensor networks
Quantum-Min-Cut/Max-Flow-Based Vertex Importance Ranking
Ability Study of Proximity Measure for Big Data Mining Context on Clustering

Similar to Automatic Visualization (20)

PPTX
Cluster Validation
PPTX
Oxford 05-oct-2012
PPT
Multivariate Analysis Power point Slides
PPT
Multivariate Analysis for new students .ppt
PPT
Multivariates Analysis for chemistry.ppt
PPT
Multivariate Analysis Charts for Students
PPTX
pattern recognition techniques and algo.pptx
PPTX
S1-Chp3-RepresentationsOfData, maths a level presentation for statistics
PPTX
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
PDF
Normal Distribution
PPSX
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
PPTX
Linear regression analysis
PDF
Reif Regression Diagnostics I and II
PDF
Multivariate Analysis
PPTX
Lect 3 background mathematics
PDF
Machine learning for_finance
PDF
Declarative data analysis
PPTX
datamining-lect8a-amachinelearningapproach.pptx
PPTX
Lect 3 background mathematics for Data Mining
PDF
Tolerance stack up and analysis mn
Cluster Validation
Oxford 05-oct-2012
Multivariate Analysis Power point Slides
Multivariate Analysis for new students .ppt
Multivariates Analysis for chemistry.ppt
Multivariate Analysis Charts for Students
pattern recognition techniques and algo.pptx
S1-Chp3-RepresentationsOfData, maths a level presentation for statistics
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
Normal Distribution
"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...
Linear regression analysis
Reif Regression Diagnostics I and II
Multivariate Analysis
Lect 3 background mathematics
Machine learning for_finance
Declarative data analysis
datamining-lect8a-amachinelearningapproach.pptx
Lect 3 background mathematics for Data Mining
Tolerance stack up and analysis mn
Ad

More from Sri Ambati (20)

PDF
H2O Label Genie Starter Track - Support Presentation
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
PDF
Intro to Enterprise h2oGPTe Presentation Slides
PDF
Enterprise h2o GPTe Learning Path Slide Deck
PDF
H2O Wave Course Starter - Presentation Slides
PDF
Large Language Models (LLMs) - Level 3 Slides
PDF
Data Science and Machine Learning Platforms (2024) Slides
PDF
Data Prep for H2O Driverless AI - Slides
PDF
H2O Cloud AI Developer Services - Slides (2024)
PDF
LLM Learning Path Level 2 - Presentation Slides
PDF
LLM Learning Path Level 1 - Presentation Slides
PDF
Hydrogen Torch - Starter Course - Presentation Slides
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
PDF
H2O Driverless AI Starter Course - Slides and Assignments
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
PPTX
Generative AI Masterclass - Model Risk Management.pptx
H2O Label Genie Starter Track - Support Presentation
H2O.ai Agents : From Theory to Practice - Support Presentation
H2O Generative AI Starter Track - Support Presentation Slides.pdf
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Intro to Enterprise h2oGPTe Presentation Slides
Enterprise h2o GPTe Learning Path Slide Deck
H2O Wave Course Starter - Presentation Slides
Large Language Models (LLMs) - Level 3 Slides
Data Science and Machine Learning Platforms (2024) Slides
Data Prep for H2O Driverless AI - Slides
H2O Cloud AI Developer Services - Slides (2024)
LLM Learning Path Level 2 - Presentation Slides
LLM Learning Path Level 1 - Presentation Slides
Hydrogen Torch - Starter Course - Presentation Slides
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
H2O Driverless AI Starter Course - Slides and Assignments
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Generative AI Masterclass - Model Risk Management.pptx
Ad

Recently uploaded (20)

PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
STKI Israel Market Study 2025 version august
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Hybrid model detection and classification of lung cancer
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Modernising the Digital Integration Hub
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
The various Industrial Revolutions .pptx
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
A novel scalable deep ensemble learning framework for big data classification...
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Getting Started with Data Integration: FME Form 101
STKI Israel Market Study 2025 version august
observCloud-Native Containerability and monitoring.pptx
Hybrid model detection and classification of lung cancer
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Group 1 Presentation -Planning and Decision Making .pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
A review of recent deep learning applications in wood surface defect identifi...
Developing a website for English-speaking practice to English as a foreign la...
O2C Customer Invoices to Receipt V15A.pptx
Tartificialntelligence_presentation.pptx
Modernising the Digital Integration Hub
1 - Historical Antecedents, Social Consideration.pdf
Chapter 5: Probability Theory and Statistics
Univ-Connecticut-ChatGPT-Presentaion.pdf
The various Industrial Revolutions .pptx
Web Crawler for Trend Tracking Gen Z Insights.pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
A novel scalable deep ensemble learning framework for big data classification...

Automatic Visualization

  • 2. Big Data set cover (core sets)
  • 3. Big Data -10 0 10 20 30 40 50 60 70 80 90 -10 0 10 20 30 40 50 60 70 80 90 Chart Title -10 0 10 20 30 40 50 60 70 80 90 -10 0 10 20 30 40 50 60 70 80 90 Chart Title
  • 8. Outliers • An anomaly is an observation inconsistent with a set of beliefs. • The anomaly depends on these beliefs • An outlier is an observation inconsistent with a set of points. • The points are presumed generated by a probabilistic process in a vector space. • All outliers are anomalies but not all anomalies are outliers • Some anomalies are logical or mathematical • Outliers are probabilistic • Outlier detection has more than a 200 year history. • The goal was to reduce bias in models • The goal today is to learn interesting stuff from examining outliers • Statisticians no longer delete outliers. They use robust methods.
  • 9. Outliers • Barnett & Lewis (1994), Outliers in Statistical Data. • Rousseeuw & Leroy (1987). Robust Regression & Outlier Detection. • Hartigan (1975) Clustering Algorithms. Beauty is truth, truth beauty,—that is all Ye know on earth, and all ye need to know.
  • 10. Outliers • Univariate outliers • Distance from Center Rule • Gaps Rule
  • 11. Outliers • Multivariate outliers • Distance from Center Rule • Gaps Rule
  • 12. Outliers 1. Map categorical variables to continuous values (SVD). 2. If p large, use random projections to reduce dimensionality. 3. Normalize columns on [0, 1] 4. If n large, aggregate • If p = 2, you could use gridding or hex binning • But general solution is based on Hartigan’s Leader algorithm 5. Compute nearest neighbor distances between points. 6. Fit exponential distribution to largest distances. 7. Reject points in upper tail of this distribution.
  • 13. Outliers • Low-dimensional projections are not reliable ways to discover high- dimensional outliers.
  • 14. Outliers • Parallel coordinates, SPLOMs, and other multivariate visualizations are not reliable ways to discover high-dimensional outliers. A -4 -2 0 2 4 1 2 3 4 5 6 12 3 4 5 6 -4 -2 0 2 4 1 2 3 4 5 6 12 3 4 5 6 -4 -2 0 2 4 -4-2024 1 2 3 4 5 6 -4-2024 1 2 3 4 5 6 B 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 345 6 1 2 345 6 C 1 2 34 5 6 1 2 34 5 6 -4-2024 1 2 34 5 6 -4-2024 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 D 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 56 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 E -4-2024 1 2 3 4 56 -4 -2 0 2 4 -4-2024 1 2 3 4 5 6 1 23 4 5 6 -4 -2 0 2 4 1 2 3 4 5 6 1 2 3 4 5 6 -4 -2 0 2 4 1 2 3 4 5 6 F 66 6 6 6 6 666 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 66 6
  • 15. Outliers • Popular ML algorithms are not reliable ways to identify outliers.
  • 16. Scagnostics • We characterize a scatterplot (2D point set) with nine measures. • We base our measures on three geometric graphs. • Convex Hull • Alpha Shape • Minimum Spanning Tree
  • 17. Scagnostics • Each geometric graph is a subset of the Delaunay triangulation
  • 18. Scagnostics X Shape 13 Shape 2) Convex: ratio of area of alpha shape to the area of convex hull. 3) Skinny: ratio of perimeter to area of the alpha shape. 4) Stringy: ratio of diameter of MST to length of MST. Similar to skinny. The diameter of a graph is the longest shortest path between a pair of its vertices. Convex: area of alpha shape divided by area of convex hull Skinny: ratio of perimeter to area of the alpha shape Stringy: ratio of 2-degree vertices in MST to number of vertices > 1-degree
  • 19. Scagnostics X Density Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where quantiles are on MST edge lengths 15 Density 7) Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where the quantiles are taken from the MST edge lengths. 8) Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the length of runt cutting edge (red). The Hartigan RUNT statistic for a node of a hierarchical clustering tree is the smaller of the number of leaves owned by each of its two children. We derive this for each vertex in the MST using an edge-cutting algorithm. largest runt longest edge in runt Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the length of runt-cutting edge (red) 15 Density 7) Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where the quantiles are taken from the MST edge lengths. 8) Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the length of runt cutting edge (red). The Hartigan RUNT statistic for a node of a hierarchical clustering tree is the smaller of the number of leaves owned by each of its two children. We derive this for each vertex in the MST using an edge-cutting algorithm. largest runt longest edge in runt Outlying: proportion of total MST length due to edges adjacent to outliers
  • 20. Scagnostics X Density Sparse: 90th percentile of distribution of edge lengths in MST Striated: proportion of all vertices in the MST that are degree-2 and have a cosine between adjacent edges less than -.75
  • 24. AutoVis Graham Wills and Leland Wilkinson. 2010. AutoVis: automatic visualization. Information Visualization 9, 1 (March 2010), 47-69.
  • 26. Future Plans 1. Add brushing to graphics 2. Create case-weight vector for DAI (0 = exclude) 3. Suggest additional features to pass to DAI 4. Animate visualizations 5. Add natural language explanations to graphics.
  • 28. Visualizing Big Data • Complexity: Many functions are polynomial or exponential • Curse of Dimensionality: distances tend toward constant as • Chokepoint: Cannot send big data over the wire • Real Estate: Cannot plot big data on the client • Cheesy solutions in 2D • Pixelate (too complex for higher dimensions) • Project (usually violates triangle inequality for ) • Image maps (OK for popups and simple links, not for EDA) • Viable solutions • Aggregate (big n) to a few thousand rows • Project (big p) to a few dozen columns