Edwin de Jonge, December 3, 2013
Big Data Visualization
“Turning Statistics into Knowledge”, Aguascalientes
With thanks to Piet Daas, MartijnTennekes
and Alex Priem
Overview
2
• Big Data
• Research ‘theme’ at Stat. Netherlands
• Data driven approach
•Visualization as a tool
•Why?
•Examples in our office
•Census
•Social Security
•Social Media
•Not shown:Traffic loops, Mobile phone data
Why Visualization?
October 1st 2013, Statistics Netherlands
Effective Display!
(seeTor Norretranders, “Band width of our senses”)
Anscombes quartet…
5
DS1 x y DS2 x y DS3 x y DS4 x y
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.1 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.1 4 5.39 19 12.5
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
Anscombe’s quartet
Property Value
Mean of x1, x2, x3, x4 All equal: 9
Variance of x1, x2, x3, x4 All equal: 11
Mean of y1, y2, y3, y4 All equal: 7.50
Variance of y1, y2, y3, y4 All equal: 4.1
Correlation for ds1, ds2, ds3, ds4 All equal 0.816
Linear regression for ds1, ds2, ds3,
ds4
All equal: y = 3.00 + 0.500x
Looks the same, right?
Lets plot!
Visualization
For Big Data:
Use appropriate:
- Summarization
- Granularity
- Noise filtering
Research:What works for big data?
9
Scatter plot with
100 data points
10
Scatter plot with
100 000 data points
11
Example 1: Census
Example Virtual Census
‐ Every 10 years a Census needs to be conducted
‐ No longer with surveys in the Netherlands
• Last traditional census was in 1971
‐ Now by (re-)using existing information
• Linking administrative sources and available sample
survey data at a large scale
• Check result
• How?
• With a visualisation method: the Tableplot
11
Making the Tableplot
1. Load file 17 million records
2. Sort record according to 17 million records
key variable
• Age in this example
3. Combine records 100 groups (170,000 records
each)
• Numeric variables
• Calculate average (avg. age)
• Categorical variables
• Ratio between categories present (male vs. female)
4. Plot figure of select number of variables
• Colours used are important up to 12
12
Big Data Visualization
October 1st 2013, Statistics Netherlands tableplot of the census test file
Tableplot: Monitor data quality
16
– All data in Office passes stages:
‐ Raw data (collected)
‐ Preproccesed (technically correct)
‐ Edited (completed data)
‐ Final (removal of outliers etc.)
Processing of data
Raw (unedited) data
Edited data
Final data
Example 2 : Social Security Register
15
Social Security Register
– Contains all financial data on jobs, benefits and
pensions in the Netherlands
‐ Collected by the DutchTax office
‐ A total of 20 million records each month
‐ How to obtain insight into so much data?
• With a visualisation method: a heat map
19
October 1st 2013, Statistics Netherlands
Heat map: Age vs. ‘Income’
16
Age
Income(euro)
17
amount
mount
22
Example 3: Social media
Daily Sentiment in Dutch Social Media
Social media: daily sentiment in Dutch messages
23
Granilarity: From day to week
Social media, daily sentiment in Dutch messagesSocial media: daily & weekly sentiment in Dutch messages
24
Granularity: From day to month
Social media, daily sentiment in Dutch messagesSocial media: daily, weekly & monthly sentiment in Dutch messages
25
Enter: Consumer confidence!
Social media, daily sentiment in Dutch messagesSocial media: monthly sentiment in Dutch messages &
Consumer confidence
26
Corr: 0.88
Conclusions
Big data is a very interesting data source for
official statistics
Visualisation is a great way of
getting/creating insight
Not only for data exploration, but also for
finding errors
27
The future of statistics?

More Related Content

PPTX
Big data as a source for official statistics
PDF
Strata Big data presentation
PPTX
Data at scale: How to deal with small challenges when they become massive - A...
PDF
A statistical approach to big data, Gustav Haraldsen and Arild Langseth, Stat...
PPTX
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
PPTX
Data Days 2014 - Benedikt Köhler
PDF
Opportunities and methodological challenges of Big Data for official statist...
PDF
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
Big data as a source for official statistics
Strata Big data presentation
Data at scale: How to deal with small challenges when they become massive - A...
A statistical approach to big data, Gustav Haraldsen and Arild Langseth, Stat...
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Days 2014 - Benedikt Köhler
Opportunities and methodological challenges of Big Data for official statist...
A New Algorithm Model for Massive-Scale Streaming Graph Analysis

What's hot (9)

PPTX
Paulo Canas Rodrigues - The role of Statistics in the Internet of Things - ...
PPTX
Creating a histogram
PDF
Big data analysis and modelling
PDF
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
PDF
Creating data dashboards to support planning
PPTX
Data insight summit 2016 excel and power bi better together
PPTX
Blue Raster Natureserve Synergy Workshop Presentation
PDF
Workshop 7 data science
Paulo Canas Rodrigues - The role of Statistics in the Internet of Things - ...
Creating a histogram
Big data analysis and modelling
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Creating data dashboards to support planning
Data insight summit 2016 excel and power bi better together
Blue Raster Natureserve Synergy Workshop Presentation
Workshop 7 data science
Ad

Similar to Big Data Visualization (20)

PDF
Data science and the future of statistics
PDF
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
PDF
IICT-Big Data.pdf slideshow information to communication
PDF
IICT-Big Data.pdf slideshow Information to communication technology
PDF
Responsible Data Science at Statistics Netherlands
PDF
Big Data @ CBS for Fontys students in Eindhoven
PDF
Big Data presentation for Statistics Canada
PPTX
PDF
EMOS 2018 Big Data methods and techniques
PDF
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
PDF
Lr 1 Intro.pdf
PPTX
Advanced Statistics with Computer Application
PPTX
Statistical Inference for development statistical model.pptx
PDF
Building Social Life Networks 130818
PDF
MASTERPIECE TO EXCEL IN DATA ANALYSIS WITH EXCEL.pdf
PDF
Big Data @ CBS
PPTX
Stories from the Field: Data are Messy and that's (kind of) ok
PDF
Big Data, Data-Driven Decision Making and Statistics Towards Data-Informed Po...
PPTX
open-data-presentation.pptx
PDF
Data stories
Data science and the future of statistics
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
IICT-Big Data.pdf slideshow information to communication
IICT-Big Data.pdf slideshow Information to communication technology
Responsible Data Science at Statistics Netherlands
Big Data @ CBS for Fontys students in Eindhoven
Big Data presentation for Statistics Canada
EMOS 2018 Big Data methods and techniques
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Lr 1 Intro.pdf
Advanced Statistics with Computer Application
Statistical Inference for development statistical model.pptx
Building Social Life Networks 130818
MASTERPIECE TO EXCEL IN DATA ANALYSIS WITH EXCEL.pdf
Big Data @ CBS
Stories from the Field: Data are Messy and that's (kind of) ok
Big Data, Data-Driven Decision Making and Statistics Towards Data-Informed Po...
open-data-presentation.pptx
Data stories
Ad

More from Edwin de Jonge (15)

PDF
sdcSpatial user!2019
PDF
Validatetools, resolve and simplify contradictive or data validation rules
PDF
Data error! But where?
PDF
Daff: diff, patch and merge for data.frame
PDF
Chunked, dplyr for large text files
PDF
Uncertainty visualisation
PDF
Heatmaps best practices Strata Hadoop
PDF
Docopt, beautiful command-line options for R, user2014
PPTX
Big data experiments
PPTX
StatMine
PDF
ffbase, statistical functions for large datasets
PDF
Tabplotd3, interactive inspection of large data
PPT
Statmine, Visuele dataexploratie
PPTX
StatMine (New Technologies and Techniques for Statistics)
PPT
StatMine, visual exploration of output data
sdcSpatial user!2019
Validatetools, resolve and simplify contradictive or data validation rules
Data error! But where?
Daff: diff, patch and merge for data.frame
Chunked, dplyr for large text files
Uncertainty visualisation
Heatmaps best practices Strata Hadoop
Docopt, beautiful command-line options for R, user2014
Big data experiments
StatMine
ffbase, statistical functions for large datasets
Tabplotd3, interactive inspection of large data
Statmine, Visuele dataexploratie
StatMine (New Technologies and Techniques for Statistics)
StatMine, visual exploration of output data

Recently uploaded (20)

PDF
sustainability-14-14877-v2.pddhzftheheeeee
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PDF
STKI Israel Market Study 2025 version august
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Developing a website for English-speaking practice to English as a foreign la...
DOCX
search engine optimization ppt fir known well about this
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
Configure Apache Mutual Authentication
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
sustainability-14-14877-v2.pddhzftheheeeee
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Convolutional neural network based encoder-decoder for efficient real-time ob...
Credit Without Borders: AI and Financial Inclusion in Bangladesh
STKI Israel Market Study 2025 version august
Flame analysis and combustion estimation using large language and vision assi...
A review of recent deep learning applications in wood surface defect identifi...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Developing a website for English-speaking practice to English as a foreign la...
search engine optimization ppt fir known well about this
Consumable AI The What, Why & How for Small Teams.pdf
Zenith AI: Advanced Artificial Intelligence
2018-HIPAA-Renewal-Training for executives
Hindi spoken digit analysis for native and non-native speakers
A contest of sentiment analysis: k-nearest neighbor versus neural network
A proposed approach for plagiarism detection in Myanmar Unicode text
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Module 1.ppt Iot fundamentals and Architecture
Configure Apache Mutual Authentication
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...

Big Data Visualization

  • 1. Edwin de Jonge, December 3, 2013 Big Data Visualization “Turning Statistics into Knowledge”, Aguascalientes With thanks to Piet Daas, MartijnTennekes and Alex Priem
  • 2. Overview 2 • Big Data • Research ‘theme’ at Stat. Netherlands • Data driven approach •Visualization as a tool •Why? •Examples in our office •Census •Social Security •Social Media •Not shown:Traffic loops, Mobile phone data
  • 3. Why Visualization? October 1st 2013, Statistics Netherlands
  • 4. Effective Display! (seeTor Norretranders, “Band width of our senses”)
  • 5. Anscombes quartet… 5 DS1 x y DS2 x y DS3 x y DS4 x y 10 8.04 10 9.14 10 7.46 8 6.58 8 6.95 8 8.14 8 6.77 8 5.76 13 7.58 13 8.74 13 12.74 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81 8 8.47 14 9.96 14 8.1 14 8.84 8 7.04 6 7.24 6 6.13 6 6.08 8 5.25 4 4.26 4 3.1 4 5.39 19 12.5 12 10.84 12 9.13 12 8.15 8 5.56 7 4.82 7 7.26 7 6.42 8 7.91 5 5.68 5 4.74 5 5.73 8 6.89
  • 6. Anscombe’s quartet Property Value Mean of x1, x2, x3, x4 All equal: 9 Variance of x1, x2, x3, x4 All equal: 11 Mean of y1, y2, y3, y4 All equal: 7.50 Variance of y1, y2, y3, y4 All equal: 4.1 Correlation for ds1, ds2, ds3, ds4 All equal 0.816 Linear regression for ds1, ds2, ds3, ds4 All equal: y = 3.00 + 0.500x Looks the same, right?
  • 8. Visualization For Big Data: Use appropriate: - Summarization - Granularity - Noise filtering Research:What works for big data?
  • 10. 10 Scatter plot with 100 000 data points
  • 12. Example Virtual Census ‐ Every 10 years a Census needs to be conducted ‐ No longer with surveys in the Netherlands • Last traditional census was in 1971 ‐ Now by (re-)using existing information • Linking administrative sources and available sample survey data at a large scale • Check result • How? • With a visualisation method: the Tableplot 11
  • 13. Making the Tableplot 1. Load file 17 million records 2. Sort record according to 17 million records key variable • Age in this example 3. Combine records 100 groups (170,000 records each) • Numeric variables • Calculate average (avg. age) • Categorical variables • Ratio between categories present (male vs. female) 4. Plot figure of select number of variables • Colours used are important up to 12 12
  • 15. October 1st 2013, Statistics Netherlands tableplot of the census test file
  • 16. Tableplot: Monitor data quality 16 – All data in Office passes stages: ‐ Raw data (collected) ‐ Preproccesed (technically correct) ‐ Edited (completed data) ‐ Final (removal of outliers etc.)
  • 17. Processing of data Raw (unedited) data Edited data Final data
  • 18. Example 2 : Social Security Register 15
  • 19. Social Security Register – Contains all financial data on jobs, benefits and pensions in the Netherlands ‐ Collected by the DutchTax office ‐ A total of 20 million records each month ‐ How to obtain insight into so much data? • With a visualisation method: a heat map 19
  • 20. October 1st 2013, Statistics Netherlands Heat map: Age vs. ‘Income’ 16 Age Income(euro)
  • 23. Daily Sentiment in Dutch Social Media Social media: daily sentiment in Dutch messages 23
  • 24. Granilarity: From day to week Social media, daily sentiment in Dutch messagesSocial media: daily & weekly sentiment in Dutch messages 24
  • 25. Granularity: From day to month Social media, daily sentiment in Dutch messagesSocial media: daily, weekly & monthly sentiment in Dutch messages 25
  • 26. Enter: Consumer confidence! Social media, daily sentiment in Dutch messagesSocial media: monthly sentiment in Dutch messages & Consumer confidence 26 Corr: 0.88
  • 27. Conclusions Big data is a very interesting data source for official statistics Visualisation is a great way of getting/creating insight Not only for data exploration, but also for finding errors 27
  • 28. The future of statistics?