SlideShare a Scribd company logo
Edwin de Jonge
In cooperation with Piet Daas, Martijn Tennekes, Marco Puts, Alex Priem
Big Data
Case studies in Official Statistics
From a Official Statistics point of view
Three types of data:
1. Survey data = data collected by SN
with questionnaires
2. Admin data = administrative (register) data
collected by third parties such
as theTax Office
3. Big data = machine generated
data of events
2
Big Data case studies
Big data = machine generated data of events
3
Source Statistics
Social media Sentiment (as indicator for
business cycle)
Mobile phone data Daytime population, tourism
statistics
Traffic loops Traffic index statistics
At the end of this talk:
Visualization methods for Big Data
Overview
4
• Big Data
• Research ‘theme’ at Stat. Netherlands
• Data driven approach
•Visualization as a tool
•Why?
•Examples in our office
• Issues & challenges
• From an official statistical perspective
Big data approach
5
Case study 1: Social media
– 3 billion messages as of 2009 gathered from Facebook,
Twitter, LinkedIn, Google+ by a Dutch intermediate
companyCoosto.
– Sentiment per message determined by classifying words
as negative or positive.
– Could be used as indicator for the business cycle. Could it
be fit to the consumer confidence, the leading business
cycle indicator?
6
Sentiment in social media
7
Platform specific sentiment
8
Table 1. Social media messages properties for various platforms and their correlation with consumer confidence
Correlation coefficient of
Social media platform Number of social Number of messages as monthly sentiment index and
media messages1
percentage of total (%) consumer confidence ( r )2
All platforms combined 3,153,002,327 100 0.75 0.78
Facebook 334,854,088 10.6 0.81* 0.85*
Twitter 2,526,481,479 80.1 0.68 0.70
Hyves 45,182,025 1.4 0.50 0.58
News sites 56,027,686 1.8 0.37 0.26
Blogs 48,600,987 1.5 0.25 0.22
Google+ 644,039 0.02 -0.04 -0.09
Linkedin 565,811 0.02 -0.23 -0.25
Youtube 5,661,274 0.2 -0.37 -0.41
Forums 134,98,938 4.3 -0.45 -0.49
1
period covered June 2010 untill November 2013
2
confirmed by visual inspecting scatterplots and additional checks (see text)
*cointegrated
Platform specific results
Granger causality reveals that Consumer Confidence precedes
Facebook sentiment ! (p-value < 0.001)
9
Case study 2: mobile phone metadata
– Pilot study with a cell phone provider with market share
of 1/3 in the Netherlands.
– Aggregated data is queried by intermediate company
Mezuro and delivered to SN. Privacy is guaranteed!
– Applications: daytime population, tourism statistics,
economic activity, mobility studies, etcetera.
10
Mobile phone population
11
MPRD (Municipal Personal Records Database) = Dutch population
Subpopulations model
12
Mobile phone metadata
weighted to the MPRD.
MPRD data &
Education Registers. MPRD data only.
Mobile phone metadata
13
Event Datail Records (EDR) contain metadata on mobile phone events (i.e. call,
SMS or data transfer).
Aggregated table: number of unique devices X time period X current region X
residential region.
Daytime population results
14
Almere: commuter town?
Foreigners at SchipholAirport
Dutch
population
totals
Case study 3: Traffic loops
Traffic loop data
‐ Each minute (24/7) the number of passing vehicles is
counted in around 20.000 ‘loops’ in the Netherlands
(100 million records a day)
‐ Nice data source for transport and traffic statistics
(and more)15
Traffic loops on main roads
16A close look at the highways around Utrecht
Traffic loops on main roads (2)
17Traffic loops everywhere…
Traffic loops on main roads (3)
18Highways simplified for analysis
Raw data: Total number of vehicles a day
19
Time (hour)
Correct for missing data: macro level
Sliding window of 5 min. Impute missing data.
Before After
Total = ~ 295 million detected vehicles Total = ~ 330 million (+ 12%)
detected vehicles
20
Data by type of vehicle
21
Small vehicles (<= 5.6 meter)
Medium vehicles (> 5.6 & <= 12.2 meter)
Long vehicles (> 12.2 meter)
All Dutch vehicles in September
Selectivity of big data
– Big Data sources may be selective when
‐ Only part of the population contributes to the data set (e.g. mobile phone
owners)
‐ The measurement mechanism is selective (e.g. traffic loops placement on
Dutch highways is not random)
– Many Big Data sources contain events
‐ How to associate events with units?
‐ Number of events per unit may vary.
– Correcting for selectivity
‐ Background characteristics – or features – are needed (linking with registers;
profiling)
‐ Use predictive modeling / machine learning to produce population estimates
23
Why Visualization?
October 1st 2013, Statistics Netherlands
Effective Display!
(seeTor Norretranders, “Band width of our senses)
Visualization of Big Data
– Large volume:
‐ Data binning or aggregation
– High velocity:
‐ Animations
‐ Dashboard / small multiples
– Large variety:
‐ Interactive interface
‐ Advanced visualization methods
26
Tableplot: Dutch (Virtual) Census
27
October 1st 2013, Statistics Netherlands
Heat map: Age vs. ‘Income’
16
Age
Income(euro)
October 1st 2013, Statistics Netherlands
17
amount
mount
Questions?
30

More Related Content

PDF
New data sources for statistics: Experiences at Statistics Netherlands.
PPTX
Big data analytics
PDF
Tracking Typhoon Haiyan: Open Government Data in Disaster Response and Recovery
PDF
Isi 2017 presentation on Big Data and bias
PDF
Innovating Good Regulatory Practice #CeDEM16
PDF
Big Data & Text Analytics - Lesson Schedule
PPT
Public Safety Mashups to Support Policy Makers || Choennie
PDF
BDVe Webinar Series - Big Data for Public Policy, the state of play - Roadmap...
New data sources for statistics: Experiences at Statistics Netherlands.
Big data analytics
Tracking Typhoon Haiyan: Open Government Data in Disaster Response and Recovery
Isi 2017 presentation on Big Data and bias
Innovating Good Regulatory Practice #CeDEM16
Big Data & Text Analytics - Lesson Schedule
Public Safety Mashups to Support Policy Makers || Choennie
BDVe Webinar Series - Big Data for Public Policy, the state of play - Roadmap...

What's hot (8)

PPTX
EW-Shopp: Interoperability Challenges and Solutions
PPTX
Big Data & Smart City Applications
PPTX
EGOV / ePart 2015 - Policy Compass Workshop Presentation
PPTX
Dealing with Open Data in Istat
PPTX
Big data as a source for official statistics
PDF
Suds summary
PDF
A proposed model_for_cybercrime_detectio
PDF
Strata Big data presentation
EW-Shopp: Interoperability Challenges and Solutions
Big Data & Smart City Applications
EGOV / ePart 2015 - Policy Compass Workshop Presentation
Dealing with Open Data in Istat
Big data as a source for official statistics
Suds summary
A proposed model_for_cybercrime_detectio
Strata Big data presentation
Ad

Viewers also liked (12)

PPTX
Big Data Use Cases for Different Verticals and Adoption Patterns - Impetus We...
PDF
Introduction to big data
PPTX
Introduction to Big Data
PDF
Big data Introduction by Mohan
PDF
Three Big Data Case Studies
PPTX
Big Data Use Cases
PPTX
BIG DATA and USE CASES
PDF
Big Data & Analytics for Government - Case Studies
PPTX
Big Data Case Study: Fortune 100 Telco
PPTX
Big Data: It’s all about the Use Cases
PPTX
5 Big Data Use Cases for 2013
PDF
Big Data & Analytics (Conceptual and Practical Introduction)
Big Data Use Cases for Different Verticals and Adoption Patterns - Impetus We...
Introduction to big data
Introduction to Big Data
Big data Introduction by Mohan
Three Big Data Case Studies
Big Data Use Cases
BIG DATA and USE CASES
Big Data & Analytics for Government - Case Studies
Big Data Case Study: Fortune 100 Telco
Big Data: It’s all about the Use Cases
5 Big Data Use Cases for 2013
Big Data & Analytics (Conceptual and Practical Introduction)
Ad

Similar to Big data experiments (20)

PDF
Big Data presentation for Statistics Canada
PDF
Opportunities and methodological challenges of Big Data for official statist...
PDF
P. Struijs, Toward the Use of Big Data for European Statistics
PDF
JRC_AI Watch. European landscape on the use of Artificial Intelligence by the...
PDF
What does “BIG DATA” mean for official statistics?
PDF
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
PDF
Big Data @ CBS for Fontys students in Eindhoven
PPTX
From eGov 2.0 to eGov 3.0: The Research Agenda
PDF
A forecasting of stock trading price using time series information based on b...
PDF
FIWARE Global Summit - The Digital Single Market - Benefits and Solutions for...
PDF
Big Data @ CBS
PDF
Vtt intelligent data analytics - Ville Könönen
PPTX
Age Friendly Economy - Introduction to Big Data
PDF
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
PDF
Power from big data - Are Europe's utilities ready for the age of data?
PDF
An overview of big data analysis
PDF
Big Data presentation Mannheim
PDF
Smart Policies: Uso de las TIC para mejorar la estructuración de políticas p...
PPTX
eGovernance Research Grand Challenges
PDF
Big data - a review (2013 4)
Big Data presentation for Statistics Canada
Opportunities and methodological challenges of Big Data for official statist...
P. Struijs, Toward the Use of Big Data for European Statistics
JRC_AI Watch. European landscape on the use of Artificial Intelligence by the...
What does “BIG DATA” mean for official statistics?
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...
Big Data @ CBS for Fontys students in Eindhoven
From eGov 2.0 to eGov 3.0: The Research Agenda
A forecasting of stock trading price using time series information based on b...
FIWARE Global Summit - The Digital Single Market - Benefits and Solutions for...
Big Data @ CBS
Vtt intelligent data analytics - Ville Könönen
Age Friendly Economy - Introduction to Big Data
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
Power from big data - Are Europe's utilities ready for the age of data?
An overview of big data analysis
Big Data presentation Mannheim
Smart Policies: Uso de las TIC para mejorar la estructuración de políticas p...
eGovernance Research Grand Challenges
Big data - a review (2013 4)

More from Edwin de Jonge (15)

PDF
sdcSpatial user!2019
PDF
Validatetools, resolve and simplify contradictive or data validation rules
PDF
Data error! But where?
PDF
Daff: diff, patch and merge for data.frame
PDF
Chunked, dplyr for large text files
PDF
Uncertainty visualisation
PDF
Heatmaps best practices Strata Hadoop
PDF
Docopt, beautiful command-line options for R, user2014
PPTX
StatMine
PDF
Big Data Visualization
PDF
ffbase, statistical functions for large datasets
PDF
Tabplotd3, interactive inspection of large data
PPT
Statmine, Visuele dataexploratie
PPTX
StatMine (New Technologies and Techniques for Statistics)
PPT
StatMine, visual exploration of output data
sdcSpatial user!2019
Validatetools, resolve and simplify contradictive or data validation rules
Data error! But where?
Daff: diff, patch and merge for data.frame
Chunked, dplyr for large text files
Uncertainty visualisation
Heatmaps best practices Strata Hadoop
Docopt, beautiful command-line options for R, user2014
StatMine
Big Data Visualization
ffbase, statistical functions for large datasets
Tabplotd3, interactive inspection of large data
Statmine, Visuele dataexploratie
StatMine (New Technologies and Techniques for Statistics)
StatMine, visual exploration of output data

Recently uploaded (20)

PPTX
5 Stages of group development guide.pptx
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
PDF
How to Get Funding for Your Trucking Business
PPTX
ICG2025_ICG 6th steering committee 30-8-24.pptx
PDF
COST SHEET- Tender and Quotation unit 2.pdf
PDF
Power and position in leadershipDOC-20250808-WA0011..pdf
PDF
Training And Development of Employee .pdf
DOCX
Business Management - unit 1 and 2
PPTX
Business Ethics - An introduction and its overview.pptx
DOCX
Euro SEO Services 1st 3 General Updates.docx
PPTX
Probability Distribution, binomial distribution, poisson distribution
PDF
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
PPT
Data mining for business intelligence ch04 sharda
PDF
Unit 1 Cost Accounting - Cost sheet
PDF
Nidhal Samdaie CV - International Business Consultant
PDF
IFRS Notes in your pocket for study all the time
PPTX
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
PPTX
HR Introduction Slide (1).pptx on hr intro
PDF
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
PPTX
Lecture (1)-Introduction.pptx business communication
5 Stages of group development guide.pptx
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
How to Get Funding for Your Trucking Business
ICG2025_ICG 6th steering committee 30-8-24.pptx
COST SHEET- Tender and Quotation unit 2.pdf
Power and position in leadershipDOC-20250808-WA0011..pdf
Training And Development of Employee .pdf
Business Management - unit 1 and 2
Business Ethics - An introduction and its overview.pptx
Euro SEO Services 1st 3 General Updates.docx
Probability Distribution, binomial distribution, poisson distribution
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
Data mining for business intelligence ch04 sharda
Unit 1 Cost Accounting - Cost sheet
Nidhal Samdaie CV - International Business Consultant
IFRS Notes in your pocket for study all the time
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
HR Introduction Slide (1).pptx on hr intro
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
Lecture (1)-Introduction.pptx business communication

Big data experiments

  • 1. Edwin de Jonge In cooperation with Piet Daas, Martijn Tennekes, Marco Puts, Alex Priem Big Data Case studies in Official Statistics
  • 2. From a Official Statistics point of view Three types of data: 1. Survey data = data collected by SN with questionnaires 2. Admin data = administrative (register) data collected by third parties such as theTax Office 3. Big data = machine generated data of events 2
  • 3. Big Data case studies Big data = machine generated data of events 3 Source Statistics Social media Sentiment (as indicator for business cycle) Mobile phone data Daytime population, tourism statistics Traffic loops Traffic index statistics At the end of this talk: Visualization methods for Big Data
  • 4. Overview 4 • Big Data • Research ‘theme’ at Stat. Netherlands • Data driven approach •Visualization as a tool •Why? •Examples in our office • Issues & challenges • From an official statistical perspective
  • 6. Case study 1: Social media – 3 billion messages as of 2009 gathered from Facebook, Twitter, LinkedIn, Google+ by a Dutch intermediate companyCoosto. – Sentiment per message determined by classifying words as negative or positive. – Could be used as indicator for the business cycle. Could it be fit to the consumer confidence, the leading business cycle indicator? 6
  • 9. Table 1. Social media messages properties for various platforms and their correlation with consumer confidence Correlation coefficient of Social media platform Number of social Number of messages as monthly sentiment index and media messages1 percentage of total (%) consumer confidence ( r )2 All platforms combined 3,153,002,327 100 0.75 0.78 Facebook 334,854,088 10.6 0.81* 0.85* Twitter 2,526,481,479 80.1 0.68 0.70 Hyves 45,182,025 1.4 0.50 0.58 News sites 56,027,686 1.8 0.37 0.26 Blogs 48,600,987 1.5 0.25 0.22 Google+ 644,039 0.02 -0.04 -0.09 Linkedin 565,811 0.02 -0.23 -0.25 Youtube 5,661,274 0.2 -0.37 -0.41 Forums 134,98,938 4.3 -0.45 -0.49 1 period covered June 2010 untill November 2013 2 confirmed by visual inspecting scatterplots and additional checks (see text) *cointegrated Platform specific results Granger causality reveals that Consumer Confidence precedes Facebook sentiment ! (p-value < 0.001) 9
  • 10. Case study 2: mobile phone metadata – Pilot study with a cell phone provider with market share of 1/3 in the Netherlands. – Aggregated data is queried by intermediate company Mezuro and delivered to SN. Privacy is guaranteed! – Applications: daytime population, tourism statistics, economic activity, mobility studies, etcetera. 10
  • 11. Mobile phone population 11 MPRD (Municipal Personal Records Database) = Dutch population
  • 12. Subpopulations model 12 Mobile phone metadata weighted to the MPRD. MPRD data & Education Registers. MPRD data only.
  • 13. Mobile phone metadata 13 Event Datail Records (EDR) contain metadata on mobile phone events (i.e. call, SMS or data transfer). Aggregated table: number of unique devices X time period X current region X residential region.
  • 14. Daytime population results 14 Almere: commuter town? Foreigners at SchipholAirport Dutch population totals
  • 15. Case study 3: Traffic loops Traffic loop data ‐ Each minute (24/7) the number of passing vehicles is counted in around 20.000 ‘loops’ in the Netherlands (100 million records a day) ‐ Nice data source for transport and traffic statistics (and more)15
  • 16. Traffic loops on main roads 16A close look at the highways around Utrecht
  • 17. Traffic loops on main roads (2) 17Traffic loops everywhere…
  • 18. Traffic loops on main roads (3) 18Highways simplified for analysis
  • 19. Raw data: Total number of vehicles a day 19 Time (hour)
  • 20. Correct for missing data: macro level Sliding window of 5 min. Impute missing data. Before After Total = ~ 295 million detected vehicles Total = ~ 330 million (+ 12%) detected vehicles 20
  • 21. Data by type of vehicle 21 Small vehicles (<= 5.6 meter) Medium vehicles (> 5.6 & <= 12.2 meter) Long vehicles (> 12.2 meter)
  • 22. All Dutch vehicles in September
  • 23. Selectivity of big data – Big Data sources may be selective when ‐ Only part of the population contributes to the data set (e.g. mobile phone owners) ‐ The measurement mechanism is selective (e.g. traffic loops placement on Dutch highways is not random) – Many Big Data sources contain events ‐ How to associate events with units? ‐ Number of events per unit may vary. – Correcting for selectivity ‐ Background characteristics – or features – are needed (linking with registers; profiling) ‐ Use predictive modeling / machine learning to produce population estimates 23
  • 24. Why Visualization? October 1st 2013, Statistics Netherlands
  • 25. Effective Display! (seeTor Norretranders, “Band width of our senses)
  • 26. Visualization of Big Data – Large volume: ‐ Data binning or aggregation – High velocity: ‐ Animations ‐ Dashboard / small multiples – Large variety: ‐ Interactive interface ‐ Advanced visualization methods 26
  • 28. October 1st 2013, Statistics Netherlands Heat map: Age vs. ‘Income’ 16 Age Income(euro)
  • 29. October 1st 2013, Statistics Netherlands 17 amount mount