SlideShare a Scribd company logo
Piet Daas, Statistics Netherlands &
Eindhoven University of Technology
Big Data in Official Statistics
1
Statistics Netherlands
Two locations:
The Hague & Heerlen
around 1800 people
We produce statistics for the
whole of the country
For this we need DATA and
reliable METHODS
2
Survey’s
Admin
sources
Big
Data
From Primary to Secondary to New Data sources
3
Targeted
data
collection,
pre-
determined
questions
and
indicators
Structured data,
collected by
government
not produced for
NSI purposes
Not collected for
NSI purposes,
data in high volume,
velocity, variety
2000BC 20th Century 21st century
Censuses
CBDS Mission
The Center explores and exploits new data sources, applies state-of-the-art
methodology in collaboration with partners, in order to provide timely and
comprehensive information on social phenomena relevant to users.
New
Fast
More detail
Reduce burden
4
Center for Big Data Statistics
− Started in September 2016
− Setup to:
– Stimulate the creation of Big Data based statistics
− New, fast, more detail & reduce burden
– Increase the adoption of Big Data based statistics in our office
− Road sensor data and beyond
– Stimulate the exchange of knowledge on the use of Big Data
− New methodology
– Cooperate with partners and obtain funding
− Very successful
5
Examples of CBDS/Big Data work @ Stat. Neth.
1 - Road sensors
• First official Big Data based statistics
• Relation with GDP (?)
2 - Mobility of people
• commuting, visualization, day-time population, tourism
3 - Movement of ships
• AIS, transhipment
4 - Innovation
• Innovative companies and web pages
5 - Social media (using text)
• Sentiment, social tension, ‘wish to move’
6 - Solar panel detection (using images)
• Energy transition theme
6
1- First official big data based statistics
Road sensor data
− Passing vehicle counts for each minute
(24/7) by about 60.000 sensors
− 20.000 on the Dutch highways
− Types of sensors:
– Induction loop
– Camera
– Bluetooth
− Large volume: approx. 230 million records/day
7
Dutch highways
8
Dutch highways
+ road sensors
9
20.000 sensors
on highways
Minute data of 1 sensor for 196 days
10
‘Afsluitdijk’ (IJsselmeer dam)
11
‘Afsluitdijk’ (IJsselmeer dam) (2)
‘Reducing’ Big Data
Big Data steps
(1)
(2) (3)
1. Traffic intensity and GDP
- GDP
- Traffic
Correlation
•91% from 2011-Q2 till
2014-Q4
14
2. Mobility
15
-Movement of people
-From home to work
assume all go by car
2. Mobility (2)
https://research.cbs.nl/colordotmap/woonwerk
-Develop visualizations to help
to understand the data
-Dotmap
Future plans
-Add public transport and
bicycle as modes
-Modeling decision about
means of transports
2. Mobility: Mobile Phone data: DTP
17
CDR/EDR data from
Dutch Mobile phone
devices
CDR roaming data
Weighting to Dutch population minus
Dutch people abroad (from Holiday
Survey)
Population and
education registers
Weight to total number of foreigners in
the Netherlands (from other sources)
Estimating without CDR/EDR data
By using admin data
Day Time
Population
2. Mobility: mobile phone data: DTP (2)
• Hourly changes
of mobile
phone activity
• Only data for
areas with
> 15 events per
hour
18
Belgium: Mobile phone data: population density
19
2b. Mobile Phone data: (Inbound) tourism
− Required: roaming data in which devices are traced at least 2
weeks, preferably 1 month. In the old dataset, devices were re-
hashed every day.
20
Mobility: mobile phone data (2)
France Eastern Europe
3. AIS Data
− Automatic Identification System
− Supplements Marine Radar
− “GPS signal of ships”
Used for
− Collision avoidance
− Fishing fleet monitoring
− Vessel Traffic Services (VTS)
− Maritime Security
− Aids to Navigation
− Accident investigation
− Fleet and Cargo Tracking
3. AIS data quality issues
23
• An AIS signal is basically composed of
– A unique identifier
– A geocode (latitude and longitude)
• All can be disturbed resulting in:
– New unique identifiers (‘ghost ships’)
• Construct a frame of repeatedly observed ids
– Scrambled locations (the ship is/has been very likely not there)
• Use a median filter (10 min slot)
3. Illustration of AIS errors
Automatic Identification System data of ships
24
3. AIS data: Netherlands
Data: Rijkswaterstaat AIS data
One month of data
25
Transshipment
https://research.cbs.nl/AIS_transshipment
4. Innovative companies: Web sites
− Web pages of companies provide information
– Can this be used to substitute the SN need for information?
– Web pages can be ‘scraped’ fairly easy
− In this study we looked at:
– The potential of web pages to provide information on
the innovative character of a company
– For both large and small companies
27
4. The Community Innovation Survey
− The Community Innovation Survey (CIS)
– Focusses on the innovativeness of companies
– Is a European standardized survey
− The questionnaire is send every other year to about 10,000
companies in the Netherlands
– Stratified sample of companies
– Minimum number of working persons (WP) is 10
– So no info on small companies (such as start-ups!)
28
4. Need for URLs
− From the CIS response we took a sample of the Innovative and a
sample of the Non-innovative companies
– 3340 Innovative and 3002 Non-innovative
− For each company we needed the corresponding website
– This was missing for 2/3 of the companies in our own Business
register
− So URLs were additionally searched
– Via Google (company name, address info etc.)
– Manually (to check correctness of URLs and to find missing)
29
4. Text mining: general overview
30
Preprocessing
(lowercase, only
characters, stop
word removal,
character length,
stemming,
…. )
Feature extraction
(‘words’ and
word combinations)
‘Machine Learning’
algorithm
Class (0/1)
training/test set
(based on survey
data) Text based
classification model- Multistage model development
- Most common approach applied
Document Term Matrix
with TF-IDF values
Webpages
Training
Testing
1. ‘Text indexing’ 2. ‘Text Encoding’
3. ‘Text Categorisation’
More in: Taeho, J. (2019). Text mining: concepts, implementation, and big data challenge. Springer.
4. Model building: import considerations
− Web pages were processed (html-files) and words extracted
– Effect of various pre-processing steps
– Later on: additional removal of words
− A supervised classification task
– Tested various algorithms (80/20 training/test set)
– Compared various metrics (Accuracy is best)
– Started with TF-IDF value in DTM (Log(TF-IDF+1) is beter)
– Effect of including words above a specific number of characters
(2, 3)
– Effect of including Word embeddings (relation between words)
31
32
Table 1. Results for the various classification algorithms tested. Default settings were used.
The average and standard deviation of 1000 tries are shown as percentages.
Words of 2 and Words of 3 and Word of 3 and
more characters more characters more characters
(%) (%) incl. word embeddings
(%)
Bernoulli Naïve Bayes 87 ± 1 60 ± 1 61 ± 2
Logistic Regression (L1 regularization) 94 ± 1 60 ± 2 93 ± 1
Nearest Neighbors (k = 2) 61 ± 2 52 ± 1 58 ± 1
Support Vector Machine (linear) 93 ± 1 60 ± 1 81 ± 1
Support Vector Machine (radial basis) 53 ± 1 53 ± 1 57 ± 3
Stochastic Gradient Decent 93 ± 1 58 ± 2 79 ± 3
Quadratic Discriminant Analysis 77 ± 8 57 ± 2 56 ± 2
Neural Network (multi-layer perceptron) 92 ± 1 62 ± 2 74 ± 3
Decision Tree 94 ± 1 54 ± 1 61 ± 2
Random Forests 94 ± 1 56 ± 2 64 ± 1
Gradient Tree Boosting 94 ± 1 59 ± 1 71 ± 1
2 character words (such as ‘nl’, ‘de’, ‘en’) dominated the model and were therefore ignored
Accuracy
4. External validation & model stability
− Tested the model on:
– Web sites of start-ups (92% innovative)
– Web sites of small companies (WP < 10) (around 33%)
− However long-term stability was
found to be an issue
- Solved this by including additional
classified data
- This is a standard Machine Learning
way to deal with this problem
33
4. Model details
− A single model including words in both languages
− 88% Accuracy over various datasets
− 580 stemmed words included in the model
− An English website is a positive indication for innovation
(compared to Dutch)
− Depending on the language, there are words that are clearly
positive associated with innovation.
– Positive: Technology, innovation, software, data
– Negative: Sale, buy, shopping car, exclusive, service
34
4. What about Small innovative companies?
− The Model is developed on large companies
– All WP >= 10
– But test on small company data demonstrated that the Model can
be used to detect small innovative companies as well
− Startups & manual checking of BR sample
– Subsequently applied the model to websites of all companies
included in SN Business register for which the website is known
− Scraped ~850.000 web sites
− Census-based approach
35
4. Small Innovative companies: Web sites
36
Based on URL’s included in Business Register of CBS (~850.000)
4. Small innovative companies
37
Amsterdam
Eindhoven
4. Results of both approaches
− Results of new (Big Data) and old (survey) approaches compared
− For large companies (WP >= 10)
– Old: 19,916 technological innovative companies (± 680)
– New: 19,276 technological innovative companies (± 23)
− For small companies (WP >= 2 and WP < 10)
– Old: - technological innovative companies
– New: 33,599 technological innovative companies
38
5 - Social media studies
Only publicly available messages are used!
Examples
5.1. Sentiment in social media
 What is the development of the average sentiment in social
media over time?
5.2. Feelings of safety/social tension
 Can social media be used to measure specific feelings in (the
online) society?
5.3. Propensity to move (‘Wish to move’)
 Can we identify messages of people that wish to move to
another house?
5.1. Social media: sentiment
5.1. Social media sentiment
− Facebook and Twitter messages both contribute
− Daily data is highly volatile
− Monthly aggregates correlate well with consumer confidence (> 0.9)
− Including sentiment series improves the accuracy of consumer
confidence series (survey data)
− Product:
 Averaged monthly or smoothed weekly online Dutch sentiment could be a
potential new indicator
 Can also be produced for large Dutch cities
5.2. Social tension indicator
− Develop a timely social media based indicator
− How does a fast (real-time) statistic look like?
– Based on all previous experiences with social media
– Making use of the typical strengths of social media
− On Twitter people really want to be the first to report
interesting news.
− Started with the idea of a ‘real-time’ safety monitor
– Can we measure the feeling of safety/unsafety online?
42
5.2. Social tension indicator (2)
− Interviewed people on the type of words used to express
safety/unsafety
– Ended up with a list of ~350 words
− Checked how often these words were used on Twitter
– Used Coosto access, a nearly complete Dutch Twitter Database
– Only ~150 words are used frequently enough online
– Removed messages of people from Flanders as good as possible
− An interesting profile occurred but:
– Are we measuring safety/unsafety feeling?
– Check what the peaks represent
43
Social unrest
3. Social tension indicator (4)
44
5.2. Social tension indicator: Filtered (+ peaks)*
Remembrance day, A’dam, May 4, 2010
Project X, Haren, Sept 22, 2012
Love parade, Duisburg,
July 24, 2014 MH17 disaster,
July 17, 2014
Charlie Hebdo,
Paris, Jan 7, 2015
Attacks, Paris,
Nov 14, 2015
Armenian children
gone, Sept 8, 2018
Attacks, Brussel,
Mar 22, 2016
Attack Manchester,
May 22, 2017
5.2. Social tension indicator: next step
45
5.3. Social media: ‘Wish to move’
− Current topic of research
 Social media contains messages that indicate a ‘wish’ of people
to move to another house (on all platforms)
 Select messages containing ‘verhuiz*’ or ‘verhuis*’
 Created a model to identify messages of people that wish to
move (accuracy 0.87)
 Checked if accounts that posted ‘wish to move messages’
actually moved (~50% moved)
 Study time-series to check on what frequency such an indicator
could best be produced
5.3. Social media: ‘Wish to move’ (2)
− Tried various models on training and test set of messages
− Logistic regression model contains single words, bigrams and trigrams
47
Algorithm Classification accuracy
Logistic Regression (L1 penalty) 0.87
Random Forest 0.87
Support Vector Machine 0.85
Boosted Trees 0.87
Deep learning 0.88
5.3. Social media: ‘Wish to move’ (3)
− Subsequent steps
 Collect messages containing ‘verhuiz*’ or ‘verhuis*’ for whole
period studied
 Remove messages from non-Dutch users
 Mainly people from upper part of Belgium (Flemish people)
 Found to be around 15% of all messages
 Apply LR model to the remaining messages to determine the
amount of ‘Wish-to-move’ messages per day
 Plot the results, was very noisy, apply filter for smoothing
 Our first application of a spline filter
48
However
Not every
Big Data
based
approach is
successful
Trend should
be upward
49
6. Solar panels
50
6. Deep solaris project: overview
− Problem statement:
CBS does not have a comprehensive dataset (not via
surveys, nor via administrative data) to create statistics on
energy transition
− Suggested solution:
Train machine learning algorithm to identify solar panels
on aerial pictures
51
6. Aerial pictures: Solar panel detection
52
6. Apply Deep Learning
− Train a neural network to detect solar panels
53
YES
NO
40 million weights is kind of a black box
6. Need to have examples of houses with solar panels
54
− From solar panel registration of Dutch tax office
6. Need to have examples of houses with solar panels (2)
55
6. Results of model on various arial pictures
56
CBDS results: Beta product web site
https://www.cbs.nl/en-gb/our-services/innovation/ 57
58
Big Data and official statistics with examples of their use
60
Thank you for your attention

More Related Content

PDF
Responsible Data Science at Statistics Netherlands
PPTX
Paulo Canas Rodrigues - The role of Statistics in the Internet of Things - ...
PDF
Opportunities and methodological challenges of Big Data for official statist...
PPTX
Predictive Analytics: Context and Use Cases
PDF
EMOS 2018 Big Data methods and techniques
PPT
Datapreneurs
PPTX
Systemof insight
PPTX
Eduworks kick-off presentation: USAL
Responsible Data Science at Statistics Netherlands
Paulo Canas Rodrigues - The role of Statistics in the Internet of Things - ...
Opportunities and methodological challenges of Big Data for official statist...
Predictive Analytics: Context and Use Cases
EMOS 2018 Big Data methods and techniques
Datapreneurs
Systemof insight
Eduworks kick-off presentation: USAL

Similar to Big Data and official statistics with examples of their use (20)

PPTX
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
PDF
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
PDF
Business Transformation Agency
PDF
Introduction to Streaming Analytics
PDF
Harnessing Big Data_UCLA
PDF
P. Struijs, Toward the Use of Big Data for European Statistics
ODP
Census Hub Project
PDF
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
PDF
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
PPTX
Identifying the new frontier of big data as an enabler for T&T industries: Re...
PDF
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
PDF
La telefonía móvil como fuente de información para el estudio de la movilidad...
PDF
Software for the Hydrographic ocean
PDF
data mining
PPTX
bigdataintro.pptx
PDF
Big Data et eGovernment
PDF
Seminaire bigdata23102014
PPTX
Introduction to data science
PDF
Big Data presentation for Statistics Canada
PDF
Big data Analytics
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...
Towards a Community-driven Data Science Body of Knowledge – Data Management S...
Business Transformation Agency
Introduction to Streaming Analytics
Harnessing Big Data_UCLA
P. Struijs, Toward the Use of Big Data for European Statistics
Census Hub Project
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
Identifying the new frontier of big data as an enabler for T&T industries: Re...
A. Nurra, From ICT survey data to experimental statistics; using IaD source f...
La telefonía móvil como fuente de información para el estudio de la movilidad...
Software for the Hydrographic ocean
data mining
bigdataintro.pptx
Big Data et eGovernment
Seminaire bigdata23102014
Introduction to data science
Big Data presentation for Statistics Canada
Big data Analytics
Ad

More from Piet J.H. Daas (20)

PDF
IT infrastructure for Big Data and Data Science at Statistics Netherlands
PDF
Use of social media for official statistics
PDF
Isi 2017 presentation on Big Data and bias
PDF
CBS lecture at the opening of Data Science Campus of ONS
PDF
Ntts2017 presentation 45
PDF
Big Data presentation Mannheim
PDF
Extracting information from ' messy' social media data
PPT
Big data cbs_piet_daas
PDF
Gebruik van sociale media voor de officiële statistiek
PDF
Big Data @ CBS
PDF
Profiling Big Data sources to assess their selectivity
PDF
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
PDF
Big Data @ CBS for Fontys students in Eindhoven
PPT
Quality challenges in modernising business statistics
PDF
Quality Approaches to Big Data
PDF
Social media sentiment and consumer confidence
PDF
Big data @ CBS
PDF
Strata Big data presentation
PDF
Big data Big impact?
PDF
Bi dutch meeting data science
IT infrastructure for Big Data and Data Science at Statistics Netherlands
Use of social media for official statistics
Isi 2017 presentation on Big Data and bias
CBS lecture at the opening of Data Science Campus of ONS
Ntts2017 presentation 45
Big Data presentation Mannheim
Extracting information from ' messy' social media data
Big data cbs_piet_daas
Gebruik van sociale media voor de officiële statistiek
Big Data @ CBS
Profiling Big Data sources to assess their selectivity
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Big Data @ CBS for Fontys students in Eindhoven
Quality challenges in modernising business statistics
Quality Approaches to Big Data
Social media sentiment and consumer confidence
Big data @ CBS
Strata Big data presentation
Big data Big impact?
Bi dutch meeting data science
Ad

Recently uploaded (20)

PDF
Item # 3 - 934 Patterson Final Review.pdf
PDF
The Role of FPOs in Advancing Rural Agriculture in India
PDF
2025 Shadow report on Ukraine's progression regarding Chapter 29 of the acquis
PPTX
Introduction_to_the_Study_of_Globalization.pptx
PPTX
sepsis.pptxMNGHGBDHSB KJHDGBSHVCJB KJDCGHBYUHFB SDJKFHDUJ
PDF
26.1.2025 venugopal K Awarded with commendation certificate.pdf
PDF
The Detrimental Impacts of Hydraulic Fracturing for Oil and Gas_ A Researched...
PDF
Storytelling youth indigenous from Bolivia 2025.pdf
PPTX
DFARS Part 249 - Termination Of Contracts
PDF
It Helpdesk Solutions - ArcLight Group
PPT
generalgeologygroundwaterchapt11-181117073208.ppt
PPTX
GSA Q+A Follow-Up To EO's, Requirements & Timelines
PPTX
GOVERNMENT-ACCOUNTING1. bsa 4 government accounting
PPTX
Portland FPDR Oregon Legislature 2025.pptx
PDF
Abhay Bhutada and Other Visionary Leaders Reinventing Governance in India
PDF
Creating Memorable Moments_ Personalized Plant Gifts.pdf
PPTX
11Sept2023_LTIA-Cluster-Training-Presentation.pptx
PPTX
STG - Sarikei 2025 Coordination Meeting.pptx
DOCX
Alexistogel: Solusi Tepat untuk Anda yang Cari Bandar Toto Macau Resmi
PDF
ISO-9001-2015-internal-audit-checklist2-sample.pdf
Item # 3 - 934 Patterson Final Review.pdf
The Role of FPOs in Advancing Rural Agriculture in India
2025 Shadow report on Ukraine's progression regarding Chapter 29 of the acquis
Introduction_to_the_Study_of_Globalization.pptx
sepsis.pptxMNGHGBDHSB KJHDGBSHVCJB KJDCGHBYUHFB SDJKFHDUJ
26.1.2025 venugopal K Awarded with commendation certificate.pdf
The Detrimental Impacts of Hydraulic Fracturing for Oil and Gas_ A Researched...
Storytelling youth indigenous from Bolivia 2025.pdf
DFARS Part 249 - Termination Of Contracts
It Helpdesk Solutions - ArcLight Group
generalgeologygroundwaterchapt11-181117073208.ppt
GSA Q+A Follow-Up To EO's, Requirements & Timelines
GOVERNMENT-ACCOUNTING1. bsa 4 government accounting
Portland FPDR Oregon Legislature 2025.pptx
Abhay Bhutada and Other Visionary Leaders Reinventing Governance in India
Creating Memorable Moments_ Personalized Plant Gifts.pdf
11Sept2023_LTIA-Cluster-Training-Presentation.pptx
STG - Sarikei 2025 Coordination Meeting.pptx
Alexistogel: Solusi Tepat untuk Anda yang Cari Bandar Toto Macau Resmi
ISO-9001-2015-internal-audit-checklist2-sample.pdf

Big Data and official statistics with examples of their use

  • 1. Piet Daas, Statistics Netherlands & Eindhoven University of Technology Big Data in Official Statistics 1
  • 2. Statistics Netherlands Two locations: The Hague & Heerlen around 1800 people We produce statistics for the whole of the country For this we need DATA and reliable METHODS 2
  • 3. Survey’s Admin sources Big Data From Primary to Secondary to New Data sources 3 Targeted data collection, pre- determined questions and indicators Structured data, collected by government not produced for NSI purposes Not collected for NSI purposes, data in high volume, velocity, variety 2000BC 20th Century 21st century Censuses
  • 4. CBDS Mission The Center explores and exploits new data sources, applies state-of-the-art methodology in collaboration with partners, in order to provide timely and comprehensive information on social phenomena relevant to users. New Fast More detail Reduce burden 4
  • 5. Center for Big Data Statistics − Started in September 2016 − Setup to: – Stimulate the creation of Big Data based statistics − New, fast, more detail & reduce burden – Increase the adoption of Big Data based statistics in our office − Road sensor data and beyond – Stimulate the exchange of knowledge on the use of Big Data − New methodology – Cooperate with partners and obtain funding − Very successful 5
  • 6. Examples of CBDS/Big Data work @ Stat. Neth. 1 - Road sensors • First official Big Data based statistics • Relation with GDP (?) 2 - Mobility of people • commuting, visualization, day-time population, tourism 3 - Movement of ships • AIS, transhipment 4 - Innovation • Innovative companies and web pages 5 - Social media (using text) • Sentiment, social tension, ‘wish to move’ 6 - Solar panel detection (using images) • Energy transition theme 6
  • 7. 1- First official big data based statistics Road sensor data − Passing vehicle counts for each minute (24/7) by about 60.000 sensors − 20.000 on the Dutch highways − Types of sensors: – Induction loop – Camera – Bluetooth − Large volume: approx. 230 million records/day 7
  • 9. Dutch highways + road sensors 9 20.000 sensors on highways
  • 10. Minute data of 1 sensor for 196 days 10
  • 13. ‘Reducing’ Big Data Big Data steps (1) (2) (3)
  • 14. 1. Traffic intensity and GDP - GDP - Traffic Correlation •91% from 2011-Q2 till 2014-Q4 14
  • 15. 2. Mobility 15 -Movement of people -From home to work assume all go by car
  • 16. 2. Mobility (2) https://research.cbs.nl/colordotmap/woonwerk -Develop visualizations to help to understand the data -Dotmap Future plans -Add public transport and bicycle as modes -Modeling decision about means of transports
  • 17. 2. Mobility: Mobile Phone data: DTP 17 CDR/EDR data from Dutch Mobile phone devices CDR roaming data Weighting to Dutch population minus Dutch people abroad (from Holiday Survey) Population and education registers Weight to total number of foreigners in the Netherlands (from other sources) Estimating without CDR/EDR data By using admin data Day Time Population
  • 18. 2. Mobility: mobile phone data: DTP (2) • Hourly changes of mobile phone activity • Only data for areas with > 15 events per hour 18
  • 19. Belgium: Mobile phone data: population density 19
  • 20. 2b. Mobile Phone data: (Inbound) tourism − Required: roaming data in which devices are traced at least 2 weeks, preferably 1 month. In the old dataset, devices were re- hashed every day. 20
  • 21. Mobility: mobile phone data (2) France Eastern Europe
  • 22. 3. AIS Data − Automatic Identification System − Supplements Marine Radar − “GPS signal of ships” Used for − Collision avoidance − Fishing fleet monitoring − Vessel Traffic Services (VTS) − Maritime Security − Aids to Navigation − Accident investigation − Fleet and Cargo Tracking
  • 23. 3. AIS data quality issues 23 • An AIS signal is basically composed of – A unique identifier – A geocode (latitude and longitude) • All can be disturbed resulting in: – New unique identifiers (‘ghost ships’) • Construct a frame of repeatedly observed ids – Scrambled locations (the ship is/has been very likely not there) • Use a median filter (10 min slot)
  • 24. 3. Illustration of AIS errors Automatic Identification System data of ships 24
  • 25. 3. AIS data: Netherlands Data: Rijkswaterstaat AIS data One month of data 25
  • 27. 4. Innovative companies: Web sites − Web pages of companies provide information – Can this be used to substitute the SN need for information? – Web pages can be ‘scraped’ fairly easy − In this study we looked at: – The potential of web pages to provide information on the innovative character of a company – For both large and small companies 27
  • 28. 4. The Community Innovation Survey − The Community Innovation Survey (CIS) – Focusses on the innovativeness of companies – Is a European standardized survey − The questionnaire is send every other year to about 10,000 companies in the Netherlands – Stratified sample of companies – Minimum number of working persons (WP) is 10 – So no info on small companies (such as start-ups!) 28
  • 29. 4. Need for URLs − From the CIS response we took a sample of the Innovative and a sample of the Non-innovative companies – 3340 Innovative and 3002 Non-innovative − For each company we needed the corresponding website – This was missing for 2/3 of the companies in our own Business register − So URLs were additionally searched – Via Google (company name, address info etc.) – Manually (to check correctness of URLs and to find missing) 29
  • 30. 4. Text mining: general overview 30 Preprocessing (lowercase, only characters, stop word removal, character length, stemming, …. ) Feature extraction (‘words’ and word combinations) ‘Machine Learning’ algorithm Class (0/1) training/test set (based on survey data) Text based classification model- Multistage model development - Most common approach applied Document Term Matrix with TF-IDF values Webpages Training Testing 1. ‘Text indexing’ 2. ‘Text Encoding’ 3. ‘Text Categorisation’ More in: Taeho, J. (2019). Text mining: concepts, implementation, and big data challenge. Springer.
  • 31. 4. Model building: import considerations − Web pages were processed (html-files) and words extracted – Effect of various pre-processing steps – Later on: additional removal of words − A supervised classification task – Tested various algorithms (80/20 training/test set) – Compared various metrics (Accuracy is best) – Started with TF-IDF value in DTM (Log(TF-IDF+1) is beter) – Effect of including words above a specific number of characters (2, 3) – Effect of including Word embeddings (relation between words) 31
  • 32. 32 Table 1. Results for the various classification algorithms tested. Default settings were used. The average and standard deviation of 1000 tries are shown as percentages. Words of 2 and Words of 3 and Word of 3 and more characters more characters more characters (%) (%) incl. word embeddings (%) Bernoulli Naïve Bayes 87 ± 1 60 ± 1 61 ± 2 Logistic Regression (L1 regularization) 94 ± 1 60 ± 2 93 ± 1 Nearest Neighbors (k = 2) 61 ± 2 52 ± 1 58 ± 1 Support Vector Machine (linear) 93 ± 1 60 ± 1 81 ± 1 Support Vector Machine (radial basis) 53 ± 1 53 ± 1 57 ± 3 Stochastic Gradient Decent 93 ± 1 58 ± 2 79 ± 3 Quadratic Discriminant Analysis 77 ± 8 57 ± 2 56 ± 2 Neural Network (multi-layer perceptron) 92 ± 1 62 ± 2 74 ± 3 Decision Tree 94 ± 1 54 ± 1 61 ± 2 Random Forests 94 ± 1 56 ± 2 64 ± 1 Gradient Tree Boosting 94 ± 1 59 ± 1 71 ± 1 2 character words (such as ‘nl’, ‘de’, ‘en’) dominated the model and were therefore ignored Accuracy
  • 33. 4. External validation & model stability − Tested the model on: – Web sites of start-ups (92% innovative) – Web sites of small companies (WP < 10) (around 33%) − However long-term stability was found to be an issue - Solved this by including additional classified data - This is a standard Machine Learning way to deal with this problem 33
  • 34. 4. Model details − A single model including words in both languages − 88% Accuracy over various datasets − 580 stemmed words included in the model − An English website is a positive indication for innovation (compared to Dutch) − Depending on the language, there are words that are clearly positive associated with innovation. – Positive: Technology, innovation, software, data – Negative: Sale, buy, shopping car, exclusive, service 34
  • 35. 4. What about Small innovative companies? − The Model is developed on large companies – All WP >= 10 – But test on small company data demonstrated that the Model can be used to detect small innovative companies as well − Startups & manual checking of BR sample – Subsequently applied the model to websites of all companies included in SN Business register for which the website is known − Scraped ~850.000 web sites − Census-based approach 35
  • 36. 4. Small Innovative companies: Web sites 36 Based on URL’s included in Business Register of CBS (~850.000)
  • 37. 4. Small innovative companies 37 Amsterdam Eindhoven
  • 38. 4. Results of both approaches − Results of new (Big Data) and old (survey) approaches compared − For large companies (WP >= 10) – Old: 19,916 technological innovative companies (± 680) – New: 19,276 technological innovative companies (± 23) − For small companies (WP >= 2 and WP < 10) – Old: - technological innovative companies – New: 33,599 technological innovative companies 38
  • 39. 5 - Social media studies Only publicly available messages are used! Examples 5.1. Sentiment in social media  What is the development of the average sentiment in social media over time? 5.2. Feelings of safety/social tension  Can social media be used to measure specific feelings in (the online) society? 5.3. Propensity to move (‘Wish to move’)  Can we identify messages of people that wish to move to another house?
  • 40. 5.1. Social media: sentiment
  • 41. 5.1. Social media sentiment − Facebook and Twitter messages both contribute − Daily data is highly volatile − Monthly aggregates correlate well with consumer confidence (> 0.9) − Including sentiment series improves the accuracy of consumer confidence series (survey data) − Product:  Averaged monthly or smoothed weekly online Dutch sentiment could be a potential new indicator  Can also be produced for large Dutch cities
  • 42. 5.2. Social tension indicator − Develop a timely social media based indicator − How does a fast (real-time) statistic look like? – Based on all previous experiences with social media – Making use of the typical strengths of social media − On Twitter people really want to be the first to report interesting news. − Started with the idea of a ‘real-time’ safety monitor – Can we measure the feeling of safety/unsafety online? 42
  • 43. 5.2. Social tension indicator (2) − Interviewed people on the type of words used to express safety/unsafety – Ended up with a list of ~350 words − Checked how often these words were used on Twitter – Used Coosto access, a nearly complete Dutch Twitter Database – Only ~150 words are used frequently enough online – Removed messages of people from Flanders as good as possible − An interesting profile occurred but: – Are we measuring safety/unsafety feeling? – Check what the peaks represent 43 Social unrest
  • 44. 3. Social tension indicator (4) 44 5.2. Social tension indicator: Filtered (+ peaks)* Remembrance day, A’dam, May 4, 2010 Project X, Haren, Sept 22, 2012 Love parade, Duisburg, July 24, 2014 MH17 disaster, July 17, 2014 Charlie Hebdo, Paris, Jan 7, 2015 Attacks, Paris, Nov 14, 2015 Armenian children gone, Sept 8, 2018 Attacks, Brussel, Mar 22, 2016 Attack Manchester, May 22, 2017
  • 45. 5.2. Social tension indicator: next step 45
  • 46. 5.3. Social media: ‘Wish to move’ − Current topic of research  Social media contains messages that indicate a ‘wish’ of people to move to another house (on all platforms)  Select messages containing ‘verhuiz*’ or ‘verhuis*’  Created a model to identify messages of people that wish to move (accuracy 0.87)  Checked if accounts that posted ‘wish to move messages’ actually moved (~50% moved)  Study time-series to check on what frequency such an indicator could best be produced
  • 47. 5.3. Social media: ‘Wish to move’ (2) − Tried various models on training and test set of messages − Logistic regression model contains single words, bigrams and trigrams 47 Algorithm Classification accuracy Logistic Regression (L1 penalty) 0.87 Random Forest 0.87 Support Vector Machine 0.85 Boosted Trees 0.87 Deep learning 0.88
  • 48. 5.3. Social media: ‘Wish to move’ (3) − Subsequent steps  Collect messages containing ‘verhuiz*’ or ‘verhuis*’ for whole period studied  Remove messages from non-Dutch users  Mainly people from upper part of Belgium (Flemish people)  Found to be around 15% of all messages  Apply LR model to the remaining messages to determine the amount of ‘Wish-to-move’ messages per day  Plot the results, was very noisy, apply filter for smoothing  Our first application of a spline filter 48
  • 49. However Not every Big Data based approach is successful Trend should be upward 49
  • 51. 6. Deep solaris project: overview − Problem statement: CBS does not have a comprehensive dataset (not via surveys, nor via administrative data) to create statistics on energy transition − Suggested solution: Train machine learning algorithm to identify solar panels on aerial pictures 51
  • 52. 6. Aerial pictures: Solar panel detection 52
  • 53. 6. Apply Deep Learning − Train a neural network to detect solar panels 53 YES NO 40 million weights is kind of a black box
  • 54. 6. Need to have examples of houses with solar panels 54 − From solar panel registration of Dutch tax office
  • 55. 6. Need to have examples of houses with solar panels (2) 55
  • 56. 6. Results of model on various arial pictures 56
  • 57. CBDS results: Beta product web site https://www.cbs.nl/en-gb/our-services/innovation/ 57
  • 58. 58
  • 60. 60 Thank you for your attention