SlideShare a Scribd company logo
Responsible Data Science at
Statistics Netherlands:
implications for big data research
Piet Daas
Senior Methodologist,Theme coordinator Big Data research
& lead Data Scientist at the Center for Big Data Statistics
Responsible Data Science @ CBS
– CBS
‐ About CBS and the CBS law
‐ A relevant example
‐ Responsible Statistics
– Big Data
‐ Center for Big Data Statistics
‐ Responsible Data science
‐ Implications (challenges) for Big Data
‐ Some examples of the things we do 2
Statistics Netherlands
3
Heerlen
The Hague
Centraal Bureau voor de Statistiek
Statistics Netherlands mission
Publish reliable and coherent statistical information that
responds to the needs of Dutch society.
– Independent organization
– When Statistics Netherlands requests information:
• companies and institutes are legally obliged to cooperate
• persons and households provide this on a voluntary basis
– To reduce the response burden SN has access to registers (admin
data) of governmental and semi-governmental organisations.
– We present facts with a ´short story´ (but no ‘cherry picking’)
– Information is made available -at the same time- to everyone for
free.
About Statistics Netherlands
Statistics Netherlands was founded in 1899
2 rooms on ‘het binnenhof’ with 5 employees
We currently have ~2000 employees
(max in 1982: 3600)
We produce >500 statistics per year
80% based on EU-regulations
There is a solid legal base to enable access to all kinds of
data and to process personal data:
- the CBS-law
- CBS data collection law
5
The Statistics Netherlands law
6
It is our intention to ‘burden’ people and companies as less as possible
with requests for data
- Re-use as much data as possible, such as data collected by others
- Increasing use of admin data and hence our interest in Big Data
Handling data since 1899
7
Responsible Statistics
– Social Statistical Database (SSD)
– Combination of predominantly administrative data on
persons combined with a number of survey’s
– The SSD is used for a whole range of social statistics, for
social research and for the virtual census
8
What’s in the SSD?
The data is combined at the individual level and covers a
long period (start date 1999)
How does the SSD work?
Privacy
SSD ‘under the hood’
– All data is processed in our most secure internal
environment
– Personal Identifiable Data (such as CSN, addresses and
names) are removed ASAP from data files
– CSN is converted to a so-called RIN-number (non-
identifable unique number)
– Researchers only get access to the variables they need
(nothing more; even for SN-colleagues)
– Output is rigorously checked for disclosure (if there is a
risk, part of the data is disturbed)
12
Responsible Statistics (2)
– Fairness
‐ PID’s are removed as early in the process as possible
‐ Data-minimalization principle is applied
‐ Data is re-used as much as possible (reduce response burden).
– Accuracy
‐ Only well-established methods are used. Should be part of the
Methodology series of Statistics Netherlands (or published in
journals)
‐ Quality checks and confidence intervals should be available
– Confidentiality
‐ Statistics are produced on non-(de)identifiable (aggregated) units
‐ All output is checked for disclosure and (locally) disturbed if needed.
– Transparency
‐ The way data is processed, combined and the estimation models used
should be clearly described and internally available (no ‘personal’
produced statistics).
13
Responsible Data Science at Statistics Netherlands
Responsible Data Science at Statistics Netherlands
Goals CBDS
- New, more detailed, real time statistics
- Reduce the data collection footprint
- Deepen knowledge in Big Data methodology
- Privacy aspects on the use of Big Data in official statistics
- Offer an ecosystem to exchange knowledge and resources
16
Admin Data Sources
• Tax Data
• Population Register
• Insurance Register
• ...
Surveys
• Labor Force Survey
• Safety Monitor
• International Trade
• …
Data integration
17
Big Data
• Sensor Data
• Social Media
• Internet Data
• ….
Safety
MobilityIncome
Tourism
Environmental
Labor force
Census
Health
Economy
Statistical Output
Energy
CBDS
Responsible Data Science
– Fairness
‐ PID’s are removed as early in the process as possible
‐ Data-minimalization principle is applied
‐ Data is re-used as much as possible (low response burden).
– Accuracy
‐ Only well-established methods are used. Should be part of the
Methodology series of Statistics Netherlands (or published in
journals)
‐ Quality checks and confidence intervals should be available (bias?)
– Confidentiality
‐ Statistics are produced on non-(de)identifiable (aggregated) units
‐ All output is checked for disclosure control and (locally) disturbed if
needed. Effect of adding Big Data sources?
– Transparency
‐ The way data is processed, combined and the estimation models
used must be clearly described and available for everyone involved
(no ‘personal’ produced statistics). What about new processes?
18
Responsible Data Science: Fairness
– Fairness
‐ PID’s are removed as early in the process as possible
• Are there PID’s available?
• Identifying units in Big Data is sometime very hard
• Some data, such as the text of a tweet, is also a PID.
• This is publicly available data, ‘consciously’ put on the
internet by a user
‐ Data-minimalization principle is applied
• Our way of working is making use of as much data as
available because of the low information content of Big
Data (the data used for another purpose then intended)
19
Responsible Data Science: Accuracy
– Accuracy
‐ Only well-established methods are used. Should be part
of the Methodology series of Statistics Netherlands (or
published in journals)
• There are hardly any well established Big Data methods
at the moment (those used for satellite data?)
‐ Quality checks and confidence intervals should be
available (and bias?)
• Fully automated quality checks are needed
• What about (reasonable) confidence intervals of ML-
based methods?
• Isn’t bias more important (selectivity of data in the
source)? 20
Responsible Data Science
– Confidentiality
‐ Statistics are produced on non-(de)identifiable
(aggregated) units
‐ All output is checked for disclosure control and (locally)
disturbed if needed.
‐ Effect of adding Big Data sources?
• Can we guaranty de-identification when more and more
Big Data is added to a statistical output?
21
Responsible Data Science: Transparency
– Transparency
‐ The way data is processed, combined and the estimation
models used must be clearly described and available for
everyone involved (no ‘personal’ produced statistics).
‐ Estimation models
• No black boxes, How transparent are ML-based
models?
‐ What about new processes?
• New kind of processes emerge, e.g. start processing of
the data at the location of the data holder (not at SN)
• Example: Traffic index based on road sensor data
22
Process for road sensor data
Big Data steps
(1)
(2) (3)
What´s next?
– Statistics Netherlands is currently changing from
Responsible Statistics to Responsible Data Science
– Clearly additional work is needed to fully enable this
– This is important as it is an essential step in unleashing
the (full) potential of Big Data.
– For Statistics Netherlands the latter leads to:
‐ New products,
‐ More readily available statistics
‐ Improve quality of existing products
‐ Assure the work of SN remains relevant for the
Netherlands 24
Work at Center of Big Data Statistics
- At the Center for Big data Statistics
1. Case studies/beta products
2. Methodological & exploratory research
- Examples of our work
‐ Income data: Visualisation 2D/3D
‐ Road sensor data: Traffic intensity and GDP
‐ Scanner data: ‘Ginger bread’ index
‐ Twitter: Social tension indicator
‐ Webpages: Identify web only shops
‐ Webpages: Identifying innovative companies
25
Heat map: Age vs. ‘Income’
Age
Income(euro)
26
Heatmaps of income vs age (gender)
A 3D heat map: Age vs. Income vs. Amount
amount
mount
3D Heatmap of income vs age
27
Road sensor data
28More on: https://www.cbs.nl/en-gb/our-services/innovation
Traffic intensity vs GDP
Scanner data
29More on: https://www.cbs.nl/nl-nl/onze-diensten/innovatie
Turnover of ‘ginger bread’ specific for Saint Nicolas festivities
(2015 and 2016: weekly)
Twitter data
30
http://research.cbs.nl/socialtension/NL/
SocialTension indicator (daily)
Web pages
– From Common Crawl archive ‘2016-07’
– Found:
‐ +/- 60 million websites
‐ +/- 50.000 Dutch web shops
‐ 12670 web shops in scope
Web only shops in the Netherlands
Web pages
32
Tip of the Iceberg
33
Thank you for your attention!@pietdaas
Questions?
35

More Related Content

PDF
New data sources for statistics: Experiences at Statistics Netherlands.
PDF
PPTX
Big data analytics
PPTX
000 introduction to big data analytics 2021
PPTX
Big Data Analytics
PDF
Opportunities and methodological challenges of Big Data for official statist...
PPSX
Applications of Big Data Analytics in Businesses
PPTX
Big Data Analytics Proposal #1
New data sources for statistics: Experiences at Statistics Netherlands.
Big data analytics
000 introduction to big data analytics 2021
Big Data Analytics
Opportunities and methodological challenges of Big Data for official statist...
Applications of Big Data Analytics in Businesses
Big Data Analytics Proposal #1

What's hot (20)

PDF
Understanding big data and data analytics big data
PDF
Business intelligence architectures.pdf
PPTX
Big Data Projects Research Ideas
PPT
WWV2015: Jibes Paul van der Hulst big data
PDF
Big data.
PPT
Real time analytics of big data
PPTX
Bigdata and Hadoop with applications
PPTX
Data science
PPTX
What is big data
DOCX
PPTX
IAOS 2018 - Enhanced recommendations on step-by-step procedure and approach t...
PPTX
Big data
PPTX
bigdataintro.pptx
PPTX
Big data session five ( a )f
PPTX
Data Activities in Austria
PPTX
Data set Introduction to Big Data
DOCX
PDF
Big Data and official statistics with examples of their use
PDF
EMOS 2018 Big Data methods and techniques
RTF
International Journal of Data Science and Analytics(IJDA)
Understanding big data and data analytics big data
Business intelligence architectures.pdf
Big Data Projects Research Ideas
WWV2015: Jibes Paul van der Hulst big data
Big data.
Real time analytics of big data
Bigdata and Hadoop with applications
Data science
What is big data
IAOS 2018 - Enhanced recommendations on step-by-step procedure and approach t...
Big data
bigdataintro.pptx
Big data session five ( a )f
Data Activities in Austria
Data set Introduction to Big Data
Big Data and official statistics with examples of their use
EMOS 2018 Big Data methods and techniques
International Journal of Data Science and Analytics(IJDA)
Ad

Similar to Responsible Data Science at Statistics Netherlands (20)

PDF
Strata Big data presentation
PPTX
Big data as a source for official statistics
PDF
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
PDF
Big Data presentation for Statistics Canada
PPTX
Lars Lyberg, Inizio: Rapport från konferensen BigSurv18
PPTX
Big Data Analytics concepts with full theory
PDF
Big Data @ CBS for Fontys students in Eindhoven
PPTX
Big data presentation for University of Reykjavik, Iceland, March 22
PDF
Big Data presentation Mannheim
PPTX
2020-12-21 data strategy in Japan
PDF
Ntts2017 presentation 45
PDF
Big Data Europe SC6 WS 3: Where we are and are going for Big Data in OpenScie...
PPTX
(old version)2020-12-21 data strategy in Japan
PDF
Setting the Scene for Big Data in Europe, Looking Ahead to the Case Studies
PPTX
StatMine
PPT
J. Van der Valk - From Labour Force Survey to Labour Market Statistics
PDF
Big data Big impact?
PPTX
data analytics lecture2.pptx
PPTX
Introduction to Data4Impact
PPTX
Identifying the new frontier of big data as an enabler for T&T industries: Re...
Strata Big data presentation
Big data as a source for official statistics
Big Data, the Future of Statistics: Experiences at Statistics Netherlands
Big Data presentation for Statistics Canada
Lars Lyberg, Inizio: Rapport från konferensen BigSurv18
Big Data Analytics concepts with full theory
Big Data @ CBS for Fontys students in Eindhoven
Big data presentation for University of Reykjavik, Iceland, March 22
Big Data presentation Mannheim
2020-12-21 data strategy in Japan
Ntts2017 presentation 45
Big Data Europe SC6 WS 3: Where we are and are going for Big Data in OpenScie...
(old version)2020-12-21 data strategy in Japan
Setting the Scene for Big Data in Europe, Looking Ahead to the Case Studies
StatMine
J. Van der Valk - From Labour Force Survey to Labour Market Statistics
Big data Big impact?
data analytics lecture2.pptx
Introduction to Data4Impact
Identifying the new frontier of big data as an enabler for T&T industries: Re...
Ad

More from Piet J.H. Daas (20)

PDF
IT infrastructure for Big Data and Data Science at Statistics Netherlands
PDF
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
PDF
Use of social media for official statistics
PDF
Isi 2017 presentation on Big Data and bias
PDF
CBS lecture at the opening of Data Science Campus of ONS
PDF
Extracting information from ' messy' social media data
PPT
Big data cbs_piet_daas
PDF
Gebruik van sociale media voor de officiële statistiek
PDF
Big Data @ CBS
PDF
Profiling Big Data sources to assess their selectivity
PDF
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
PPT
Quality challenges in modernising business statistics
PDF
Quality Approaches to Big Data
PDF
Social media sentiment and consumer confidence
PDF
Big data @ CBS
PDF
Bi dutch meeting data science
PDF
Piet daas big_data_official_statistics_target_groningen
PDF
Big data en officiële statistiek
PDF
Data science and the future of statistics
PDF
New Data Sources for Statistics, Social media: Twitter.
IT infrastructure for Big Data and Data Science at Statistics Netherlands
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
Use of social media for official statistics
Isi 2017 presentation on Big Data and bias
CBS lecture at the opening of Data Science Campus of ONS
Extracting information from ' messy' social media data
Big data cbs_piet_daas
Gebruik van sociale media voor de officiële statistiek
Big Data @ CBS
Profiling Big Data sources to assess their selectivity
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Quality challenges in modernising business statistics
Quality Approaches to Big Data
Social media sentiment and consumer confidence
Big data @ CBS
Bi dutch meeting data science
Piet daas big_data_official_statistics_target_groningen
Big data en officiële statistiek
Data science and the future of statistics
New Data Sources for Statistics, Social media: Twitter.

Recently uploaded (20)

PPTX
Portland FPDR Oregon Legislature 2025.pptx
PDF
ISO-9001-2015-internal-audit-checklist2-sample.pdf
PPTX
26.1.2025 venugopal K Awarded with commendation certificate.pptx
PDF
PPT Item #s 2&3 - 934 Patterson SUP & Final Review
PPTX
PCCR-ROTC-UNIT-ORGANIZATIONAL-STRUCTURE-pptx-Copy (1).pptx
PDF
PPT Items # 6&7 - 900 Cambridge Oval Right-of-Way
PPTX
GOVERNMENT-ACCOUNTING1. bsa 4 government accounting
PPTX
Weekly Report 17-10-2024_cybersecutity.pptx
PPTX
Social_Medias_Parents_Education_PPT.pptx
PPTX
SOMANJAN PRAMANIK_3500032 2042.pptx
PPT
Adolescent Health Orientation and Health care
PPTX
OUR GOVERNMENT-Grade 5 -World around us.
PDF
oil palm convergence 2024 mahabubnagar.pdf
PDF
PPT - Primary Rules of Interpretation (1).pdf
PDF
Creating Memorable Moments_ Personalized Plant Gifts.pdf
PDF
Item # 4 -- 328 Albany St. compt. review
PDF
Courtesy Meeting NIPA and MBS Australia.
PPT
generalgeologygroundwaterchapt11-181117073208.ppt
PDF
Environmental Management Basics 2025 for BDOs WBCS by Samanjit Sen Gupta.pdf
DOC
LU毕业证学历认证,赫尔大学毕业证硕士的学历和学位
Portland FPDR Oregon Legislature 2025.pptx
ISO-9001-2015-internal-audit-checklist2-sample.pdf
26.1.2025 venugopal K Awarded with commendation certificate.pptx
PPT Item #s 2&3 - 934 Patterson SUP & Final Review
PCCR-ROTC-UNIT-ORGANIZATIONAL-STRUCTURE-pptx-Copy (1).pptx
PPT Items # 6&7 - 900 Cambridge Oval Right-of-Way
GOVERNMENT-ACCOUNTING1. bsa 4 government accounting
Weekly Report 17-10-2024_cybersecutity.pptx
Social_Medias_Parents_Education_PPT.pptx
SOMANJAN PRAMANIK_3500032 2042.pptx
Adolescent Health Orientation and Health care
OUR GOVERNMENT-Grade 5 -World around us.
oil palm convergence 2024 mahabubnagar.pdf
PPT - Primary Rules of Interpretation (1).pdf
Creating Memorable Moments_ Personalized Plant Gifts.pdf
Item # 4 -- 328 Albany St. compt. review
Courtesy Meeting NIPA and MBS Australia.
generalgeologygroundwaterchapt11-181117073208.ppt
Environmental Management Basics 2025 for BDOs WBCS by Samanjit Sen Gupta.pdf
LU毕业证学历认证,赫尔大学毕业证硕士的学历和学位

Responsible Data Science at Statistics Netherlands

  • 1. Responsible Data Science at Statistics Netherlands: implications for big data research Piet Daas Senior Methodologist,Theme coordinator Big Data research & lead Data Scientist at the Center for Big Data Statistics
  • 2. Responsible Data Science @ CBS – CBS ‐ About CBS and the CBS law ‐ A relevant example ‐ Responsible Statistics – Big Data ‐ Center for Big Data Statistics ‐ Responsible Data science ‐ Implications (challenges) for Big Data ‐ Some examples of the things we do 2
  • 4. Statistics Netherlands mission Publish reliable and coherent statistical information that responds to the needs of Dutch society. – Independent organization – When Statistics Netherlands requests information: • companies and institutes are legally obliged to cooperate • persons and households provide this on a voluntary basis – To reduce the response burden SN has access to registers (admin data) of governmental and semi-governmental organisations. – We present facts with a ´short story´ (but no ‘cherry picking’) – Information is made available -at the same time- to everyone for free.
  • 5. About Statistics Netherlands Statistics Netherlands was founded in 1899 2 rooms on ‘het binnenhof’ with 5 employees We currently have ~2000 employees (max in 1982: 3600) We produce >500 statistics per year 80% based on EU-regulations There is a solid legal base to enable access to all kinds of data and to process personal data: - the CBS-law - CBS data collection law 5
  • 6. The Statistics Netherlands law 6 It is our intention to ‘burden’ people and companies as less as possible with requests for data - Re-use as much data as possible, such as data collected by others - Increasing use of admin data and hence our interest in Big Data
  • 8. Responsible Statistics – Social Statistical Database (SSD) – Combination of predominantly administrative data on persons combined with a number of survey’s – The SSD is used for a whole range of social statistics, for social research and for the virtual census 8
  • 9. What’s in the SSD? The data is combined at the individual level and covers a long period (start date 1999)
  • 10. How does the SSD work?
  • 12. SSD ‘under the hood’ – All data is processed in our most secure internal environment – Personal Identifiable Data (such as CSN, addresses and names) are removed ASAP from data files – CSN is converted to a so-called RIN-number (non- identifable unique number) – Researchers only get access to the variables they need (nothing more; even for SN-colleagues) – Output is rigorously checked for disclosure (if there is a risk, part of the data is disturbed) 12
  • 13. Responsible Statistics (2) – Fairness ‐ PID’s are removed as early in the process as possible ‐ Data-minimalization principle is applied ‐ Data is re-used as much as possible (reduce response burden). – Accuracy ‐ Only well-established methods are used. Should be part of the Methodology series of Statistics Netherlands (or published in journals) ‐ Quality checks and confidence intervals should be available – Confidentiality ‐ Statistics are produced on non-(de)identifiable (aggregated) units ‐ All output is checked for disclosure and (locally) disturbed if needed. – Transparency ‐ The way data is processed, combined and the estimation models used should be clearly described and internally available (no ‘personal’ produced statistics). 13
  • 16. Goals CBDS - New, more detailed, real time statistics - Reduce the data collection footprint - Deepen knowledge in Big Data methodology - Privacy aspects on the use of Big Data in official statistics - Offer an ecosystem to exchange knowledge and resources 16
  • 17. Admin Data Sources • Tax Data • Population Register • Insurance Register • ... Surveys • Labor Force Survey • Safety Monitor • International Trade • … Data integration 17 Big Data • Sensor Data • Social Media • Internet Data • …. Safety MobilityIncome Tourism Environmental Labor force Census Health Economy Statistical Output Energy CBDS
  • 18. Responsible Data Science – Fairness ‐ PID’s are removed as early in the process as possible ‐ Data-minimalization principle is applied ‐ Data is re-used as much as possible (low response burden). – Accuracy ‐ Only well-established methods are used. Should be part of the Methodology series of Statistics Netherlands (or published in journals) ‐ Quality checks and confidence intervals should be available (bias?) – Confidentiality ‐ Statistics are produced on non-(de)identifiable (aggregated) units ‐ All output is checked for disclosure control and (locally) disturbed if needed. Effect of adding Big Data sources? – Transparency ‐ The way data is processed, combined and the estimation models used must be clearly described and available for everyone involved (no ‘personal’ produced statistics). What about new processes? 18
  • 19. Responsible Data Science: Fairness – Fairness ‐ PID’s are removed as early in the process as possible • Are there PID’s available? • Identifying units in Big Data is sometime very hard • Some data, such as the text of a tweet, is also a PID. • This is publicly available data, ‘consciously’ put on the internet by a user ‐ Data-minimalization principle is applied • Our way of working is making use of as much data as available because of the low information content of Big Data (the data used for another purpose then intended) 19
  • 20. Responsible Data Science: Accuracy – Accuracy ‐ Only well-established methods are used. Should be part of the Methodology series of Statistics Netherlands (or published in journals) • There are hardly any well established Big Data methods at the moment (those used for satellite data?) ‐ Quality checks and confidence intervals should be available (and bias?) • Fully automated quality checks are needed • What about (reasonable) confidence intervals of ML- based methods? • Isn’t bias more important (selectivity of data in the source)? 20
  • 21. Responsible Data Science – Confidentiality ‐ Statistics are produced on non-(de)identifiable (aggregated) units ‐ All output is checked for disclosure control and (locally) disturbed if needed. ‐ Effect of adding Big Data sources? • Can we guaranty de-identification when more and more Big Data is added to a statistical output? 21
  • 22. Responsible Data Science: Transparency – Transparency ‐ The way data is processed, combined and the estimation models used must be clearly described and available for everyone involved (no ‘personal’ produced statistics). ‐ Estimation models • No black boxes, How transparent are ML-based models? ‐ What about new processes? • New kind of processes emerge, e.g. start processing of the data at the location of the data holder (not at SN) • Example: Traffic index based on road sensor data 22
  • 23. Process for road sensor data Big Data steps (1) (2) (3)
  • 24. What´s next? – Statistics Netherlands is currently changing from Responsible Statistics to Responsible Data Science – Clearly additional work is needed to fully enable this – This is important as it is an essential step in unleashing the (full) potential of Big Data. – For Statistics Netherlands the latter leads to: ‐ New products, ‐ More readily available statistics ‐ Improve quality of existing products ‐ Assure the work of SN remains relevant for the Netherlands 24
  • 25. Work at Center of Big Data Statistics - At the Center for Big data Statistics 1. Case studies/beta products 2. Methodological & exploratory research - Examples of our work ‐ Income data: Visualisation 2D/3D ‐ Road sensor data: Traffic intensity and GDP ‐ Scanner data: ‘Ginger bread’ index ‐ Twitter: Social tension indicator ‐ Webpages: Identify web only shops ‐ Webpages: Identifying innovative companies 25
  • 26. Heat map: Age vs. ‘Income’ Age Income(euro) 26 Heatmaps of income vs age (gender)
  • 27. A 3D heat map: Age vs. Income vs. Amount amount mount 3D Heatmap of income vs age 27
  • 28. Road sensor data 28More on: https://www.cbs.nl/en-gb/our-services/innovation Traffic intensity vs GDP
  • 29. Scanner data 29More on: https://www.cbs.nl/nl-nl/onze-diensten/innovatie Turnover of ‘ginger bread’ specific for Saint Nicolas festivities (2015 and 2016: weekly)
  • 31. Web pages – From Common Crawl archive ‘2016-07’ – Found: ‐ +/- 60 million websites ‐ +/- 50.000 Dutch web shops ‐ 12670 web shops in scope Web only shops in the Netherlands
  • 33. Tip of the Iceberg 33
  • 34. Thank you for your attention!@pietdaas