SlideShare a Scribd company logo
ESSnet Big Data WP8
Methodology (+ Quality, +IT)
Deliverables prepared by: WP8 members
BDES 2018 - Sofia,
14-15 May 2018
• Introduction Piet Daas
• IT Jacek Maślankowski
• Quality Magdalena Six
• Methodology Valentin Chavdarov & Piet Daas
• Literature study Jacek Maślankowski
• Discussant Faiz Alsuhail
• Questions + Discussion All
Overview of this session
• Aim of WP8 is to generalize the findings of the pilots in ESSnet Big
Data and relate them to the conditions for future use of big data
sources within the European Statistical System.
• Only active in SGA-2 (January 2017 - May 2018)
• Focus on Methodology, Quality and IT-infrastructure
Overview of WP8
• Based on real world experiences
– Work performed in WP 1-7 of ESSnet and other work relevant for
NSI’s (or similar)
• Broad area: focus on most important topics
– In three areas: IT, Quality and Methodology
– Identify the most important topics (for each area) at the start of
WP8 during a workshop with experts
– To assure a sufficiently ‘blended’ view on BD
• Follow a bottom-up approach
Starting points of WP8
• Identified most important topics for
• IT
– 10 in total
• Quality
– 7 in total
• Methodology
– 11 in total
• Topics based on the BD ‘start of art’ in January 2017
– One topic emerged in each of the 3 areas
Results of the workshop
• A Process ‘view’ on Big Data
– IT: Data Processing Life Cycle
– Quality: Process Chain Control
– Methodology: Data Process Architecture
– This is important
• GSBPM provides a general view on NSI processes
(Generic Statistical Business Process Model )
Common topic
Big Data process: Data driven
Different than the approach commonly used in official statistics
IT Report
in the ESSnet Big Data
Deliverable 8.3 of WP8
Prepared by: WP8 members
Jacek Maślankowski, Statistics Poland
BDES 2018 - Sofia,
14-15 May 2018
1. Big Data processing life cycle
2. Metadata management (ontology)
3. Format of Big Data processing
4. Data-hub and Data-lake
5. Data source integration
6. Choosing the right infrastructure
7. List of secure and tested API’s
8. Shared libraries and documented standards
9. Speed of algorithms
10. Training/skills/knowledge
Information covered in the report
Conceptual Big Data platform
Data processing and data storage
Data type
Batch
Static data
Structured
RDBMS, DBF, ...
Relational
database, files
Hadoop, MySQL, ..
Unstructured
Text, Website, ...
Files, NoSQL
Hadoop, Solr, ...
Semi-
structured
CSV, JSON, XML, ...
Files, NoSQL or
relat. databases
Hadoop, HBase, ...
Streaming
Realtime data
Sensors
TXT or CSV files
In-memory
processing
engine
Spark, Kafka, ...
Web
Websites
In-memory
processing
engine
Spark, Storm, ...
Github repositories
No. Name Link Main features
1 Awesome Official Statistics
software
https://github.com/SNStatComp/awesom
e-official-statistics-software
The list of useful statistical software
with links to other GitHub
repositories, by CBS NL
2 ONS (Office for National
Statistics) UK Big Data team
https://github.com/ONSBigData Various software developed by ONS
UK Big Data Team
3 ONS (Office for National
Statistics) UK Data Science
Campus
https://github.com/datasciencecampus Various software developed by ONS
Data Science Campus Team
4 ESTP Big Data course
software
https://github.com/SNStatComp/ESTPBD Various software developed for the
ESTP Big Data training courses
API’s used
No. Name of the
API with
hyperlink
Basic functionality Restrictions Potential domains
(WP number)
Remarks
1 Twitter API Scrap the tweets by keywords,
hashtags, users; streaming
scrapping
25 to 900 requests/15 minutes; access only to public
profiles
Population, Social
Statistics, Tourism
(WP2, WP7)
Account and API code
needed
2 Facebook Graph
API
Collect information from
public profiles, also very
specific such as photos
metatags
Mostly present information, typical no more than
dozens of requests
Population (WP7) Account and API code
needed
3 Google Maps
API
Looking for any kind of
objects (e.g., hotels),
verification of addresses,
monitoring the traffic on
specific roads
Free up to 2.5 thous. requests per day.
$0.50 USD / 1 thous. additional requests, up to 100
thous. daily, if billing is enabled.
Tourism (WP7) Google account and API
code needed
4 Google Custom
Search API
Can be used to search through
one website, with
modifications it will search for
a keywords in the whole
Internet; can be used to find a
URL of the specific enterprise
JSON/Atom Custom Search API provides 100 search
queries per day for free. Additional requests cost $5
per 1000 queries, up to 10k queries per day.
Business (WP2) Google account and API
code needed
5 Bing API Finding specific URL of the
enterprise
7 queries per second (QPS) per IP address Business (WP2) AppID needed
6 Guardian API Collect news articles and
comments from Guardian
website
Free for non-commercial use. Up to 12 calls per
second, Up to 5,000 calls per day, Access to article text,
Access to over 1,900,000 pieces of content.
Population, Social
Statistics (WP7)
Registered account
needed
7 Copernicus
Open Access
Hub
Access to Sentinel-1 and
Sentinel-2 repositories
Free for registered users Agriculture (WP7) Registered account
needed
1. There is no unified framework for Big Data metadata
management.
2. There is a common point in all WPs on tools and data
storage.
3. Data-lakes and data-hubs are still not explored deeply.
4. There are best practices on using different API’s.
5. Software is shared by NSI’s on Github repositories.
6. There is no unified framework for data sources integration.
7. Variety of training courses allows increasing required skills
of data scientists.
Main findings
Report on
Quality Aspects of Big Data
in the ESSnet Big Data
Deliverable 8.2 of WP8
Prepared by: WP8 members
Magdalena Six, Statistics Austria
BDES 2018 - Sofia,
14-15 May 2018
In relation to cause(s) of errors:
• Coverage, Accuracy and Selectivity
• Processing errors
• Linkability
• Measurement errors
• Model errors and precision
In relation to changes in the composition of the source
• Comparability over time
• Process chain control
7 Quality Aspects of Big Data
7 Quality Aspects in the Context of
UNECE’s Quality Framework for BD
• 3 Phases of the business process: Input, Throughput, Output
• 3 Hyperdimensions: Source, Data, Metadata
Structure of the Report on Quality in
the ESSnet Big Data
7 Chapters according to the 7 identified quality aspects
Same structure for each chapter:
1. Introduction: meaning of the respective quality aspect in the
context of Big Data
2. Examples and Methods: Role of the respective quality aspect
in the WP1-WP7
3. Discussion: Challenges for the quality aspect, cross connections
to other Chapters in the Quality Report, but also to IT and
Methodology Report
Examples for new (?), BD specific (?)
Error Sources
• Scrambling of the Automated Identification Signal (AIS) of ships in WP4 ->
measurement or coverage error?
• Scraping of a deceptive Job vacancy ad -> measurement or coverage error?
• Non-stable access to the BD source, change in technological process
generating the BD, change in use of BD-generating devices -> comparability
over time
• Multiple layers of (new) processing steps required (advanced techniques for
editing, imputation, linking techniques, text mining algorithms…) including
new error sources
• Deduction of information about target variable from other variables via
modelling, models based on small-sample statistical inference don’t work
Quality Measures: Challenges from the
past and Challenges ahead
• Still in the experimenting phase
• Often no routine, no regular access to Big Data source
• Focus in WPs more on potential sources and potential access to
sources than on a standardized reporting of quality measures
• Experimental phase shows: Big Data sources, as well as processes
needed to work with these sources are so diverse that the
development of standardized quality measures / a quality framework
will be challenging
Report on
Methodology
in the ESSnet Big Data
Deliverable 8.4 of WP8
Prepared by: WP8 members
Valentin Chavdarov & Piet Daas
BDES 2018 - Sofia,
14-15 May 2018
Why Big data methodology?
1. A good part of statistical methodology is built
around survey data. There are many conventions
in statistical methodology that reflect the failure
of surveys to capture important social economic
and social phenomena.
2. Big Data is a by-product of modern society. Not a
lot is known on the data generation process and
of the units included.
3. Working in a data-driven way is new for NSI’s.
Methods and principles are needed to assure
valid conclusions are drawn when using Big Data.
Big data methodology issues
1. Assessing accuracy
2. What should our final product look like?
3. Deal with spatial dimension
4. Changes in data sources
5. Mashine learning in official statistics
6. Data linkage
7. Secure multi-party computation
8. Infererence
9. Sampling
10. Data process architecture
11. Unit identification problem
Big data methodology issues
- cont
• Methodological issues are different in terms of scope. Assessing
accuracy for example covers almost all stages of statistical
production process: from collecting data through processing to data
analysis.
• Some of issues are BD specific: data linkage; changes in data
sources; unit identification problem.
Risk of social sciences datafication
There are three ways in which Big Data can be used for official statistics
1) Survey based, as an additional source
to improve survey based estimation (~ WP2, WP7,
sentiment NL)
2) Census based, as the main/single source
Whole target population is included (WP4, road sensor NL)
3) Incomplete, as the main/single source
Only part of the target population is included (WP1, WP3 ….)
Need to correct for that
Using Big Data
Methodology Quality IT
• Bias & models Coverage Choosing right infra
• Data driven way of working Sources of error Training/skills/knowledge
• Machine Learning (2 places) Editing data Big Data libraries
• Linking (e.g. geo-loc) Linkability Programming languages
• Unit identification (features)
In these areas new methods are needed and is being developed!
More important/New to Big Data
Literature study
in the ESSnet Big Data
Deliverable 8.1 of WP8
Prepared by: WP8 members
Jacek Maślankowski, Statistics Poland
BDES 2018 - Sofia,
14-15 May 2018
• Bibliographic data
• Link
• Short overview (strengths, weaknesses)
• Data sources
• Domains
• Keywords
• Classification (A – very relevant, B – relevant, C – less relevant)
Sharing the experience
WP8 Wiki 
Reports, milestones and
deliverables 
Literature overview
Living document
Thank you for your attention

More Related Content

PDF
Investment Fund Analytics
PDF
Lecture3 business intelligence
PDF
Data Mining: Future Trends and Applications
PDF
V3 i35
PDF
An introduction to Data Mining
PDF
Tag.bio aws public jun 08 2021
PDF
An introduction to Data Mining by Kurt Thearling
PDF
Big data service architecture: a survey
Investment Fund Analytics
Lecture3 business intelligence
Data Mining: Future Trends and Applications
V3 i35
An introduction to Data Mining
Tag.bio aws public jun 08 2021
An introduction to Data Mining by Kurt Thearling
Big data service architecture: a survey

What's hot (20)

PPT
Data mining
PDF
Accelerating Time to Research Using CloudBank
PPTX
Big data road map
PPTX
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
PDF
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
PDF
Building Knowledge Graphs in 10 steps
PDF
Unit 3 part 2
PDF
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
PPTX
Big data analytics
PDF
Role of Data Cleaning in Data Warehouse
PDF
Online retail a look at data consulting approach
PDF
PDF
Fairification experience clarifying the semantics of data matrices
PDF
Tag.bio: Self Service Data Mesh Platform
PPTX
Introduction to Big Data Analytics
PDF
Paper id 26201475
PDF
Democratizing Data within your organization - Data Discovery
PDF
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
PPTX
BAS 250 Lecture 1
PDF
What is Data Commons and How Can Your Organization Build One?
Data mining
Accelerating Time to Research Using CloudBank
Big data road map
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
Building Knowledge Graphs in 10 steps
Unit 3 part 2
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Big data analytics
Role of Data Cleaning in Data Warehouse
Online retail a look at data consulting approach
Fairification experience clarifying the semantics of data matrices
Tag.bio: Self Service Data Mesh Platform
Introduction to Big Data Analytics
Paper id 26201475
Democratizing Data within your organization - Data Discovery
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
BAS 250 Lecture 1
What is Data Commons and How Can Your Organization Build One?
Ad

Similar to ESSnet Big Data WP8 Methodology (+ Quality, +IT) (20)

PDF
Advanced Analytics and Machine Learning with Data Virtualization
PPTX
Big data analytics
PDF
Lecture1 introduction to big data
PDF
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
PDF
Eclipse day Sydney 2014 BIG data presentation
PDF
Advanced Analytics and Machine Learning with Data Virtualization
PPTX
KU_Big_Data_3_25_2015a
PDF
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
DOCX
PDF
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
DOCX
PDF
Big Data Intoduction & Hadoop ArchitectureModule1.pdf
PDF
Advanced Analytics and Machine Learning with Data Virtualization (India)
PPTX
Mapping presentation THAG big data from space
PDF
Big Data Evolution
DOCX
DOCX
PDF
02 a holistic approach to big data
PPTX
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
PDF
Data-centric design and the knowledge graph
Advanced Analytics and Machine Learning with Data Virtualization
Big data analytics
Lecture1 introduction to big data
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Eclipse day Sydney 2014 BIG data presentation
Advanced Analytics and Machine Learning with Data Virtualization
KU_Big_Data_3_25_2015a
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
Big Data Intoduction & Hadoop ArchitectureModule1.pdf
Advanced Analytics and Machine Learning with Data Virtualization (India)
Mapping presentation THAG big data from space
Big Data Evolution
02 a holistic approach to big data
Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01
Data-centric design and the knowledge graph
Ad

More from Piet J.H. Daas (20)

PDF
Big Data and official statistics with examples of their use
PDF
IT infrastructure for Big Data and Data Science at Statistics Netherlands
PDF
EMOS 2018 Big Data methods and techniques
PDF
Use of social media for official statistics
PDF
Isi 2017 presentation on Big Data and bias
PDF
Responsible Data Science at Statistics Netherlands
PDF
CBS lecture at the opening of Data Science Campus of ONS
PDF
Ntts2017 presentation 45
PDF
Big Data presentation Mannheim
PDF
Extracting information from ' messy' social media data
PPT
Big data cbs_piet_daas
PDF
Gebruik van sociale media voor de officiële statistiek
PDF
Big Data @ CBS
PDF
Profiling Big Data sources to assess their selectivity
PDF
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
PDF
Big Data @ CBS for Fontys students in Eindhoven
PDF
Big Data presentation for Statistics Canada
PPT
Quality challenges in modernising business statistics
PDF
Quality Approaches to Big Data
PDF
Social media sentiment and consumer confidence
Big Data and official statistics with examples of their use
IT infrastructure for Big Data and Data Science at Statistics Netherlands
EMOS 2018 Big Data methods and techniques
Use of social media for official statistics
Isi 2017 presentation on Big Data and bias
Responsible Data Science at Statistics Netherlands
CBS lecture at the opening of Data Science Campus of ONS
Ntts2017 presentation 45
Big Data presentation Mannheim
Extracting information from ' messy' social media data
Big data cbs_piet_daas
Gebruik van sociale media voor de officiële statistiek
Big Data @ CBS
Profiling Big Data sources to assess their selectivity
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Big Data @ CBS for Fontys students in Eindhoven
Big Data presentation for Statistics Canada
Quality challenges in modernising business statistics
Quality Approaches to Big Data
Social media sentiment and consumer confidence

Recently uploaded (20)

PPTX
sepsis.pptxMNGHGBDHSB KJHDGBSHVCJB KJDCGHBYUHFB SDJKFHDUJ
PDF
Creating Memorable Moments_ Personalized Plant Gifts.pdf
PDF
Item # 5 - 5307 Broadway St final review
PPTX
STG - Sarikei 2025 Coordination Meeting.pptx
PDF
PPT Items # 6&7 - 900 Cambridge Oval Right-of-Way
PDF
It Helpdesk Solutions - ArcLight Group
PDF
Population Estimates 2025 Regional Snapshot 08.11.25
PPTX
Quiz - Saturday.pptxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
PPTX
The DFARS - Part 250 - Extraordinary Contractual Actions
PPTX
GSA Q+A Follow-Up To EO's, Requirements & Timelines
PPTX
Vocational Education for educational purposes
DOCX
EAPP.docxdffgythjyuikuuiluikluikiukuuuuuu
PDF
Item # 4 -- 328 Albany St. compt. review
PDF
ISO-9001-2015-internal-audit-checklist2-sample.pdf
PDF
Item # 2 - 934 Patterson Specific Use Permit (SUP)
PPTX
GOVERNMENT-ACCOUNTING1. bsa 4 government accounting
PPTX
Omnibus rules on leave administration.pptx
DOC
LU毕业证学历认证,赫尔大学毕业证硕士的学历和学位
PDF
Item # 3 - 934 Patterson Final Review.pdf
PDF
26.1.2025 venugopal K Awarded with commendation certificate.pdf
sepsis.pptxMNGHGBDHSB KJHDGBSHVCJB KJDCGHBYUHFB SDJKFHDUJ
Creating Memorable Moments_ Personalized Plant Gifts.pdf
Item # 5 - 5307 Broadway St final review
STG - Sarikei 2025 Coordination Meeting.pptx
PPT Items # 6&7 - 900 Cambridge Oval Right-of-Way
It Helpdesk Solutions - ArcLight Group
Population Estimates 2025 Regional Snapshot 08.11.25
Quiz - Saturday.pptxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
The DFARS - Part 250 - Extraordinary Contractual Actions
GSA Q+A Follow-Up To EO's, Requirements & Timelines
Vocational Education for educational purposes
EAPP.docxdffgythjyuikuuiluikluikiukuuuuuu
Item # 4 -- 328 Albany St. compt. review
ISO-9001-2015-internal-audit-checklist2-sample.pdf
Item # 2 - 934 Patterson Specific Use Permit (SUP)
GOVERNMENT-ACCOUNTING1. bsa 4 government accounting
Omnibus rules on leave administration.pptx
LU毕业证学历认证,赫尔大学毕业证硕士的学历和学位
Item # 3 - 934 Patterson Final Review.pdf
26.1.2025 venugopal K Awarded with commendation certificate.pdf

ESSnet Big Data WP8 Methodology (+ Quality, +IT)

  • 1. ESSnet Big Data WP8 Methodology (+ Quality, +IT) Deliverables prepared by: WP8 members BDES 2018 - Sofia, 14-15 May 2018
  • 2. • Introduction Piet Daas • IT Jacek Maślankowski • Quality Magdalena Six • Methodology Valentin Chavdarov & Piet Daas • Literature study Jacek Maślankowski • Discussant Faiz Alsuhail • Questions + Discussion All Overview of this session
  • 3. • Aim of WP8 is to generalize the findings of the pilots in ESSnet Big Data and relate them to the conditions for future use of big data sources within the European Statistical System. • Only active in SGA-2 (January 2017 - May 2018) • Focus on Methodology, Quality and IT-infrastructure Overview of WP8
  • 4. • Based on real world experiences – Work performed in WP 1-7 of ESSnet and other work relevant for NSI’s (or similar) • Broad area: focus on most important topics – In three areas: IT, Quality and Methodology – Identify the most important topics (for each area) at the start of WP8 during a workshop with experts – To assure a sufficiently ‘blended’ view on BD • Follow a bottom-up approach Starting points of WP8
  • 5. • Identified most important topics for • IT – 10 in total • Quality – 7 in total • Methodology – 11 in total • Topics based on the BD ‘start of art’ in January 2017 – One topic emerged in each of the 3 areas Results of the workshop
  • 6. • A Process ‘view’ on Big Data – IT: Data Processing Life Cycle – Quality: Process Chain Control – Methodology: Data Process Architecture – This is important • GSBPM provides a general view on NSI processes (Generic Statistical Business Process Model ) Common topic
  • 7. Big Data process: Data driven Different than the approach commonly used in official statistics
  • 8. IT Report in the ESSnet Big Data Deliverable 8.3 of WP8 Prepared by: WP8 members Jacek Maślankowski, Statistics Poland BDES 2018 - Sofia, 14-15 May 2018
  • 9. 1. Big Data processing life cycle 2. Metadata management (ontology) 3. Format of Big Data processing 4. Data-hub and Data-lake 5. Data source integration 6. Choosing the right infrastructure 7. List of secure and tested API’s 8. Shared libraries and documented standards 9. Speed of algorithms 10. Training/skills/knowledge Information covered in the report
  • 11. Data processing and data storage Data type Batch Static data Structured RDBMS, DBF, ... Relational database, files Hadoop, MySQL, .. Unstructured Text, Website, ... Files, NoSQL Hadoop, Solr, ... Semi- structured CSV, JSON, XML, ... Files, NoSQL or relat. databases Hadoop, HBase, ... Streaming Realtime data Sensors TXT or CSV files In-memory processing engine Spark, Kafka, ... Web Websites In-memory processing engine Spark, Storm, ...
  • 12. Github repositories No. Name Link Main features 1 Awesome Official Statistics software https://github.com/SNStatComp/awesom e-official-statistics-software The list of useful statistical software with links to other GitHub repositories, by CBS NL 2 ONS (Office for National Statistics) UK Big Data team https://github.com/ONSBigData Various software developed by ONS UK Big Data Team 3 ONS (Office for National Statistics) UK Data Science Campus https://github.com/datasciencecampus Various software developed by ONS Data Science Campus Team 4 ESTP Big Data course software https://github.com/SNStatComp/ESTPBD Various software developed for the ESTP Big Data training courses
  • 13. API’s used No. Name of the API with hyperlink Basic functionality Restrictions Potential domains (WP number) Remarks 1 Twitter API Scrap the tweets by keywords, hashtags, users; streaming scrapping 25 to 900 requests/15 minutes; access only to public profiles Population, Social Statistics, Tourism (WP2, WP7) Account and API code needed 2 Facebook Graph API Collect information from public profiles, also very specific such as photos metatags Mostly present information, typical no more than dozens of requests Population (WP7) Account and API code needed 3 Google Maps API Looking for any kind of objects (e.g., hotels), verification of addresses, monitoring the traffic on specific roads Free up to 2.5 thous. requests per day. $0.50 USD / 1 thous. additional requests, up to 100 thous. daily, if billing is enabled. Tourism (WP7) Google account and API code needed 4 Google Custom Search API Can be used to search through one website, with modifications it will search for a keywords in the whole Internet; can be used to find a URL of the specific enterprise JSON/Atom Custom Search API provides 100 search queries per day for free. Additional requests cost $5 per 1000 queries, up to 10k queries per day. Business (WP2) Google account and API code needed 5 Bing API Finding specific URL of the enterprise 7 queries per second (QPS) per IP address Business (WP2) AppID needed 6 Guardian API Collect news articles and comments from Guardian website Free for non-commercial use. Up to 12 calls per second, Up to 5,000 calls per day, Access to article text, Access to over 1,900,000 pieces of content. Population, Social Statistics (WP7) Registered account needed 7 Copernicus Open Access Hub Access to Sentinel-1 and Sentinel-2 repositories Free for registered users Agriculture (WP7) Registered account needed
  • 14. 1. There is no unified framework for Big Data metadata management. 2. There is a common point in all WPs on tools and data storage. 3. Data-lakes and data-hubs are still not explored deeply. 4. There are best practices on using different API’s. 5. Software is shared by NSI’s on Github repositories. 6. There is no unified framework for data sources integration. 7. Variety of training courses allows increasing required skills of data scientists. Main findings
  • 15. Report on Quality Aspects of Big Data in the ESSnet Big Data Deliverable 8.2 of WP8 Prepared by: WP8 members Magdalena Six, Statistics Austria BDES 2018 - Sofia, 14-15 May 2018
  • 16. In relation to cause(s) of errors: • Coverage, Accuracy and Selectivity • Processing errors • Linkability • Measurement errors • Model errors and precision In relation to changes in the composition of the source • Comparability over time • Process chain control 7 Quality Aspects of Big Data
  • 17. 7 Quality Aspects in the Context of UNECE’s Quality Framework for BD • 3 Phases of the business process: Input, Throughput, Output • 3 Hyperdimensions: Source, Data, Metadata
  • 18. Structure of the Report on Quality in the ESSnet Big Data 7 Chapters according to the 7 identified quality aspects Same structure for each chapter: 1. Introduction: meaning of the respective quality aspect in the context of Big Data 2. Examples and Methods: Role of the respective quality aspect in the WP1-WP7 3. Discussion: Challenges for the quality aspect, cross connections to other Chapters in the Quality Report, but also to IT and Methodology Report
  • 19. Examples for new (?), BD specific (?) Error Sources • Scrambling of the Automated Identification Signal (AIS) of ships in WP4 -> measurement or coverage error? • Scraping of a deceptive Job vacancy ad -> measurement or coverage error? • Non-stable access to the BD source, change in technological process generating the BD, change in use of BD-generating devices -> comparability over time • Multiple layers of (new) processing steps required (advanced techniques for editing, imputation, linking techniques, text mining algorithms…) including new error sources • Deduction of information about target variable from other variables via modelling, models based on small-sample statistical inference don’t work
  • 20. Quality Measures: Challenges from the past and Challenges ahead • Still in the experimenting phase • Often no routine, no regular access to Big Data source • Focus in WPs more on potential sources and potential access to sources than on a standardized reporting of quality measures • Experimental phase shows: Big Data sources, as well as processes needed to work with these sources are so diverse that the development of standardized quality measures / a quality framework will be challenging
  • 21. Report on Methodology in the ESSnet Big Data Deliverable 8.4 of WP8 Prepared by: WP8 members Valentin Chavdarov & Piet Daas BDES 2018 - Sofia, 14-15 May 2018
  • 22. Why Big data methodology? 1. A good part of statistical methodology is built around survey data. There are many conventions in statistical methodology that reflect the failure of surveys to capture important social economic and social phenomena. 2. Big Data is a by-product of modern society. Not a lot is known on the data generation process and of the units included. 3. Working in a data-driven way is new for NSI’s. Methods and principles are needed to assure valid conclusions are drawn when using Big Data.
  • 23. Big data methodology issues 1. Assessing accuracy 2. What should our final product look like? 3. Deal with spatial dimension 4. Changes in data sources 5. Mashine learning in official statistics 6. Data linkage 7. Secure multi-party computation 8. Infererence 9. Sampling 10. Data process architecture 11. Unit identification problem
  • 24. Big data methodology issues - cont • Methodological issues are different in terms of scope. Assessing accuracy for example covers almost all stages of statistical production process: from collecting data through processing to data analysis. • Some of issues are BD specific: data linkage; changes in data sources; unit identification problem.
  • 25. Risk of social sciences datafication
  • 26. There are three ways in which Big Data can be used for official statistics 1) Survey based, as an additional source to improve survey based estimation (~ WP2, WP7, sentiment NL) 2) Census based, as the main/single source Whole target population is included (WP4, road sensor NL) 3) Incomplete, as the main/single source Only part of the target population is included (WP1, WP3 ….) Need to correct for that Using Big Data
  • 27. Methodology Quality IT • Bias & models Coverage Choosing right infra • Data driven way of working Sources of error Training/skills/knowledge • Machine Learning (2 places) Editing data Big Data libraries • Linking (e.g. geo-loc) Linkability Programming languages • Unit identification (features) In these areas new methods are needed and is being developed! More important/New to Big Data
  • 28. Literature study in the ESSnet Big Data Deliverable 8.1 of WP8 Prepared by: WP8 members Jacek Maślankowski, Statistics Poland BDES 2018 - Sofia, 14-15 May 2018
  • 29. • Bibliographic data • Link • Short overview (strengths, weaknesses) • Data sources • Domains • Keywords • Classification (A – very relevant, B – relevant, C – less relevant) Sharing the experience WP8 Wiki  Reports, milestones and deliverables  Literature overview
  • 31. Thank you for your attention