SlideShare a Scribd company logo
Piet Daas and Joep Burger
With special thanks to Marco Puts & Dong Nguyen1
Profiling Big Data sources to
assess their selectivity
Big Data
– More and more organizations want to use Big Data as a
new/additional source of information
– However, there are some major challenges :
– Selectivity of Big Data
– Source does not have to completely cover the target
population
– What part of the population is included?
2
Profiling: extracting ‘features’
3
– Extract background characteristics (‘features’) from the ‘units’
in Big Data in an attempt to determine its selectivity
‐ The need for this depends on the ‘type’ of Big data source and its
foreseen use
– Important background characteristics for statistics are:
‐ Persons: gender, age, income, education, origin,
urbanicity, household composition, ..
‐ Companies: number of employees, turnover, type of
economic activity, legal form, ..
Social Media: Twitter as an example
4
– On Social media persons, companies and ‘others’ can
create an account and create messages
‐ In the Netherlands 70% of the population is active on social media
– What kind of information is available on Twitter of a user
‐ Focus on gender!
– Let’s look at a profile: @pietdaas
5
1)Name
2) Short bio
3) Messages
content
4) Picture
Studied a Twitter sample
– From a list of Dutch Twitter users (~330.000)
– A random sample of 1000 unique ids was drawn
– Of the sample:
‐ 844 profiles still existed
• 844 had a name
• 583 provided a short bio
• 473 created ‘tweets’
• 804 had a ‘non-default’ picture
• 409 Men (49%)
• 282 Women (33%)
• 153 ‘Others’ (18%)
• companies, organizations, dogs, cats, ‘bots’..
6
DefaultTwitter picture
Gender findings: 1) First name
7
– Used Dutch ‘Voornamenbank’ website (First name database)
– Score between 0 and 1 (female – male); 676 of 844 (80%) names were registered
– Unknown names scored -1 (usually companies/organizations)
8
Gender findings: 2) Short bio
– If a short bio is provided
‐ Quite a number of people mention there ‘position’ in the
family
• Mother, father, papa, mama, ‘son of’, etc.
‐ Sometimes also occupations are mentioned that reflect the
gender (‘studente’)
‐ 155 of 583 (27%) indicated there gender in short bio
‐ Need to check both English and Dutch texts
Gender findings: 3) Tweets content
– In cooperation with University of Twente (Dong Nguyen)
– Machine learning approach that determines gender specific writing style
‐ Language specific: Messages need to be Dutch!
‐ 437 of 473 (92%) persons that created tweets could be classified
Gender findings: 4) Profile picture
10
– Use OpenCV to process pictures
1) Face recognition
2) Standardisation of faces (resize & rotate)
3) Classify faces according to gender
- 603 of 804 (75%) profile pictures had 1 or more faces on it
1
2
3
Gender findings: overall results
11
Diagnostic Odds Ratio =
(TP/FN) / (FP/TN)
random guessing
log(DOR) = 0
‐ Multi-agent findings
• Need clever ways to combine these
• Take processing efficiency of the ‘agent’ into consideration
Diagnostic Odds
Ratio (log)
First name 6.41
Short bio 3.50
Tweet content 2.36
Picture (faces) 0.72
Thank you for your attention !
12

More Related Content

PDF
Big Data @ CBS
PDF
Social media sentiment and consumer confidence
PDF
Big data for development
PPTX
Text analysis-semantic-search
PDF
Adding value to NLP: a little semantics goes a long way
PDF
Understanding the world with NLP: interactions between society, behaviour and...
PDF
Document(2)
PDF
Predicting Elections with Twitter
Big Data @ CBS
Social media sentiment and consumer confidence
Big data for development
Text analysis-semantic-search
Adding value to NLP: a little semantics goes a long way
Understanding the world with NLP: interactions between society, behaviour and...
Document(2)
Predicting Elections with Twitter

What's hot (20)

PDF
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
PDF
Sentiment Analysis and Social Media: How and Why
PDF
ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...
PDF
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
PDF
Data augmented ethnography: 
using big data and ethnography to explore candi...
PPTX
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
PPTX
Fake News Detector
PPT
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
PDF
Chung-Jui LAI - Polarization of Political Opinion by News Media
PPTX
Political Poster Edit
PDF
Chung-Hong Chan and King-Wa Fu: Differential opinions among Hong Kong online ...
PPTX
Twitter Data Analytics
PPTX
On user generated content, teleology and predictability in social systems
PPTX
2018 02-13 pathways-data enquiry_martina_emke
PDF
Twitter Based Election Prediction and Analysis
PDF
Secondary source qual
PPTX
Analysis Tweets Korea Politicians(25 Sep2009)Sj
PDF
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
PDF
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
PDF
Available Data Science M.Sc. Thesis Proposals
Twitter Based Outcome Predictions of 2019 Indian General Elections Using Deci...
Sentiment Analysis and Social Media: How and Why
ECSM2014: Using Social Media To Inform Policy Making: To whom are we listenin...
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
Data augmented ethnography: 
using big data and ethnography to explore candi...
Social Media in Australia: A ‘Big Data’ Perspective on Twitter
Fake News Detector
Twitter analytics: some thoughts on sampling, tools, data, ethics and user re...
Chung-Jui LAI - Polarization of Political Opinion by News Media
Political Poster Edit
Chung-Hong Chan and King-Wa Fu: Differential opinions among Hong Kong online ...
Twitter Data Analytics
On user generated content, teleology and predictability in social systems
2018 02-13 pathways-data enquiry_martina_emke
Twitter Based Election Prediction and Analysis
Secondary source qual
Analysis Tweets Korea Politicians(25 Sep2009)Sj
1 Crore Projects | ieee 2016 Projects | 2016 ieee Projects in chennai
PREDICTING ELECTION OUTCOME FROM SOCIAL MEDIA DATA
Available Data Science M.Sc. Thesis Proposals
Ad

Similar to Profiling Big Data sources to assess their selectivity (20)

PDF
Extracting information from ' messy' social media data
PDF
Ethics and Data
PDF
Inclusive networks (2014 Forum on Workplace Inclusion)
PPTX
The Impact of Social Media on a Digital World.pptx
PPTX
Bigdatahuman
PPTX
Maura Tuohy
PPTX
HumanityRoad training - Basic Crisis Information Management
PPTX
Ai, social media and political polarization
PDF
Commercialization Forum
PPT
Shortcut To Career Preparation 2009 2010
PPTX
Social Media in the Workplace
PDF
Honeypot Projects are Everywhere
PDF
Getting started in data science (4:3)
PDF
Getting started in data science (4:3)
PPTX
CHI 2014 Panel: Opportunities and Risks of Discovering Personality Traits fro...
PPTX
Big Data Ethics Cjbe july 2021
PPTX
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
PDF
Data science and ethics in fundraising
PPTX
Lecture 10 Inferential Data Analysis, Personality Quizes and Fake News...
PDF
DWS16 - Plenary - Earning digital trust - Vesselin Popov, University of Cambr...
Extracting information from ' messy' social media data
Ethics and Data
Inclusive networks (2014 Forum on Workplace Inclusion)
The Impact of Social Media on a Digital World.pptx
Bigdatahuman
Maura Tuohy
HumanityRoad training - Basic Crisis Information Management
Ai, social media and political polarization
Commercialization Forum
Shortcut To Career Preparation 2009 2010
Social Media in the Workplace
Honeypot Projects are Everywhere
Getting started in data science (4:3)
Getting started in data science (4:3)
CHI 2014 Panel: Opportunities and Risks of Discovering Personality Traits fro...
Big Data Ethics Cjbe july 2021
Manichean Progress: Positive and Negative States of the Art in Web-Scale Data...
Data science and ethics in fundraising
Lecture 10 Inferential Data Analysis, Personality Quizes and Fake News...
DWS16 - Plenary - Earning digital trust - Vesselin Popov, University of Cambr...
Ad

More from Piet J.H. Daas (20)

PDF
Big Data and official statistics with examples of their use
PDF
IT infrastructure for Big Data and Data Science at Statistics Netherlands
PDF
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
PDF
EMOS 2018 Big Data methods and techniques
PDF
Use of social media for official statistics
PDF
Isi 2017 presentation on Big Data and bias
PDF
Responsible Data Science at Statistics Netherlands
PDF
CBS lecture at the opening of Data Science Campus of ONS
PDF
Ntts2017 presentation 45
PDF
Big Data presentation Mannheim
PPT
Big data cbs_piet_daas
PDF
Gebruik van sociale media voor de officiële statistiek
PDF
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
PDF
Big Data @ CBS for Fontys students in Eindhoven
PDF
Big Data presentation for Statistics Canada
PPT
Quality challenges in modernising business statistics
PDF
Quality Approaches to Big Data
PDF
Opportunities and methodological challenges of Big Data for official statist...
PDF
Big data @ CBS
PDF
Strata Big data presentation
Big Data and official statistics with examples of their use
IT infrastructure for Big Data and Data Science at Statistics Netherlands
ESSnet Big Data WP8 Methodology (+ Quality, +IT)
EMOS 2018 Big Data methods and techniques
Use of social media for official statistics
Isi 2017 presentation on Big Data and bias
Responsible Data Science at Statistics Netherlands
CBS lecture at the opening of Data Science Campus of ONS
Ntts2017 presentation 45
Big Data presentation Mannheim
Big data cbs_piet_daas
Gebruik van sociale media voor de officiële statistiek
Using Road Sensor Data for Official Statistics: towards a Big Data Methodology
Big Data @ CBS for Fontys students in Eindhoven
Big Data presentation for Statistics Canada
Quality challenges in modernising business statistics
Quality Approaches to Big Data
Opportunities and methodological challenges of Big Data for official statist...
Big data @ CBS
Strata Big data presentation

Recently uploaded (20)

PDF
2026 RMHC Terms & Conditions agreement - updated 8.1.25.pdf
PDF
Item # 2 - 934 Patterson Specific Use Permit (SUP)
PPT
Adolescent Health Orientation and Health care
PDF
It Helpdesk Solutions - ArcLight Group
PDF
2025 Shadow report on Ukraine's progression regarding Chapter 29 of the acquis
PDF
PPT - Primary Rules of Interpretation (1).pdf
PDF
buyers sellers meeting of mangoes in mahabubnagar.pdf
PPTX
Social_Medias_Parents_Education_PPT.pptx
PPTX
26.1.2025 venugopal K Awarded with commendation certificate.pptx
PPTX
GOVERNMENT-ACCOUNTING1. bsa 4 government accounting
PPTX
Weekly Report 17-10-2024_cybersecutity.pptx
PPTX
PCCR-ROTC-UNIT-ORGANIZATIONAL-STRUCTURE-pptx-Copy (1).pptx
PPTX
The DFARS - Part 250 - Extraordinary Contractual Actions
PPTX
Quiz - Saturday.pptxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
PDF
Environmental Management Basics 2025 for BDOs WBCS by Samanjit Sen Gupta.pdf
PPTX
Vocational Education for educational purposes
PDF
The Role of FPOs in Advancing Rural Agriculture in India
PDF
26.1.2025 venugopal K Awarded with commendation certificate.pdf
PPTX
STG - Sarikei 2025 Coordination Meeting.pptx
PDF
Strategic Planning for Child Rights and Protection Programming.pdf
2026 RMHC Terms & Conditions agreement - updated 8.1.25.pdf
Item # 2 - 934 Patterson Specific Use Permit (SUP)
Adolescent Health Orientation and Health care
It Helpdesk Solutions - ArcLight Group
2025 Shadow report on Ukraine's progression regarding Chapter 29 of the acquis
PPT - Primary Rules of Interpretation (1).pdf
buyers sellers meeting of mangoes in mahabubnagar.pdf
Social_Medias_Parents_Education_PPT.pptx
26.1.2025 venugopal K Awarded with commendation certificate.pptx
GOVERNMENT-ACCOUNTING1. bsa 4 government accounting
Weekly Report 17-10-2024_cybersecutity.pptx
PCCR-ROTC-UNIT-ORGANIZATIONAL-STRUCTURE-pptx-Copy (1).pptx
The DFARS - Part 250 - Extraordinary Contractual Actions
Quiz - Saturday.pptxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Environmental Management Basics 2025 for BDOs WBCS by Samanjit Sen Gupta.pdf
Vocational Education for educational purposes
The Role of FPOs in Advancing Rural Agriculture in India
26.1.2025 venugopal K Awarded with commendation certificate.pdf
STG - Sarikei 2025 Coordination Meeting.pptx
Strategic Planning for Child Rights and Protection Programming.pdf

Profiling Big Data sources to assess their selectivity

  • 1. Piet Daas and Joep Burger With special thanks to Marco Puts & Dong Nguyen1 Profiling Big Data sources to assess their selectivity
  • 2. Big Data – More and more organizations want to use Big Data as a new/additional source of information – However, there are some major challenges : – Selectivity of Big Data – Source does not have to completely cover the target population – What part of the population is included? 2
  • 3. Profiling: extracting ‘features’ 3 – Extract background characteristics (‘features’) from the ‘units’ in Big Data in an attempt to determine its selectivity ‐ The need for this depends on the ‘type’ of Big data source and its foreseen use – Important background characteristics for statistics are: ‐ Persons: gender, age, income, education, origin, urbanicity, household composition, .. ‐ Companies: number of employees, turnover, type of economic activity, legal form, ..
  • 4. Social Media: Twitter as an example 4 – On Social media persons, companies and ‘others’ can create an account and create messages ‐ In the Netherlands 70% of the population is active on social media – What kind of information is available on Twitter of a user ‐ Focus on gender! – Let’s look at a profile: @pietdaas
  • 5. 5 1)Name 2) Short bio 3) Messages content 4) Picture
  • 6. Studied a Twitter sample – From a list of Dutch Twitter users (~330.000) – A random sample of 1000 unique ids was drawn – Of the sample: ‐ 844 profiles still existed • 844 had a name • 583 provided a short bio • 473 created ‘tweets’ • 804 had a ‘non-default’ picture • 409 Men (49%) • 282 Women (33%) • 153 ‘Others’ (18%) • companies, organizations, dogs, cats, ‘bots’.. 6 DefaultTwitter picture
  • 7. Gender findings: 1) First name 7 – Used Dutch ‘Voornamenbank’ website (First name database) – Score between 0 and 1 (female – male); 676 of 844 (80%) names were registered – Unknown names scored -1 (usually companies/organizations)
  • 8. 8 Gender findings: 2) Short bio – If a short bio is provided ‐ Quite a number of people mention there ‘position’ in the family • Mother, father, papa, mama, ‘son of’, etc. ‐ Sometimes also occupations are mentioned that reflect the gender (‘studente’) ‐ 155 of 583 (27%) indicated there gender in short bio ‐ Need to check both English and Dutch texts
  • 9. Gender findings: 3) Tweets content – In cooperation with University of Twente (Dong Nguyen) – Machine learning approach that determines gender specific writing style ‐ Language specific: Messages need to be Dutch! ‐ 437 of 473 (92%) persons that created tweets could be classified
  • 10. Gender findings: 4) Profile picture 10 – Use OpenCV to process pictures 1) Face recognition 2) Standardisation of faces (resize & rotate) 3) Classify faces according to gender - 603 of 804 (75%) profile pictures had 1 or more faces on it 1 2 3
  • 11. Gender findings: overall results 11 Diagnostic Odds Ratio = (TP/FN) / (FP/TN) random guessing log(DOR) = 0 ‐ Multi-agent findings • Need clever ways to combine these • Take processing efficiency of the ‘agent’ into consideration Diagnostic Odds Ratio (log) First name 6.41 Short bio 3.50 Tweet content 2.36 Picture (faces) 0.72
  • 12. Thank you for your attention ! 12