SlideShare a Scribd company logo
Broad Twitter Corpus: A Diverse
Named Entity Recognition Resource
Leon Derczynski
Kalina Bontcheva
Ian Roberts
Broad Twitter Corpus: A Diverse
Named Entity Recognition Resource
“I strongly recommend this paper”
“It is therefore a very useful resource”
“Impact of resources: 5
Overall recommendation: 5
Reviewer Confidence: 5”
wow
so review
very paper
much japan
Most of our language tech was trained on news
The bias is:
- middle class
- white
-working age
- educated
- male
- 1980s/1990s
- from the US
- journalist
- following AP guidelines
Your phone rewards you if you talk and write like
(and that's ok.. sort of)
Photo © Michael Jang 1983
Your phone rewards you if you talk and write like
(and that's ok.. sort of)
.. and punishes you when you don't.
(not cool!)
The REAL problem:
Our studies have centred on a
tiny, over-biased set of data
There is no variation!
(analyse some WSJ if you are not convinced..)
It's time to up our game;
social media is a cheap & unprecedented resource
e.g. Baldwin @ WNUT15; Hovy @ ACL15
Social media is incredibly powerful
- sample of all global discourse
- warns of earthquakes
- sends fire engines
- predicts virus outbreaks (e.g. WNV)
Traditional tools have awful performance
Stanford NER 40% F1
Single-topic recall 66%
.. cross-topic 33%
What kind of entities do we find in social media?
High variety – ages quickly
News Tweets
PER Politicians, business
leaders, journalists,
celebrities
Sportsmen, actors, TV
personalities, celebrities,
names of friends
LOC Countries, cities, rivers,
and other places
related to current affairs
Restaurants, bars, local
landmarks/areas, cities,
rarely countries
ORG Public and private
companies, government
organisations
Bands, internet
companies, sports clubs
Why a new corpus?
Existing ones are tiny, and hyperfocused
Name Tokens Schema Annotation Notes
UMBC 7K PLO Crowd Low IAA
Ritter 46K Freebase Expert, single No IAA
Microsoft 12K PLO + Product ? Private
MSM 29K PLO + Misc Expert, multiple
No hashtags /
usernames
What kind of variance do we see?
Temporal:
- concept drift over time
- daily cycles (work, family, socialising)
- weekly cycles
- time of year (seasonal behaviours)
Spatial
- many different anglophone regions
- different surface forms in each
- different signifiers (LLC – Ltd. - DAC)
Social
- WSJ readers and writers
- net celebrities
- tv characters
Corpus design:
Temporal
- drawn over six years, from twitter archive
- selected over multiple temporal cycles
Spatial
- spread over six anglophone regions:
UK, US, IE, CA, NZ, AU
Social
- general segment
- selection for news
- selection for commentary
Annotation problems
Workflow:
Crowdsourcing platform interfaces = pita
Not in USA, so no mturk access
Solution:
- GATE Crowdsourcing plugin
- Load corpus, set up task, add API
key, launch job, done!
- Automatic result collection &
alignment
- Even Java/Swing is prettier than
mturk’s back end
Annotation problems
Task design
Lots of training required
Many entity types
Solution
Brief instructions
Clean interface
Annotate just one entity type at a time
- pricy but way better, and overall, quicker
Annotation problems
Annotator recall
Pretty serious problem
People have limited knowledge, limited world experience
Expert annotators actually not good – we’re desperately overfit
Don’t believe me? Who can explain this real document?
KKTNY in 45 min!!!!!
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Annotation problems
Annotator recall
Pretty serious problem
People have limited knowledge, limited world experience
Expert annotators actually not good – we’re desperately overfit
Don’t believe me? Who can explain this real document?
KKTNY in 45 min!!!!!
Solution:
Ignore traditional IAA
Pool the results - “max recall”
Rare knowledge ≠ Wrong knowledge
Post-solution:
Expert adjudication step
Annotation problems
Crowd can be pretty dumb
Not its fault – we gave no education
People need precise idea of task
Solution 1
Ensure workers get good score on known data first
Lace the text with gold data, for monitoring & feedback
Solution 2
Keep task focused (just one entity type)
Give instructions & examples
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Results – annotator quality
Experts are consistent, but don’t get far
Crowd is varied and inconsistent, but gets
superior recall performance
Remember, recall is the problem with soc med!
Group
Recall over final
annotations
F1 IAA
Expert 0.309 0.835
Crowd 0.837 0.350
Results: size
Name Tokens Schema Annotation Notes
UMBC 7K PLO Crowd Low IAA
Ritter 46K Freebase Expert, single No IAA
Microsoft 12K PLO + Product ? Private
MSM 29K PLO + Misc Expert, multiple
No hashtags /
usernames
BTC
(Broad Twitter
Corpus)
165K PLO
Expert +
Crowd
Source JSON
available
Documents 9 551
Tokens 165 739
Person 5 271
Location 3 114
Organisation 3 732
Total 12 117
Results: diversity
Sorry Botswana,
Bahamas, South Africa,
Malta.. looking forward to
seeing you crowdsource!
Results: diversity
By year, and month
Results: diversity
By day of month, weekday, and time of day
Results: IAA
Adjudication is the agreement with max-recall
Naïve is micro-averaged lenient match
Note that max-recall performs very well
(according to expert..)
Level Adjudication Naïve
Whole doc 0.839 N/a
Person 0.920 0.799
Location 0.963 0.861
Organisation 0.936 0.954
All 0.940 0.877
Results: popular surface forms
CONLL is: * ancient
* US and int.rel. centric
* about cricket???
Results: long tail steepness
Tail vs. head tells us something about diversity
If a few forms make up many mentions, the corpus is more boring:
- less variety (qualitative)
- harder to generalise
about (maths!)
We bisect at h-index
point, and compare
proportions
Corpus distribution
Totally legal to give source; it’s under 50K tweets
- JSON
- GATE docs
- CoNLL
All intermediate crowdsourcing data included in the GATE docs
Available before Dec 16
To be extra sure, also available as “rehydratable standoff”
Thanks! And thank you everyone!
Alonso & Lease, 2011
Bontcheva et al. 2014a
Bontcheva et al. 2014b
Callison-Burch &
Dredze, 2010
Difallah et al. 2013
Finin et al. 2010
Hovy et al. 2013
Khanna et al. 2010
Morris et al. 2012
Sabou et al. 2014
Balog et al. 2012
Bollacker et al. 2008
Hovy 2010
Rowe et al. 2013
Ritter et al. 2011
Rose et al. 2002
Tjong Kim Sam et al. 2003
Coppersmith et al. 2014
De Choudhury et al. 2013
Kedzie et al. 2015
Neubig et al. 2011
Tumasjan et al. 2010
Eisenstein et al. 2010
Eisenstein 2013
Hu et al. 2013
Kergl et al. 2014
Mascaro & Goggins 2012
Tufekci 2014
Bontcheva et al. 2013
Liu et al. 2011
Lui & Baldwin 2012
Magdy & Elsayed 2016
Mostafa 2013
O’Connor et al. 2010
Fromreide et al. 2014
Masud et al. 2010

More Related Content

PDF
Handling and Mining Linguistic Variation in UGC
PPTX
Natural Language processing
PPTX
Proposal defense
PDF
Tao Chen - 2015 - Interactive Second Language Learning from News Websites
PDF
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
PDF
Intro to nlp
PPTX
Plain language and artificial intelligence
PPT
Recent Advances in Natural Language Processing
Handling and Mining Linguistic Variation in UGC
Natural Language processing
Proposal defense
Tao Chen - 2015 - Interactive Second Language Learning from News Websites
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Intro to nlp
Plain language and artificial intelligence
Recent Advances in Natural Language Processing

Similar to Broad Twitter Corpus: A Diverse Named Entity Recognition Resource (20)

PPTX
Professional Information Research
PPT
Text Analytics: Yesterday, Today and Tomorrow
PPT
Oss swot
PDF
PLAIN2013 Rethink, Reorganize, Reword, Redesign
PPT
16-nlp (2).ppt
PDF
Intro to Data Science
PPTX
Reaching Peak Performance for Knowledge Workers
PPTX
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...
PDF
Starting to Process Social Media
PPTX
Introduction to Application Profiles
PDF
BDS14 Big Data Analytics to the masses
PDF
How to impress your boss and your customer in a modern software development c...
PDF
Enterprise Scale Knowledge Graphs
PDF
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
PPT
lect36-tasks.ppt
PPT
NLP Tasks and Applications.ppt useful in
PDF
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
PPTX
Beyond document retrieval using semantic annotations
PPT
jon-on reasearch.ppt
PDF
Babak Rasolzadeh: The importance of entities
Professional Information Research
Text Analytics: Yesterday, Today and Tomorrow
Oss swot
PLAIN2013 Rethink, Reorganize, Reword, Redesign
16-nlp (2).ppt
Intro to Data Science
Reaching Peak Performance for Knowledge Workers
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...
Starting to Process Social Media
Introduction to Application Profiles
BDS14 Big Data Analytics to the masses
How to impress your boss and your customer in a modern software development c...
Enterprise Scale Knowledge Graphs
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
lect36-tasks.ppt
NLP Tasks and Applications.ppt useful in
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Beyond document retrieval using semantic annotations
jon-on reasearch.ppt
Babak Rasolzadeh: The importance of entities
Ad

More from Leon Derczynski (20)

PDF
Joint Rumour Stance and Veracity
PDF
State of Tools for NLP in Danish: 2018
ODP
RumourEval
ODP
Efficient named entity annotation through pre-empting
PDF
Leveraging the Power of Social Media
PDF
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
ODP
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
ODP
Christmas Presentation at Aarhus: What I do
PDF
Recognising and Interpreting Named Temporal Expressions
PPT
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
PDF
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
PDF
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
PDF
Determining the Types of Temporal Relations in Discourse
PDF
Microblog-genre noise and its impact on semantic annotation accuracy
PDF
Empirical Validation of Reichenbach’s Tense Framework
PDF
Towards Context-Aware Search and Analysis on Social Media Data
PDF
Determining the Types of Temporal Relations in Discourse
PDF
TIMEN: An Open Temporal Expression Normalisation Resource
PPT
Review of: Challenges of migrating to agile methodologies
PPT
A data driven approach to query expansion in question answering
Joint Rumour Stance and Veracity
State of Tools for NLP in Danish: 2018
RumourEval
Efficient named entity annotation through pre-empting
Leveraging the Power of Social Media
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Christmas Presentation at Aarhus: What I do
Recognising and Interpreting Named Temporal Expressions
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Determining the Types of Temporal Relations in Discourse
Microblog-genre noise and its impact on semantic annotation accuracy
Empirical Validation of Reichenbach’s Tense Framework
Towards Context-Aware Search and Analysis on Social Media Data
Determining the Types of Temporal Relations in Discourse
TIMEN: An Open Temporal Expression Normalisation Resource
Review of: Challenges of migrating to agile methodologies
A data driven approach to query expansion in question answering
Ad

Recently uploaded (20)

PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
Sciences of Europe No 170 (2025)
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
2. Earth - The Living Planet earth and life
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Microbiology with diagram medical studies .pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
BIOMOLECULES PPT........................
PPT
protein biochemistry.ppt for university classes
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
. Radiology Case Scenariosssssssssssssss
Comparative Structure of Integument in Vertebrates.pptx
Sciences of Europe No 170 (2025)
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
TOTAL hIP ARTHROPLASTY Presentation.pptx
Introduction to Cardiovascular system_structure and functions-1
Derivatives of integument scales, beaks, horns,.pptx
neck nodes and dissection types and lymph nodes levels
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
2. Earth - The Living Planet earth and life
Placing the Near-Earth Object Impact Probability in Context
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Microbiology with diagram medical studies .pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
BIOMOLECULES PPT........................
protein biochemistry.ppt for university classes
Cell Membrane: Structure, Composition & Functions
Classification Systems_TAXONOMY_SCIENCE8.pptx

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource

  • 1. Broad Twitter Corpus: A Diverse Named Entity Recognition Resource Leon Derczynski Kalina Bontcheva Ian Roberts
  • 2. Broad Twitter Corpus: A Diverse Named Entity Recognition Resource “I strongly recommend this paper” “It is therefore a very useful resource” “Impact of resources: 5 Overall recommendation: 5 Reviewer Confidence: 5” wow so review very paper much japan
  • 3. Most of our language tech was trained on news The bias is: - middle class - white -working age - educated - male - 1980s/1990s - from the US - journalist - following AP guidelines
  • 4. Your phone rewards you if you talk and write like (and that's ok.. sort of) Photo © Michael Jang 1983
  • 5. Your phone rewards you if you talk and write like (and that's ok.. sort of) .. and punishes you when you don't. (not cool!)
  • 6. The REAL problem: Our studies have centred on a tiny, over-biased set of data There is no variation! (analyse some WSJ if you are not convinced..) It's time to up our game; social media is a cheap & unprecedented resource e.g. Baldwin @ WNUT15; Hovy @ ACL15
  • 7. Social media is incredibly powerful - sample of all global discourse - warns of earthquakes - sends fire engines - predicts virus outbreaks (e.g. WNV) Traditional tools have awful performance Stanford NER 40% F1 Single-topic recall 66% .. cross-topic 33%
  • 8. What kind of entities do we find in social media? High variety – ages quickly News Tweets PER Politicians, business leaders, journalists, celebrities Sportsmen, actors, TV personalities, celebrities, names of friends LOC Countries, cities, rivers, and other places related to current affairs Restaurants, bars, local landmarks/areas, cities, rarely countries ORG Public and private companies, government organisations Bands, internet companies, sports clubs
  • 9. Why a new corpus? Existing ones are tiny, and hyperfocused Name Tokens Schema Annotation Notes UMBC 7K PLO Crowd Low IAA Ritter 46K Freebase Expert, single No IAA Microsoft 12K PLO + Product ? Private MSM 29K PLO + Misc Expert, multiple No hashtags / usernames
  • 10. What kind of variance do we see? Temporal: - concept drift over time - daily cycles (work, family, socialising) - weekly cycles - time of year (seasonal behaviours) Spatial - many different anglophone regions - different surface forms in each - different signifiers (LLC – Ltd. - DAC) Social - WSJ readers and writers - net celebrities - tv characters
  • 11. Corpus design: Temporal - drawn over six years, from twitter archive - selected over multiple temporal cycles Spatial - spread over six anglophone regions: UK, US, IE, CA, NZ, AU Social - general segment - selection for news - selection for commentary
  • 12. Annotation problems Workflow: Crowdsourcing platform interfaces = pita Not in USA, so no mturk access Solution: - GATE Crowdsourcing plugin - Load corpus, set up task, add API key, launch job, done! - Automatic result collection & alignment - Even Java/Swing is prettier than mturk’s back end
  • 13. Annotation problems Task design Lots of training required Many entity types Solution Brief instructions Clean interface Annotate just one entity type at a time - pricy but way better, and overall, quicker
  • 14. Annotation problems Annotator recall Pretty serious problem People have limited knowledge, limited world experience Expert annotators actually not good – we’re desperately overfit Don’t believe me? Who can explain this real document? KKTNY in 45 min!!!!!
  • 16. Annotation problems Annotator recall Pretty serious problem People have limited knowledge, limited world experience Expert annotators actually not good – we’re desperately overfit Don’t believe me? Who can explain this real document? KKTNY in 45 min!!!!! Solution: Ignore traditional IAA Pool the results - “max recall” Rare knowledge ≠ Wrong knowledge Post-solution: Expert adjudication step
  • 17. Annotation problems Crowd can be pretty dumb Not its fault – we gave no education People need precise idea of task Solution 1 Ensure workers get good score on known data first Lace the text with gold data, for monitoring & feedback Solution 2 Keep task focused (just one entity type) Give instructions & examples
  • 19. Results – annotator quality Experts are consistent, but don’t get far Crowd is varied and inconsistent, but gets superior recall performance Remember, recall is the problem with soc med! Group Recall over final annotations F1 IAA Expert 0.309 0.835 Crowd 0.837 0.350
  • 20. Results: size Name Tokens Schema Annotation Notes UMBC 7K PLO Crowd Low IAA Ritter 46K Freebase Expert, single No IAA Microsoft 12K PLO + Product ? Private MSM 29K PLO + Misc Expert, multiple No hashtags / usernames BTC (Broad Twitter Corpus) 165K PLO Expert + Crowd Source JSON available Documents 9 551 Tokens 165 739 Person 5 271 Location 3 114 Organisation 3 732 Total 12 117
  • 21. Results: diversity Sorry Botswana, Bahamas, South Africa, Malta.. looking forward to seeing you crowdsource!
  • 23. Results: diversity By day of month, weekday, and time of day
  • 24. Results: IAA Adjudication is the agreement with max-recall Naïve is micro-averaged lenient match Note that max-recall performs very well (according to expert..) Level Adjudication Naïve Whole doc 0.839 N/a Person 0.920 0.799 Location 0.963 0.861 Organisation 0.936 0.954 All 0.940 0.877
  • 25. Results: popular surface forms CONLL is: * ancient * US and int.rel. centric * about cricket???
  • 26. Results: long tail steepness Tail vs. head tells us something about diversity If a few forms make up many mentions, the corpus is more boring: - less variety (qualitative) - harder to generalise about (maths!) We bisect at h-index point, and compare proportions
  • 27. Corpus distribution Totally legal to give source; it’s under 50K tweets - JSON - GATE docs - CoNLL All intermediate crowdsourcing data included in the GATE docs Available before Dec 16 To be extra sure, also available as “rehydratable standoff”
  • 28. Thanks! And thank you everyone! Alonso & Lease, 2011 Bontcheva et al. 2014a Bontcheva et al. 2014b Callison-Burch & Dredze, 2010 Difallah et al. 2013 Finin et al. 2010 Hovy et al. 2013 Khanna et al. 2010 Morris et al. 2012 Sabou et al. 2014 Balog et al. 2012 Bollacker et al. 2008 Hovy 2010 Rowe et al. 2013 Ritter et al. 2011 Rose et al. 2002 Tjong Kim Sam et al. 2003 Coppersmith et al. 2014 De Choudhury et al. 2013 Kedzie et al. 2015 Neubig et al. 2011 Tumasjan et al. 2010 Eisenstein et al. 2010 Eisenstein 2013 Hu et al. 2013 Kergl et al. 2014 Mascaro & Goggins 2012 Tufekci 2014 Bontcheva et al. 2013 Liu et al. 2011 Lui & Baldwin 2012 Magdy & Elsayed 2016 Mostafa 2013 O’Connor et al. 2010 Fromreide et al. 2014 Masud et al. 2010