Broad Twitter Corpus: A Diverse Named Entity Recognition Resource

Broad Twitter Corpus: A Diverse
Named Entity Recognition Resource
Leon Derczynski
Kalina Bontcheva
Ian Roberts

Broad Twitter Corpus: A Diverse
Named Entity Recognition Resource
“I strongly recommend this paper”
“It is therefore a very useful resource”
“Impact of resources: 5
Overall recommendation: 5
Reviewer Confidence: 5”
wow
so review
very paper
much japan

Most of our language tech was trained on news
The bias is:
- middle class
- white
-working age
- educated
- male
- 1980s/1990s
- from the US
- journalist
- following AP guidelines

Your phone rewards you if you talk and write like
(and that's ok.. sort of)
Photo © Michael Jang 1983

Your phone rewards you if you talk and write like
(and that's ok.. sort of)
.. and punishes you when you don't.
(not cool!)

The REAL problem:
Our studies have centred on a
tiny, over-biased set of data
There is no variation!
(analyse some WSJ if you are not convinced..)
It's time to up our game;
social media is a cheap & unprecedented resource
e.g. Baldwin @ WNUT15; Hovy @ ACL15

Social media is incredibly powerful
- sample of all global discourse
- warns of earthquakes
- sends fire engines
- predicts virus outbreaks (e.g. WNV)
Traditional tools have awful performance
Stanford NER 40% F1
Single-topic recall 66%
.. cross-topic 33%

What kind of entities do we find in social media?
High variety – ages quickly
News Tweets
PER Politicians, business
leaders, journalists,
celebrities
Sportsmen, actors, TV
personalities, celebrities,
names of friends
LOC Countries, cities, rivers,
and other places
related to current affairs
Restaurants, bars, local
landmarks/areas, cities,
rarely countries
ORG Public and private
companies, government
organisations
Bands, internet
companies, sports clubs

Why a new corpus?
Existing ones are tiny, and hyperfocused
Name Tokens Schema Annotation Notes
UMBC 7K PLO Crowd Low IAA
Ritter 46K Freebase Expert, single No IAA
Microsoft 12K PLO + Product ? Private
MSM 29K PLO + Misc Expert, multiple
No hashtags /
usernames

What kind of variance do we see?
Temporal:
- concept drift over time
- daily cycles (work, family, socialising)
- weekly cycles
- time of year (seasonal behaviours)
Spatial
- many different anglophone regions
- different surface forms in each
- different signifiers (LLC – Ltd. - DAC)
Social
- WSJ readers and writers
- net celebrities
- tv characters

Corpus design:
Temporal
- drawn over six years, from twitter archive
- selected over multiple temporal cycles
Spatial
- spread over six anglophone regions:
UK, US, IE, CA, NZ, AU
Social
- general segment
- selection for news
- selection for commentary

Annotation problems
Workflow:
Crowdsourcing platform interfaces = pita
Not in USA, so no mturk access
Solution:
- GATE Crowdsourcing plugin
- Load corpus, set up task, add API
key, launch job, done!
- Automatic result collection &
alignment
- Even Java/Swing is prettier than
mturk’s back end

Annotation problems
Task design
Lots of training required
Many entity types
Solution
Brief instructions
Clean interface
Annotate just one entity type at a time
- pricy but way better, and overall, quicker

Annotation problems
Annotator recall
Pretty serious problem
People have limited knowledge, limited world experience
Expert annotators actually not good – we’re desperately overfit
Don’t believe me? Who can explain this real document?
KKTNY in 45 min!!!!!

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource

Annotation problems
Annotator recall
Pretty serious problem
People have limited knowledge, limited world experience
Expert annotators actually not good – we’re desperately overfit
Don’t believe me? Who can explain this real document?
KKTNY in 45 min!!!!!
Solution:
Ignore traditional IAA
Pool the results - “max recall”
Rare knowledge ≠ Wrong knowledge
Post-solution:
Expert adjudication step

Annotation problems
Crowd can be pretty dumb
Not its fault – we gave no education
People need precise idea of task
Solution 1
Ensure workers get good score on known data first
Lace the text with gold data, for monitoring & feedback
Solution 2
Keep task focused (just one entity type)
Give instructions & examples

Results – annotator quality
Experts are consistent, but don’t get far
Crowd is varied and inconsistent, but gets
superior recall performance
Remember, recall is the problem with soc med!
Group
Recall over final
annotations
F1 IAA
Expert 0.309 0.835
Crowd 0.837 0.350

Results: size
Name Tokens Schema Annotation Notes
UMBC 7K PLO Crowd Low IAA
Ritter 46K Freebase Expert, single No IAA
Microsoft 12K PLO + Product ? Private
MSM 29K PLO + Misc Expert, multiple
No hashtags /
usernames
BTC
(Broad Twitter
Corpus)
165K PLO
Expert +
Crowd
Source JSON
available
Documents 9 551
Tokens 165 739
Person 5 271
Location 3 114
Organisation 3 732
Total 12 117

Results: diversity
Sorry Botswana,
Bahamas, South Africa,
Malta.. looking forward to
seeing you crowdsource!

Results: diversity
By year, and month

Results: diversity
By day of month, weekday, and time of day

Results: IAA
Adjudication is the agreement with max-recall
Naïve is micro-averaged lenient match
Note that max-recall performs very well
(according to expert..)
Level Adjudication Naïve
Whole doc 0.839 N/a
Person 0.920 0.799
Location 0.963 0.861
Organisation 0.936 0.954
All 0.940 0.877

Results: popular surface forms
CONLL is: * ancient
* US and int.rel. centric
* about cricket???

Results: long tail steepness
Tail vs. head tells us something about diversity
If a few forms make up many mentions, the corpus is more boring:
- less variety (qualitative)
- harder to generalise
about (maths!)
We bisect at h-index
point, and compare
proportions

Corpus distribution
Totally legal to give source; it’s under 50K tweets
- JSON
- GATE docs
- CoNLL
All intermediate crowdsourcing data included in the GATE docs
Available before Dec 16
To be extra sure, also available as “rehydratable standoff”

Thanks! And thank you everyone!
Alonso & Lease, 2011
Bontcheva et al. 2014a
Bontcheva et al. 2014b
Callison-Burch &
Dredze, 2010
Difallah et al. 2013
Finin et al. 2010
Hovy et al. 2013
Khanna et al. 2010
Morris et al. 2012
Sabou et al. 2014
Balog et al. 2012
Bollacker et al. 2008
Hovy 2010
Rowe et al. 2013
Ritter et al. 2011
Rose et al. 2002
Tjong Kim Sam et al. 2003
Coppersmith et al. 2014
De Choudhury et al. 2013
Kedzie et al. 2015
Neubig et al. 2011
Tumasjan et al. 2010
Eisenstein et al. 2010
Eisenstein 2013
Hu et al. 2013
Kergl et al. 2014
Mascaro & Goggins 2012
Tufekci 2014
Bontcheva et al. 2013
Liu et al. 2011
Lui & Baldwin 2012
Magdy & Elsayed 2016
Mostafa 2013
O’Connor et al. 2010
Fromreide et al. 2014
Masud et al. 2010

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource

More Related Content

Similar to Broad Twitter Corpus: A Diverse Named Entity Recognition Resource (20)

More from Leon Derczynski (20)

Recently uploaded (20)

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource