Microblog-genre noise and its impact on semantic annotation accuracy

Microblog-Genre Noise and
Impact on Semantic Annotation Accuracy
Leon Derczynski
Diana Maynard
Niraj Aswani
Kalina Bontcheva

Evolution of communication
Functional utterances
Vowels
Velar closure: consonants
Speech
New modality: writing
Digital text
E-mail
Social media
Increased
machine-
readable
information
??

Social Media = Big Data
Gartner ''3V'' definition:
1.Volume
2.Velocity
3.Variety
High volume & velocity of messages:
Twitter has ~20 000 000 users per month
They write ~500 000 000 messages per day
Massive variety:
Stock markets;
Earthquakes;
Social arrangements;
… Bieber

What resources do we have now?
Large, content-rich, linked, digital streams of human communication
We transfer knowledge via communication
Sampling communication gives a sample of human knowledge
''You've only done that which you can communicate''
The metadata (time – place – imagery) gives a richer resource:
→A sampling of human behaviour

Linking these resources
Why is it useful to link this data?
Identifying subjects, themes, … ''entities''
What's the text about?
Who'd be interested in it?
How can we summarise it?
- very important, considering information overload in soc med!

Typical annotation pipeline
Language ID
Tokenisation
PoS tagging
Text

Typical annotation pipeline
Named entity recognition
dbpedia.org/resource/.....
Michael_Jackson
Michael_Jackson_(writer)
Linking entities

Pipeline cumulative effect
Good performance is important at each stage – not just entity linking

Language ID
LADY GAGA IS BETTER THE 5th TIME OH BABY(:
The Jan. 21 show started with the unveiling of an
impressive three-story castle from which Gaga
emerges. The band members were in various
portals, separated from each other for most of the
show. For the next 2 hours and 15 minutes, Lady
Gaga repeatedly stormed the moveable castle,
turning it into her own gothic Barbie Dreamhouse
Newswire:
Microblog:

Language ID difficulties
General accuracy on microblogs: 89.5%
Problems include switching language mid-text:
je bent Jacques cousteau niet die een nieuwe soort heeft ontdekt,
het is duidelijk, ze bedekken hun gezicht. Get over it
New info in this format:
Metadata:
spatial information
linked URLs
Emoticons:
:) vs. ^_^
cu vs. 88
Accuracy when customised to genre: 97.4%

Tokenisation
General accuracy on microblogs: 80%
Goal is to convert byte stream to readily-digestible word chunks
Word bound discovery is a critical language acquisition task
The LIBYAN AID Team successfully shipped these broadcasting
equipment to Misrata last August 2011, to establish an FM Radio
station ranging 600km, broadcasting to the west side of Libya to
help overthrow Gaddafi's regime.
RT @JosetteSheeran: @WFP #Libya breakthru! We move
urgently needed #food (wheat, flour) by truck convoy into
western Libya for 1st time :D

Tokenisation difficulties
Not curated, so typos
Improper grammar – e.g. apostrophe usage; live with it!
doesnt → doesnt
doesn't → does n't
Smileys and emoticons
I <3 you → I & lt ; you
This piece ;,,( so emotional → this piece ; , , ( so emotional
Loss of information (sentiment)
Punctuation for emphasis
*HUGS YOU**KISSES YOU* → * HUGS YOU**KISSES YOU *
Words run together

Tokenisation fixes
Custom tokeniser!
Apostrophe insertion
Slang unpacking
Ima get u → I 'm going to get you
Emoticon rules
- if we can spot them, we know not to break them
Customised accuracy on microblogs: 96%

Part of speech tagging
Goal is to assign words to classes (verb, noun etc)
General accuracy on newswire:
97.3% token, 56.8% sentence
General accuracy on microblogs:
Sentence-level accuracy important:
without whole sentence correct, difficult to extract syntax

Part of speech tagging difficulties
Many unknowns:
Music bands
Soulja Boy | TheDeAndreWay.com in stores Nov 2, 2010
Places
#LB #news: Silverado Park Pool Swim Lessons
Capitalisation way off
@thewantedmusic on my tv :) aka derek
last day of sorting pope visit to birmingham stuff out
Slang
~HAPPY B-DAY TAYLOR !!! LUVZ YA~
Orthographic errors
dont even have homwork today, suprising ?
Dialect
Shall we go out for dinner this evening?
Ey yo wen u gon let me tap dat

Part of speech tagging fixes
Slang dictionary for ''repair''
Won't cover previously-unseen slang
In-genre labelled data
Expensive to create!
Leverage ML
Existing taggers can handle unknown words
Maximise use of these features!
General accuracy on microblogs:
Accuracy when customised to genre:

Named Entity Recognition
Goal is to find entities we might like to link
General accuracy on newswire: 89% F1
General accuracy on microblogs: 41% F1
Newswire:
Microblog:
Gotta dress up for london fashion week and party in
style!!!
London Fashion Week grows up – but mustn't take
itself too seriously. Once a launching pad for new
designers, it is fast becoming the main event. But
LFW mustn't let the luxury and money crush its
sense of silliness.

NER difficulties
Rule-based systems get the bulk of entities (newswire 77% F1)
ML-based systems do well at the remainder (newswire 89% F1)
Small proportion of
difficult entities
Many complex issues
Using improved pipeline:
ML struggles, even with in-genre data: 49% F1
Rules cut through microblog noise: 80% F1

NER on Facebook
Longer texts than tweets
Still has informal tone
MWEs are a problem!
- all capitalised:
Green Europe Imperiled as Debt Crises
Triggers Carbon Market Drop
Difficult, though easier than Twitter
Maybe due to the possibility of including more verbal context?

Entity linking
Goal is to find out which entity a mention refers to
''The murderer was Professor Plum, in the Library,
with the Candlestick!''
Which Professor Plum?
Disambiguation is through connecting text to the web of data
dbpedia.org/resource/Professor_Plum_(astrophysicist)
Two tasks:
- Whole-text linking
- Entity-level linking

''Aboutness''
Goal: answer ''What entities is this text about?''
Good for tweets:
Lack of lexicalised context
Not all related concepts are in the text
Helpful for summarisation
No concern for entity bounds; (finding them is tough in microblog!)
* but *
Added concern for themes in text
e.g. ''marketing'', ''US elections''

Aboutness performance
Corpus:
from Meij et al. ''Adding semantics to microblog posts.''
468 tweets
From one to six concepts per tweet
DBpedia spotlight: highest recall (47.5)
TextRazor: highest precision (64.6)
Zemanta: highest F1 (41.0)
Zemanta tuned for blog entries – so compensates for some noise

Word-level linking
Goal is to link an entity
Given:
The entity mention
Surrounding microblog context
No corpora exist for this exact task:
Two commercially produced ones
Policy says ''no sharing''
How can we approach this key task?

Word-level linking performance
Dataset: Replab
Task is to determine relatedness-or-not
Six entities given
Few hundred tweets per entity
Detect mentions of entity in tweets
We disambiguate mentions to DBpedia / Wikipedia (easy to map)
General performance: F1 around 70

Word-level linking issues
NER errors
Missed entities damages / destroys linking
Specificity problems
Lufthansa Cargo
Lufthansa Cargo
Which organisation to choose?
Require good NER
Direct linking chunking reduces precision:
Apple trees in the home garden bit.ly/yOztKs
Pipeline NER does not mark Apple as entity here
Lack of disambiguation context is a problem!

Word-level linking issues
Automatic annotation:
Branching out from Lincoln park(LOC) after dark ... Hello "Russian
Navy(ORG)", it's like the same thing but with glitter!
Actual:
Branching out from Lincoln park after dark(ORG) ... Hello "Russian
Navy", it's like the same thing but with glitter!
Clue in unusual collocations
+ ?

Whole pipeline: how to fix?
Common genre problems centre on mucky, uncurated text
Orth error
Slang
Brevity
Condensed
Non-Chicago punctuation..
Maybe clearing up this will improve performance?

Normalisation
General solution for overcoming linguistic noise
How to repair?
1. Gazetteer (quick & dirty); or..
2. Noisy channel model
Task is to ''reverse engineer'' the noise on this channel
Brown clustering; double metaphone; auto orth correction
An honest, well-formed sentence u wot m8 biber #lol

Normalisation performance
NER on tweets:
Rule-based
No normalisation F1 80%
Gazetter normalisation F1 81%
Noisy channel F1 81%
ML-based
No normalisation F1 49.1%
Gazetter normalisation F1 47.6%
Noisy channel F1 49.3%
Negligible performance impact, and introduces errors!
Sentiment change:
undisambiguable → disambiguable
Meaning change:
She has Huntington's → She has Huntingdon's

Future directions
MORE DATA!
and better..
no IAA for many resources
Maybe from the crowd?
MORE CONTEXT!
Not just linguistic
Microblog has host of metatdata
Explicit:
Time, Place, URIs, Hashtags
Implicit:
Friend network
Previous messages

Thank you!
Thank you for listening!
Do you have any questions?

Microblog-genre noise and its impact on semantic annotation accuracy

More Related Content

Viewers also liked (13)

Similar to Microblog-genre noise and its impact on semantic annotation accuracy (20)

More from Leon Derczynski (20)

Recently uploaded (20)

Microblog-genre noise and its impact on semantic annotation accuracy