Unsupervised Word Usage Similarity in Social Media Texts

Methodology

Results

Conclusion

Unsupervised Word Usage Similarity in Social
Media Texts
Spandana Gella and Paul Cook and Bo Han
Department of Computing and Information Systems
The University of Melbourne

1 / 18

Methodology

Results

Conclusion

Social media — Twitter

Huge volume of user generated text
Applications: trend analysis, event detection, natural disaster
response co-ordination

Short, noisy text: Challenging for traditional NLP
Little work to-date on lexical semantics on Twitter
This paper: Lexical semantic interpretation in tweets based on
word usage similarity

2 / 18

Methodology

Results

Conclusion

Word sense disambiguation (WSD)

Given a word in context, select the best-ﬁtting “sense” from a
sense inventory
ne1 headin to blue boyz footy match this weekend?
#iamcarlton
game in which players or teams compete against each other

I think they are a perfect match! #cute #xoxo
something that resembles or harmonizes; “that tie makes a
good match with your jacket”
a pair of people who live together; “a married couple from
Chicago”

3 / 18

Methodology

Results

Conclusion

Word sense disambiguation (WSD)

Given a word in context, select the best-ﬁtting “sense” from a
sense inventory
ne1 headin to blue boyz footy match this weekend?
#iamcarlton
game in which players or teams compete against each other

I think they are a perfect match! #cute #xoxo
something that resembles or harmonizes; “that tie makes a
good match with your jacket”
a pair of people who live together; “a married couple from
Chicago”

4 / 18

Methodology

Results

Conclusion

Issues with WSD

Choice of sense inventory
match: WordNet 9 senses, Macmillan 4 senses

No sense tagged resources for social media data
Cannot capture novel usage patterns
Assumes a single sense per usage
Challenges over social media: short, noisy, non-standard
syntax
Solution? An alternative representation of meaning in context

5 / 18

Methodology

Results

Conclusion

Usage similarity (Usim)
The manual task of rating the similarity of a pair of usages of
a word (SPair) [Erk et al., 2009].
Similarity on a graded scale (1 – 5)
No more senses; independent of sense inventory
Novel usages: Rate similarity to established usages
SPair example (annotators’ judgement: 3.2)
Setting goals for myself this year, ﬁgured if it’s on paper I’ll
be more inspired to work harder.
This is very unsmart of me to get tipsy and then have to go
home and write a paper.

6 / 18

Methodology

Results

Conclusion

Gold standard — Usim-tweet annotation
10 nouns: bar, charge, execution, ﬁeld, ﬁgure, function,
investigator, match, paper, post
55 pairs of tweets (SPairs) were annotated per lemma
(sampled from Twitter Streaming API)
Amazon Mechanical Turk annotation
Sample Annotation:

7 / 18

Methodology

Results

Conclusion

Background corpora
Original: Tweets from Streaming API containing target
word as noun
gotta 3 page paper due tomorrow haven start
#procrastination

Expanded: Original + document expansion based on
medium-frequency hashtags
research paper due tomorrow 2 page 3 #procrastination
insert ﬁgure equations lab report less time word #physicsmajor
#procrastination ...

RandExpanded: Original + extra tweets containing
target

8 / 18

Methodology

Results

Conclusion

Model Usim for Twitter

No large annotated training resources: unsupervised
No parser: bag-of-words
Methods?
Vector space model (VSM)
Topic models (LDA) [Lui et al., 2012]
Weighted Textual Matrix Factorization (WTMF)

1 model per target word

9 / 18

Methodology

Results

Conclusion

Methods

VSM (Baseline): Second order co-occurrence
LDA (Our approach): Represent documents as topic
distribution vectors
WTMF (Benchmark): Consider information from “missing”
words related to the latent vector proﬁle
State-of-the-art on a similar task

10 / 18

Methodology

Results

Conclusion

Topic modeling — LDA

11 / 18

Methodology

Results

Conclusion

Method overview

12 / 18

Methodology

Results

Conclusion

Results

Model
Baseline
WTMF (d)
LDA (T )

Original
0.09
0.03 (8)
0.20 (8)

Expanded
0.08
0.10 (20)
0.29 (5)

RandExpanded
0.09
0.09 (5)
0.18 (20)

Spearman’s rank correlation (ρ) for each method based on each
background corpus

13 / 18

Methodology

Results

Conclusion

Results on Expanded
Lemma
bar
charge
execution
field
figure
function
investigator
match
paper
post
Overall

Best-T
ρ
T
0.35
50
0.33
20
0.58
5
0.53
10
0.24 250
0.40
10
0.50
5
0.45
5
0.32
30
0.2
30
0.29
5

5-topics
ρ
0.1
−0.08
0.58
0.32
0.14
0.27
0.50
0.45
0.22
−0.01
0.29

ρ values that are significant (p > 0.05) are shown in bold
14 / 18

Methodology

Results

Conclusion

Results varying d and T

0.3
0.2
0.1

Spearman correlation ρ

0.0

q
q

q

q

q

q

q

q

q

q

q

Original
Expanded
RandExpanded

q

Spearman correlation ρ
0.0
0.1
0.2
0.3

Original
Expanded
RandExpanded

q

q

q

q
q
q
q
q

q

q

q

q

d

(a) WTMF: ρ versus dimensions (d)

q

q

−0.1

q

q

q

2
3
5
8
10
20
30
50
100
150
200
250
300
350
400
450
500

200

150

50

100

30

20

8

10

5

3

2

−0.1

q

T

(b) LDA: ρ versus topics (T )

15 / 18

Methodology

Results

Conclusion

Summary

Computationally modeled Usim over social media
LDA approach out-performed a baseline and benchmark
Hashtag based document expansion improved performance of
LDA and benchmark
Gold-standard dataset Usim-tweet will be made available
Future work
Alternative document expansion (e.g., author based)
Context representation: POS, positional word features, etc.
Non-parametric topic modelling (e.g., HDP)

16 / 18

Methodology

Results

Conclusion

Thanks

17 / 18

Methodology

Results

Conclusion

Bibliography

Erk, K., McCarthy, D., and Gaylord, N. (2009).
Investigations on word senses and word usages.
In Proceedings of the Joint conference of the 47th Annual Meeting of the
Association for Computational Linguistics and the 4th International Joint
Conference on Natural Language Processing of the Asian Federation of
Natural Language Processing (ACL-IJCNLP 2009), pages 10–18,
Singapore.
Lui, M., Baldwin, T., and McCarthy, D. (2012).
Unsupervised estimation of word usage similarity.
In Proceedings of the Australasian Language Technology Workshop 2012
(ALTW 2012), pages 33–41, Dunedin, New Zealand.

18 / 18

Unsupervised Word Usage Similarity in Social Media Texts

More Related Content

What's hot (19)

Similar to Unsupervised Word Usage Similarity in Social Media Texts (20)

Recently uploaded (20)

Unsupervised Word Usage Similarity in Social Media Texts