SlideShare a Scribd company logo
Methodology

Results

Conclusion

Unsupervised Word Usage Similarity in Social
Media Texts
Spandana Gella and Paul Cook and Bo Han
Department of Computing and Information Systems
The University of Melbourne

1 / 18
Methodology

Results

Conclusion

Social media — Twitter

Huge volume of user generated text
Applications: trend analysis, event detection, natural disaster
response co-ordination

Short, noisy text: Challenging for traditional NLP
Little work to-date on lexical semantics on Twitter
This paper: Lexical semantic interpretation in tweets based on
word usage similarity

2 / 18
Methodology

Results

Conclusion

Word sense disambiguation (WSD)

Given a word in context, select the best-fitting “sense” from a
sense inventory
ne1 headin to blue boyz footy match this weekend?
#iamcarlton
game in which players or teams compete against each other

I think they are a perfect match! #cute #xoxo
something that resembles or harmonizes; “that tie makes a
good match with your jacket”
a pair of people who live together; “a married couple from
Chicago”

3 / 18
Methodology

Results

Conclusion

Word sense disambiguation (WSD)

Given a word in context, select the best-fitting “sense” from a
sense inventory
ne1 headin to blue boyz footy match this weekend?
#iamcarlton
game in which players or teams compete against each other

I think they are a perfect match! #cute #xoxo
something that resembles or harmonizes; “that tie makes a
good match with your jacket”
a pair of people who live together; “a married couple from
Chicago”

4 / 18
Methodology

Results

Conclusion

Issues with WSD

Choice of sense inventory
match: WordNet 9 senses, Macmillan 4 senses

No sense tagged resources for social media data
Cannot capture novel usage patterns
Assumes a single sense per usage
Challenges over social media: short, noisy, non-standard
syntax
Solution? An alternative representation of meaning in context

5 / 18
Methodology

Results

Conclusion

Usage similarity (Usim)
The manual task of rating the similarity of a pair of usages of
a word (SPair) [Erk et al., 2009].
Similarity on a graded scale (1 – 5)
No more senses; independent of sense inventory
Novel usages: Rate similarity to established usages
SPair example (annotators’ judgement: 3.2)
Setting goals for myself this year, figured if it’s on paper I’ll
be more inspired to work harder.
This is very unsmart of me to get tipsy and then have to go
home and write a paper.

6 / 18
Methodology

Results

Conclusion

Gold standard — Usim-tweet annotation
10 nouns: bar, charge, execution, field, figure, function,
investigator, match, paper, post
55 pairs of tweets (SPairs) were annotated per lemma
(sampled from Twitter Streaming API)
Amazon Mechanical Turk annotation
Sample Annotation:

7 / 18
Methodology

Results

Conclusion

Background corpora
Original: Tweets from Streaming API containing target
word as noun
gotta 3 page paper due tomorrow haven start
#procrastination

Expanded: Original + document expansion based on
medium-frequency hashtags
research paper due tomorrow 2 page 3 #procrastination
insert figure equations lab report less time word #physicsmajor
#procrastination ...

RandExpanded: Original + extra tweets containing
target

8 / 18
Methodology

Results

Conclusion

Model Usim for Twitter

No large annotated training resources: unsupervised
No parser: bag-of-words
Methods?
Vector space model (VSM)
Topic models (LDA) [Lui et al., 2012]
Weighted Textual Matrix Factorization (WTMF)

1 model per target word

9 / 18
Methodology

Results

Conclusion

Methods

VSM (Baseline): Second order co-occurrence
LDA (Our approach): Represent documents as topic
distribution vectors
WTMF (Benchmark): Consider information from “missing”
words related to the latent vector profile
State-of-the-art on a similar task

10 / 18
Methodology

Results

Conclusion

Topic modeling — LDA

11 / 18
Methodology

Results

Conclusion

Method overview

12 / 18
Methodology

Results

Conclusion

Results

Model
Baseline
WTMF (d)
LDA (T )

Original
0.09
0.03 (8)
0.20 (8)

Expanded
0.08
0.10 (20)
0.29 (5)

RandExpanded
0.09
0.09 (5)
0.18 (20)

Spearman’s rank correlation (ρ) for each method based on each
background corpus

13 / 18
Methodology

Results

Conclusion

Results on Expanded
Lemma
bar
charge
execution
field
figure
function
investigator
match
paper
post
Overall

Best-T
ρ
T
0.35
50
0.33
20
0.58
5
0.53
10
0.24 250
0.40
10
0.50
5
0.45
5
0.32
30
0.2
30
0.29
5

5-topics
ρ
0.1
−0.08
0.58
0.32
0.14
0.27
0.50
0.45
0.22
−0.01
0.29

ρ values that are significant (p > 0.05) are shown in bold
14 / 18
Methodology

Results

Conclusion

Results varying d and T

0.3
0.2
0.1

Spearman correlation ρ

0.0

q
q

q

q

q

q

q

q

q

q

q

Original
Expanded
RandExpanded

q

Spearman correlation ρ
0.0
0.1
0.2
0.3

Original
Expanded
RandExpanded

q

q

q

q
q
q
q
q

q

q

q

q

d

(a) WTMF: ρ versus dimensions (d)

q

q

−0.1

q

q

q

2
3
5
8
10
20
30
50
100
150
200
250
300
350
400
450
500

200

150

50

100

30

20

8

10

5

3

2

−0.1

q

T

(b) LDA: ρ versus topics (T )

15 / 18
Methodology

Results

Conclusion

Summary

Computationally modeled Usim over social media
LDA approach out-performed a baseline and benchmark
Hashtag based document expansion improved performance of
LDA and benchmark
Gold-standard dataset Usim-tweet will be made available
Future work
Alternative document expansion (e.g., author based)
Context representation: POS, positional word features, etc.
Non-parametric topic modelling (e.g., HDP)

16 / 18
Methodology

Results

Conclusion

Thanks

17 / 18
Methodology

Results

Conclusion

Bibliography

Erk, K., McCarthy, D., and Gaylord, N. (2009).
Investigations on word senses and word usages.
In Proceedings of the Joint conference of the 47th Annual Meeting of the
Association for Computational Linguistics and the 4th International Joint
Conference on Natural Language Processing of the Asian Federation of
Natural Language Processing (ACL-IJCNLP 2009), pages 10–18,
Singapore.
Lui, M., Baldwin, T., and McCarthy, D. (2012).
Unsupervised estimation of word usage similarity.
In Proceedings of the Australasian Language Technology Workshop 2012
(ALTW 2012), pages 33–41, Dunedin, New Zealand.

18 / 18

More Related Content

PDF
Interactive Analysis of Word Vector Embeddings
PPTX
Language Models for Information Retrieval
PDF
An empirical performance evaluation of relational keyword search systems
DOCX
M phil
PDF
Mlj 2013 itm
DOC
Action Research Statement of the Issue
PPT
Email Data Cleaning
Interactive Analysis of Word Vector Embeddings
Language Models for Information Retrieval
An empirical performance evaluation of relational keyword search systems
M phil
Mlj 2013 itm
Action Research Statement of the Issue
Email Data Cleaning

What's hot (19)

PPTX
Entity-oriented sentiment analysis of tweets: results and problems
DOC
20051128.doc
PPTX
NLP Project Presentation
PDF
Some Information Retrieval Models and Our Experiments for TREC KBA
PDF
Similarity Measures for Semantic Relation Extraction
PDF
Topic modeling of marketing scientific papers: An experimental survey
PPTX
Does Data Quality lays in facts, or in acts?
PDF
Semantic Grounding Strategies for Tagbased Recommender Systems
PDF
Nlp presentation
PDF
PPTX
In-text Citations - APA 6th ed
PDF
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
PDF
Acm tist-v3 n4-tist-2010-11-0317
DOCX
Question 1 describe the series of connections that would be m
PDF
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
PDF
Workshop unpad2014 with ref
PPT
Factors Influencing Customers’ Intention to Use Instant Messaging to Communic...
PPTX
Studying Software Quality Using Topic Models
PPT
Chapter 02 collaborative recommendation
Entity-oriented sentiment analysis of tweets: results and problems
20051128.doc
NLP Project Presentation
Some Information Retrieval Models and Our Experiments for TREC KBA
Similarity Measures for Semantic Relation Extraction
Topic modeling of marketing scientific papers: An experimental survey
Does Data Quality lays in facts, or in acts?
Semantic Grounding Strategies for Tagbased Recommender Systems
Nlp presentation
In-text Citations - APA 6th ed
A SURVEY PAPER ON EXTRACTION OF OPINION WORD AND OPINION TARGET FROM ONLINE R...
Acm tist-v3 n4-tist-2010-11-0317
Question 1 describe the series of connections that would be m
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Workshop unpad2014 with ref
Factors Influencing Customers’ Intention to Use Instant Messaging to Communic...
Studying Software Quality Using Topic Models
Chapter 02 collaborative recommendation
Ad

Similar to Unsupervised Word Usage Similarity in Social Media Texts (20)

PDF
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
PDF
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
PDF
Word Embedding In IR
PPTX
The effect of number of concepts on readability of schemas 2
PPT
Eurolan 2005 Pedersen
PPT
Aspects of broad folksonomies
DOCX
The Comparison and Contrast Block Comparison Essay TemplateThe B.docx
PPT
2-Chapter Two-N-gram Language Models.ppt
PDF
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHS
PPTX
The Duet model
PDF
Two Level Disambiguation Model for Query Translation
PDF
AUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHS
PDF
AUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHS
PDF
please help i have 40 minsItem 1In the case below, the original .pdf
PPTX
NAACL2015 presentation
PDF
Analyzing The Organization Of Collaborative Math Problem-Solving In Online Ch...
PDF
What forty years_of_research_says_about__the_impact_of_technology_on_learning...
PPT
PPT
Introduction to spss
PPTX
Towards Automatic Analysis of Online Discussions among Hong Kong Students
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
Word Embedding In IR
The effect of number of concepts on readability of schemas 2
Eurolan 2005 Pedersen
Aspects of broad folksonomies
The Comparison and Contrast Block Comparison Essay TemplateThe B.docx
2-Chapter Two-N-gram Language Models.ppt
TEXT PLAGIARISM CHECKER USING FRIENDSHIP GRAPHS
The Duet model
Two Level Disambiguation Model for Query Translation
AUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHS
AUTOMATED SHORT ANSWER GRADER USING FRIENDSHIP GRAPHS
please help i have 40 minsItem 1In the case below, the original .pdf
NAACL2015 presentation
Analyzing The Organization Of Collaborative Math Problem-Solving In Online Ch...
What forty years_of_research_says_about__the_impact_of_technology_on_learning...
Introduction to spss
Towards Automatic Analysis of Online Discussions among Hong Kong Students
Ad

Recently uploaded (20)

PDF
Subscribe This Channel Subscribe Back You
PDF
Transform Your Social Media, Grow Your Brand
PPTX
Developing lesson plan gejegkavbw gagsgf
PDF
FINAL-Content-Marketing-Made-Easy-Workbook-Guied-Editable.pdf
PDF
25K Btc Enabled Cash App Accounts – Safe, Fast, Verified.pdf
PDF
The Fastest Way to Look Popular Buy Reactions Today
DOCX
Buy Goethe A1 ,B2 ,C1 certificate online without writing
PDF
Climate Risk and Credit Allocation: How Banks Are Integrating Environmental R...
PDF
Mastering Social Media Marketing in 2025.pdf
PDF
Presence That Pays Off Activate My Social Growth
PDF
TikTok Live shadow viewers_ Who watches without being counted
PDF
Instagram Reels Growth Guide 2025.......
DOCX
Get More Leads From LinkedIn Ads Today .docx
PDF
Instant Audience, Long-Term Impact Buy Real Telegram Members
PDF
Live Echo Boost on TikTok_ Double Devices, Higher Ranks
PDF
Your Best Post Vanished. Blame the Attention Economy
PDF
How can India improve its Public Diplomacy - Social Media.pdf
DOC
ASU毕业证学历认证,圣三一拉邦音乐与舞蹈学院毕业证留学本科毕业证
PPTX
Types of Social Media Marketing for Business Success
PPT
memimpindegra1uejehejehdksnsjsbdkdndgggwksj
Subscribe This Channel Subscribe Back You
Transform Your Social Media, Grow Your Brand
Developing lesson plan gejegkavbw gagsgf
FINAL-Content-Marketing-Made-Easy-Workbook-Guied-Editable.pdf
25K Btc Enabled Cash App Accounts – Safe, Fast, Verified.pdf
The Fastest Way to Look Popular Buy Reactions Today
Buy Goethe A1 ,B2 ,C1 certificate online without writing
Climate Risk and Credit Allocation: How Banks Are Integrating Environmental R...
Mastering Social Media Marketing in 2025.pdf
Presence That Pays Off Activate My Social Growth
TikTok Live shadow viewers_ Who watches without being counted
Instagram Reels Growth Guide 2025.......
Get More Leads From LinkedIn Ads Today .docx
Instant Audience, Long-Term Impact Buy Real Telegram Members
Live Echo Boost on TikTok_ Double Devices, Higher Ranks
Your Best Post Vanished. Blame the Attention Economy
How can India improve its Public Diplomacy - Social Media.pdf
ASU毕业证学历认证,圣三一拉邦音乐与舞蹈学院毕业证留学本科毕业证
Types of Social Media Marketing for Business Success
memimpindegra1uejehejehdksnsjsbdkdndgggwksj

Unsupervised Word Usage Similarity in Social Media Texts

  • 1. Methodology Results Conclusion Unsupervised Word Usage Similarity in Social Media Texts Spandana Gella and Paul Cook and Bo Han Department of Computing and Information Systems The University of Melbourne 1 / 18
  • 2. Methodology Results Conclusion Social media — Twitter Huge volume of user generated text Applications: trend analysis, event detection, natural disaster response co-ordination Short, noisy text: Challenging for traditional NLP Little work to-date on lexical semantics on Twitter This paper: Lexical semantic interpretation in tweets based on word usage similarity 2 / 18
  • 3. Methodology Results Conclusion Word sense disambiguation (WSD) Given a word in context, select the best-fitting “sense” from a sense inventory ne1 headin to blue boyz footy match this weekend? #iamcarlton game in which players or teams compete against each other I think they are a perfect match! #cute #xoxo something that resembles or harmonizes; “that tie makes a good match with your jacket” a pair of people who live together; “a married couple from Chicago” 3 / 18
  • 4. Methodology Results Conclusion Word sense disambiguation (WSD) Given a word in context, select the best-fitting “sense” from a sense inventory ne1 headin to blue boyz footy match this weekend? #iamcarlton game in which players or teams compete against each other I think they are a perfect match! #cute #xoxo something that resembles or harmonizes; “that tie makes a good match with your jacket” a pair of people who live together; “a married couple from Chicago” 4 / 18
  • 5. Methodology Results Conclusion Issues with WSD Choice of sense inventory match: WordNet 9 senses, Macmillan 4 senses No sense tagged resources for social media data Cannot capture novel usage patterns Assumes a single sense per usage Challenges over social media: short, noisy, non-standard syntax Solution? An alternative representation of meaning in context 5 / 18
  • 6. Methodology Results Conclusion Usage similarity (Usim) The manual task of rating the similarity of a pair of usages of a word (SPair) [Erk et al., 2009]. Similarity on a graded scale (1 – 5) No more senses; independent of sense inventory Novel usages: Rate similarity to established usages SPair example (annotators’ judgement: 3.2) Setting goals for myself this year, figured if it’s on paper I’ll be more inspired to work harder. This is very unsmart of me to get tipsy and then have to go home and write a paper. 6 / 18
  • 7. Methodology Results Conclusion Gold standard — Usim-tweet annotation 10 nouns: bar, charge, execution, field, figure, function, investigator, match, paper, post 55 pairs of tweets (SPairs) were annotated per lemma (sampled from Twitter Streaming API) Amazon Mechanical Turk annotation Sample Annotation: 7 / 18
  • 8. Methodology Results Conclusion Background corpora Original: Tweets from Streaming API containing target word as noun gotta 3 page paper due tomorrow haven start #procrastination Expanded: Original + document expansion based on medium-frequency hashtags research paper due tomorrow 2 page 3 #procrastination insert figure equations lab report less time word #physicsmajor #procrastination ... RandExpanded: Original + extra tweets containing target 8 / 18
  • 9. Methodology Results Conclusion Model Usim for Twitter No large annotated training resources: unsupervised No parser: bag-of-words Methods? Vector space model (VSM) Topic models (LDA) [Lui et al., 2012] Weighted Textual Matrix Factorization (WTMF) 1 model per target word 9 / 18
  • 10. Methodology Results Conclusion Methods VSM (Baseline): Second order co-occurrence LDA (Our approach): Represent documents as topic distribution vectors WTMF (Benchmark): Consider information from “missing” words related to the latent vector profile State-of-the-art on a similar task 10 / 18
  • 13. Methodology Results Conclusion Results Model Baseline WTMF (d) LDA (T ) Original 0.09 0.03 (8) 0.20 (8) Expanded 0.08 0.10 (20) 0.29 (5) RandExpanded 0.09 0.09 (5) 0.18 (20) Spearman’s rank correlation (ρ) for each method based on each background corpus 13 / 18
  • 14. Methodology Results Conclusion Results on Expanded Lemma bar charge execution field figure function investigator match paper post Overall Best-T ρ T 0.35 50 0.33 20 0.58 5 0.53 10 0.24 250 0.40 10 0.50 5 0.45 5 0.32 30 0.2 30 0.29 5 5-topics ρ 0.1 −0.08 0.58 0.32 0.14 0.27 0.50 0.45 0.22 −0.01 0.29 ρ values that are significant (p > 0.05) are shown in bold 14 / 18
  • 15. Methodology Results Conclusion Results varying d and T 0.3 0.2 0.1 Spearman correlation ρ 0.0 q q q q q q q q q q q Original Expanded RandExpanded q Spearman correlation ρ 0.0 0.1 0.2 0.3 Original Expanded RandExpanded q q q q q q q q q q q q d (a) WTMF: ρ versus dimensions (d) q q −0.1 q q q 2 3 5 8 10 20 30 50 100 150 200 250 300 350 400 450 500 200 150 50 100 30 20 8 10 5 3 2 −0.1 q T (b) LDA: ρ versus topics (T ) 15 / 18
  • 16. Methodology Results Conclusion Summary Computationally modeled Usim over social media LDA approach out-performed a baseline and benchmark Hashtag based document expansion improved performance of LDA and benchmark Gold-standard dataset Usim-tweet will be made available Future work Alternative document expansion (e.g., author based) Context representation: POS, positional word features, etc. Non-parametric topic modelling (e.g., HDP) 16 / 18
  • 18. Methodology Results Conclusion Bibliography Erk, K., McCarthy, D., and Gaylord, N. (2009). Investigations on word senses and word usages. In Proceedings of the Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2009), pages 10–18, Singapore. Lui, M., Baldwin, T., and McCarthy, D. (2012). Unsupervised estimation of word usage similarity. In Proceedings of the Australasian Language Technology Workshop 2012 (ALTW 2012), pages 33–41, Dunedin, New Zealand. 18 / 18