Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG

北京大学计算机科学技术研究所
Institute of Computer Science & Technology Peking University
Feature Extraction for Effective
Microblog Search and Adaptive
Clustering Algorithms for TTG
PKUICST at TREC 2014 Microblog Track
Chao Lv Feifan Fan Runwei Qiang Yue Fei Jianwu Yang
qiangrw@pku.edu.cn
Peking University

Ad hoc Search Task
• Challenges
• System Overview
• Feature Extraction
• Experimental Results
2
(Q1 , t1)
(Q2 , t2)
…
(Qn , tn)

Challenges
• Tweet is under the length limitation of 140 characters
• Severe vocabulary-mismatch problem
• It is necessary to apply query expansion techniques
• Abundance of shortened URLs
• We should offer ways to expand document
• Large quantities of pointless babble
• Tweet quality should be defined to filter non-informative message.
3

Motivations
• Learning to rank can make full use of different models or factors in
microblog search
• different factors => different features
4

System Framework
5
TREC’13
Topics
TREC’14
Topics
Candidate
Generation
Tweets13
Tweets14
Feature
Generation
Learning System
Test Set
Labels
Model
Ranking System
Ranked
Tweets
Training Set

Feature Extraction
Related Work in Microblog Search
• Many features have been proved useful
• Semantic features between query and document
• Tweet quality features, i.e. link, retweet, and mention count/binary
• An empirical study on learning to rank of tweets [1] (20)
• Content relevance features (3)
• Twitter’s specific features (6)
• Account Authority Features (12)
• TREC 2012 microblog track experiments at Kobe University [2] (8)
• Feature Analysis in Microblog Retrieval Based on Learning to Rank [3] (15)
• Exploiting Ranking Factorization Machines for Microblog Retrieval [4] (29)
6

Feature Extraction
Features for Traditional Web Search
• Hundreds/Thousands of Features in the Full Ranker for Web Search
• LETOR Dataset
• A pack of benchmark data sets for research on Learning To Rank.
• Each query-url pair is represented by a 136-dimensional vectors.
• Features such as:
• covered query number of body, anchor, title, url and whole document
• Page rank
• url click count
• url dwell time
• …
7

Feature Extraction
Retrieval Model
Retrieval
Model
Query
Document
8
• OKAPI BM25 Score (BM25)
• Language Model Score (LM)
• LM.DIR
• LM.JM
• LM.ABS
• TFIDF Model Score (TFIDF)

Feature Extraction
Query
Query
Document
Retrieval
Model
9
• Use different queries to better
understand the user’s search
intent
• Original Query
• Top Tweet Based Query
• Web Based Query
• Freebase Based Query
• Whether to use PRF based query
expansion?

Feature Extraction
Query Example
10
Ron Weasley
birthday
Web Results
1. Ronald Weasley - Harry Potter Wiki
Ronald BiliusWeasley was the sixth of seven children born to Arthur and
Molly Weasley (née Prewett), and got his middle name from his uncle. He
was born at?
2. Ronald Weasley's seventeenth birthday - Harry Potter Wiki
Ronald Weasley's seventeenth birthday took place on 1 March, 1997. He
received many gifts from
3. Drunk Ron Weasley Sings Happy Birthday To Harry Potter - YouTube
Jul 31, 2013 Drunk Ron Weasley (played by Simon Pegg) visits Jimmy
Fallon to wish Harry Potter a happy birthday. Subscribe NOW to The
Tonight Show?
4. …
5. …
Issue Tweet
It s Ron Weasley s birthday The ginger who vomited slugs out from his
mouth happy birthday Ron
weaslei 0.1064
ron 0.0745
potter 0.0532
birthdai 0.0532
ronald 0.0532
birthdai 0.2000
ron 0.2000
ginger 0.1000
weaslei 0.1000
vomit 0.1000
birthdai 0.2549
ron 0.2549
weaslei 0.1961
ginger 0.0588
vomit 0.0588
WebQuery
IssueQuery
MergeQuery
RTRM [7]
OriginQuery

Feature Extraction
Document
Document
Retrieval
Model
Query
• Plain Tweet Text (Origin)
Say HappyBirthdayRonWeasley and share your creativity by
submitting a drawing of Ron to celebrate
• Topic Information from URL (Title)
Pottermore Insider Happy birthday Ron Weasley
• Merged Text (DocEx)
Say HappyBirthdayRonWeasley and share your creativity
by submitting a drawing of Ron to celebrate Pottermore
Insider Happy birthday Ron Weasley
11

Feature Extraction
Document
API
• Get tweets with common API
• Save time for crawling
• Use general term statistics
• Statistical Index with Lucence
Local
• Local copy of the API corpus
• Preprocessing before indexing
• Non-English tweets removal with
ldig
• RT tweets removal
• Dynamic Index with Lemur
12

Feature Extraction
Quality Features
• Quality Features
1. Time Difference between Query Issue Time and Tweet Post Time
2. Mention Count
3. Hashtag Count
4. Shortened URL Count
5. Term Count of Text
6. Length of Text
13

Experimental Results
• PKUICST1[auto] using API corpus related features ( 4*3*2*5 + 10= 130 )
• PKUICST2[auto] using Local corpus related features ( 4*3*2*5 + 10= 130 )
• PKUICST3[auto] using both API and Local corpus related features (120 + 120 + 10)
• PKUICST4[auto] Language Model, with web-based query expansion
14
Run MAP P@30
PKUICST1 0.5834 0.7242
PKUICST2 0.5648 0.7279
PKUICST3 0.5863 0.7224
PKUICST4 0.5422 0.6958

TTG Task
• Challenges
• System Overview
• Candidate Selection
• Clustering Algorithm
• Experimental Results
"I have an information need expressed by a
query Q at time t and I would like a summary
that captures relevant information."
15

Challenges
• Systems will need to address two challenges:
• Determine how many results to return.
• Detect (and eliminate) redundant tweets.
16

System Overview
17
TREC’11-12
Topics
TREC’14
Topics
Ad Hoc
Search System
Tweets11-12
Summarized
Tweets
Candidate
Selection
Test Set
Ground
Truth
Tweets14
Clustering
Algorithm
Training Set

Candidate Selection
• Determine how many results to return
• Unified Tweet Number (N=200)
• Score Threshold (Learning to rank score) (score > 4.5, Avg N = 89)
• Manually Selected Tweet Number N for Each Query (Avg N=225)
18
1200
1000
800
600
400
200
0
N
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55
Ad Hoc Results
Top N
Candidates
for Clustering
Removed

Clustering Algorithm I
Star Clustering
19

Clustering Algorithm II
Hierarchical Clustering
20
Layer
L=1
L=2
L=3
L=4
L=5
L=6
Similarity threshold
0.9
0.8
0.7
0.6
0.5
0.4
0.3
t1 t2 t3 t4 t5 t6 t7

Experimental Results
• TTGPKUICST1 [auto]
• star clustering with tuned parameter 휎 = 0.7 and uniform tweet number 푁 = 200
• TTGPKUICST2 [auto]
• hierarchical clustering method with distance threshold 훽 = 0.3 and score threshold 훼 = 4.5
• TTGPKUICST3 [manual]
• hierarchical clustering method with distance threshold 훽 = 0.3 and manually selected 푁
• TTGPKUICST4 [manual]
• star clustering with tuned parameter 휎 = 0.7 and manually selected 푁
21
Run Recall RecallW Precision F1 F1W
TTGPKUICST1 0.5221 0.7016 0.2682 0.3544 0.3881
TTGPKUICST2 0.3698 0.5840 0.4571 0.4088 0.5128
TTGPKUICST3 0.4849 0.6583 0.3635 0.4156 0.4684
TTGPKUICST4 0.5174 0.6615 0.3664 0.4290 0.4716

Reference
1. Y. Duan, L. Jiang, T. Qin, M. Zhou and H.-Y. Shum. An empirical study on learning to rank of tweets. In
Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pages
295–303. Association for Computational Linguistics, 2010.
2. Miyanishi, T., Okamura, N., Liu, X., Seki, K. and Uehara, K. Trec 2011 Microblog Track Experiments at
Kobe University. In: Proceeding of the Twentieth Text REtrieval Conference, 2011
3. Z Han, X Li, M Yang and H Qi, S Li. Feature Analysis in Microblog Retrieval Based on Learning to Rank.
atural Language Processing and Chinese Computing, 2013.
4. R Qiang, F Liang and J Yang. Exploiting Ranking Factorization Machines for Microblog Retrieval.
5. X Wang and C Zhai. Learn from web search logs to organize search results. In Proceedings of the 30th
annual international ACM SIGIR conference on Research and development in information retrieval,
2007.
6. Han J and Kamber M. Data Mining, Southeast Asia Edition: Concepts and Techniques[M]. Morgan
kaufmann, 2006.
7. F Liang, R Qiang and J Yang. Exploiting real-time information retrieval in the microblogosphere. JCDL
2012.
22

Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG

More Related Content

Similar to Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG (20)

Recently uploaded (20)

Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG

Editor's Notes