SlideShare a Scribd company logo
北京大学计算机科学技术研究所 
Institute of Computer Science & Technology Peking University 
Feature Extraction for Effective 
Microblog Search and Adaptive 
Clustering Algorithms for TTG 
PKUICST at TREC 2014 Microblog Track 
Chao Lv Feifan Fan Runwei Qiang Yue Fei Jianwu Yang 
qiangrw@pku.edu.cn 
Peking University
Ad hoc Search Task 
• Challenges 
• System Overview 
• Feature Extraction 
• Experimental Results 
2 
(Q1 , t1) 
(Q2 , t2) 
… 
(Qn , tn)
Challenges 
• Tweet is under the length limitation of 140 characters 
• Severe vocabulary-mismatch problem 
• It is necessary to apply query expansion techniques 
• Abundance of shortened URLs 
• We should offer ways to expand document 
• Large quantities of pointless babble 
• Tweet quality should be defined to filter non-informative message. 
3
Motivations 
• Learning to rank can make full use of different models or factors in 
microblog search 
• different factors => different features 
4
System Framework 
5 
TREC’13 
Topics 
TREC’14 
Topics 
Candidate 
Generation 
Tweets13 
Tweets14 
Feature 
Generation 
Learning System 
Test Set 
Labels 
Model 
Ranking System 
Ranked 
Tweets 
Training Set
Feature Extraction 
Related Work in Microblog Search 
• Many features have been proved useful 
• Semantic features between query and document 
• Tweet quality features, i.e. link, retweet, and mention count/binary 
• An empirical study on learning to rank of tweets [1] (20) 
• Content relevance features (3) 
• Twitter’s specific features (6) 
• Account Authority Features (12) 
• TREC 2012 microblog track experiments at Kobe University [2] (8) 
• Feature Analysis in Microblog Retrieval Based on Learning to Rank [3] (15) 
• Exploiting Ranking Factorization Machines for Microblog Retrieval [4] (29) 
6
Feature Extraction 
Features for Traditional Web Search 
• Hundreds/Thousands of Features in the Full Ranker for Web Search 
• LETOR Dataset 
• A pack of benchmark data sets for research on Learning To Rank. 
• Each query-url pair is represented by a 136-dimensional vectors. 
• Features such as: 
• covered query number of body, anchor, title, url and whole document 
• Page rank 
• url click count 
• url dwell time 
• … 
7
Feature Extraction 
Retrieval Model 
Retrieval 
Model 
Query 
Document 
8 
• OKAPI BM25 Score (BM25) 
• Language Model Score (LM) 
• LM.DIR 
• LM.JM 
• LM.ABS 
• TFIDF Model Score (TFIDF)
Feature Extraction 
Query 
Query 
Document 
Retrieval 
Model 
9 
• Use different queries to better 
understand the user’s search 
intent 
• Original Query 
• Top Tweet Based Query 
• Web Based Query 
• Freebase Based Query 
• Whether to use PRF based query 
expansion?
Feature Extraction 
Query Example 
10 
Ron Weasley 
birthday 
Web Results 
1. Ronald Weasley - Harry Potter Wiki 
Ronald BiliusWeasley was the sixth of seven children born to Arthur and 
Molly Weasley (née Prewett), and got his middle name from his uncle. He 
was born at? 
2. Ronald Weasley's seventeenth birthday - Harry Potter Wiki 
Ronald Weasley's seventeenth birthday took place on 1 March, 1997. He 
received many gifts from 
3. Drunk Ron Weasley Sings Happy Birthday To Harry Potter - YouTube 
Jul 31, 2013 Drunk Ron Weasley (played by Simon Pegg) visits Jimmy 
Fallon to wish Harry Potter a happy birthday. Subscribe NOW to The 
Tonight Show? 
4. … 
5. … 
Issue Tweet 
It s Ron Weasley s birthday The ginger who vomited slugs out from his 
mouth happy birthday Ron 
weaslei 0.1064 
ron 0.0745 
potter 0.0532 
birthdai 0.0532 
ronald 0.0532 
birthdai 0.2000 
ron 0.2000 
ginger 0.1000 
weaslei 0.1000 
vomit 0.1000 
birthdai 0.2549 
ron 0.2549 
weaslei 0.1961 
ginger 0.0588 
vomit 0.0588 
WebQuery 
IssueQuery 
MergeQuery 
RTRM [7] 
OriginQuery
Feature Extraction 
Document 
Document 
Retrieval 
Model 
Query 
• Plain Tweet Text (Origin) 
Say HappyBirthdayRonWeasley and share your creativity by 
submitting a drawing of Ron to celebrate 
• Topic Information from URL (Title) 
Pottermore Insider Happy birthday Ron Weasley 
• Merged Text (DocEx) 
Say HappyBirthdayRonWeasley and share your creativity 
by submitting a drawing of Ron to celebrate Pottermore 
Insider Happy birthday Ron Weasley 
11
Feature Extraction 
Document 
API 
• Get tweets with common API 
• Save time for crawling 
• Use general term statistics 
• Statistical Index with Lucence 
Local 
• Local copy of the API corpus 
• Preprocessing before indexing 
• Non-English tweets removal with 
ldig 
• RT tweets removal 
• Dynamic Index with Lemur 
12
Feature Extraction 
Quality Features 
• Quality Features 
1. Time Difference between Query Issue Time and Tweet Post Time 
2. Mention Count 
3. Hashtag Count 
4. Shortened URL Count 
5. Term Count of Text 
6. Length of Text 
13
Experimental Results 
• PKUICST1[auto] using API corpus related features ( 4*3*2*5 + 10= 130 ) 
• PKUICST2[auto] using Local corpus related features ( 4*3*2*5 + 10= 130 ) 
• PKUICST3[auto] using both API and Local corpus related features (120 + 120 + 10) 
• PKUICST4[auto] Language Model, with web-based query expansion 
14 
Run MAP P@30 
PKUICST1 0.5834 0.7242 
PKUICST2 0.5648 0.7279 
PKUICST3 0.5863 0.7224 
PKUICST4 0.5422 0.6958
TTG Task 
• Challenges 
• System Overview 
• Candidate Selection 
• Clustering Algorithm 
• Experimental Results 
"I have an information need expressed by a 
query Q at time t and I would like a summary 
that captures relevant information." 
15
Challenges 
• Systems will need to address two challenges: 
• Determine how many results to return. 
• Detect (and eliminate) redundant tweets. 
16
System Overview 
17 
TREC’11-12 
Topics 
TREC’14 
Topics 
Ad Hoc 
Search System 
Tweets11-12 
Summarized 
Tweets 
Candidate 
Selection 
Test Set 
Ground 
Truth 
Tweets14 
Clustering 
Algorithm 
Training Set
Candidate Selection 
• Determine how many results to return 
• Unified Tweet Number (N=200) 
• Score Threshold (Learning to rank score) (score > 4.5, Avg N = 89) 
• Manually Selected Tweet Number N for Each Query (Avg N=225) 
18 
1200 
1000 
800 
600 
400 
200 
0 
N 
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 
Ad Hoc Results 
Top N 
Candidates 
for Clustering 
Removed
Clustering Algorithm I 
Star Clustering 
19
Clustering Algorithm II 
Hierarchical Clustering 
20 
Layer 
L=1 
L=2 
L=3 
L=4 
L=5 
L=6 
Similarity threshold 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
t1 t2 t3 t4 t5 t6 t7
Experimental Results 
• TTGPKUICST1 [auto] 
• star clustering with tuned parameter 휎 = 0.7 and uniform tweet number 푁 = 200 
• TTGPKUICST2 [auto] 
• hierarchical clustering method with distance threshold 훽 = 0.3 and score threshold 훼 = 4.5 
• TTGPKUICST3 [manual] 
• hierarchical clustering method with distance threshold 훽 = 0.3 and manually selected 푁 
• TTGPKUICST4 [manual] 
• star clustering with tuned parameter 휎 = 0.7 and manually selected 푁 
21 
Run Recall RecallW Precision F1 F1W 
TTGPKUICST1 0.5221 0.7016 0.2682 0.3544 0.3881 
TTGPKUICST2 0.3698 0.5840 0.4571 0.4088 0.5128 
TTGPKUICST3 0.4849 0.6583 0.3635 0.4156 0.4684 
TTGPKUICST4 0.5174 0.6615 0.3664 0.4290 0.4716
Reference 
1. Y. Duan, L. Jiang, T. Qin, M. Zhou and H.-Y. Shum. An empirical study on learning to rank of tweets. In 
Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pages 
295–303. Association for Computational Linguistics, 2010. 
2. Miyanishi, T., Okamura, N., Liu, X., Seki, K. and Uehara, K. Trec 2011 Microblog Track Experiments at 
Kobe University. In: Proceeding of the Twentieth Text REtrieval Conference, 2011 
3. Z Han, X Li, M Yang and H Qi, S Li. Feature Analysis in Microblog Retrieval Based on Learning to Rank. 
atural Language Processing and Chinese Computing, 2013. 
4. R Qiang, F Liang and J Yang. Exploiting Ranking Factorization Machines for Microblog Retrieval. 
5. X Wang and C Zhai. Learn from web search logs to organize search results. In Proceedings of the 30th 
annual international ACM SIGIR conference on Research and development in information retrieval, 
2007. 
6. Han J and Kamber M. Data Mining, Southeast Asia Edition: Concepts and Techniques[M]. Morgan 
kaufmann, 2006. 
7. F Liang, R Qiang and J Yang. Exploiting real-time information retrieval in the microblogosphere. JCDL 
2012. 
22
北京大学计算机科学技术研究所 
Institute of Computer Science & Technology Peking University 
Feature Extraction for Effective 
Microblog Search and Adaptive 
Clustering Algorithms for TTG 
PKUICST at TREC 2014 Microblog Track 
Chao Lv Feifan Fan Runwei Qiang Yue Fei Jianwu Yang 
qiangrw@pku.edu.cn 
Peking University

More Related Content

PPT
Integrating a Domain Ontology Development Environment and an Ontology Search ...
PDF
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
PPTX
Feature Extraction
PPTX
Techniques For Deep Query Understanding
PDF
Harnessing Web Page Directories for Large-Scale Classification of Tweets
PDF
Knowledge Discovery in Social Media and Scientific Digital Libraries
PDF
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
PDF
unit-5.pdf
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Feature Extraction
Techniques For Deep Query Understanding
Harnessing Web Page Directories for Large-Scale Classification of Tweets
Knowledge Discovery in Social Media and Scientific Digital Libraries
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
unit-5.pdf

Similar to Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG (20)

PDF
Unleashing twitter data for fun and insight
PDF
Unleashing Twitter Data for Fun and Insight
PDF
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
PDF
Exploiting Ranking Factorization Machines for Microblog Retrieval
PPTX
Improving Semantic Search Using Query Log Analysis
PPT
What happen after crawling big data?
PPTX
presentation
PDF
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
PDF
A Network-Aware Approach for Searching As-You-Type in Social Media
PDF
Adaptive Search Based On User Tags in Social Networking
PPT
Retrieval and Feedback Models for Blog Feed Search
PDF
Learning to recommend with user generated content
PDF
July 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
PDF
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
PPS
MLforIR.pps
ODP
Sigir 2011 proceedings
PPT
The personal search engine
PDF
An Introduction to Information Retrieval.pdf
ODP
Summary of SIGIR 2011 Papers
PDF
December 2024: Top 10 Read Articles in Data Mining & Knowledge Management Pro...
Unleashing twitter data for fun and insight
Unleashing Twitter Data for Fun and Insight
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
Exploiting Ranking Factorization Machines for Microblog Retrieval
Improving Semantic Search Using Query Log Analysis
What happen after crawling big data?
presentation
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
A Network-Aware Approach for Searching As-You-Type in Social Media
Adaptive Search Based On User Tags in Social Networking
Retrieval and Feedback Models for Blog Feed Search
Learning to recommend with user generated content
July 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
MLforIR.pps
Sigir 2011 proceedings
The personal search engine
An Introduction to Information Retrieval.pdf
Summary of SIGIR 2011 Papers
December 2024: Top 10 Read Articles in Data Mining & Knowledge Management Pro...
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
A Presentation on Artificial Intelligence
PDF
Hybrid model detection and classification of lung cancer
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Touch Screen Technology
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
1. Introduction to Computer Programming.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
project resource management chapter-09.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Unlocking AI with Model Context Protocol (MCP)
Zenith AI: Advanced Artificial Intelligence
A Presentation on Artificial Intelligence
Hybrid model detection and classification of lung cancer
OMC Textile Division Presentation 2021.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Touch Screen Technology
A comparative analysis of optical character recognition models for extracting...
Heart disease approach using modified random forest and particle swarm optimi...
TLE Review Electricity (Electricity).pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
A novel scalable deep ensemble learning framework for big data classification...
1. Introduction to Computer Programming.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Hindi spoken digit analysis for native and non-native speakers
A comparative study of natural language inference in Swahili using monolingua...
project resource management chapter-09.pdf
Ad

Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG

  • 1. 北京大学计算机科学技术研究所 Institute of Computer Science & Technology Peking University Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG PKUICST at TREC 2014 Microblog Track Chao Lv Feifan Fan Runwei Qiang Yue Fei Jianwu Yang qiangrw@pku.edu.cn Peking University
  • 2. Ad hoc Search Task • Challenges • System Overview • Feature Extraction • Experimental Results 2 (Q1 , t1) (Q2 , t2) … (Qn , tn)
  • 3. Challenges • Tweet is under the length limitation of 140 characters • Severe vocabulary-mismatch problem • It is necessary to apply query expansion techniques • Abundance of shortened URLs • We should offer ways to expand document • Large quantities of pointless babble • Tweet quality should be defined to filter non-informative message. 3
  • 4. Motivations • Learning to rank can make full use of different models or factors in microblog search • different factors => different features 4
  • 5. System Framework 5 TREC’13 Topics TREC’14 Topics Candidate Generation Tweets13 Tweets14 Feature Generation Learning System Test Set Labels Model Ranking System Ranked Tweets Training Set
  • 6. Feature Extraction Related Work in Microblog Search • Many features have been proved useful • Semantic features between query and document • Tweet quality features, i.e. link, retweet, and mention count/binary • An empirical study on learning to rank of tweets [1] (20) • Content relevance features (3) • Twitter’s specific features (6) • Account Authority Features (12) • TREC 2012 microblog track experiments at Kobe University [2] (8) • Feature Analysis in Microblog Retrieval Based on Learning to Rank [3] (15) • Exploiting Ranking Factorization Machines for Microblog Retrieval [4] (29) 6
  • 7. Feature Extraction Features for Traditional Web Search • Hundreds/Thousands of Features in the Full Ranker for Web Search • LETOR Dataset • A pack of benchmark data sets for research on Learning To Rank. • Each query-url pair is represented by a 136-dimensional vectors. • Features such as: • covered query number of body, anchor, title, url and whole document • Page rank • url click count • url dwell time • … 7
  • 8. Feature Extraction Retrieval Model Retrieval Model Query Document 8 • OKAPI BM25 Score (BM25) • Language Model Score (LM) • LM.DIR • LM.JM • LM.ABS • TFIDF Model Score (TFIDF)
  • 9. Feature Extraction Query Query Document Retrieval Model 9 • Use different queries to better understand the user’s search intent • Original Query • Top Tweet Based Query • Web Based Query • Freebase Based Query • Whether to use PRF based query expansion?
  • 10. Feature Extraction Query Example 10 Ron Weasley birthday Web Results 1. Ronald Weasley - Harry Potter Wiki Ronald BiliusWeasley was the sixth of seven children born to Arthur and Molly Weasley (née Prewett), and got his middle name from his uncle. He was born at? 2. Ronald Weasley's seventeenth birthday - Harry Potter Wiki Ronald Weasley's seventeenth birthday took place on 1 March, 1997. He received many gifts from 3. Drunk Ron Weasley Sings Happy Birthday To Harry Potter - YouTube Jul 31, 2013 Drunk Ron Weasley (played by Simon Pegg) visits Jimmy Fallon to wish Harry Potter a happy birthday. Subscribe NOW to The Tonight Show? 4. … 5. … Issue Tweet It s Ron Weasley s birthday The ginger who vomited slugs out from his mouth happy birthday Ron weaslei 0.1064 ron 0.0745 potter 0.0532 birthdai 0.0532 ronald 0.0532 birthdai 0.2000 ron 0.2000 ginger 0.1000 weaslei 0.1000 vomit 0.1000 birthdai 0.2549 ron 0.2549 weaslei 0.1961 ginger 0.0588 vomit 0.0588 WebQuery IssueQuery MergeQuery RTRM [7] OriginQuery
  • 11. Feature Extraction Document Document Retrieval Model Query • Plain Tweet Text (Origin) Say HappyBirthdayRonWeasley and share your creativity by submitting a drawing of Ron to celebrate • Topic Information from URL (Title) Pottermore Insider Happy birthday Ron Weasley • Merged Text (DocEx) Say HappyBirthdayRonWeasley and share your creativity by submitting a drawing of Ron to celebrate Pottermore Insider Happy birthday Ron Weasley 11
  • 12. Feature Extraction Document API • Get tweets with common API • Save time for crawling • Use general term statistics • Statistical Index with Lucence Local • Local copy of the API corpus • Preprocessing before indexing • Non-English tweets removal with ldig • RT tweets removal • Dynamic Index with Lemur 12
  • 13. Feature Extraction Quality Features • Quality Features 1. Time Difference between Query Issue Time and Tweet Post Time 2. Mention Count 3. Hashtag Count 4. Shortened URL Count 5. Term Count of Text 6. Length of Text 13
  • 14. Experimental Results • PKUICST1[auto] using API corpus related features ( 4*3*2*5 + 10= 130 ) • PKUICST2[auto] using Local corpus related features ( 4*3*2*5 + 10= 130 ) • PKUICST3[auto] using both API and Local corpus related features (120 + 120 + 10) • PKUICST4[auto] Language Model, with web-based query expansion 14 Run MAP P@30 PKUICST1 0.5834 0.7242 PKUICST2 0.5648 0.7279 PKUICST3 0.5863 0.7224 PKUICST4 0.5422 0.6958
  • 15. TTG Task • Challenges • System Overview • Candidate Selection • Clustering Algorithm • Experimental Results "I have an information need expressed by a query Q at time t and I would like a summary that captures relevant information." 15
  • 16. Challenges • Systems will need to address two challenges: • Determine how many results to return. • Detect (and eliminate) redundant tweets. 16
  • 17. System Overview 17 TREC’11-12 Topics TREC’14 Topics Ad Hoc Search System Tweets11-12 Summarized Tweets Candidate Selection Test Set Ground Truth Tweets14 Clustering Algorithm Training Set
  • 18. Candidate Selection • Determine how many results to return • Unified Tweet Number (N=200) • Score Threshold (Learning to rank score) (score > 4.5, Avg N = 89) • Manually Selected Tweet Number N for Each Query (Avg N=225) 18 1200 1000 800 600 400 200 0 N 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 Ad Hoc Results Top N Candidates for Clustering Removed
  • 19. Clustering Algorithm I Star Clustering 19
  • 20. Clustering Algorithm II Hierarchical Clustering 20 Layer L=1 L=2 L=3 L=4 L=5 L=6 Similarity threshold 0.9 0.8 0.7 0.6 0.5 0.4 0.3 t1 t2 t3 t4 t5 t6 t7
  • 21. Experimental Results • TTGPKUICST1 [auto] • star clustering with tuned parameter 휎 = 0.7 and uniform tweet number 푁 = 200 • TTGPKUICST2 [auto] • hierarchical clustering method with distance threshold 훽 = 0.3 and score threshold 훼 = 4.5 • TTGPKUICST3 [manual] • hierarchical clustering method with distance threshold 훽 = 0.3 and manually selected 푁 • TTGPKUICST4 [manual] • star clustering with tuned parameter 휎 = 0.7 and manually selected 푁 21 Run Recall RecallW Precision F1 F1W TTGPKUICST1 0.5221 0.7016 0.2682 0.3544 0.3881 TTGPKUICST2 0.3698 0.5840 0.4571 0.4088 0.5128 TTGPKUICST3 0.4849 0.6583 0.3635 0.4156 0.4684 TTGPKUICST4 0.5174 0.6615 0.3664 0.4290 0.4716
  • 22. Reference 1. Y. Duan, L. Jiang, T. Qin, M. Zhou and H.-Y. Shum. An empirical study on learning to rank of tweets. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pages 295–303. Association for Computational Linguistics, 2010. 2. Miyanishi, T., Okamura, N., Liu, X., Seki, K. and Uehara, K. Trec 2011 Microblog Track Experiments at Kobe University. In: Proceeding of the Twentieth Text REtrieval Conference, 2011 3. Z Han, X Li, M Yang and H Qi, S Li. Feature Analysis in Microblog Retrieval Based on Learning to Rank. atural Language Processing and Chinese Computing, 2013. 4. R Qiang, F Liang and J Yang. Exploiting Ranking Factorization Machines for Microblog Retrieval. 5. X Wang and C Zhai. Learn from web search logs to organize search results. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007. 6. Han J and Kamber M. Data Mining, Southeast Asia Edition: Concepts and Techniques[M]. Morgan kaufmann, 2006. 7. F Liang, R Qiang and J Yang. Exploiting real-time information retrieval in the microblogosphere. JCDL 2012. 22
  • 23. 北京大学计算机科学技术研究所 Institute of Computer Science & Technology Peking University Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG PKUICST at TREC 2014 Microblog Track Chao Lv Feifan Fan Runwei Qiang Yue Fei Jianwu Yang qiangrw@pku.edu.cn Peking University

Editor's Notes

  • #2: Good morning, everybody. This is Runwei. I’m a senior three student of Peking University. Our report name is feature extraction for effective microblog search and adaptive clustering algorithms for TTG.
  • #3: During this year’s track, we still focus more on the ad hoc search task. That is, given a query Q at T, find the most relevant tweets about Q before time T. I’ll first describe our system overview, then present our feature extraction method, and at last show out experimental results.
  • #4: We know that there are many challenges in microblog search. As the tweet is very short, we will meet sever vocabulary-mismatch problem. Thus it is necessary to apply query expansion technologies. There are also a lot of shortened urls, we should use them to expand the original tweets. What’s more, there are all kinds of babbles in the microblogosphere, we should define tweet quality and filter the non-informative tweet.
  • #5: learning to rank framework can make full use of various factors that affect the microblog search. Thus our system is based on the learning to rank framework.
  • #6: With the help of the provided common API, we first get the topic related candidate with original query. To improve the recall of the retrieved results, query expansion is used to get more candidate results. Then, we generate features for the query-document pair. With the labels of TREC 13 topics, we trained a model with Ranking SVM. Then for the test set, we can re-rank the tweets according to the ranking function. Finally, we return the top 1000 tweets.
  • #7: The most important part of our system is feature extraction, which directly affect the final retrieval performance. We investigated a lot of related work of learning to rank in microblog search and find that many features have been proven useful. However, the features used in previous work are actually very few.
  • #8: When it comes to the traditional web search, we found that hundreds or even thousands of features are used in the full ranker. For example, Microsoft published a dataset for evaluating the effectiveness of different learning to rank algorithms. They provided 136 features for each query and url pair. Following this, we also want to extract more effective features for microblog search.
  • #9: We want to generate the features from three perspectives: query, document and retrieval models. For retrieval models, we use the traditional BM25, Language Model and TFIDF models. Specially, we use three different smoothing methods in language modeling framework. Thus, we have totally 5 retrieval methods. 𝑘 1 = 0.3 and b = 0.05
  • #10: From the query side, we expand the query using three different query expansion techniques. That is PRF based, web based and freebase based query expansion. As in this year’s track, external evidences are encouraged for improving the retrieval performance. We seek for Google and Freebase’s help. As the freebase related features don’t perform very well in our training data, we don’t use them in our submitted runs.
  • #11: Take the query “Ron Weasley birthday” for an example, we search the top related 5 web documents using Google search API with a time limitation. Then we recognize all the verbs, nouns and adjectives with python nltk tool, top 5 terms are then used as a new query named WebQuery. We also search the top tweet as issue tweet with common API, then we extract topical terms to generate the IssueQuery. Besides, we also combine original query and issue query with an interpolation parameter 0.6 (origin) to form a MergeQuery. http://jwebnet.net/advancedgooglesearch.html#advSearchURI
  • #12: From the document side, we crawl all the urls embedded in the tweets and extract the title information of each page. We name this corpus as TopicInfo. We also merge the OriginInfo and TopicInfo to generate the DocExInfo corpus. The texts are preprocessed in several ways: … Note that aside from the corpus retrieved by api, we also crawl a copy of the official corpus and name it local corpus.
  • #13: We further compare the differences of the two corpora. …
  • #14: Besides the relevance score related features, we also define some traditional quality features to filter babbles.
  • #15: Here are the experimental results. 4:query 3:document 2:prf 5: retrieval models From the table we can see that l2r related runs are better than the unsupervised run (i.e. PKUICST4). Learning to rank runs show significant superiority over the unsupervised run (PKUICST4). Api corpus is better than the local corpus candidate. Combining api and local corpus can get a little improvement in terms of MAP compared with run with only api corpus features.
  • #16: Tweet Timeline Generation (TTG) is a new task for this year's Microblog track with a putative user model as follows: "I have an information need expressed by a query Q at time t and I would like a summary that captures relevant information." 
  • #17: Determine how many results to return : that is, systems will be penalized for results that are too verbose. Detect (and eliminate) redundant tweets: this is equivalent to saying that systems must detect novelty. Systems will be penalized for returning tweets that contain redundant information.
  • #18: Given a query, we first get 1000 results using the ad hoc search system, Then a candidate selection component is used to determine how many results to returns for clustering. With the selected tweets, we apply two clustering algorithms to get the final summarized tweets. Topics for 11-12 are used for parameter tuning.
  • #19: Three methods are used to determine …
  • #20: Now I’ll describe the first clustering algorithm, that is star clustering. We first build a graph according to the tweet similarity. Each tweet is a vertex, when two tweets’ cosine similarity score is larger than a threshold \sigma, there will be a edge between the two tweets. Then for each iteration, we found the vertex with the highest degree and regard it as the star tweet, and other tweets that have an edge with it will be added to that cluster. This cluster is then deleted from the graph, then we conduct this procedure iteratively until all stars are found.
  • #21: We also attempt using the hierarchical clustering algorithm. At first, each tweet is a cluster, then we compute the distance between each cluster. The nearest cluster are then combined to a new cluster. Then we do this iteratively until the minimum cluster distance is less than a tuning parameter beta.
  • #22: Here are the experimental results for our submitted runs. PKUICST1 uses … From the table, we can observe that TTGPKUICST1 shows significant superiority in terms of unweighted recall and weighted recall over other runs, while it performs poorly in terms of precision. TTGPKUICST2 performs better in precision compared with other runs. TTGPKUICST3 and TTGPKUICST4 which utilize manually selected top N parameter have medium and stable performance over all metrics.