SlideShare a Scribd company logo
北京大学计算机科学技术研究所 
Institute of Computer Science & Technology Peking University 
Feature Extraction for Effective 
Microblog Search and Adaptive 
Clustering Algorithms for TTG 
PKUICST at TREC 2014 Microblog Track 
Chao Lv Feifan Fan Runwei Qiang Yue Fei Jianwu Yang 
qiangrw@pku.edu.cn 
Peking University
Ad hoc Search Task 
• Challenges 
• System Overview 
• Feature Extraction 
• Experimental Results 
2 
(Q1 , t1) 
(Q2 , t2) 
… 
(Qn , tn)
Challenges 
• Tweet is under the length limitation of 140 characters 
• Severe vocabulary-mismatch problem 
• It is necessary to apply query expansion techniques 
• Abundance of shortened URLs 
• We should offer ways to expand document 
• Large quantities of pointless babble 
• Tweet quality should be defined to filter non-informative message. 
3
Motivations 
• Learning to rank can make full use of different models or factors in 
microblog search 
• different factors => different features 
4
System Framework 
5 
TREC’13 
Topics 
TREC’14 
Topics 
Candidate 
Generation 
Tweets13 
Tweets14 
Feature 
Generation 
Learning System 
Test Set 
Labels 
Model 
Ranking System 
Ranked 
Tweets 
Training Set
Feature Extraction 
Related Work in Microblog Search 
• Many features have been proved useful 
• Semantic features between query and document 
• Tweet quality features, i.e. link, retweet, and mention count/binary 
• An empirical study on learning to rank of tweets [1] (20) 
• Content relevance features (3) 
• Twitter’s specific features (6) 
• Account Authority Features (12) 
• TREC 2012 microblog track experiments at Kobe University [2] (8) 
• Feature Analysis in Microblog Retrieval Based on Learning to Rank [3] (15) 
• Exploiting Ranking Factorization Machines for Microblog Retrieval [4] (29) 
6
Feature Extraction 
Features for Traditional Web Search 
• Hundreds/Thousands of Features in the Full Ranker for Web Search 
• LETOR Dataset 
• A pack of benchmark data sets for research on Learning To Rank. 
• Each query-url pair is represented by a 136-dimensional vectors. 
• Features such as: 
• covered query number of body, anchor, title, url and whole document 
• Page rank 
• url click count 
• url dwell time 
• … 
7
Feature Extraction 
Retrieval Model 
Retrieval 
Model 
Query 
Document 
8 
• OKAPI BM25 Score (BM25) 
• Language Model Score (LM) 
• LM.DIR 
• LM.JM 
• LM.ABS 
• TFIDF Model Score (TFIDF)
Feature Extraction 
Query 
Query 
Document 
Retrieval 
Model 
9 
• Use different queries to better 
understand the user’s search 
intent 
• Original Query 
• Top Tweet Based Query 
• Web Based Query 
• Freebase Based Query 
• Whether to use PRF based query 
expansion?
Feature Extraction 
Query Example 
10 
Ron Weasley 
birthday 
Web Results 
1. Ronald Weasley - Harry Potter Wiki 
Ronald BiliusWeasley was the sixth of seven children born to Arthur and 
Molly Weasley (née Prewett), and got his middle name from his uncle. He 
was born at? 
2. Ronald Weasley's seventeenth birthday - Harry Potter Wiki 
Ronald Weasley's seventeenth birthday took place on 1 March, 1997. He 
received many gifts from 
3. Drunk Ron Weasley Sings Happy Birthday To Harry Potter - YouTube 
Jul 31, 2013 Drunk Ron Weasley (played by Simon Pegg) visits Jimmy 
Fallon to wish Harry Potter a happy birthday. Subscribe NOW to The 
Tonight Show? 
4. … 
5. … 
Issue Tweet 
It s Ron Weasley s birthday The ginger who vomited slugs out from his 
mouth happy birthday Ron 
weaslei 0.1064 
ron 0.0745 
potter 0.0532 
birthdai 0.0532 
ronald 0.0532 
birthdai 0.2000 
ron 0.2000 
ginger 0.1000 
weaslei 0.1000 
vomit 0.1000 
birthdai 0.2549 
ron 0.2549 
weaslei 0.1961 
ginger 0.0588 
vomit 0.0588 
WebQuery 
IssueQuery 
MergeQuery 
RTRM [7] 
OriginQuery
Feature Extraction 
Document 
Document 
Retrieval 
Model 
Query 
• Plain Tweet Text (Origin) 
Say HappyBirthdayRonWeasley and share your creativity by 
submitting a drawing of Ron to celebrate 
• Topic Information from URL (Title) 
Pottermore Insider Happy birthday Ron Weasley 
• Merged Text (DocEx) 
Say HappyBirthdayRonWeasley and share your creativity 
by submitting a drawing of Ron to celebrate Pottermore 
Insider Happy birthday Ron Weasley 
11
Feature Extraction 
Document 
API 
• Get tweets with common API 
• Save time for crawling 
• Use general term statistics 
• Statistical Index with Lucence 
Local 
• Local copy of the API corpus 
• Preprocessing before indexing 
• Non-English tweets removal with 
ldig 
• RT tweets removal 
• Dynamic Index with Lemur 
12
Feature Extraction 
Quality Features 
• Quality Features 
1. Time Difference between Query Issue Time and Tweet Post Time 
2. Mention Count 
3. Hashtag Count 
4. Shortened URL Count 
5. Term Count of Text 
6. Length of Text 
13
Experimental Results 
• PKUICST1[auto] using API corpus related features ( 4*3*2*5 + 10= 130 ) 
• PKUICST2[auto] using Local corpus related features ( 4*3*2*5 + 10= 130 ) 
• PKUICST3[auto] using both API and Local corpus related features (120 + 120 + 10) 
• PKUICST4[auto] Language Model, with web-based query expansion 
14 
Run MAP P@30 
PKUICST1 0.5834 0.7242 
PKUICST2 0.5648 0.7279 
PKUICST3 0.5863 0.7224 
PKUICST4 0.5422 0.6958
TTG Task 
• Challenges 
• System Overview 
• Candidate Selection 
• Clustering Algorithm 
• Experimental Results 
"I have an information need expressed by a 
query Q at time t and I would like a summary 
that captures relevant information." 
15
Challenges 
• Systems will need to address two challenges: 
• Determine how many results to return. 
• Detect (and eliminate) redundant tweets. 
16
System Overview 
17 
TREC’11-12 
Topics 
TREC’14 
Topics 
Ad Hoc 
Search System 
Tweets11-12 
Summarized 
Tweets 
Candidate 
Selection 
Test Set 
Ground 
Truth 
Tweets14 
Clustering 
Algorithm 
Training Set
Candidate Selection 
• Determine how many results to return 
• Unified Tweet Number (N=200) 
• Score Threshold (Learning to rank score) (score > 4.5, Avg N = 89) 
• Manually Selected Tweet Number N for Each Query (Avg N=225) 
18 
1200 
1000 
800 
600 
400 
200 
0 
N 
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 
Ad Hoc Results 
Top N 
Candidates 
for Clustering 
Removed
Clustering Algorithm I 
Star Clustering 
19
Clustering Algorithm II 
Hierarchical Clustering 
20 
Layer 
L=1 
L=2 
L=3 
L=4 
L=5 
L=6 
Similarity threshold 
0.9 
0.8 
0.7 
0.6 
0.5 
0.4 
0.3 
t1 t2 t3 t4 t5 t6 t7
Experimental Results 
• TTGPKUICST1 [auto] 
• star clustering with tuned parameter 휎 = 0.7 and uniform tweet number 푁 = 200 
• TTGPKUICST2 [auto] 
• hierarchical clustering method with distance threshold 훽 = 0.3 and score threshold 훼 = 4.5 
• TTGPKUICST3 [manual] 
• hierarchical clustering method with distance threshold 훽 = 0.3 and manually selected 푁 
• TTGPKUICST4 [manual] 
• star clustering with tuned parameter 휎 = 0.7 and manually selected 푁 
21 
Run Recall RecallW Precision F1 F1W 
TTGPKUICST1 0.5221 0.7016 0.2682 0.3544 0.3881 
TTGPKUICST2 0.3698 0.5840 0.4571 0.4088 0.5128 
TTGPKUICST3 0.4849 0.6583 0.3635 0.4156 0.4684 
TTGPKUICST4 0.5174 0.6615 0.3664 0.4290 0.4716
Reference 
1. Y. Duan, L. Jiang, T. Qin, M. Zhou and H.-Y. Shum. An empirical study on learning to rank of tweets. In 
Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pages 
295–303. Association for Computational Linguistics, 2010. 
2. Miyanishi, T., Okamura, N., Liu, X., Seki, K. and Uehara, K. Trec 2011 Microblog Track Experiments at 
Kobe University. In: Proceeding of the Twentieth Text REtrieval Conference, 2011 
3. Z Han, X Li, M Yang and H Qi, S Li. Feature Analysis in Microblog Retrieval Based on Learning to Rank. 
atural Language Processing and Chinese Computing, 2013. 
4. R Qiang, F Liang and J Yang. Exploiting Ranking Factorization Machines for Microblog Retrieval. 
5. X Wang and C Zhai. Learn from web search logs to organize search results. In Proceedings of the 30th 
annual international ACM SIGIR conference on Research and development in information retrieval, 
2007. 
6. Han J and Kamber M. Data Mining, Southeast Asia Edition: Concepts and Techniques[M]. Morgan 
kaufmann, 2006. 
7. F Liang, R Qiang and J Yang. Exploiting real-time information retrieval in the microblogosphere. JCDL 
2012. 
22
北京大学计算机科学技术研究所 
Institute of Computer Science & Technology Peking University 
Feature Extraction for Effective 
Microblog Search and Adaptive 
Clustering Algorithms for TTG 
PKUICST at TREC 2014 Microblog Track 
Chao Lv Feifan Fan Runwei Qiang Yue Fei Jianwu Yang 
qiangrw@pku.edu.cn 
Peking University

More Related Content

PPT
Integrating a Domain Ontology Development Environment and an Ontology Search ...
PDF
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
PPTX
Decision Tree for Predictive Modeling
PDF
Brainwave Feature Extraction, Classification & Prediction
PPTX
Predicting Wine Quality Using Different Implementations of Decision Tree Algo...
PPT
Decision tree and random forest
PPSX
Decision tree Using c4.5 Algorithm
PPTX
Decision trees
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Decision Tree for Predictive Modeling
Brainwave Feature Extraction, Classification & Prediction
Predicting Wine Quality Using Different Implementations of Decision Tree Algo...
Decision tree and random forest
Decision tree Using c4.5 Algorithm
Decision trees

Similar to Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG (20)

PPTX
Techniques For Deep Query Understanding
PDF
Harnessing Web Page Directories for Large-Scale Classification of Tweets
PDF
Knowledge Discovery in Social Media and Scientific Digital Libraries
PDF
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
PDF
unit-5.pdf
PDF
Unleashing twitter data for fun and insight
PDF
Unleashing Twitter Data for Fun and Insight
PDF
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
PDF
Exploiting Ranking Factorization Machines for Microblog Retrieval
PPTX
Improving Semantic Search Using Query Log Analysis
PPT
What happen after crawling big data?
PPTX
presentation
PDF
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
PDF
A Network-Aware Approach for Searching As-You-Type in Social Media
PDF
Adaptive Search Based On User Tags in Social Networking
PPT
Retrieval and Feedback Models for Blog Feed Search
PDF
Learning to recommend with user generated content
PDF
July 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
PDF
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
PPS
MLforIR.pps
Techniques For Deep Query Understanding
Harnessing Web Page Directories for Large-Scale Classification of Tweets
Knowledge Discovery in Social Media and Scientific Digital Libraries
MOVIE RATING PREDICTION BASED ON TWITTER SENTIMENT ANALYSIS
unit-5.pdf
Unleashing twitter data for fun and insight
Unleashing Twitter Data for Fun and Insight
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
Exploiting Ranking Factorization Machines for Microblog Retrieval
Improving Semantic Search Using Query Log Analysis
What happen after crawling big data?
presentation
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
A Network-Aware Approach for Searching As-You-Type in Social Media
Adaptive Search Based On User Tags in Social Networking
Retrieval and Feedback Models for Blog Feed Search
Learning to recommend with user generated content
July 2025: Top 10 Read Articles in Data Mining & Knowledge Management Process
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation System
MLforIR.pps
Ad

Recently uploaded (20)

PDF
Mushroom cultivation and it's methods.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
Mushroom cultivation and it's methods.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
cloud_computing_Infrastucture_as_cloud_p
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Building Integrated photovoltaic BIPV_UPV.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
1 - Historical Antecedents, Social Consideration.pdf
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
A novel scalable deep ensemble learning framework for big data classification...
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
A comparative analysis of optical character recognition models for extracting...
MIND Revenue Release Quarter 2 2025 Press Release
Group 1 Presentation -Planning and Decision Making .pptx
Ad

Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG

  • 1. 北京大学计算机科学技术研究所 Institute of Computer Science & Technology Peking University Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG PKUICST at TREC 2014 Microblog Track Chao Lv Feifan Fan Runwei Qiang Yue Fei Jianwu Yang qiangrw@pku.edu.cn Peking University
  • 2. Ad hoc Search Task • Challenges • System Overview • Feature Extraction • Experimental Results 2 (Q1 , t1) (Q2 , t2) … (Qn , tn)
  • 3. Challenges • Tweet is under the length limitation of 140 characters • Severe vocabulary-mismatch problem • It is necessary to apply query expansion techniques • Abundance of shortened URLs • We should offer ways to expand document • Large quantities of pointless babble • Tweet quality should be defined to filter non-informative message. 3
  • 4. Motivations • Learning to rank can make full use of different models or factors in microblog search • different factors => different features 4
  • 5. System Framework 5 TREC’13 Topics TREC’14 Topics Candidate Generation Tweets13 Tweets14 Feature Generation Learning System Test Set Labels Model Ranking System Ranked Tweets Training Set
  • 6. Feature Extraction Related Work in Microblog Search • Many features have been proved useful • Semantic features between query and document • Tweet quality features, i.e. link, retweet, and mention count/binary • An empirical study on learning to rank of tweets [1] (20) • Content relevance features (3) • Twitter’s specific features (6) • Account Authority Features (12) • TREC 2012 microblog track experiments at Kobe University [2] (8) • Feature Analysis in Microblog Retrieval Based on Learning to Rank [3] (15) • Exploiting Ranking Factorization Machines for Microblog Retrieval [4] (29) 6
  • 7. Feature Extraction Features for Traditional Web Search • Hundreds/Thousands of Features in the Full Ranker for Web Search • LETOR Dataset • A pack of benchmark data sets for research on Learning To Rank. • Each query-url pair is represented by a 136-dimensional vectors. • Features such as: • covered query number of body, anchor, title, url and whole document • Page rank • url click count • url dwell time • … 7
  • 8. Feature Extraction Retrieval Model Retrieval Model Query Document 8 • OKAPI BM25 Score (BM25) • Language Model Score (LM) • LM.DIR • LM.JM • LM.ABS • TFIDF Model Score (TFIDF)
  • 9. Feature Extraction Query Query Document Retrieval Model 9 • Use different queries to better understand the user’s search intent • Original Query • Top Tweet Based Query • Web Based Query • Freebase Based Query • Whether to use PRF based query expansion?
  • 10. Feature Extraction Query Example 10 Ron Weasley birthday Web Results 1. Ronald Weasley - Harry Potter Wiki Ronald BiliusWeasley was the sixth of seven children born to Arthur and Molly Weasley (née Prewett), and got his middle name from his uncle. He was born at? 2. Ronald Weasley's seventeenth birthday - Harry Potter Wiki Ronald Weasley's seventeenth birthday took place on 1 March, 1997. He received many gifts from 3. Drunk Ron Weasley Sings Happy Birthday To Harry Potter - YouTube Jul 31, 2013 Drunk Ron Weasley (played by Simon Pegg) visits Jimmy Fallon to wish Harry Potter a happy birthday. Subscribe NOW to The Tonight Show? 4. … 5. … Issue Tweet It s Ron Weasley s birthday The ginger who vomited slugs out from his mouth happy birthday Ron weaslei 0.1064 ron 0.0745 potter 0.0532 birthdai 0.0532 ronald 0.0532 birthdai 0.2000 ron 0.2000 ginger 0.1000 weaslei 0.1000 vomit 0.1000 birthdai 0.2549 ron 0.2549 weaslei 0.1961 ginger 0.0588 vomit 0.0588 WebQuery IssueQuery MergeQuery RTRM [7] OriginQuery
  • 11. Feature Extraction Document Document Retrieval Model Query • Plain Tweet Text (Origin) Say HappyBirthdayRonWeasley and share your creativity by submitting a drawing of Ron to celebrate • Topic Information from URL (Title) Pottermore Insider Happy birthday Ron Weasley • Merged Text (DocEx) Say HappyBirthdayRonWeasley and share your creativity by submitting a drawing of Ron to celebrate Pottermore Insider Happy birthday Ron Weasley 11
  • 12. Feature Extraction Document API • Get tweets with common API • Save time for crawling • Use general term statistics • Statistical Index with Lucence Local • Local copy of the API corpus • Preprocessing before indexing • Non-English tweets removal with ldig • RT tweets removal • Dynamic Index with Lemur 12
  • 13. Feature Extraction Quality Features • Quality Features 1. Time Difference between Query Issue Time and Tweet Post Time 2. Mention Count 3. Hashtag Count 4. Shortened URL Count 5. Term Count of Text 6. Length of Text 13
  • 14. Experimental Results • PKUICST1[auto] using API corpus related features ( 4*3*2*5 + 10= 130 ) • PKUICST2[auto] using Local corpus related features ( 4*3*2*5 + 10= 130 ) • PKUICST3[auto] using both API and Local corpus related features (120 + 120 + 10) • PKUICST4[auto] Language Model, with web-based query expansion 14 Run MAP P@30 PKUICST1 0.5834 0.7242 PKUICST2 0.5648 0.7279 PKUICST3 0.5863 0.7224 PKUICST4 0.5422 0.6958
  • 15. TTG Task • Challenges • System Overview • Candidate Selection • Clustering Algorithm • Experimental Results "I have an information need expressed by a query Q at time t and I would like a summary that captures relevant information." 15
  • 16. Challenges • Systems will need to address two challenges: • Determine how many results to return. • Detect (and eliminate) redundant tweets. 16
  • 17. System Overview 17 TREC’11-12 Topics TREC’14 Topics Ad Hoc Search System Tweets11-12 Summarized Tweets Candidate Selection Test Set Ground Truth Tweets14 Clustering Algorithm Training Set
  • 18. Candidate Selection • Determine how many results to return • Unified Tweet Number (N=200) • Score Threshold (Learning to rank score) (score > 4.5, Avg N = 89) • Manually Selected Tweet Number N for Each Query (Avg N=225) 18 1200 1000 800 600 400 200 0 N 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 Ad Hoc Results Top N Candidates for Clustering Removed
  • 19. Clustering Algorithm I Star Clustering 19
  • 20. Clustering Algorithm II Hierarchical Clustering 20 Layer L=1 L=2 L=3 L=4 L=5 L=6 Similarity threshold 0.9 0.8 0.7 0.6 0.5 0.4 0.3 t1 t2 t3 t4 t5 t6 t7
  • 21. Experimental Results • TTGPKUICST1 [auto] • star clustering with tuned parameter 휎 = 0.7 and uniform tweet number 푁 = 200 • TTGPKUICST2 [auto] • hierarchical clustering method with distance threshold 훽 = 0.3 and score threshold 훼 = 4.5 • TTGPKUICST3 [manual] • hierarchical clustering method with distance threshold 훽 = 0.3 and manually selected 푁 • TTGPKUICST4 [manual] • star clustering with tuned parameter 휎 = 0.7 and manually selected 푁 21 Run Recall RecallW Precision F1 F1W TTGPKUICST1 0.5221 0.7016 0.2682 0.3544 0.3881 TTGPKUICST2 0.3698 0.5840 0.4571 0.4088 0.5128 TTGPKUICST3 0.4849 0.6583 0.3635 0.4156 0.4684 TTGPKUICST4 0.5174 0.6615 0.3664 0.4290 0.4716
  • 22. Reference 1. Y. Duan, L. Jiang, T. Qin, M. Zhou and H.-Y. Shum. An empirical study on learning to rank of tweets. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pages 295–303. Association for Computational Linguistics, 2010. 2. Miyanishi, T., Okamura, N., Liu, X., Seki, K. and Uehara, K. Trec 2011 Microblog Track Experiments at Kobe University. In: Proceeding of the Twentieth Text REtrieval Conference, 2011 3. Z Han, X Li, M Yang and H Qi, S Li. Feature Analysis in Microblog Retrieval Based on Learning to Rank. atural Language Processing and Chinese Computing, 2013. 4. R Qiang, F Liang and J Yang. Exploiting Ranking Factorization Machines for Microblog Retrieval. 5. X Wang and C Zhai. Learn from web search logs to organize search results. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007. 6. Han J and Kamber M. Data Mining, Southeast Asia Edition: Concepts and Techniques[M]. Morgan kaufmann, 2006. 7. F Liang, R Qiang and J Yang. Exploiting real-time information retrieval in the microblogosphere. JCDL 2012. 22
  • 23. 北京大学计算机科学技术研究所 Institute of Computer Science & Technology Peking University Feature Extraction for Effective Microblog Search and Adaptive Clustering Algorithms for TTG PKUICST at TREC 2014 Microblog Track Chao Lv Feifan Fan Runwei Qiang Yue Fei Jianwu Yang qiangrw@pku.edu.cn Peking University

Editor's Notes

  • #2: Good morning, everybody. This is Runwei. I’m a senior three student of Peking University. Our report name is feature extraction for effective microblog search and adaptive clustering algorithms for TTG.
  • #3: During this year’s track, we still focus more on the ad hoc search task. That is, given a query Q at T, find the most relevant tweets about Q before time T. I’ll first describe our system overview, then present our feature extraction method, and at last show out experimental results.
  • #4: We know that there are many challenges in microblog search. As the tweet is very short, we will meet sever vocabulary-mismatch problem. Thus it is necessary to apply query expansion technologies. There are also a lot of shortened urls, we should use them to expand the original tweets. What’s more, there are all kinds of babbles in the microblogosphere, we should define tweet quality and filter the non-informative tweet.
  • #5: learning to rank framework can make full use of various factors that affect the microblog search. Thus our system is based on the learning to rank framework.
  • #6: With the help of the provided common API, we first get the topic related candidate with original query. To improve the recall of the retrieved results, query expansion is used to get more candidate results. Then, we generate features for the query-document pair. With the labels of TREC 13 topics, we trained a model with Ranking SVM. Then for the test set, we can re-rank the tweets according to the ranking function. Finally, we return the top 1000 tweets.
  • #7: The most important part of our system is feature extraction, which directly affect the final retrieval performance. We investigated a lot of related work of learning to rank in microblog search and find that many features have been proven useful. However, the features used in previous work are actually very few.
  • #8: When it comes to the traditional web search, we found that hundreds or even thousands of features are used in the full ranker. For example, Microsoft published a dataset for evaluating the effectiveness of different learning to rank algorithms. They provided 136 features for each query and url pair. Following this, we also want to extract more effective features for microblog search.
  • #9: We want to generate the features from three perspectives: query, document and retrieval models. For retrieval models, we use the traditional BM25, Language Model and TFIDF models. Specially, we use three different smoothing methods in language modeling framework. Thus, we have totally 5 retrieval methods. 𝑘 1 = 0.3 and b = 0.05
  • #10: From the query side, we expand the query using three different query expansion techniques. That is PRF based, web based and freebase based query expansion. As in this year’s track, external evidences are encouraged for improving the retrieval performance. We seek for Google and Freebase’s help. As the freebase related features don’t perform very well in our training data, we don’t use them in our submitted runs.
  • #11: Take the query “Ron Weasley birthday” for an example, we search the top related 5 web documents using Google search API with a time limitation. Then we recognize all the verbs, nouns and adjectives with python nltk tool, top 5 terms are then used as a new query named WebQuery. We also search the top tweet as issue tweet with common API, then we extract topical terms to generate the IssueQuery. Besides, we also combine original query and issue query with an interpolation parameter 0.6 (origin) to form a MergeQuery. http://jwebnet.net/advancedgooglesearch.html#advSearchURI
  • #12: From the document side, we crawl all the urls embedded in the tweets and extract the title information of each page. We name this corpus as TopicInfo. We also merge the OriginInfo and TopicInfo to generate the DocExInfo corpus. The texts are preprocessed in several ways: … Note that aside from the corpus retrieved by api, we also crawl a copy of the official corpus and name it local corpus.
  • #13: We further compare the differences of the two corpora. …
  • #14: Besides the relevance score related features, we also define some traditional quality features to filter babbles.
  • #15: Here are the experimental results. 4:query 3:document 2:prf 5: retrieval models From the table we can see that l2r related runs are better than the unsupervised run (i.e. PKUICST4). Learning to rank runs show significant superiority over the unsupervised run (PKUICST4). Api corpus is better than the local corpus candidate. Combining api and local corpus can get a little improvement in terms of MAP compared with run with only api corpus features.
  • #16: Tweet Timeline Generation (TTG) is a new task for this year's Microblog track with a putative user model as follows: "I have an information need expressed by a query Q at time t and I would like a summary that captures relevant information." 
  • #17: Determine how many results to return : that is, systems will be penalized for results that are too verbose. Detect (and eliminate) redundant tweets: this is equivalent to saying that systems must detect novelty. Systems will be penalized for returning tweets that contain redundant information.
  • #18: Given a query, we first get 1000 results using the ad hoc search system, Then a candidate selection component is used to determine how many results to returns for clustering. With the selected tweets, we apply two clustering algorithms to get the final summarized tweets. Topics for 11-12 are used for parameter tuning.
  • #19: Three methods are used to determine …
  • #20: Now I’ll describe the first clustering algorithm, that is star clustering. We first build a graph according to the tweet similarity. Each tweet is a vertex, when two tweets’ cosine similarity score is larger than a threshold \sigma, there will be a edge between the two tweets. Then for each iteration, we found the vertex with the highest degree and regard it as the star tweet, and other tweets that have an edge with it will be added to that cluster. This cluster is then deleted from the graph, then we conduct this procedure iteratively until all stars are found.
  • #21: We also attempt using the hierarchical clustering algorithm. At first, each tweet is a cluster, then we compute the distance between each cluster. The nearest cluster are then combined to a new cluster. Then we do this iteratively until the minimum cluster distance is less than a tuning parameter beta.
  • #22: Here are the experimental results for our submitted runs. PKUICST1 uses … From the table, we can observe that TTGPKUICST1 shows significant superiority in terms of unweighted recall and weighted recall over other runs, while it performs poorly in terms of precision. TTGPKUICST2 performs better in precision compared with other runs. TTGPKUICST3 and TTGPKUICST4 which utilize manually selected top N parameter have medium and stable performance over all metrics.