SlideShare a Scribd company logo
Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU
Outline The task Overview of Blogs & Blog Search Challenges in Blog Search Our approach Retrieval Models Query Expansion Models Conclusion
Background
What is a Blog?
What is a Feed? <xml> <feed> <entry> <author>Peter …</> <title>Good, Evil…</> <content>I’ve said…</> </entry> <entry> <author>Peter …</> <title>Agreeing…</> <content>Some peo…</> </entry> …
Blog-Feed Correspondence Blog Feed Post Entry HTML XML
Why are Blogs important? Technorati currently tracking  > 112.8 Million Blogs > 175,000  new  Blogs per day > 1.6 Million posts per day [http://www.technorati.com/about/]
The Task
Feed Search at TREC Ranking Blogs/Feeds (collections of posts) in response to a user’s query,  [X] “ A relevant feed should have a  principle  and  recurring   interest in  X ”  —  TREC 2007 Blog Track (a.k.a. Blog Distillation)
Feed Search at TREC [Gardening] [Apple iPod] [Violence in Sudan] [Gun Control] [Food] [Wine] Represent Ongoing Information  Needs Frequently Very General
Challenges in Feed Search
Challenges in Feed Search A feed is a collection of documents  entries time feed
A feed is a collection of documents  How does  relevance  at the  entry  level correspond to  relevance  at the  feed  level? Challenges in Feed Search entries time feed
Challenges in Feed Search 2. Even a  topical  feed is  topically diverse time Space Exploration topic NASA China’s plans for the moon shuttle launch My dog Mars rover Boeing
Challenges in Feed Search 2. Even a  topical  feed is  topically diverse Can we favor entries close to the  central topic  of the feed? Space Exploration time topic
Challenges in Feed Search 3.  Feeds are noisy Spam blogs, Spam & off topic comments time
Challenges in Feed Search 4.  General & Ongoing Information Needs [Mac] [Music] [Food] [Wine] …  post regularly about new  products ,  features , or  application software  of Apple Mac computers. …  describing  songs ,  biographies  of musicians, musical  styles  and their  influences  of music on people are discussed. … such as  tastings ,  reviews , food  matching  or  pairing , and  oenophile news  and  events . …  describing experiences  eating  cuisines,  culinary delights , recipes ,  nutrition plans .
Our Approach
Feeds: Topically Diverse Noisy Collections Information Needs: General & Ongoing Challenges Our Approach Retrieval Models Feedback Models
Retrieval Models Challenge:  ranking topically diverse collections Representation: feed vs. entry Model topical relationship between entries
Large Document (Feed) Model [Q] <?xml… … </…> `<?xml… … </…> <?xml… … </…> <?xml… <feed> <entry> <entry> <entry> <entry> <entry> … </…> <?xml… … </…> <?xml… … </…> <?xml… … </…> <?xml… <feed> <entry> <entry> <entry> <entry> <entry> … </…> Feed Document  Collection Ranked Feeds Rank by Indri’s standard retrieval model [Metzler and Croft, 2004; 2005]
Large Document (Feed) Model Advantages: A straightforward application of existing retrieval techniques Potential Pitfalls: Large entries dominate a feed’s language model Ignores relationship among entries Feed Entry E E Entry Entry E
Small Document (Entry) Model Ranked Entries [Q] <entry> <entry> <entry> <entry> <?xml… <entry> Entry Document  Collection <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> Ranked Feeds document = entry Apply some rank aggregation function Rank By
Small Document (Entry) Model Query Likelihood Entry Centrality Feed Prior: favors longer feeds ReDDE Federated Search Algortihm [Si & Callan, 2003]
Entry Centrality Uniform : Geometric Mean : time topic
Small Document (Entry) Model Advantages: Controls for differing entry length Models topical relationship among entries Disadvantages: Centrality computation is slow(er) Not only improves speed,  Also performance Q
Retrieval Model Results
Retrieval Model Results 45 Queries from the TREC 2007 Blog Distillation Task BLOG06 test collection, XML feeds only 5-Fold Cross Validation for all retrieval model smoothing parameters
Retrieval Model Results Mean Average Precision Large Document (Feed) Model Small Document (Entry) Models
Retrieval Model Results Mean Average Precision Uniform Log(Feed Length) Uniform Log Prior Map 0.188
Retrieval Model Results Mean Average Precision Uniform Log(Feed Length) Uniform n/a
Feedback Models Challenge:  Noisy collection with general & ongoing information needs Use a cleaner external collection for query expansion (Wikipedia) With an expansion technique designed to identify multiple query facets
Query Expansion (PRF) [Q] BLOG06 Collection Related Terms from top K documents [Q + Terms] [Lavrenko & Croft, 2001]
Query Expansion Example Ideal digital photography depth of field photographic film photojournalism cinematography [Photography] PRF photography nude erotic art girl free teen fashion women
Feedback Model Results Mean Average Precision None PRF
Query Expansion (Wikipedia PRF) [Q] BLOG06 Collection [Q + Terms] [Lavrenko & Croft, 2001] Wikipedia [Diaz & Metzler, 2006] Related Terms from top K documents
Query Expansion Example Ideal digital photography depth of field photographic film photojournalism cinematography [Photography] PRF photography nude erotic art girl free teen fashion women Wikipedia PRF photography director special film art camera music cinematographer photographic
Feedback Model Results Mean Average Precision None PRF Wiki. PRF
Query Expansion (Wikipedia Link) [Q] BLOG06 Collection [Q + Terms] Wikipedia Related Terms from  link structure
Wikipedia Link-Based Query Expansion
Wikipedia Link-Based Expansion Wikipedia … Q
Wikipedia Link-Based Expansion … Relevance Set,  Top R = 100 Working Set,  Top W = 1000 Q Wikipedia
Wikipedia Link-Based Expansion … Wikipedia Q Relevance Set,  Top R = 100 Working Set,  Top W = 1000
Wikipedia Link-Based Expansion Relevance Set,  Top R = 100 Working Set,  Top W = 1000 … Wikipedia Extract anchor text from Working Set  that link to the  Relevance Set . Q
Wikipedia Link-Based Expansion Relevance Set,  Top R = 500 Working Set,  Top W = 1000 … Wikipedia Extract anchor text from Working Set  that link to the  Relevance Set . Q Combines relevance and popularity Relevance: An anchor phrase that links to a high ranked article gets a high score Popularity: An anchor phrase that links many times to a mid-ranked articles also gets high score
Query Expansion Example Wikipedia Link-Based photography photographer digital photography photographic depth of field feature photography film photographic film photojournalism [Photography] PRF photography nude erotic art girl free teen fashion women Ideal digital photography depth of field photographic film photojournalism cinematography
Feedback Model Results Mean Average Precision None PRF Wiki. PRF Wiki. Link
Conclusion Feed Search Challenges: Feeds are topically diverse, noisy collections Ranked against ongoing & general information needs Novel Retrieval Models: Ranking collections, sensitive to topical relationship among entries Novel Feedback Models: Discover multiple query facets & robust to collection noise
Thank You! Student Travel Grant funding from:    ACM SIGIR,    Amit Singhal,    Microsoft Research
Entry Centrality GM Derivation where Entry Generation Likelihood: |E|
Query Expansion Examples Wikipedia Expansion Music Folk music Electronic music Folk Music video World music Ambient Electronic Country music [Music] PRF Music Country Download Free MP3 Mp3andmore Lyric Listen Song
Query Expansion Examples Wikipedia Expansion scotland scottish parliament scottish scottish national party wars of scottish independence scottish independence william wallace glasgow scottish socialist party [Scottish Independence] PRF scotland independence party convention politics snp national people scot
Query Expansion Examples Wikipedia Expansion machine learning learning artificial intelligence turing machine machine gun neural network support vector machine supervised learning artificial neural network [Machine Learning] PRF learn machine credit card karaoke journal sex model sew
Query Generality Characteristics Query Length: BLOG: 1.9 words  TB04: 3.2 words TB05: 3.0 words ODP Depth BLOG: 4.7 levels TB04: 5.2 levels TB05: 5.3 levels
Relevance Set Cohesiveness … Relevance Set,  Top R = 100 Wikipedia Cohesiveness = |  L in  | |  L in  U  L out  |
Relevant Set Cohesiveness
Is it the Queries? Feed Search Queries  ≠ TB Adhoc Queries But, none of these measures predict whether wikipedia expansions helps…

More Related Content

PPT
Yahoo Making The Web Searchable
PPTX
Oboyski cal bug_ecn_2012
PPT
Influences on personality
PPT
Longino ecn 2012
PPTX
De walt ecn_2012
PPTX
Santiago varela eliana cardona zapata
PPTX
Swimming Lesson and Clothing
PPS
Muhammad and incest a
Yahoo Making The Web Searchable
Oboyski cal bug_ecn_2012
Influences on personality
Longino ecn 2012
De walt ecn_2012
Santiago varela eliana cardona zapata
Swimming Lesson and Clothing
Muhammad and incest a

Similar to Retrieval and Feedback Models for Blog Feed Search (20)

PDF
Word embeddings as a service - PyData NYC 2015
PDF
Семантический поиск - что это, как работает и чем отличается от просто поиска
PDF
Data scientist enablement dse 400 week 3 roadmap
PDF
Scalable Learning Technologies for Big Data Mining
PDF
CM UTaipei Kaggle Share
PDF
UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library co...
PPTX
AI and Python: Developing a Conversational Interface using Python
PDF
Machine Audition Principles Algorithms and Systems Premier Reference Source 1...
PDF
Qualitative Data Analysis with ATLAS ti 2nd Edition Susanne Friese
PDF
Pratical Deep Dive into the Semantic Web - #smconnect
PDF
Web Scale Information Extraction tutorial ecml2013
DOCX
1 ASSIGNMENT 1 REVIEWING RESEARCH AND MAKIN.docx
PPTX
Dynamic Search Using Semantics & Statistics
PDF
Data scientist enablement dse 400 week 4 roadmap
PPTX
ANIn Coimbatore _ April 2025 | Why data is important and how synthetic data c...
PDF
Elsevier/Maryland Publishing Connect - 14_0331 (pdf)
PDF
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
PDF
Data-Driven Growth: Lies, Lawyers & Outsized Results
PDF
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Word embeddings as a service - PyData NYC 2015
Семантический поиск - что это, как работает и чем отличается от просто поиска
Data scientist enablement dse 400 week 3 roadmap
Scalable Learning Technologies for Big Data Mining
CM UTaipei Kaggle Share
UKSG 2024 - Demystifying AI - Evaluating future uses and limits in library co...
AI and Python: Developing a Conversational Interface using Python
Machine Audition Principles Algorithms and Systems Premier Reference Source 1...
Qualitative Data Analysis with ATLAS ti 2nd Edition Susanne Friese
Pratical Deep Dive into the Semantic Web - #smconnect
Web Scale Information Extraction tutorial ecml2013
1 ASSIGNMENT 1 REVIEWING RESEARCH AND MAKIN.docx
Dynamic Search Using Semantics & Statistics
Data scientist enablement dse 400 week 4 roadmap
ANIn Coimbatore _ April 2025 | Why data is important and how synthetic data c...
Elsevier/Maryland Publishing Connect - 14_0331 (pdf)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Data-Driven Growth: Lies, Lawyers & Outsized Results
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Ad

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
A Presentation on Artificial Intelligence
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
A Presentation on Touch Screen Technology
PDF
Encapsulation theory and applications.pdf
PDF
Getting Started with Data Integration: FME Form 101
Agricultural_Statistics_at_a_Glance_2022_0.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
A comparative analysis of optical character recognition models for extracting...
1 - Historical Antecedents, Social Consideration.pdf
Heart disease approach using modified random forest and particle swarm optimi...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Building Integrated photovoltaic BIPV_UPV.pdf
WOOl fibre morphology and structure.pdf for textiles
Web App vs Mobile App What Should You Build First.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
A novel scalable deep ensemble learning framework for big data classification...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Programs and apps: productivity, graphics, security and other tools
A Presentation on Artificial Intelligence
DP Operators-handbook-extract for the Mautical Institute
Zenith AI: Advanced Artificial Intelligence
Hindi spoken digit analysis for native and non-native speakers
A Presentation on Touch Screen Technology
Encapsulation theory and applications.pdf
Getting Started with Data Integration: FME Form 101
Ad

Retrieval and Feedback Models for Blog Feed Search

  • 1. Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Singapore Jonathan Elsas, Jaime Arguello, Jamie Callan & Jaime Carbonell LTI/SCS/CMU
  • 2. Outline The task Overview of Blogs & Blog Search Challenges in Blog Search Our approach Retrieval Models Query Expansion Models Conclusion
  • 4. What is a Blog?
  • 5. What is a Feed? <xml> <feed> <entry> <author>Peter …</> <title>Good, Evil…</> <content>I’ve said…</> </entry> <entry> <author>Peter …</> <title>Agreeing…</> <content>Some peo…</> </entry> …
  • 6. Blog-Feed Correspondence Blog Feed Post Entry HTML XML
  • 7. Why are Blogs important? Technorati currently tracking > 112.8 Million Blogs > 175,000 new Blogs per day > 1.6 Million posts per day [http://www.technorati.com/about/]
  • 9. Feed Search at TREC Ranking Blogs/Feeds (collections of posts) in response to a user’s query, [X] “ A relevant feed should have a principle and recurring interest in X ” — TREC 2007 Blog Track (a.k.a. Blog Distillation)
  • 10. Feed Search at TREC [Gardening] [Apple iPod] [Violence in Sudan] [Gun Control] [Food] [Wine] Represent Ongoing Information Needs Frequently Very General
  • 12. Challenges in Feed Search A feed is a collection of documents entries time feed
  • 13. A feed is a collection of documents How does relevance at the entry level correspond to relevance at the feed level? Challenges in Feed Search entries time feed
  • 14. Challenges in Feed Search 2. Even a topical feed is topically diverse time Space Exploration topic NASA China’s plans for the moon shuttle launch My dog Mars rover Boeing
  • 15. Challenges in Feed Search 2. Even a topical feed is topically diverse Can we favor entries close to the central topic of the feed? Space Exploration time topic
  • 16. Challenges in Feed Search 3. Feeds are noisy Spam blogs, Spam & off topic comments time
  • 17. Challenges in Feed Search 4. General & Ongoing Information Needs [Mac] [Music] [Food] [Wine] … post regularly about new products , features , or application software of Apple Mac computers. … describing songs , biographies of musicians, musical styles and their influences of music on people are discussed. … such as tastings , reviews , food matching or pairing , and oenophile news and events . … describing experiences eating cuisines, culinary delights , recipes , nutrition plans .
  • 19. Feeds: Topically Diverse Noisy Collections Information Needs: General & Ongoing Challenges Our Approach Retrieval Models Feedback Models
  • 20. Retrieval Models Challenge: ranking topically diverse collections Representation: feed vs. entry Model topical relationship between entries
  • 21. Large Document (Feed) Model [Q] <?xml… … </…> `<?xml… … </…> <?xml… … </…> <?xml… <feed> <entry> <entry> <entry> <entry> <entry> … </…> <?xml… … </…> <?xml… … </…> <?xml… … </…> <?xml… <feed> <entry> <entry> <entry> <entry> <entry> … </…> Feed Document Collection Ranked Feeds Rank by Indri’s standard retrieval model [Metzler and Croft, 2004; 2005]
  • 22. Large Document (Feed) Model Advantages: A straightforward application of existing retrieval techniques Potential Pitfalls: Large entries dominate a feed’s language model Ignores relationship among entries Feed Entry E E Entry Entry E
  • 23. Small Document (Entry) Model Ranked Entries [Q] <entry> <entry> <entry> <entry> <?xml… <entry> Entry Document Collection <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> <entry> <entry> <entry> <entry> <?xml… <entry> Ranked Feeds document = entry Apply some rank aggregation function Rank By
  • 24. Small Document (Entry) Model Query Likelihood Entry Centrality Feed Prior: favors longer feeds ReDDE Federated Search Algortihm [Si & Callan, 2003]
  • 25. Entry Centrality Uniform : Geometric Mean : time topic
  • 26. Small Document (Entry) Model Advantages: Controls for differing entry length Models topical relationship among entries Disadvantages: Centrality computation is slow(er) Not only improves speed, Also performance Q
  • 28. Retrieval Model Results 45 Queries from the TREC 2007 Blog Distillation Task BLOG06 test collection, XML feeds only 5-Fold Cross Validation for all retrieval model smoothing parameters
  • 29. Retrieval Model Results Mean Average Precision Large Document (Feed) Model Small Document (Entry) Models
  • 30. Retrieval Model Results Mean Average Precision Uniform Log(Feed Length) Uniform Log Prior Map 0.188
  • 31. Retrieval Model Results Mean Average Precision Uniform Log(Feed Length) Uniform n/a
  • 32. Feedback Models Challenge: Noisy collection with general & ongoing information needs Use a cleaner external collection for query expansion (Wikipedia) With an expansion technique designed to identify multiple query facets
  • 33. Query Expansion (PRF) [Q] BLOG06 Collection Related Terms from top K documents [Q + Terms] [Lavrenko & Croft, 2001]
  • 34. Query Expansion Example Ideal digital photography depth of field photographic film photojournalism cinematography [Photography] PRF photography nude erotic art girl free teen fashion women
  • 35. Feedback Model Results Mean Average Precision None PRF
  • 36. Query Expansion (Wikipedia PRF) [Q] BLOG06 Collection [Q + Terms] [Lavrenko & Croft, 2001] Wikipedia [Diaz & Metzler, 2006] Related Terms from top K documents
  • 37. Query Expansion Example Ideal digital photography depth of field photographic film photojournalism cinematography [Photography] PRF photography nude erotic art girl free teen fashion women Wikipedia PRF photography director special film art camera music cinematographer photographic
  • 38. Feedback Model Results Mean Average Precision None PRF Wiki. PRF
  • 39. Query Expansion (Wikipedia Link) [Q] BLOG06 Collection [Q + Terms] Wikipedia Related Terms from link structure
  • 42. Wikipedia Link-Based Expansion … Relevance Set, Top R = 100 Working Set, Top W = 1000 Q Wikipedia
  • 43. Wikipedia Link-Based Expansion … Wikipedia Q Relevance Set, Top R = 100 Working Set, Top W = 1000
  • 44. Wikipedia Link-Based Expansion Relevance Set, Top R = 100 Working Set, Top W = 1000 … Wikipedia Extract anchor text from Working Set that link to the Relevance Set . Q
  • 45. Wikipedia Link-Based Expansion Relevance Set, Top R = 500 Working Set, Top W = 1000 … Wikipedia Extract anchor text from Working Set that link to the Relevance Set . Q Combines relevance and popularity Relevance: An anchor phrase that links to a high ranked article gets a high score Popularity: An anchor phrase that links many times to a mid-ranked articles also gets high score
  • 46. Query Expansion Example Wikipedia Link-Based photography photographer digital photography photographic depth of field feature photography film photographic film photojournalism [Photography] PRF photography nude erotic art girl free teen fashion women Ideal digital photography depth of field photographic film photojournalism cinematography
  • 47. Feedback Model Results Mean Average Precision None PRF Wiki. PRF Wiki. Link
  • 48. Conclusion Feed Search Challenges: Feeds are topically diverse, noisy collections Ranked against ongoing & general information needs Novel Retrieval Models: Ranking collections, sensitive to topical relationship among entries Novel Feedback Models: Discover multiple query facets & robust to collection noise
  • 49. Thank You! Student Travel Grant funding from: ACM SIGIR, Amit Singhal, Microsoft Research
  • 50. Entry Centrality GM Derivation where Entry Generation Likelihood: |E|
  • 51. Query Expansion Examples Wikipedia Expansion Music Folk music Electronic music Folk Music video World music Ambient Electronic Country music [Music] PRF Music Country Download Free MP3 Mp3andmore Lyric Listen Song
  • 52. Query Expansion Examples Wikipedia Expansion scotland scottish parliament scottish scottish national party wars of scottish independence scottish independence william wallace glasgow scottish socialist party [Scottish Independence] PRF scotland independence party convention politics snp national people scot
  • 53. Query Expansion Examples Wikipedia Expansion machine learning learning artificial intelligence turing machine machine gun neural network support vector machine supervised learning artificial neural network [Machine Learning] PRF learn machine credit card karaoke journal sex model sew
  • 54. Query Generality Characteristics Query Length: BLOG: 1.9 words TB04: 3.2 words TB05: 3.0 words ODP Depth BLOG: 4.7 levels TB04: 5.2 levels TB05: 5.3 levels
  • 55. Relevance Set Cohesiveness … Relevance Set, Top R = 100 Wikipedia Cohesiveness = | L in | | L in U L out |
  • 57. Is it the Queries? Feed Search Queries ≠ TB Adhoc Queries But, none of these measures predict whether wikipedia expansions helps…