SlideShare a Scribd company logo
Compare & Contrast: Using the Web to Discover Comparable Cases for News Stories Jiahui Liu, Earl Wagner, Larry Birnbaum Northwestern University Intelligent Information Laboratory Reporter: Chieh-Chang Yang Date: 2007/07/05
Outline Introduction The problem and overview of the proposed solution Implementation of Compare & Contrast Evaluation Conclusion
Introduction Comparing and contrasting is an important strategy people employ to understand new situations and create solutions for new problems. In writing a news story, a reporter often compares the new event with other similar events to make it more familiar to readers, as well as to analyze any trends involving the new event.
Introduction In this paper, we present  Compare & Contrast , a system that use the Web to discover comparable cases for news stories, documents about similar situations but involving distinct entities. The system analyzes a news story given by the user and builds a model of the story. With the story model, the system dynamically discovers entities comparable to main entity in the original story and uses these comparable entities as  seeds  to retrieve web pages about comparable cases.
 
Introduction To do this, the system identifies the  generic situation keywords , terms and phrases describing the situation of the story, and the  main entity , the person, place, or organization that the story is talking about. The system dynamically discovers  comparable entities , entities involved in similar situation as the main entity, based on word contexts similarity. The system formulates queries to general web search engines by combining the comparable entities and the generic situation keywords to retrieve web pages about comparable cases.
The problem and overview of the proposed solution In describing an event, a reporter seeks to answer the five W questions: who, what , where, when and why.  We note that the  who, where and when  of a news account are  named entities . Actions and relationships among the actors, on the other hand, appear as  non-named entity terms  and give information about  what and why , which constitute the  generic situation . Based on this insight, we propose an approach for finding comparable cases by using the named entities and the non-named entities differently in modeling the story and retrieving information.
The problem and overview of the proposed solution In terms of our theory of comparable cases, documents about comparable cases should contain similar non-named entity terms as the original story, but have different named entities. The system selects the top non-named entity terms and phrases as the generic situation keywords to query for relevant documents. However, whether two entities are comparable is dependent upon the context of the situation, not just by their static similarities and distinctions.
 
 
Implementation of Compare & Contrast News story modeling Comparable entity discovery Page filtering to remedy noise on the Web
News story modeling When the URL of a news web page is sent to Compare&Contrast, the system retrieve the web page, extracts the news content from the page, spilts the sentences and tags the named entities. For named entity recognition, the system uses the web service provided by  ClearForest Semantic Web Services (SWS),  adopting its tags of person, organization, company, product and geographical location.
News story modeling For the non-named entity terms, stop words are removed and the rest of the terms are stemmed with a Porter Stemmer. To create a vector representation of the non-named entity terms, we used a modified TF-IDF model which incrementally decreases the importance of terms appearing later in the news article. To implement this idea, we assign scores to sentences according to their position.
News story modeling When computing the  term frequency (TF)  for the non-named entity terms, each occurrence of a term is given the score of the sentence it appears in, rather than being counted evenly. Moreover, the TF of terms in the title or the lead sentence is doubled. The  IDF  of terms is computed using an archive of 343,187 news stories collected from April 2004 to June 2006. We found that it would be beneficial to capture the important word groups in event descriptions, such as ”open source” or “nuclear test.” So the system treats the stemmed bigrams which appeared more than three times in the article as  phrases . The TF of a phrase is computed in the same way as a unigram. The IDF of a phrase is the maximum of the IDFs of the two terms of the phrase.
News story modeling Unlike the non-named entity terms, the vector representation of named entities only uses TF. The TF for named entities is computed the same way as non-named entity terms. The named entity with the highest score is chosen as the main entity. A tricky issue in counting named entities is that different references to the same entity should be grouped together. In writing the news stories, journalists usually give the full name of the named entity at the first mention, but use some shortened form later.
Comparable entity discovery The system tries to retrieve a set of potentially relevant pages using the query : - “main entity” {generic situation keywords}. We defined the  word context  of a named entity as the terms and phrases co-occurring with the main entity. A  word context vector  is built for the main entity in the original news article by collecting all the terms and phrases co-occurring with the main entity.
Comparable entity discovery The potentially relevant pages are preprocessed. To compute the similarity of word context, each sentence in the potentially relevant pages is scored using the word context word vector. Entities of the same type as the main entity in a potentially relevant pages are considered as candidates for comparable entities. The similarity score of an entity is computed using the score of all the sentences they appears in.
Comparable entity discovery After this process, each potentially relevant page has a set of candidates for comparable entities with their simScores. We observed that among these potentially relevant pages, there are some web pages describing the same events. It would be beneficial to cluster the articles about the same events together. We develop our method for clustering articles according to the overlap of the important entities in the articles. simScores of the same named entities within a cluster are added together.
Comparable entity discovery After Compare&Contrast identifies the comparable entities, the system use the comparable entities as seeds to retrieve comparable cases with query: + “comparable entity” –”main entity” {generic situation keywords}. To verify the comparable cases, the system uses the search result counts returned by web search engines to calculate the relevant score of comparable cases. The benefit of taking into account the search result count is twofold: More hits means the cases have larger coverage in public. The system may produce false comparable entities. However, there are very few web pages describing the false comparable entities with the generic situation keywords.
Page filtering to remedy noise on the Web Within the set of potentially relevant pages, there are some irrelevant web pages. We identified two different categories of harmful pages and developed filters accordingly: Directory pages: the percentage of upper case characters is often higher than other pages.( filter out more than 28% UP) Irrelevant pages: the summaries of results returned by web search engines are compared with the vector representation of the generic situation. These two filters are executed before the Comparable Entity Identifier.
Evaluation We need a collection of news articles for which comparable cases can be found on the Web. However, some news articles describe and discuss  general phenomena  or  specific events , thus no focused entities or comparable cases exist. We notices that there is a moderate potion of news articles that contain comparisons or contrasts  inside  the articles. These news articles can be good candidate for our test cases. Moreover, the comparable cases mentioned in these news articles can be used as  answer keys  for evaluation.
 
Evaluation So we built a collection of test cases by gathering news articles mentioning comparable cases. We collected 40 news articles from various news websites, and we divide the test cases into three categories: politics, business, and technology. We conducted two experiments on the collection:  We ran Compare&Contrast on all the test cases and use the comparable cases given in the articles to evaluate the effectiveness of the technique for discovering comparable entities.  We randomly selected 6 test cases from the collection and invited 5 people to judge whether the web pages Compare&Contrast found are about relevant cases comparable to the original news stories.
Effectiveness of Comparable Entity Discovery The 40 news articles are fed into the Compare&Contrast. For each news article, the system returned its top five, or fewer, comparable entities with their score above certain threshold. If some of the comparable entities are mentioned by the comparison part of the test case, the test case is counted as a hit.
Relevance of Retrieved Pages To evaluate the relevance of the web pages the system found, we randomly selected 6 test cases and invited 5 people to judge the relevance of the retrieved pages. The 5 users consisted of 2 graduate students, 2 staff members, and 1 undergraduate student. For the 6 test cases, 4 are hit cases. For each test case, there were 5 comparable entities and the system presented 3 or fewer web pages retrieved for each comparable entity. Altogether are 85 web pages. Each web page is given 1 point if one user thinks the web page contains a relevant comparable case for the original news story, and 0 point if not. The average score for all 85 web pages is 3.13, and we consider a web page with score equal to or more than 3 to be relevant.
Relevance of Retrieved Pages
Conclusion  In this paper, we analyzed the problem of finding comparable cases for news stories, characterizing comparable cases as those that share a similar situation as the original story but involved different entities. We presented Compare&Contrast: a system for finding the comparable cases by automatically formulating queries based on the story model derived from the original article and dynamically discovering comparable entities involved in comparable cases. We plan to investigate a more sophisticated use of named entities and develop more intelligent query formulation mechanism to find better results.

More Related Content

PDF
WT - Web & Working of Search Engine
PPT
Googling of GooGle
PDF
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
PDF
EMBERS AutoGSR: Automated Coding of Civil Unrest Events
PDF
Volume 2-issue-6-2016-2020
PDF
Topic-specific Web Crawler using Probability Method
PDF
DMAP: Data Aggregation and Presentation Framework
PDF
WEB PAGE RANKING BASED ON TEXT SUBSTANCE OF LINKED PAGES
WT - Web & Working of Search Engine
Googling of GooGle
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
EMBERS AutoGSR: Automated Coding of Civil Unrest Events
Volume 2-issue-6-2016-2020
Topic-specific Web Crawler using Probability Method
DMAP: Data Aggregation and Presentation Framework
WEB PAGE RANKING BASED ON TEXT SUBSTANCE OF LINKED PAGES

What's hot (17)

PDF
Computing semantic similarity measure between words using web search engine
PDF
Evaluation of Web Search Engines Based on Ranking of Results and Features
PDF
Datawarehousing and Business Intelligence
PDF
Pagerank and hits
PDF
Science of the Interwebs
PPT
Use of Contextualized Attention Metadata for Ranking and Recommending Learnin...
PPT
Tovek Presentation by Livio Costantini
PPTX
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
DOC
Done rerea dlink-farm-spam(3)
PDF
Syntactic Indexes for Text Retrieval
PPTX
An Introduction to Text Analytics: 2013 Workshop presentation
PDF
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...
PDF
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
PPTX
Search Engine working, Crawlers working, Search Engine mechanism
PPTX
Link analysis : Comparative study of HITS and Page Rank Algorithm
PPTX
Emerging Trends Workflow
PDF
Social Data Mining
Computing semantic similarity measure between words using web search engine
Evaluation of Web Search Engines Based on Ranking of Results and Features
Datawarehousing and Business Intelligence
Pagerank and hits
Science of the Interwebs
Use of Contextualized Attention Metadata for Ranking and Recommending Learnin...
Tovek Presentation by Livio Costantini
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Done rerea dlink-farm-spam(3)
Syntactic Indexes for Text Retrieval
An Introduction to Text Analytics: 2013 Workshop presentation
WEB SEARCH ENGINE BASED SEMANTIC SIMILARITY MEASURE BETWEEN WORDS USING PATTE...
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
Search Engine working, Crawlers working, Search Engine mechanism
Link analysis : Comparative study of HITS and Page Rank Algorithm
Emerging Trends Workflow
Social Data Mining
Ad

Similar to Compare & Contrast Using The Web To Discover Comparable Cases For News Stories (20)

PPT
Understanding Seo At A Glance
PDF
Volume 2-issue-6-2016-2020
PDF
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
PDF
Data Science - Part XI - Text Analytics
PPTX
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
PDF
Twitter sentimentanalysis report
PPTX
Statistical entity extraction from web
DOCX
Entity linking with a knowledge base issues,
PDF
Automatically Constructing Semantic Web Services From Online Sources
PPTX
PPT
Phrase Based Indexing
PPT
Phrase Based Indexing and Information Retrivel
PDF
Measure Term Similarity Using a Semantic Network Approach
PDF
DEEP LEARNING SENTIMENT ANALYSIS OF AMAZON.COM REVIEWS AND RATINGS
PDF
Sweeny ux-seo om-cap 2014_v3
PDF
What IA, UX and SEO Can Learn from Each Other
PPTX
Nonprofit social graph
PPTX
Optimization by translation
PPT
Information Literacy: Finding Information
PDF
Context Sensitive Relatedness Measure of Word Pairs
Understanding Seo At A Glance
Volume 2-issue-6-2016-2020
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Data Science - Part XI - Text Analytics
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Twitter sentimentanalysis report
Statistical entity extraction from web
Entity linking with a knowledge base issues,
Automatically Constructing Semantic Web Services From Online Sources
Phrase Based Indexing
Phrase Based Indexing and Information Retrivel
Measure Term Similarity Using a Semantic Network Approach
DEEP LEARNING SENTIMENT ANALYSIS OF AMAZON.COM REVIEWS AND RATINGS
Sweeny ux-seo om-cap 2014_v3
What IA, UX and SEO Can Learn from Each Other
Nonprofit social graph
Optimization by translation
Information Literacy: Finding Information
Context Sensitive Relatedness Measure of Word Pairs
Ad

Recently uploaded (20)

PDF
Corporate Finance Fundamentals - Course Presentation.pdf
PDF
Lecture1.pdf buss1040 uses economics introduction
PDF
discourse-2025-02-building-a-trillion-dollar-dream.pdf
PDF
NAPF_RESPONSE_TO_THE_PENSIONS_COMMISSION_8 _2_.pdf
PDF
Understanding University Research Expenditures (1)_compressed.pdf
PDF
Buy Verified Stripe Accounts for Sale - Secure and.pdf
PPTX
introuction to banking- Types of Payment Methods
PDF
Copia de Minimal 3D Technology Consulting Presentation.pdf
PDF
ECONOMICS AND ENTREPRENEURS LESSONSS AND
PDF
Mathematical Economics 23lec03slides.pdf
PDF
CLIMATE CHANGE AS A THREAT MULTIPLIER: ASSESSING ITS IMPACT ON RESOURCE SCARC...
PPTX
FL INTRODUCTION TO AGRIBUSINESS CHAPTER 1
PDF
Q2 2025 :Lundin Gold Conference Call Presentation_Final.pdf
PDF
Unkipdf.pdf of work in the economy we are
PPTX
kyc aml guideline a detailed pt onthat.pptx
PDF
Why Ignoring Passive Income for Retirees Could Cost You Big.pdf
PPTX
OAT_ORI_Fed Independence_August 2025.pptx
PPTX
The discussion on the Economic in transportation .pptx
PPTX
How best to drive Metrics, Ratios, and Key Performance Indicators
PPTX
Session 11-13. Working Capital Management and Cash Budget.pptx
Corporate Finance Fundamentals - Course Presentation.pdf
Lecture1.pdf buss1040 uses economics introduction
discourse-2025-02-building-a-trillion-dollar-dream.pdf
NAPF_RESPONSE_TO_THE_PENSIONS_COMMISSION_8 _2_.pdf
Understanding University Research Expenditures (1)_compressed.pdf
Buy Verified Stripe Accounts for Sale - Secure and.pdf
introuction to banking- Types of Payment Methods
Copia de Minimal 3D Technology Consulting Presentation.pdf
ECONOMICS AND ENTREPRENEURS LESSONSS AND
Mathematical Economics 23lec03slides.pdf
CLIMATE CHANGE AS A THREAT MULTIPLIER: ASSESSING ITS IMPACT ON RESOURCE SCARC...
FL INTRODUCTION TO AGRIBUSINESS CHAPTER 1
Q2 2025 :Lundin Gold Conference Call Presentation_Final.pdf
Unkipdf.pdf of work in the economy we are
kyc aml guideline a detailed pt onthat.pptx
Why Ignoring Passive Income for Retirees Could Cost You Big.pdf
OAT_ORI_Fed Independence_August 2025.pptx
The discussion on the Economic in transportation .pptx
How best to drive Metrics, Ratios, and Key Performance Indicators
Session 11-13. Working Capital Management and Cash Budget.pptx

Compare & Contrast Using The Web To Discover Comparable Cases For News Stories

  • 1. Compare & Contrast: Using the Web to Discover Comparable Cases for News Stories Jiahui Liu, Earl Wagner, Larry Birnbaum Northwestern University Intelligent Information Laboratory Reporter: Chieh-Chang Yang Date: 2007/07/05
  • 2. Outline Introduction The problem and overview of the proposed solution Implementation of Compare & Contrast Evaluation Conclusion
  • 3. Introduction Comparing and contrasting is an important strategy people employ to understand new situations and create solutions for new problems. In writing a news story, a reporter often compares the new event with other similar events to make it more familiar to readers, as well as to analyze any trends involving the new event.
  • 4. Introduction In this paper, we present Compare & Contrast , a system that use the Web to discover comparable cases for news stories, documents about similar situations but involving distinct entities. The system analyzes a news story given by the user and builds a model of the story. With the story model, the system dynamically discovers entities comparable to main entity in the original story and uses these comparable entities as seeds to retrieve web pages about comparable cases.
  • 5.  
  • 6. Introduction To do this, the system identifies the generic situation keywords , terms and phrases describing the situation of the story, and the main entity , the person, place, or organization that the story is talking about. The system dynamically discovers comparable entities , entities involved in similar situation as the main entity, based on word contexts similarity. The system formulates queries to general web search engines by combining the comparable entities and the generic situation keywords to retrieve web pages about comparable cases.
  • 7. The problem and overview of the proposed solution In describing an event, a reporter seeks to answer the five W questions: who, what , where, when and why. We note that the who, where and when of a news account are named entities . Actions and relationships among the actors, on the other hand, appear as non-named entity terms and give information about what and why , which constitute the generic situation . Based on this insight, we propose an approach for finding comparable cases by using the named entities and the non-named entities differently in modeling the story and retrieving information.
  • 8. The problem and overview of the proposed solution In terms of our theory of comparable cases, documents about comparable cases should contain similar non-named entity terms as the original story, but have different named entities. The system selects the top non-named entity terms and phrases as the generic situation keywords to query for relevant documents. However, whether two entities are comparable is dependent upon the context of the situation, not just by their static similarities and distinctions.
  • 9.  
  • 10.  
  • 11. Implementation of Compare & Contrast News story modeling Comparable entity discovery Page filtering to remedy noise on the Web
  • 12. News story modeling When the URL of a news web page is sent to Compare&Contrast, the system retrieve the web page, extracts the news content from the page, spilts the sentences and tags the named entities. For named entity recognition, the system uses the web service provided by ClearForest Semantic Web Services (SWS), adopting its tags of person, organization, company, product and geographical location.
  • 13. News story modeling For the non-named entity terms, stop words are removed and the rest of the terms are stemmed with a Porter Stemmer. To create a vector representation of the non-named entity terms, we used a modified TF-IDF model which incrementally decreases the importance of terms appearing later in the news article. To implement this idea, we assign scores to sentences according to their position.
  • 14. News story modeling When computing the term frequency (TF) for the non-named entity terms, each occurrence of a term is given the score of the sentence it appears in, rather than being counted evenly. Moreover, the TF of terms in the title or the lead sentence is doubled. The IDF of terms is computed using an archive of 343,187 news stories collected from April 2004 to June 2006. We found that it would be beneficial to capture the important word groups in event descriptions, such as ”open source” or “nuclear test.” So the system treats the stemmed bigrams which appeared more than three times in the article as phrases . The TF of a phrase is computed in the same way as a unigram. The IDF of a phrase is the maximum of the IDFs of the two terms of the phrase.
  • 15. News story modeling Unlike the non-named entity terms, the vector representation of named entities only uses TF. The TF for named entities is computed the same way as non-named entity terms. The named entity with the highest score is chosen as the main entity. A tricky issue in counting named entities is that different references to the same entity should be grouped together. In writing the news stories, journalists usually give the full name of the named entity at the first mention, but use some shortened form later.
  • 16. Comparable entity discovery The system tries to retrieve a set of potentially relevant pages using the query : - “main entity” {generic situation keywords}. We defined the word context of a named entity as the terms and phrases co-occurring with the main entity. A word context vector is built for the main entity in the original news article by collecting all the terms and phrases co-occurring with the main entity.
  • 17. Comparable entity discovery The potentially relevant pages are preprocessed. To compute the similarity of word context, each sentence in the potentially relevant pages is scored using the word context word vector. Entities of the same type as the main entity in a potentially relevant pages are considered as candidates for comparable entities. The similarity score of an entity is computed using the score of all the sentences they appears in.
  • 18. Comparable entity discovery After this process, each potentially relevant page has a set of candidates for comparable entities with their simScores. We observed that among these potentially relevant pages, there are some web pages describing the same events. It would be beneficial to cluster the articles about the same events together. We develop our method for clustering articles according to the overlap of the important entities in the articles. simScores of the same named entities within a cluster are added together.
  • 19. Comparable entity discovery After Compare&Contrast identifies the comparable entities, the system use the comparable entities as seeds to retrieve comparable cases with query: + “comparable entity” –”main entity” {generic situation keywords}. To verify the comparable cases, the system uses the search result counts returned by web search engines to calculate the relevant score of comparable cases. The benefit of taking into account the search result count is twofold: More hits means the cases have larger coverage in public. The system may produce false comparable entities. However, there are very few web pages describing the false comparable entities with the generic situation keywords.
  • 20. Page filtering to remedy noise on the Web Within the set of potentially relevant pages, there are some irrelevant web pages. We identified two different categories of harmful pages and developed filters accordingly: Directory pages: the percentage of upper case characters is often higher than other pages.( filter out more than 28% UP) Irrelevant pages: the summaries of results returned by web search engines are compared with the vector representation of the generic situation. These two filters are executed before the Comparable Entity Identifier.
  • 21. Evaluation We need a collection of news articles for which comparable cases can be found on the Web. However, some news articles describe and discuss general phenomena or specific events , thus no focused entities or comparable cases exist. We notices that there is a moderate potion of news articles that contain comparisons or contrasts inside the articles. These news articles can be good candidate for our test cases. Moreover, the comparable cases mentioned in these news articles can be used as answer keys for evaluation.
  • 22.  
  • 23. Evaluation So we built a collection of test cases by gathering news articles mentioning comparable cases. We collected 40 news articles from various news websites, and we divide the test cases into three categories: politics, business, and technology. We conducted two experiments on the collection: We ran Compare&Contrast on all the test cases and use the comparable cases given in the articles to evaluate the effectiveness of the technique for discovering comparable entities. We randomly selected 6 test cases from the collection and invited 5 people to judge whether the web pages Compare&Contrast found are about relevant cases comparable to the original news stories.
  • 24. Effectiveness of Comparable Entity Discovery The 40 news articles are fed into the Compare&Contrast. For each news article, the system returned its top five, or fewer, comparable entities with their score above certain threshold. If some of the comparable entities are mentioned by the comparison part of the test case, the test case is counted as a hit.
  • 25. Relevance of Retrieved Pages To evaluate the relevance of the web pages the system found, we randomly selected 6 test cases and invited 5 people to judge the relevance of the retrieved pages. The 5 users consisted of 2 graduate students, 2 staff members, and 1 undergraduate student. For the 6 test cases, 4 are hit cases. For each test case, there were 5 comparable entities and the system presented 3 or fewer web pages retrieved for each comparable entity. Altogether are 85 web pages. Each web page is given 1 point if one user thinks the web page contains a relevant comparable case for the original news story, and 0 point if not. The average score for all 85 web pages is 3.13, and we consider a web page with score equal to or more than 3 to be relevant.
  • 27. Conclusion In this paper, we analyzed the problem of finding comparable cases for news stories, characterizing comparable cases as those that share a similar situation as the original story but involved different entities. We presented Compare&Contrast: a system for finding the comparable cases by automatically formulating queries based on the story model derived from the original article and dynamically discovering comparable entities involved in comparable cases. We plan to investigate a more sophisticated use of named entities and develop more intelligent query formulation mechanism to find better results.