Grouping Search-Engine Returned Citations forPerson-Name QueriesReema Al-Kamha, David W. Embley, WIDM’04Brigham Young UniversityAdvisor: Chia-Hui ChangStudent: Kuan-Hua HuoDate: 2010-6-11
OutlineInstructionRelated workA multi-faseted approachExperimental resultsConcluding remarks2
InstructionCitations: the returned results that are related to a specific query in a search engine.Citation contains the title of the web page found, text below the title that includes the keywords of the query, and the URL of the web page found. Using Googlethe query ”Kelly Flanagan”3
Instruction   cont.In the outputretain the basic search-engine returned citationswithin each groupmaintain the search engine ranking orderamong groupsmaintain the relative order of citations as originally presented by the search engine4
Related workA cross-document coreference occurs when the same person, place, event, or concept is discussed in more than one text source.  we would need to find a context, which is not straightforward for the mixture of structured, semistructured, and unstructured documents on the web.But we found that neither produced satisfactory results.5
Related work   cont.Object identification refers to the task of deciding that two observed objects are in fact one and the same object. compare an object’s shared attributes in order to identify matching objects our technique involves links and page similarity in addition to attributes6
Related work   cont.One technique that that is typically used to  resolve objec identity is probabilistic modeling.Probabilistic modeling compares objects based on shared attributesuses appearance probability to determine the similarity between objects.7
A multi-faceted approachEach facet represents a relevant aspect of the problem space about which we can gather evidence that two citations reference the same person or different persons.attributes about a personlinks within and among sitespage similarity as facets8
A multi-faceted approach   cont.Attributes LinksPage similarityConfidence Matrix ConstuctionFinal Confidence MatrixGrouping Algorithm9
AttributesAttributes appear in web pages of citationsphone numberemail addressState (from a list that contains all state names and their abbreviations)cityzip code10
Attributes   cont.We only extract state, city, and zip code values in an address context.To extract a city we extract all strings that match the regular expressions	([A-Z]( \ w)+( )?) { 1,3 } 	and satisfy the context specification consisting of this string followed by an optional comma, white space, and a state name.11
LinksIf two URLs share a common host, we can be reasonably confident that they refer to the same person.12
Links   cont. It is common to have two different persons that have the same name in two citations that have a popular host like www.yahoo.com.To find the number of pages that point to a hostQuery link:siteURL in Google shows all pages and gives a count of  the  number  of  pages  that  point  to  that  URL13
Links   cont. We determined empirically that a host h is popular for person-name queries if more than 400 pages point to h.If two citations c 1 and c 2 that are results of a person name query share the same non-popular host, or if the URL of one citation c 1 has the same non-popular host as one of the URLs that belongs to the web page referenced by the other citation c 2, then we can be confident about grouping c 1 and c 2 together for the same person.14
Page similarityThe similarity between web pages referenced by the two returned citationsIf two web pages refer to the same person, there are specific words associated with that person.David  Embley,  who  is  a  professor  and  a  co-director  of  the  Data  ExtractionResearch Group  in  the  Computer  Science  Departmentat  Brigham Young University.Sandra Rogers contain Lessons from the Light, a book she wrote.15
Page similarity   cont.We consider pairs of words that start with a capital letter and that are either adjacent or separated by a connector (and, or, but) or by a preposition which may be followed by an article (a, an, the) or by a single capital letter followed by dot.Adjacent cap-word pairsCap-Word  (Connector | Preposition  (Article)? | (Capital-LetterDot))?Cap-Word16
Page similarity   cont.We  eliminate  these  pairs  by  constructing  a  stop-word list, which is a list of frequently appearing adjacent cap-word pairs.We collected approximately 10,000 web documents taken at random from the Open Directory Project, DMOZ.We constructed all adjacent cap-word pairs.We considered all pairs with a frequency greater than two to be stop words.17
Page similarity   cont.the  number  of  adjacent  cap-word  pairs  as an indicator of the similarity between two web pagesIn particular, we consider whether two web pages share exactly one,  exactly two,  exactly three,  or four or more adjacent cap-word pairs.   The greater the number of adjacent cap-word  pairs,  the  greater  the  similarity  between  the  pages.18
Confidence Matrix ConstructionConstruct a confidence matrix, one for each facetattributes, links, and page similarityThe confidence matrix:upper triangular matrix over all pairs of the n  returned citations C 1 , C 2 , ...  , C n .Element Cij (i < j) in the confidence matrixthe confidence Ci and C j refer to the same person19
Confidence Matrix Construction   cont.In order to compute the conditional probabilities that represent confidence values, we construct a training set.We entered each name as a query for Google, and we collected the first 50 returned citations for each name.49+48+.....+2+1 = 1,225 comparison pairsTotal  number of comparisons 9*1225=11,02520
Confidence Matrix Construction   cont.Same  Person:  whether  the  names  are for  the  same personPhone:  whether the web pages to which the citations link contain the same phone numberEmail:  whether the web pages to which the citations link contain the same email addressZip: whether the web pages to which the citations link contain the same address zip codeCity:  whether the web pages to which the citations link contain the same address cityState:  whether the web pages to which the citations link contain the same address state21Host1 :  whether the citations have URLs in the same host
Host2 : whether the URL of one citation has the same host as one of the URLs that belongs to the web page of the other citation
Share1 :  whether the web pages referenced by the citations share exactly one adjacent cap-word pair
Share2 :  whether the web pages referenced by the citations share exactly two adjacent cap-word pair
Share3 : whether the web pages referenced by the citations share exactly three adjacent cap-word pair;
Share ≥ 4 :  whether the web pages referenced by the citations share four or more adjacent cap-word pairsConfidence Matrix Construction   cont.For attribute facetthe web pages referenced by the citations have one of these attributes or any combination of these attributes.For example, we estimate P(Same Person “Yes” | Email = “Yes”), which is the probability that two citations refer to the same person knowing that the web pages referenced by them have the same email address, by dividing the number of citation pairs that are related to the same person and have the same email by the number of citation pairs that have the same email address in the training set.22
Confidence Matrix Construction   cont.For pairs, triples, quadruples, and quintuples of attributes, we also compute conditional probabilities.For example, we estimate P(Same Person  = “Yes” | City = “Yes” and State = “Yes”) which is the probability that two citations refer to the same person knowing that the web pages referenced by them share the same address city and state, by dividing the number of citation pairs that are related to the same person and have the same address city and state by the number of citation pairs that share same address city and state in the training set.23
Confidence Matrix Construction   cont.For link  facetthe URLs of the citations share the same non-popular host For example, we estimate P(Same Person = “Yes” | Host1  “Yes” and Host1  is non-popular) by dividing the number of citation pairs that are related to the same person and have the same non-popular host by the number of citation pairs that share a common, non-popular host.24
Confidence Matrix Construction   cont.For page similarity facetthe web pages referenced by them share exactly one, or two, or three, or four or more pairs of two adjacent cap-word pairs. For example, we estimate P(Same Person = “Yes” | Share2 = “Yes”) by dividing the number of citation pairs that are related to the same person and share two cap-word pairs by the number of citation pairs that share two cap-word pairs.25
Final Confidence MatrixGenerate  the final confidence matrix by combining the confidence matrices using Stanford certainty theory Stanford certainty theory defines a confidence measure and generates some simple rules for combining independent evidence.26
Final Confidence Matrix   cont. CF (E 1 ): the certainty factor associated with evidence E 1 for  some  observation  BCF (E 2 ): the  certainty  factor associated with evidence E 2 for the same observation Bnew certainty factor CF of B (called the compound certainty  factor  of  B)CF (E 1 )+CF (E 2 )-(CF (E 1 ) ∗ CF (E 2 )).27

More Related Content

DOC
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
PDF
SEGMENTING TWITTER HASHTAGS
PPT
Understanding Seo At A Glance
PDF
Js3616841689
PPTX
Windows Azure Mobile Services
PPT
Pagerank (1)
PDF
Get modern
PPT
Citing Internet Sources- How to cite an internet site?- ثبت مصادر الانترنت
UsingSocialNetworkingTheoryToUnderstandPowerinOrganizations
SEGMENTING TWITTER HASHTAGS
Understanding Seo At A Glance
Js3616841689
Windows Azure Mobile Services
Pagerank (1)
Get modern
Citing Internet Sources- How to cite an internet site?- ثبت مصادر الانترنت

Viewers also liked (20)

DOC
Bitácora nº 3
PPT
Macpartitionmanager
PPTX
Unit 5 research project scientific structure template
PPTX
MSP TechDay
PDF
Boas práticas no desenvolvimento de software
PDF
Colvin exadata and_oem12c
PPT
Building a social media presence
PDF
www.Maxi-stromvergleich.de
PPTX
Apache Projekte als Basis einer Integrationsplattform
PPS
DOC
PPTX
Environmental & Geographical Science Honours 2015
 
PDF
Six Steps to eLearning engagement and boosting the value of your LMS
PPTX
Nominalia and Simply @ Congreso web
PDF
September 2012 Toronto Real Estate Market Watch
PPTX
RefWorks Get References Into Folders
 
Bitácora nº 3
Macpartitionmanager
Unit 5 research project scientific structure template
MSP TechDay
Boas práticas no desenvolvimento de software
Colvin exadata and_oem12c
Building a social media presence
www.Maxi-stromvergleich.de
Apache Projekte als Basis einer Integrationsplattform
Environmental & Geographical Science Honours 2015
 
Six Steps to eLearning engagement and boosting the value of your LMS
Nominalia and Simply @ Congreso web
September 2012 Toronto Real Estate Market Watch
RefWorks Get References Into Folders
 
Ad

Similar to Grouping search engine returned citations for person-name queries (20)

PDF
Lise Getoor, "
PPTX
Facilitating Human Intervention in Coreference Resolution with Comparative En...
PPT
Link Analysis
PPT
Link Analysis
PPTX
novel and efficient approch for detection of duplicate pages in web crawling
PDF
Computing semantic similarity measure between words using web search engine
PDF
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
PDF
A comprehensive survey of link mining and anomalies detection
PPT
Wikipedia as an Ontology for Describing Documents
PDF
Declarative analysis of noisy information networks
PDF
Social Networks
PPTX
PPTX
Collaboration Recommender
PDF
Similarity at Scale
PPT
2006-05-25__coi-semdis
PDF
Simple semantics in topic detection and tracking
ODP
finding nobel prize window by PageRank
PDF
Current Approaches in Search Result Diversification
PDF
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
PPT
Public profile
Lise Getoor, "
Facilitating Human Intervention in Coreference Resolution with Comparative En...
Link Analysis
Link Analysis
novel and efficient approch for detection of duplicate pages in web crawling
Computing semantic similarity measure between words using web search engine
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
A comprehensive survey of link mining and anomalies detection
Wikipedia as an Ontology for Describing Documents
Declarative analysis of noisy information networks
Social Networks
Collaboration Recommender
Similarity at Scale
2006-05-25__coi-semdis
Simple semantics in topic detection and tracking
finding nobel prize window by PageRank
Current Approaches in Search Result Diversification
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
Public profile
Ad

Recently uploaded (20)

PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
The various Industrial Revolutions .pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
Architecture types and enterprise applications.pdf
PPTX
Benefits of Physical activity for teenagers.pptx
DOCX
search engine optimization ppt fir known well about this
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPT
What is a Computer? Input Devices /output devices
PDF
Five Habits of High-Impact Board Members
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Hybrid model detection and classification of lung cancer
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
Developing a website for English-speaking practice to English as a foreign la...
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Zenith AI: Advanced Artificial Intelligence
The various Industrial Revolutions .pptx
Tartificialntelligence_presentation.pptx
Architecture types and enterprise applications.pdf
Benefits of Physical activity for teenagers.pptx
search engine optimization ppt fir known well about this
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
What is a Computer? Input Devices /output devices
Five Habits of High-Impact Board Members
A comparative study of natural language inference in Swahili using monolingua...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Hybrid model detection and classification of lung cancer
NewMind AI Weekly Chronicles – August ’25 Week III
Web Crawler for Trend Tracking Gen Z Insights.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A review of recent deep learning applications in wood surface defect identifi...
Taming the Chaos: How to Turn Unstructured Data into Decisions

Grouping search engine returned citations for person-name queries

  • 1. Grouping Search-Engine Returned Citations forPerson-Name QueriesReema Al-Kamha, David W. Embley, WIDM’04Brigham Young UniversityAdvisor: Chia-Hui ChangStudent: Kuan-Hua HuoDate: 2010-6-11
  • 2. OutlineInstructionRelated workA multi-faseted approachExperimental resultsConcluding remarks2
  • 3. InstructionCitations: the returned results that are related to a specific query in a search engine.Citation contains the title of the web page found, text below the title that includes the keywords of the query, and the URL of the web page found. Using Googlethe query ”Kelly Flanagan”3
  • 4. Instruction cont.In the outputretain the basic search-engine returned citationswithin each groupmaintain the search engine ranking orderamong groupsmaintain the relative order of citations as originally presented by the search engine4
  • 5. Related workA cross-document coreference occurs when the same person, place, event, or concept is discussed in more than one text source. we would need to find a context, which is not straightforward for the mixture of structured, semistructured, and unstructured documents on the web.But we found that neither produced satisfactory results.5
  • 6. Related work cont.Object identification refers to the task of deciding that two observed objects are in fact one and the same object. compare an object’s shared attributes in order to identify matching objects our technique involves links and page similarity in addition to attributes6
  • 7. Related work cont.One technique that that is typically used to resolve objec identity is probabilistic modeling.Probabilistic modeling compares objects based on shared attributesuses appearance probability to determine the similarity between objects.7
  • 8. A multi-faceted approachEach facet represents a relevant aspect of the problem space about which we can gather evidence that two citations reference the same person or different persons.attributes about a personlinks within and among sitespage similarity as facets8
  • 9. A multi-faceted approach cont.Attributes LinksPage similarityConfidence Matrix ConstuctionFinal Confidence MatrixGrouping Algorithm9
  • 10. AttributesAttributes appear in web pages of citationsphone numberemail addressState (from a list that contains all state names and their abbreviations)cityzip code10
  • 11. Attributes cont.We only extract state, city, and zip code values in an address context.To extract a city we extract all strings that match the regular expressions ([A-Z]( \ w)+( )?) { 1,3 } and satisfy the context specification consisting of this string followed by an optional comma, white space, and a state name.11
  • 12. LinksIf two URLs share a common host, we can be reasonably confident that they refer to the same person.12
  • 13. Links cont. It is common to have two different persons that have the same name in two citations that have a popular host like www.yahoo.com.To find the number of pages that point to a hostQuery link:siteURL in Google shows all pages and gives a count of the number of pages that point to that URL13
  • 14. Links cont. We determined empirically that a host h is popular for person-name queries if more than 400 pages point to h.If two citations c 1 and c 2 that are results of a person name query share the same non-popular host, or if the URL of one citation c 1 has the same non-popular host as one of the URLs that belongs to the web page referenced by the other citation c 2, then we can be confident about grouping c 1 and c 2 together for the same person.14
  • 15. Page similarityThe similarity between web pages referenced by the two returned citationsIf two web pages refer to the same person, there are specific words associated with that person.David Embley, who is a professor and a co-director of the Data ExtractionResearch Group in the Computer Science Departmentat Brigham Young University.Sandra Rogers contain Lessons from the Light, a book she wrote.15
  • 16. Page similarity cont.We consider pairs of words that start with a capital letter and that are either adjacent or separated by a connector (and, or, but) or by a preposition which may be followed by an article (a, an, the) or by a single capital letter followed by dot.Adjacent cap-word pairsCap-Word (Connector | Preposition (Article)? | (Capital-LetterDot))?Cap-Word16
  • 17. Page similarity cont.We eliminate these pairs by constructing a stop-word list, which is a list of frequently appearing adjacent cap-word pairs.We collected approximately 10,000 web documents taken at random from the Open Directory Project, DMOZ.We constructed all adjacent cap-word pairs.We considered all pairs with a frequency greater than two to be stop words.17
  • 18. Page similarity cont.the number of adjacent cap-word pairs as an indicator of the similarity between two web pagesIn particular, we consider whether two web pages share exactly one, exactly two, exactly three, or four or more adjacent cap-word pairs. The greater the number of adjacent cap-word pairs, the greater the similarity between the pages.18
  • 19. Confidence Matrix ConstructionConstruct a confidence matrix, one for each facetattributes, links, and page similarityThe confidence matrix:upper triangular matrix over all pairs of the n returned citations C 1 , C 2 , ... , C n .Element Cij (i < j) in the confidence matrixthe confidence Ci and C j refer to the same person19
  • 20. Confidence Matrix Construction cont.In order to compute the conditional probabilities that represent confidence values, we construct a training set.We entered each name as a query for Google, and we collected the first 50 returned citations for each name.49+48+.....+2+1 = 1,225 comparison pairsTotal number of comparisons 9*1225=11,02520
  • 21. Confidence Matrix Construction cont.Same Person: whether the names are for the same personPhone: whether the web pages to which the citations link contain the same phone numberEmail: whether the web pages to which the citations link contain the same email addressZip: whether the web pages to which the citations link contain the same address zip codeCity: whether the web pages to which the citations link contain the same address cityState: whether the web pages to which the citations link contain the same address state21Host1 : whether the citations have URLs in the same host
  • 22. Host2 : whether the URL of one citation has the same host as one of the URLs that belongs to the web page of the other citation
  • 23. Share1 : whether the web pages referenced by the citations share exactly one adjacent cap-word pair
  • 24. Share2 : whether the web pages referenced by the citations share exactly two adjacent cap-word pair
  • 25. Share3 : whether the web pages referenced by the citations share exactly three adjacent cap-word pair;
  • 26. Share ≥ 4 : whether the web pages referenced by the citations share four or more adjacent cap-word pairsConfidence Matrix Construction cont.For attribute facetthe web pages referenced by the citations have one of these attributes or any combination of these attributes.For example, we estimate P(Same Person “Yes” | Email = “Yes”), which is the probability that two citations refer to the same person knowing that the web pages referenced by them have the same email address, by dividing the number of citation pairs that are related to the same person and have the same email by the number of citation pairs that have the same email address in the training set.22
  • 27. Confidence Matrix Construction cont.For pairs, triples, quadruples, and quintuples of attributes, we also compute conditional probabilities.For example, we estimate P(Same Person = “Yes” | City = “Yes” and State = “Yes”) which is the probability that two citations refer to the same person knowing that the web pages referenced by them share the same address city and state, by dividing the number of citation pairs that are related to the same person and have the same address city and state by the number of citation pairs that share same address city and state in the training set.23
  • 28. Confidence Matrix Construction cont.For link facetthe URLs of the citations share the same non-popular host For example, we estimate P(Same Person = “Yes” | Host1 “Yes” and Host1 is non-popular) by dividing the number of citation pairs that are related to the same person and have the same non-popular host by the number of citation pairs that share a common, non-popular host.24
  • 29. Confidence Matrix Construction cont.For page similarity facetthe web pages referenced by them share exactly one, or two, or three, or four or more pairs of two adjacent cap-word pairs. For example, we estimate P(Same Person = “Yes” | Share2 = “Yes”) by dividing the number of citation pairs that are related to the same person and share two cap-word pairs by the number of citation pairs that share two cap-word pairs.25
  • 30. Final Confidence MatrixGenerate the final confidence matrix by combining the confidence matrices using Stanford certainty theory Stanford certainty theory defines a confidence measure and generates some simple rules for combining independent evidence.26
  • 31. Final Confidence Matrix cont. CF (E 1 ): the certainty factor associated with evidence E 1 for some observation BCF (E 2 ): the certainty factor associated with evidence E 2 for the same observation Bnew certainty factor CF of B (called the compound certainty factor of B)CF (E 1 )+CF (E 2 )-(CF (E 1 ) ∗ CF (E 2 )).27
  • 32. Gouping algorithmInput: the final confidence matrixOutput: groups of the search engine returned citationsWe are highly confident about grouping two citations Ci and C j together in a set S 1We are highly confident about grouping two citations C j and C k together in a set S 2S 1 and S 2 share one or more citations (C j in our example) =>we are confident about grouping S 1 and S 2 together in one group S 3 . The threshold we use for “highly confident” is 0.8, which we determined empirically.28
  • 33. Expermimental resultsTo evaluate the performance of our system, we used split and merge measures.For example: eight returned citations C 1 , C 2 , C 3 , C 4 , C 5 , C 6 , C 7 , C 8 correct grouping resultGroup 1: { C 1 , C 2 , C 4 , C 6 , C 7 } , Group 2: { C 3 , C 8 } , Group 3: { C 5 }Our systemGroup 1: { C 1 , C 2 , C 4 } , Group 2: { C 3 , C 6 , C 7 } , Group 3: { C 5 , C 8 }29
  • 34. Expermimental results cont.The average normalized score for splits for all facets is 0.004The average normalized score for merges is 0.01430
  • 35. Concluding remarkWe designed and implemented a system that can automatically group the returned citations from a search engine person-name query, such that each group of citations refers to the same person.We gave experimental evidence to show that our approach can be successful.31