Grouping search engine returned citations for person-name queries

Grouping Search-Engine Returned Citations forPerson-Name QueriesReema Al-Kamha, David W. Embley, WIDM’04Brigham Young UniversityAdvisor: Chia-Hui ChangStudent: Kuan-Hua HuoDate: 2010-6-11

OutlineInstructionRelated workA multi-faseted approachExperimental resultsConcluding remarks2

InstructionCitations: the returned results that are related to a specific query in a search engine.Citation contains the title of the web page found, text below the title that includes the keywords of the query, and the URL of the web page found. Using Googlethe query ”Kelly Flanagan”3

Instruction cont.In the outputretain the basic search-engine returned citationswithin each groupmaintain the search engine ranking orderamong groupsmaintain the relative order of citations as originally presented by the search engine4

Related workA cross-document coreference occurs when the same person, place, event, or concept is discussed in more than one text source. we would need to find a context, which is not straightforward for the mixture of structured, semistructured, and unstructured documents on the web.But we found that neither produced satisfactory results.5

Related work cont.Object identification refers to the task of deciding that two observed objects are in fact one and the same object. compare an object’s shared attributes in order to identify matching objects our technique involves links and page similarity in addition to attributes6

Related work cont.One technique that that is typically used to resolve objec identity is probabilistic modeling.Probabilistic modeling compares objects based on shared attributesuses appearance probability to determine the similarity between objects.7

A multi-faceted approachEach facet represents a relevant aspect of the problem space about which we can gather evidence that two citations reference the same person or different persons.attributes about a personlinks within and among sitespage similarity as facets8

A multi-faceted approach cont.Attributes LinksPage similarityConfidence Matrix ConstuctionFinal Confidence MatrixGrouping Algorithm9

AttributesAttributes appear in web pages of citationsphone numberemail addressState (from a list that contains all state names and their abbreviations)cityzip code10

Attributes cont.We only extract state, city, and zip code values in an address context.To extract a city we extract all strings that match the regular expressions ([A-Z]( \ w)+( )?) { 1,3 } and satisfy the context specification consisting of this string followed by an optional comma, white space, and a state name.11

LinksIf two URLs share a common host, we can be reasonably confident that they refer to the same person.12

Links cont. It is common to have two different persons that have the same name in two citations that have a popular host like www.yahoo.com.To find the number of pages that point to a hostQuery link:siteURL in Google shows all pages and gives a count of the number of pages that point to that URL13

Links cont. We determined empirically that a host h is popular for person-name queries if more than 400 pages point to h.If two citations c 1 and c 2 that are results of a person name query share the same non-popular host, or if the URL of one citation c 1 has the same non-popular host as one of the URLs that belongs to the web page referenced by the other citation c 2, then we can be confident about grouping c 1 and c 2 together for the same person.14

Page similarityThe similarity between web pages referenced by the two returned citationsIf two web pages refer to the same person, there are specific words associated with that person.David Embley, who is a professor and a co-director of the Data ExtractionResearch Group in the Computer Science Departmentat Brigham Young University.Sandra Rogers contain Lessons from the Light, a book she wrote.15

Page similarity cont.We consider pairs of words that start with a capital letter and that are either adjacent or separated by a connector (and, or, but) or by a preposition which may be followed by an article (a, an, the) or by a single capital letter followed by dot.Adjacent cap-word pairsCap-Word (Connector | Preposition (Article)? | (Capital-LetterDot))?Cap-Word16

Page similarity cont.We eliminate these pairs by constructing a stop-word list, which is a list of frequently appearing adjacent cap-word pairs.We collected approximately 10,000 web documents taken at random from the Open Directory Project, DMOZ.We constructed all adjacent cap-word pairs.We considered all pairs with a frequency greater than two to be stop words.17

Page similarity cont.the number of adjacent cap-word pairs as an indicator of the similarity between two web pagesIn particular, we consider whether two web pages share exactly one, exactly two, exactly three, or four or more adjacent cap-word pairs. The greater the number of adjacent cap-word pairs, the greater the similarity between the pages.18

Confidence Matrix ConstructionConstruct a confidence matrix, one for each facetattributes, links, and page similarityThe confidence matrix:upper triangular matrix over all pairs of the n returned citations C 1 , C 2 , ... , C n .Element Cij (i < j) in the confidence matrixthe confidence Ci and C j refer to the same person19

Confidence Matrix Construction cont.In order to compute the conditional probabilities that represent confidence values, we construct a training set.We entered each name as a query for Google, and we collected the first 50 returned citations for each name.49+48+.....+2+1 = 1,225 comparison pairsTotal number of comparisons 9*1225=11,02520

Confidence Matrix Construction cont.Same Person: whether the names are for the same personPhone: whether the web pages to which the citations link contain the same phone numberEmail: whether the web pages to which the citations link contain the same email addressZip: whether the web pages to which the citations link contain the same address zip codeCity: whether the web pages to which the citations link contain the same address cityState: whether the web pages to which the citations link contain the same address state21Host1 : whether the citations have URLs in the same host

Host2 : whether the URL of one citation has the same host as one of the URLs that belongs to the web page of the other citation

Share1 : whether the web pages referenced by the citations share exactly one adjacent cap-word pair

Share2 : whether the web pages referenced by the citations share exactly two adjacent cap-word pair

Share3 : whether the web pages referenced by the citations share exactly three adjacent cap-word pair;

Share ≥ 4 : whether the web pages referenced by the citations share four or more adjacent cap-word pairsConfidence Matrix Construction cont.For attribute facetthe web pages referenced by the citations have one of these attributes or any combination of these attributes.For example, we estimate P(Same Person “Yes” | Email = “Yes”), which is the probability that two citations refer to the same person knowing that the web pages referenced by them have the same email address, by dividing the number of citation pairs that are related to the same person and have the same email by the number of citation pairs that have the same email address in the training set.22

Confidence Matrix Construction cont.For pairs, triples, quadruples, and quintuples of attributes, we also compute conditional probabilities.For example, we estimate P(Same Person = “Yes” | City = “Yes” and State = “Yes”) which is the probability that two citations refer to the same person knowing that the web pages referenced by them share the same address city and state, by dividing the number of citation pairs that are related to the same person and have the same address city and state by the number of citation pairs that share same address city and state in the training set.23

Confidence Matrix Construction cont.For link facetthe URLs of the citations share the same non-popular host For example, we estimate P(Same Person = “Yes” | Host1 “Yes” and Host1 is non-popular) by dividing the number of citation pairs that are related to the same person and have the same non-popular host by the number of citation pairs that share a common, non-popular host.24

Confidence Matrix Construction cont.For page similarity facetthe web pages referenced by them share exactly one, or two, or three, or four or more pairs of two adjacent cap-word pairs. For example, we estimate P(Same Person = “Yes” | Share2 = “Yes”) by dividing the number of citation pairs that are related to the same person and share two cap-word pairs by the number of citation pairs that share two cap-word pairs.25

Final Confidence MatrixGenerate the final confidence matrix by combining the confidence matrices using Stanford certainty theory Stanford certainty theory defines a confidence measure and generates some simple rules for combining independent evidence.26

Final Confidence Matrix cont. CF (E 1 ): the certainty factor associated with evidence E 1 for some observation BCF (E 2 ): the certainty factor associated with evidence E 2 for the same observation Bnew certainty factor CF of B (called the compound certainty factor of B)CF (E 1 )+CF (E 2 )-(CF (E 1 ) ∗ CF (E 2 )).27

Grouping search engine returned citations for person-name queries

More Related Content

Viewers also liked (20)

Similar to Grouping search engine returned citations for person-name queries (20)

Recently uploaded (20)

Grouping search engine returned citations for person-name queries