Web search-metrics-tutorial-www2010-section-3of7-coverage

1

Web Search Engine Metrics
for Measuring User
Satisfaction
[Section 3 of 7: Coverage]
Ali Dasdan, eBay
Kostas Tsioutsiouliklis, Yahoo!
Emre Velipasaoglu, Yahoo!
With contributions from Prasad Kantamneni, Yahoo!
27 Apr 2010
(Update in Aug 2015: The authors work in different companies now.)

2

Tutorial
@
19th International
World Wide Web
Conference
http://www2010.org/
April 26-30, 2010

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Disclaimers
•  This talk presents the opinions of the
authors. It does not necessarily reflect
the views of our employers.
•  This talk does not imply that these
metrics are used by our employers, or
should they be used, they may not be
used in the way described in this talk.
•  The examples are just that – examples.
Please do not generalize them to the
level of comparing search engines.
3

4

Coverage Metrics
Section 3/7
of
WWW’10 Tutorial on Web Search Engine Metrics
by
A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu

Example on coverage: URL was not
found
5

Example on coverage: But content
was found under different URLs
6

Example on coverage: URL was
also found after some time
7

Definitions for coverage
•  Coverage refers to presence of
content of interest in a catalog.
•  Coverage ratio
– defined as the ratio of the number of
documents (pages) found to the number of
documents (pages) tested
– Can be represented as a distribution when
many document attributes are considered
together
8

Some background:
Shingling and Jaccard Index
9

Doc = (a b c d e) (5 terms)
2-grams: (a b, b c, c d, d e)
Shingles for 2-grams (after hashing them): 10, 3, 7, 16
Min shingle: 3 (used as a signature of Doc)
Doc1 = (a b c d e)
Doc2 = (a e f g)
Doc1 Doc2 = (a e)
Doc1 Doc2 = (a b c d e f g)
Jaccard index = |Doc1 Doc2| / |Doc1 Doc2|
= 2 / 7 ≈ 30%
(Broder’s shingling method estimates this index.)

How to measure coverage
•  Given an input document with its URL
•  Query by URL (QBU)
–  enter URL at the target search engine’s query interface
–  if the URL is not found, then iterate using “normalized” forms of
the same URL
•  Query by content (QBC)
–  if URL is not given or URL search has failed, then perform this
search
–  generate a set of queries (called strong queries) from the
document
–  submit the queries to the target search engine’s query interface
–  combine the returned results
–  perform a more thorough similarity check between the returned
documents and the input document
•  Compute coverage ratio over multiple documents
10

Query-by-Content flowchart
11

String signature: Terms from page
Strings combined into queries
Similarity check using shingles
Search results extraction

Query by content:
How to generate queries
•  Select sequences of terms by frequency
–  terms with the lowest frequency or highest TF-IDF
•  Select sequences of terms by position
–  +/- two terms at every 5th term
•  Select sequences of terms randomly
–  find a sequence of consecutive terms randomly in the
document
•  Select sequences of terms randomly via shingles
–  find the document’s shingles signature
–  find the corresponding sequences of terms
–  This method produces the same query signature for the
same document, as opposed to the method above.
–  This method also beats the method above.
12

Further issues to consider
•  URL normalization
–  Example:
•  wikipedia.org/wiki/Casino_Royale or wikipedia.org/?title=Casino_Royale
–  see Dasgupta, Kumar, and Sasturkar (2008)
•  Page templates and ads
–  or how to avoid undesired matches
•  Search for non-textual content
–  images, mathematical formulas, tables and other similar
structures
•  Definition of content similarity
•  Syntactic vs. semantic match
•  How to balance coverage against other
objectives
–  E.g., what if a page is found at position 1 vs. position 10?
13

Key problems
•  Measure web growth in general and
along any dimension
•  Compare search engines
automatically and reliably
•  Improve content-based search,
including semantic-similarity search
•  Improve copy detection methods for
quality and performance, including
URL based copy detection
14

Reference review on coverage
metrics
•  Luhn (1957)
–  summarizes an input document by selecting terms or sentences
by frequency
–  Bharat and Broder (1998) discovered the same method
independently for a different purpose.
–  Aside: Luhn also invented hashing with chaining.
•  Bar-Yossef and Gurevich (2008)
–  introduces improved methods to randomly sample pages from a
search engine’s index using its public query interface, a
problem introduced by Bharat and Broder (1998)
15

Reference review on coverage
metrics
•  Dasdan et al. (2009), Pereira and Ziviani (2004)
–  represents an input document by selecting (sequences of)
terms randomly or by frequency
–  uses the term-based document signature as queries (called
strong queries) for similarity search
–  Yang et al. (2009) proposes similar methods for blog search.
–  Dasdan et al. (2009) proposes random sequence selection via
shingling.
•  Olston and Najork (2010)
–  gives a detailed survey of web crawling
–  discusses how to optimize for both coverage and freshness in a
web crawler
16

References
•  Z. Bar-Yossef and M. Gurevich (2008), Random sampling from a
search engine’s index, J. ACM, 55(5).
•  K. Bharat, A. Broder (1998), A technique for measuring the relative
size and overlap of public Web search engines, WWW’98.
•  S. Brin, J. Davis, and H. Garcia-Molina (1995), Copy detection
mechanisms for digital documents, SIGMOD’95.
•  A. Dasdan, P. D’Alberto, C. Drome, and S. Kolay (2009), Automating
retrieval for similar content using search engine query interface,
CIKM’09.
•  A. Dasgupta, R. Kumar, and A. Sasturkar (2008), De-duping URLs via
Rewrite Rules, KDD’08.
–  Also Koppula et al. (2009), Learning URL patterns from web page de-duplication, WSDM’10.
•  H. Luhn (1957), A statistical approach to mechanized encoding and
searching of literary information, IBM J. Research and Dev., 1(4):309–
317.
•  H. P. Luhn (1958), The automatic creation of literature abstracts, IBM
J. Research and Dev., 2(2).
•  C. Olston and M. Najork (2010), Web crawling, Chapter in Foundations
and Trends in Information Retrieval, 4(3):175--246.
•  A.R. Pereira Jr. and N. Ziviani (2004), Retrieving similar documents
from the Web, J. Web Engineering, 2(4):247-261.
•  Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, D. Papadias
(2009), Query by document, WSDM’09.
17

Web search-metrics-tutorial-www2010-section-3of7-coverage

More Related Content

Similar to Web search-metrics-tutorial-www2010-section-3of7-coverage (20)

Recently uploaded (20)

Web search-metrics-tutorial-www2010-section-3of7-coverage