SlideShare a Scribd company logo
1	

Web Search Engine Metrics
for Measuring User
Satisfaction
[Section 3 of 7: Coverage]
Ali Dasdan, eBay
Kostas Tsioutsiouliklis, Yahoo!
Emre Velipasaoglu, Yahoo!
With contributions from Prasad Kantamneni, Yahoo!
27 Apr 2010
(Update in Aug 2015: The authors work in different companies now.)
2	

Tutorial
@
19th International
World Wide Web
Conference
http://www2010.org/
April 26-30, 2010
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Disclaimers
•  This talk presents the opinions of the
authors. It does not necessarily reflect
the views of our employers.
•  This talk does not imply that these
metrics are used by our employers, or
should they be used, they may not be
used in the way described in this talk.
•  The examples are just that – examples.
Please do not generalize them to the
level of comparing search engines.
3
4	

Coverage Metrics
Section 3/7
of
WWW’10 Tutorial on Web Search Engine Metrics
by
A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on coverage: URL was not
found
5
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on coverage: But content
was found under different URLs
6
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on coverage: URL was
also found after some time
7
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Definitions for coverage
•  Coverage refers to presence of
content of interest in a catalog.
•  Coverage ratio
– defined as the ratio of the number of
documents (pages) found to the number of
documents (pages) tested
– Can be represented as a distribution when
many document attributes are considered
together
8
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Some background:
Shingling and Jaccard Index
9	

Doc = (a b c d e) (5 terms)
2-grams: (a b, b c, c d, d e)
Shingles for 2-grams (after hashing them): 10, 3, 7, 16
Min shingle: 3 (used as a signature of Doc)
Doc1 = (a b c d e)
Doc2 = (a e f g)
Doc1 Doc2 = (a e)
Doc1 Doc2 = (a b c d e f g)
Jaccard index = |Doc1 Doc2| / |Doc1 Doc2|
= 2 / 7 ≈ 30%
(Broder’s shingling method estimates this index.)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
How to measure coverage
•  Given an input document with its URL
•  Query by URL (QBU)
–  enter URL at the target search engine’s query interface
–  if the URL is not found, then iterate using “normalized” forms of
the same URL
•  Query by content (QBC)
–  if URL is not given or URL search has failed, then perform this
search
–  generate a set of queries (called strong queries) from the
document
–  submit the queries to the target search engine’s query interface
–  combine the returned results
–  perform a more thorough similarity check between the returned
documents and the input document
•  Compute coverage ratio over multiple documents
10
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Query-by-Content flowchart
11	

String signature: Terms from page
Strings combined into queries
Similarity check using shingles
Search results extraction
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Query by content:
How to generate queries
•  Select sequences of terms by frequency
–  terms with the lowest frequency or highest TF-IDF
•  Select sequences of terms by position
–  +/- two terms at every 5th term
•  Select sequences of terms randomly
–  find a sequence of consecutive terms randomly in the
document
•  Select sequences of terms randomly via shingles
–  find the document’s shingles signature
–  find the corresponding sequences of terms
–  This method produces the same query signature for the
same document, as opposed to the method above.
–  This method also beats the method above.
12
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Further issues to consider
•  URL normalization
–  Example:
•  wikipedia.org/wiki/Casino_Royale or wikipedia.org/?title=Casino_Royale
–  see Dasgupta, Kumar, and Sasturkar (2008)
•  Page templates and ads
–  or how to avoid undesired matches
•  Search for non-textual content
–  images, mathematical formulas, tables and other similar
structures
•  Definition of content similarity
•  Syntactic vs. semantic match
•  How to balance coverage against other
objectives
–  E.g., what if a page is found at position 1 vs. position 10?
13
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Key problems
•  Measure web growth in general and
along any dimension
•  Compare search engines
automatically and reliably
•  Improve content-based search,
including semantic-similarity search
•  Improve copy detection methods for
quality and performance, including
URL based copy detection
14
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Reference review on coverage
metrics
•  Luhn (1957)
–  summarizes an input document by selecting terms or sentences
by frequency
–  Bharat and Broder (1998) discovered the same method
independently for a different purpose.
–  Aside: Luhn also invented hashing with chaining.
•  Bar-Yossef and Gurevich (2008)
–  introduces improved methods to randomly sample pages from a
search engine’s index using its public query interface, a
problem introduced by Bharat and Broder (1998)
15
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Reference review on coverage
metrics
•  Dasdan et al. (2009), Pereira and Ziviani (2004)
–  represents an input document by selecting (sequences of)
terms randomly or by frequency
–  uses the term-based document signature as queries (called
strong queries) for similarity search
–  Yang et al. (2009) proposes similar methods for blog search.
–  Dasdan et al. (2009) proposes random sequence selection via
shingling.
•  Olston and Najork (2010)
–  gives a detailed survey of web crawling
–  discusses how to optimize for both coverage and freshness in a
web crawler
16
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
References
•  Z. Bar-Yossef and M. Gurevich (2008), Random sampling from a
search engine’s index, J. ACM, 55(5).
•  K. Bharat, A. Broder (1998), A technique for measuring the relative
size and overlap of public Web search engines, WWW’98.
•  S. Brin, J. Davis, and H. Garcia-Molina (1995), Copy detection
mechanisms for digital documents, SIGMOD’95.
•  A. Dasdan, P. D’Alberto, C. Drome, and S. Kolay (2009), Automating
retrieval for similar content using search engine query interface,
CIKM’09.
•  A. Dasgupta, R. Kumar, and A. Sasturkar (2008), De-duping URLs via
Rewrite Rules, KDD’08.
–  Also Koppula et al. (2009), Learning URL patterns from web page de-duplication, WSDM’10.
•  H. Luhn (1957), A statistical approach to mechanized encoding and
searching of literary information, IBM J. Research and Dev., 1(4):309–
317.
•  H. P. Luhn (1958), The automatic creation of literature abstracts, IBM
J. Research and Dev., 2(2).
•  C. Olston and M. Najork (2010), Web crawling, Chapter in Foundations
and Trends in Information Retrieval, 4(3):175--246.
•  A.R. Pereira Jr. and N. Ziviani (2004), Retrieving similar documents
from the Web, J. Web Engineering, 2(4):247-261.
•  Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, D. Papadias
(2009), Query by document, WSDM’09.
17

More Related Content

PDF
Web Scale Information Extraction (ISWC2013 tutorial)
PDF
The Sustainable Value of Open Data
PPTX
CASE STUDY OF TYPHOID IN RSUD BANJARBARU
PPTX
Prezentacja
PDF
德寶法師 佛教禪修直解
DOCX
PDF
二十四緣發趣論
Web Scale Information Extraction (ISWC2013 tutorial)
The Sustainable Value of Open Data
CASE STUDY OF TYPHOID IN RSUD BANJARBARU
Prezentacja
德寶法師 佛教禪修直解
二十四緣發趣論

Similar to Web search-metrics-tutorial-www2010-section-3of7-coverage (20)

PDF
Web search-metrics-tutorial-www2010-section-5of7-discovery
PDF
Web search-metrics-tutorial-www2010-section-1of7-introduction
PDF
Web search-metrics-tutorial-www2010-section-6of7-freshness
PDF
Web search-metrics-tutorial-www2010-section-7of7-presentation
PDF
Web search-metrics-tutorial-www2010-section-4of7-diversity
PDF
Document Recommendation using Boosting Based Multi-graph Classification: A Re...
PDF
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
PDF
Query Recommendation by using Collaborative Filtering Approach
PDF
Pf3426712675
PPTX
Structured data and metadata evaluation methodology for organizations looking...
PDF
`A Survey on approaches of Web Mining in Varied Areas
PDF
Pdd crawler a focused web
PDF
Recent research in web page classification – a review
PDF
Recent research in web page classification – a review
PDF
Perception Determined Constructing Algorithm for Document Clustering
PPTX
Boegershausen et al. (2022).pptx
PDF
Search Engine Scrapper
PDF
Query- And User-Dependent Approach for Ranking Query Results in Web Databases
PDF
Ijmet 10 02_050
PDF
Matching data detection for the integration system
Web search-metrics-tutorial-www2010-section-5of7-discovery
Web search-metrics-tutorial-www2010-section-1of7-introduction
Web search-metrics-tutorial-www2010-section-6of7-freshness
Web search-metrics-tutorial-www2010-section-7of7-presentation
Web search-metrics-tutorial-www2010-section-4of7-diversity
Document Recommendation using Boosting Based Multi-graph Classification: A Re...
IRJET- Text-based Domain and Image Categorization of Google Search Engine usi...
Query Recommendation by using Collaborative Filtering Approach
Pf3426712675
Structured data and metadata evaluation methodology for organizations looking...
`A Survey on approaches of Web Mining in Varied Areas
Pdd crawler a focused web
Recent research in web page classification – a review
Recent research in web page classification – a review
Perception Determined Constructing Algorithm for Document Clustering
Boegershausen et al. (2022).pptx
Search Engine Scrapper
Query- And User-Dependent Approach for Ranking Query Results in Web Databases
Ijmet 10 02_050
Matching data detection for the integration system
Ad

Recently uploaded (20)

PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Geodesy 1.pptx...............................................
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Construction Project Organization Group 2.pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Digital Logic Computer Design lecture notes
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
PPT on Performance Review to get promotions
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Lecture Notes Electrical Wiring System Components
Operating System & Kernel Study Guide-1 - converted.pdf
Geodesy 1.pptx...............................................
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
UNIT 4 Total Quality Management .pptx
additive manufacturing of ss316l using mig welding
Construction Project Organization Group 2.pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Foundation to blockchain - A guide to Blockchain Tech
Digital Logic Computer Design lecture notes
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
R24 SURVEYING LAB MANUAL for civil enggi
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPT on Performance Review to get promotions
CH1 Production IntroductoryConcepts.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Ad

Web search-metrics-tutorial-www2010-section-3of7-coverage

  • 1. 1 Web Search Engine Metrics for Measuring User Satisfaction [Section 3 of 7: Coverage] Ali Dasdan, eBay Kostas Tsioutsiouliklis, Yahoo! Emre Velipasaoglu, Yahoo! With contributions from Prasad Kantamneni, Yahoo! 27 Apr 2010 (Update in Aug 2015: The authors work in different companies now.)
  • 2. 2 Tutorial @ 19th International World Wide Web Conference http://www2010.org/ April 26-30, 2010
  • 3. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Disclaimers •  This talk presents the opinions of the authors. It does not necessarily reflect the views of our employers. •  This talk does not imply that these metrics are used by our employers, or should they be used, they may not be used in the way described in this talk. •  The examples are just that – examples. Please do not generalize them to the level of comparing search engines. 3
  • 4. 4 Coverage Metrics Section 3/7 of WWW’10 Tutorial on Web Search Engine Metrics by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu
  • 5. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Example on coverage: URL was not found 5
  • 6. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Example on coverage: But content was found under different URLs 6
  • 7. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Example on coverage: URL was also found after some time 7
  • 8. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Definitions for coverage •  Coverage refers to presence of content of interest in a catalog. •  Coverage ratio – defined as the ratio of the number of documents (pages) found to the number of documents (pages) tested – Can be represented as a distribution when many document attributes are considered together 8
  • 9. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Some background: Shingling and Jaccard Index 9 Doc = (a b c d e) (5 terms) 2-grams: (a b, b c, c d, d e) Shingles for 2-grams (after hashing them): 10, 3, 7, 16 Min shingle: 3 (used as a signature of Doc) Doc1 = (a b c d e) Doc2 = (a e f g) Doc1 Doc2 = (a e) Doc1 Doc2 = (a b c d e f g) Jaccard index = |Doc1 Doc2| / |Doc1 Doc2| = 2 / 7 ≈ 30% (Broder’s shingling method estimates this index.)
  • 10. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. How to measure coverage •  Given an input document with its URL •  Query by URL (QBU) –  enter URL at the target search engine’s query interface –  if the URL is not found, then iterate using “normalized” forms of the same URL •  Query by content (QBC) –  if URL is not given or URL search has failed, then perform this search –  generate a set of queries (called strong queries) from the document –  submit the queries to the target search engine’s query interface –  combine the returned results –  perform a more thorough similarity check between the returned documents and the input document •  Compute coverage ratio over multiple documents 10
  • 11. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Query-by-Content flowchart 11 String signature: Terms from page Strings combined into queries Similarity check using shingles Search results extraction
  • 12. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Query by content: How to generate queries •  Select sequences of terms by frequency –  terms with the lowest frequency or highest TF-IDF •  Select sequences of terms by position –  +/- two terms at every 5th term •  Select sequences of terms randomly –  find a sequence of consecutive terms randomly in the document •  Select sequences of terms randomly via shingles –  find the document’s shingles signature –  find the corresponding sequences of terms –  This method produces the same query signature for the same document, as opposed to the method above. –  This method also beats the method above. 12
  • 13. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Further issues to consider •  URL normalization –  Example: •  wikipedia.org/wiki/Casino_Royale or wikipedia.org/?title=Casino_Royale –  see Dasgupta, Kumar, and Sasturkar (2008) •  Page templates and ads –  or how to avoid undesired matches •  Search for non-textual content –  images, mathematical formulas, tables and other similar structures •  Definition of content similarity •  Syntactic vs. semantic match •  How to balance coverage against other objectives –  E.g., what if a page is found at position 1 vs. position 10? 13
  • 14. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Key problems •  Measure web growth in general and along any dimension •  Compare search engines automatically and reliably •  Improve content-based search, including semantic-similarity search •  Improve copy detection methods for quality and performance, including URL based copy detection 14
  • 15. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Reference review on coverage metrics •  Luhn (1957) –  summarizes an input document by selecting terms or sentences by frequency –  Bharat and Broder (1998) discovered the same method independently for a different purpose. –  Aside: Luhn also invented hashing with chaining. •  Bar-Yossef and Gurevich (2008) –  introduces improved methods to randomly sample pages from a search engine’s index using its public query interface, a problem introduced by Bharat and Broder (1998) 15
  • 16. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. Reference review on coverage metrics •  Dasdan et al. (2009), Pereira and Ziviani (2004) –  represents an input document by selecting (sequences of) terms randomly or by frequency –  uses the term-based document signature as queries (called strong queries) for similarity search –  Yang et al. (2009) proposes similar methods for blog search. –  Dasdan et al. (2009) proposes random sequence selection via shingling. •  Olston and Najork (2010) –  gives a detailed survey of web crawling –  discusses how to optimize for both coverage and freshness in a web crawler 16
  • 17. © Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010. References •  Z. Bar-Yossef and M. Gurevich (2008), Random sampling from a search engine’s index, J. ACM, 55(5). •  K. Bharat, A. Broder (1998), A technique for measuring the relative size and overlap of public Web search engines, WWW’98. •  S. Brin, J. Davis, and H. Garcia-Molina (1995), Copy detection mechanisms for digital documents, SIGMOD’95. •  A. Dasdan, P. D’Alberto, C. Drome, and S. Kolay (2009), Automating retrieval for similar content using search engine query interface, CIKM’09. •  A. Dasgupta, R. Kumar, and A. Sasturkar (2008), De-duping URLs via Rewrite Rules, KDD’08. –  Also Koppula et al. (2009), Learning URL patterns from web page de-duplication, WSDM’10. •  H. Luhn (1957), A statistical approach to mechanized encoding and searching of literary information, IBM J. Research and Dev., 1(4):309– 317. •  H. P. Luhn (1958), The automatic creation of literature abstracts, IBM J. Research and Dev., 2(2). •  C. Olston and M. Najork (2010), Web crawling, Chapter in Foundations and Trends in Information Retrieval, 4(3):175--246. •  A.R. Pereira Jr. and N. Ziviani (2004), Retrieving similar documents from the Web, J. Web Engineering, 2(4):247-261. •  Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, D. Papadias (2009), Query by document, WSDM’09. 17