SlideShare a Scribd company logo
SIMS 202
Information Organization
      and Retrieval


Prof. Marti Hearst and Prof. Ray Larson
           UC Berkeley SIMS
       Tues/Thurs 9:30-11:00am
               Fall 2000



      Uploaded by: CarAutoDriver
Last Time
Web Search
– Directories vs. Search engines
– How web search differs from other search
   » Type of data searched over
   » Type of searches done
   » Type of searchers doing search
– Web queries are short
   » This probably means people are often using search
     engines to find starting points
   » Once at a useful site, they must follow links or use
     site search
– Web search ranking combines many features
What about Ranking?
Lots of variation here
– Pretty messy in many cases
– Details usually proprietary and fluctuating
Combining subsets of:
–   Term frequencies
–   Term proximities
–   Term position (title, top of page, etc)
–   Term characteristics (boldface, capitalized, etc)
–   Link analysis information
–   Category information
–   Popularity information
Most use a variant of vector space ranking to
combine these
Here’s how it might work:
– Make a vector of weights for each feature
– Multiply this by the counts for each feature
From description of the NorthernLight search engine, by Mark Krellenstein
http://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm
High-Precision Ranking

Proximity search can help get high-
precision results if > 1 term
– Hearst ’96 paper:
  » Combine Boolean and passage-level proximity
  » Proves significant improvements when
    retrieving top 5, 10, 20, 30 documents
  » Results reproduced by Mitra et al. 98
  » Google uses something similar
Boolean Formulations, Hearst 96



Results
Spam

Email Spam:
– Undesired content
Web Spam:
– Content is disguised as something it is
  not, in order to
  » Be retrieved more often than it otherwise
    would
  » Be retrieved in contexts that it otherwise
    would not be retrieved in
Web Spam
What are the types of Web spam?
– Add extra terms to get a higher ranking
   » Repeat “cars” thousands of times
– Add irrelevant terms to get more hits
   » Put a dictionary in the comments field
   » Put extra terms in the same color as the background
     of the web page
– Add irrelevant terms to get different types of
  hits
   » Put “sex” in the title field in sites that are selling
     cars
– Add irrelevant links to boost your link analysis
  ranking
There is a constant “arms race” between
web search companies and spammers
Commercial Issues
General internet search is often
commercially driven
– Commercial sector sometimes hides things –
  harder to track than research
– On the other hand, most CTOs for search
  engine companies used to be researchers, and
  so help us out
– Commercial search engine information changes
  monthly
– Sometimes motivations are commercial rather
  than technical
   » Goto.com uses payments to determine ranking order
   » iwon.com gives out prizes
Web Search Architecture
Web Search Architecture

Preprocessing
– Collection gathering phase
  » Web crawling
– Collection indexing phase
Online
– Query servers
– This part not talked about in the
  readings
From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Standard Web Search Engine Architecture
                          Check for duplicates,
        crawl the              store the
          web                 documents
                                              DocIds



                                                       create an
 user                                                   inverted
query                                                     index



                                    Search
                                                       Inverted
           Show results             engine
             To user                                     index
                                    servers
More detailed
architecture,
from Brin & Page
98.

Only covers the
preprocessing in
detail, not the
query serving.
Inverted Indexes for Web Search Engines

Inverted indexes are still used, even
though the web is so huge
Some systems partition the indexes across
different machines; each machine handles
different parts of the data
Other systems duplicate the data across
many machines; queries are distributed
among the machines
Most do a combination of these
In this example, the data
for the pages is
partitioned across
machines. Additionally,
each partition is allocated
multiple machines to
handle the queries.

Each row can handle 120
queries per second

Each column can handle
7M pages

To handle more queries,
add another row.




                 From description of the FAST search engine, by Knut Risvik
            http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Cascading Allocation of CPUs
A variation on this that produces a
cost-savings:
– Put high-quality/common pages on many
  machines
– Put lower quality/less common pages on
  fewer machines
– Query goes to high quality machines
  first
– If no hits found there, go to other
  machines
Web Crawlers

How do the web search engines get all
of the items they index?
Main idea:
–   Start with known sites
–   Record information for these sites
–   Follow the links from each site
–   Record information found at new sites
–   Repeat
Web Crawlers
How do the web search engines get all of
the items they index?
More precisely:
– Put a set of known sites on a queue
– Repeat the following until the queue is empty:
   » Take the first page off of the queue
   » If this page has not yet been processed:
        Record the information found on this page
          – Positions of words, links going out, etc
        Add each link on the current page to the queue
        Record that this page has been processed
In what order should the links be followed?
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html




                       Structure to be traversed
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
 http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html




                                       Breadth-first search
                                       (must be in presentation mode to see this animation)
Page Visit Order
       Animated examples of breadth-first vs depth-first search on trees:
         http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html




Depth-first search
(must be in presentation mode to see this animation)
Page Visit Order
Animated examples of breadth-first vs depth-first search on trees:
 http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
Depth-First Crawling
            (more complex – graphs & sites)
                                                               Site       Page
                                                                      1       1
                                                                      1       2
  Page 1                                                              1       4
             Site 1                   Page 1     Site 2               1       6
                                                                      1       3
                                                                      1       5
                                                                      3       1
           Page 3            Page 2                                   5       1
                                                  Page 3
Page 2                                                                6
                                                                      5
                                                                              1
                                                                              2
                                                                      2       1
                                                                      2       2
                    Page 5        Page 1                              2       3

 Page 4
                                         Site 5       Page 1

           Page 6        Page 1         Page 2        Site 6
                         Site 3
Breadth First Crawling
             (more complex – graphs & sites)
                                                               Site Page
                                                                  1    1
  Page 1                                                          2    1
             Site 1                   Page 1     Site 2           1    2
                                                                  1    6
                                                                  1    3
           Page 3            Page 2                               2    2
                                                  Page 3
Page 2                                                            2    3
                                                                  1    4
                                                                  3    1
                                                                  1    5
                    Page 5        Page 1                          5    1
 Page 4                                                           5    2
                                         Site 5       Page 1      6    1

           Page 6        Page 1         Page 2        Site 6
                         Site 3
Web Crawling Issues
Keep out signs
– A file called norobots.txt tells the crawler which
  directories are off limits
Freshness
– Figure out which pages change often
– Recrawl these often
Duplicates, virtual hosts, etc
– Convert page contents with a hash function
– Compare new pages to the hash table
Lots of problems
–   Server unavailable
–   Incorrect html
–   Missing links
–   Infinite loops
Web crawling is difficult to do robustly!
Cha-Cha

Cha-cha searches an intranet
– Sites associated with an organization
Instead of hand-edited categories
– Computes shortest path from the root
  for each hit
– Organizes search results according to
  which subdomain the pages are found in
Cha-Cha Web Crawling Algorithm
Start with a list of servers to crawl
– for UCB, simply start with www.berkeley.edu
Restrict crawl to certain domain(s)
– *.berkeley.edu
Obey No Robots standard
Follow hyperlinks only
– do not read local filesystems
   » links are placed on a queue
   » traversal is breadth-first
See first lecture or the technical papers for
more information
Summary
Web search differs from traditional IR
systems
– Different kind of collection
– Different kinds of users/queries
– Different economic motivations
Ranking combines many features in a
difficult-to-specify manner
– Link analysis and proximity of terms seems
  especially important
– This is in contrast to the term-frequency
  orientation of standard search
   » Why?
Summary (cont.)

Web search engine archicture
– Similar in many ways to standard IR
– Indexes usually duplicated across
  machines to handle many queries quickly
Web crawling
– Used to create the collection
– Can be guided by quality metrics
– Is very difficult to do robustly
Web Search Statistics
Searches
 per Day



Info missing
For fast.com,
Excite,
Northernlight,
etc.




                 Information from searchenginewatch.com
Web
Search
Engine
 Visits




          Information from searchenginewatch.com
Percentage
of web users
who visit the
 site shown




                Information from searchenginewatch.com
Search
Engine
 Size
 (July
2000)




         Information from searchenginewatch.com
Does size
 matter?
You can’t
 access
many hits
anyhow.




            Information from searchenginewatch.com
Increasing
numbers of
 indexed
pages, self-
 reported




               Information from searchenginewatch.com
Increasing
numbers of
 indexed
   pages
   (more
  recent)
    self-
 reported




             Information from searchenginewatch.com
Web
Coverage




           Information from searchenginewatch.com
From description of the FAST search engine, by Knut Risvik
http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Directory
  sizes




            Information from searchenginewatch.com

More Related Content

PDF
Weyerhaeuser Analyst Meeting DC – Day 1 (Economic Overview)
PPT
How search engines work
PPTX
How Do Search Engines Work
PPTX
Working of search engine
PPT
Working Of Search Engine
PPTX
Search Engine Powerpoint
PPT
Basic SEO Presentation
PPTX
Introduction to SEO
Weyerhaeuser Analyst Meeting DC – Day 1 (Economic Overview)
How search engines work
How Do Search Engines Work
Working of search engine
Working Of Search Engine
Search Engine Powerpoint
Basic SEO Presentation
Introduction to SEO

Similar to Information organization (20)

PDF
GoodRelations Tutorial Part 1
PDF
ISWC GoodRelations Tutorial Part 1
PPTX
A Lap Around Internet Explorer 8
PPTX
Technical SEO (Pagination & Crawling) by Adam Audette
PPTX
Web Mining.pptx
PPTX
SEO for Ecommerce: A Comprehensive Guide
PPTX
Search marketing workshop 11 aug12 by communicate2
PPTX
The On-page of SEO for Ecommerce - Adam Audette - SearchFest 2013
PDF
Thesecrets
PPT
SEO - How does it work, Why is it important, and why do we have to do it?
PDF
IRJET- A Two-Way Smart Web Spider
PPTX
Deep Comparison Shopping
PPTX
SMX Advanced: Thriving in the New World of Pagination
PPTX
Fried toronto sps14 91 wcm intranet
PPTX
Facebook Spotlight - Facebook Mobile Hack Hong Kong
PPTX
Facebook Spotlight - Facebook Mobile Hack Hong Kong
PPTX
Seo analysis of jabong.com at Pravin K Gupta
PDF
Changhao jiang facebook
PPTX
Web design
PPTX
CSC 8101 Non Relational Databases
GoodRelations Tutorial Part 1
ISWC GoodRelations Tutorial Part 1
A Lap Around Internet Explorer 8
Technical SEO (Pagination & Crawling) by Adam Audette
Web Mining.pptx
SEO for Ecommerce: A Comprehensive Guide
Search marketing workshop 11 aug12 by communicate2
The On-page of SEO for Ecommerce - Adam Audette - SearchFest 2013
Thesecrets
SEO - How does it work, Why is it important, and why do we have to do it?
IRJET- A Two-Way Smart Web Spider
Deep Comparison Shopping
SMX Advanced: Thriving in the New World of Pagination
Fried toronto sps14 91 wcm intranet
Facebook Spotlight - Facebook Mobile Hack Hong Kong
Facebook Spotlight - Facebook Mobile Hack Hong Kong
Seo analysis of jabong.com at Pravin K Gupta
Changhao jiang facebook
Web design
CSC 8101 Non Relational Databases
Ad

More from Stefanos Anastasiadis (15)

PDF
Webmaster guide-en
PDF
Web design ing
PDF
Web search engines and search technology
PDF
Ultra search
PDF
Tips and technics for search engine market
PDF
The little-joomla-seo-book-v1
PDF
The google best_practices_guide
PDF
Web search algorithms and user interfaces
PDF
Searching the web general
PDF
Integration visualization
PPT
Seminar algorithms of web
PDF
Search engines
PPT
Get your-web-site-to-be-found
PPT
Search engine strategies 8 04
PPTX
Ecommerce webinar-oct-2010
Webmaster guide-en
Web design ing
Web search engines and search technology
Ultra search
Tips and technics for search engine market
The little-joomla-seo-book-v1
The google best_practices_guide
Web search algorithms and user interfaces
Searching the web general
Integration visualization
Seminar algorithms of web
Search engines
Get your-web-site-to-be-found
Search engine strategies 8 04
Ecommerce webinar-oct-2010
Ad

Information organization

  • 1. SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000 Uploaded by: CarAutoDriver
  • 2. Last Time Web Search – Directories vs. Search engines – How web search differs from other search » Type of data searched over » Type of searches done » Type of searchers doing search – Web queries are short » This probably means people are often using search engines to find starting points » Once at a useful site, they must follow links or use site search – Web search ranking combines many features
  • 3. What about Ranking? Lots of variation here – Pretty messy in many cases – Details usually proprietary and fluctuating Combining subsets of: – Term frequencies – Term proximities – Term position (title, top of page, etc) – Term characteristics (boldface, capitalized, etc) – Link analysis information – Category information – Popularity information Most use a variant of vector space ranking to combine these Here’s how it might work: – Make a vector of weights for each feature – Multiply this by the counts for each feature
  • 4. From description of the NorthernLight search engine, by Mark Krellenstein http://www.infonortics.com/searchengines/sh00/krellenstein_files/frame.htm
  • 5. High-Precision Ranking Proximity search can help get high- precision results if > 1 term – Hearst ’96 paper: » Combine Boolean and passage-level proximity » Proves significant improvements when retrieving top 5, 10, 20, 30 documents » Results reproduced by Mitra et al. 98 » Google uses something similar
  • 7. Spam Email Spam: – Undesired content Web Spam: – Content is disguised as something it is not, in order to » Be retrieved more often than it otherwise would » Be retrieved in contexts that it otherwise would not be retrieved in
  • 8. Web Spam What are the types of Web spam? – Add extra terms to get a higher ranking » Repeat “cars” thousands of times – Add irrelevant terms to get more hits » Put a dictionary in the comments field » Put extra terms in the same color as the background of the web page – Add irrelevant terms to get different types of hits » Put “sex” in the title field in sites that are selling cars – Add irrelevant links to boost your link analysis ranking There is a constant “arms race” between web search companies and spammers
  • 9. Commercial Issues General internet search is often commercially driven – Commercial sector sometimes hides things – harder to track than research – On the other hand, most CTOs for search engine companies used to be researchers, and so help us out – Commercial search engine information changes monthly – Sometimes motivations are commercial rather than technical » Goto.com uses payments to determine ranking order » iwon.com gives out prizes
  • 11. Web Search Architecture Preprocessing – Collection gathering phase » Web crawling – Collection indexing phase Online – Query servers – This part not talked about in the readings
  • 12. From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
  • 13. Standard Web Search Engine Architecture Check for duplicates, crawl the store the web documents DocIds create an user inverted query index Search Inverted Show results engine To user index servers
  • 14. More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving.
  • 15. Inverted Indexes for Web Search Engines Inverted indexes are still used, even though the web is so huge Some systems partition the indexes across different machines; each machine handles different parts of the data Other systems duplicate the data across many machines; queries are distributed among the machines Most do a combination of these
  • 16. In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row. From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
  • 17. Cascading Allocation of CPUs A variation on this that produces a cost-savings: – Put high-quality/common pages on many machines – Put lower quality/less common pages on fewer machines – Query goes to high quality machines first – If no hits found there, go to other machines
  • 18. Web Crawlers How do the web search engines get all of the items they index? Main idea: – Start with known sites – Record information for these sites – Follow the links from each site – Record information found at new sites – Repeat
  • 19. Web Crawlers How do the web search engines get all of the items they index? More precisely: – Put a set of known sites on a queue – Repeat the following until the queue is empty: » Take the first page off of the queue » If this page has not yet been processed: Record the information found on this page – Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processed In what order should the links be followed?
  • 20. Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html Structure to be traversed
  • 21. Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html Breadth-first search (must be in presentation mode to see this animation)
  • 22. Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html Depth-first search (must be in presentation mode to see this animation)
  • 23. Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html
  • 24. Depth-First Crawling (more complex – graphs & sites) Site Page 1 1 1 2 Page 1 1 4 Site 1 Page 1 Site 2 1 6 1 3 1 5 3 1 Page 3 Page 2 5 1 Page 3 Page 2 6 5 1 2 2 1 2 2 Page 5 Page 1 2 3 Page 4 Site 5 Page 1 Page 6 Page 1 Page 2 Site 6 Site 3
  • 25. Breadth First Crawling (more complex – graphs & sites) Site Page 1 1 Page 1 2 1 Site 1 Page 1 Site 2 1 2 1 6 1 3 Page 3 Page 2 2 2 Page 3 Page 2 2 3 1 4 3 1 1 5 Page 5 Page 1 5 1 Page 4 5 2 Site 5 Page 1 6 1 Page 6 Page 1 Page 2 Site 6 Site 3
  • 26. Web Crawling Issues Keep out signs – A file called norobots.txt tells the crawler which directories are off limits Freshness – Figure out which pages change often – Recrawl these often Duplicates, virtual hosts, etc – Convert page contents with a hash function – Compare new pages to the hash table Lots of problems – Server unavailable – Incorrect html – Missing links – Infinite loops Web crawling is difficult to do robustly!
  • 27. Cha-Cha Cha-cha searches an intranet – Sites associated with an organization Instead of hand-edited categories – Computes shortest path from the root for each hit – Organizes search results according to which subdomain the pages are found in
  • 28. Cha-Cha Web Crawling Algorithm Start with a list of servers to crawl – for UCB, simply start with www.berkeley.edu Restrict crawl to certain domain(s) – *.berkeley.edu Obey No Robots standard Follow hyperlinks only – do not read local filesystems » links are placed on a queue » traversal is breadth-first See first lecture or the technical papers for more information
  • 29. Summary Web search differs from traditional IR systems – Different kind of collection – Different kinds of users/queries – Different economic motivations Ranking combines many features in a difficult-to-specify manner – Link analysis and proximity of terms seems especially important – This is in contrast to the term-frequency orientation of standard search » Why?
  • 30. Summary (cont.) Web search engine archicture – Similar in many ways to standard IR – Indexes usually duplicated across machines to handle many queries quickly Web crawling – Used to create the collection – Can be guided by quality metrics – Is very difficult to do robustly
  • 32. Searches per Day Info missing For fast.com, Excite, Northernlight, etc. Information from searchenginewatch.com
  • 33. Web Search Engine Visits Information from searchenginewatch.com
  • 34. Percentage of web users who visit the site shown Information from searchenginewatch.com
  • 35. Search Engine Size (July 2000) Information from searchenginewatch.com
  • 36. Does size matter? You can’t access many hits anyhow. Information from searchenginewatch.com
  • 37. Increasing numbers of indexed pages, self- reported Information from searchenginewatch.com
  • 38. Increasing numbers of indexed pages (more recent) self- reported Information from searchenginewatch.com
  • 39. Web Coverage Information from searchenginewatch.com
  • 40. From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
  • 41. Directory sizes Information from searchenginewatch.com