SlideShare a Scribd company logo
Lecture 39:
                                   …and the
                                   World Wide
                                   Web




cs1120 Fall 2011
David Evans
http://www.cs.virginia.edu/evans
Announcements
Exam 2 due 61 seconds ago!
           70
           69
           68
           67
           66
           65
           64
           63
           62
           60

Friday: we will return graded Exam 2, along with
  guidance about the Final

  Must be present (or email me in advance) to win!

 If you want to present your PS8 in class Monday, remember to email me!



                                                                          2
Plan
The World Wide Web
Building Web Applications
How Google Works
  (or, going back to pre-PS5 to make things
      really fast again!)
cs1120 recap in one (heavily animated) slide!



                                                3
The World Wide Web
The “Desk Wide Web”




            Memex Machine
Vannevar Bush, As We May Think, LIFE, 1945
WorldWideWeb




Sir Tim Berners-Lee   First web server and client, 1990
CERN (Switzerland)           (This picture, 1993)
        MIT
Overview:
                                             Many of the discussions of the
                                             future at CERN and the LHC era
                                             end with the question – “Yes, but
                                             how will we ever keep track of
                                             such a large project?” This
                                             proposal provides an answer to
                                             such questions. Firstly, it
                                             discusses the problem of
                                             information access at CERN.
                                             Then, it introduces the idea of
                                             linked information systems, and
                                             compares them with less flexible
                                             ways of finding information.


http://www.w3.org/History/1989/proposal-msw.html
A Practical Project




                      8
9
WorldWideWeb
Established a common language for sharing
  information on computers

Lots of previous attempts (Gopher, WAIS,
  Archie, Xanadu, etc.) failed




                                            10
Why the World Wide Web?
World Wide Web succeeded because it was simple!

Didn’t attempt to maintain links, just a common
  way to name things
Uniform Resource Locators (URL)
   http://www.cs.virginia.edu/cs1120/index.html
    Service     Hostname            File Path


HyperText Transfer Protocol
HyperText Transfer Protocol

                                      Server

                   GET /cs1120/index.html HTTP/1.0


                            <html>
                            <head>               Contents
                            …                    of file


Client (Browser)     HTML
                      HyperText Markup Language
HTML: HyperText Markup Language

Language for controlling display of web pages
Uses formatting tags: between < and >
        Document ::= <html> Header Body </html>
        Header ::= <head> HeadElements </head>
        HeadElements ::= HeadElement HeadElements
        HeadElements ::= ε | <title> Element </title>
        Body ::= <body> Elements </body>
        Elements ::= ε | Element Elements
        Element ::= <p> Element </p>
        Element ::= <center> Element </center>
        …
Popular Web Site: Strategy 1
          Static, Authored Web Site
                                                     Drawbacks:
                                                     •Have to do all the
                                                     work yourself
                                                     •The world may
                                                     already have enough
                                                     Twinkie-experiment
                                                     websites

Content Producer




                   http://www.twinkiesproject.com/
Popular Web Site: Strategy 2
           Dynamic Web Applications
                                                                                   Attracts users
                                 Seed content and
                                     function




Web Programmer




                                                                                  Produce more
                                                                                     content
                                      eBay in 1997
                 http://web.archive.org/web/19970614001443/http://www.ebay.com/
Popular Web Site: Strategy 2
               Dynamic Web Applications
                                             Attracts users
   Seed content and
       function




Advantages:
• Users do most of the work
• If you’re lucky, they might even pay you
 for the privilege!

Disadvantages:
• Lose control over the content (you might
                                       Produce more
   get sued for things your users do)
                                           content            reddit.com today
• Have to know how to program a web
   application
    reddit.com in 2005
Dynamic Web Sites
Programs that run on the web server
   Can be written in any language (often in Python or Java), just
     need a way to connect the web server to the program
   Program generates HTML (often JavaScript also now)
   Every useful web site does this
Programs that run on the client’s machine
   Java, JavaScript (aka, “Scheme for the Web”), Flash, etc.:
     language must be supported by the client’s browser
   Responsive interface: limited round-trips to server
Searching the Web




                    18
19
Building a Web Search Engine
Database of web pages
  Crawling the web collecting pages and links
  Indexing them efficiently
Responding to Searches
  Spell checking – edit distance
  How to find documents that match a query
  How to rank the “best” documents
Crawling Crawler
activeURLs = * “www.yahoo.com” +
while (len(activeURLs) > 0) :
 newURLs = [ ]
 for URL in activeURLs:
    page = downloadPage (URL)
    newURLs += extractLinks (page)
 activeURLs = newURLs
                     Problems:
                     Will keep revisiting the same pages
                     Will take very long to get a good view of the web
                     Will annoy web server admins
                     downloadPage and extractLinks must be very robust
Building a Web Search Engine
Database of web pages
  Crawling the web collecting pages and links
  Indexing them efficiently
Responding to Searches
  How to find documents that match a query
  How to rank the “best” documents
Building an Index
What if we just stored all the pages?
Answering a query would be (size of the database)
      (need to look at all characters in database)

Google: about 40 Billion pages (1 Trillion URLs, but number
actually indexed is a closely kept corporate secret)
               * 60 KB (average web page size)
       = ~2.4 Quadrillion bytes to search!

Linear is not nearly good enough when n is Quadrillions
Hash Table
             Index                          Key-Value Pairs
               0               , <“Colleen”, ? >, <“virginia”, ? >, … -
               1               , <“Bob”, ? >, … -
               2
               3
               …
     [about a million bins?]

def lookup(key, table) : searchEntries(table[H(key, len(table))])

       Finding a good H is difficult
           You can download google’s from
           http://code.google.com/p/google-sparsehash/
Google’s Lexicon
1998: 14 million words (billions today?)
Lookup word in H(word, nbins): maps to WordID

    Key                       Words
      0       *<“aardvark”, 1024235>, ... +
      1       *<“aaa”, 224155>, ..., <“zzz”, 29543> +
     ...      ...
  nbins – 1   *<“abba”, 25583>, ..., <“zeit”, 50395> +
Google’s Reverse Index
 (Based on 1998 paper…definitely changed some since then, but now they are secretive!)

  WordId         ndocs         pointer
00000000                 3
00000001               15

...                                                              “Inverted
                                                                  Barrels”:
16777215              105                                       41 GB (1998)
                                                              Today: many TB?
         Lexicon: 293 MB (1998)
         Today: many GB?
Inverted Barrels
docid (27 bits)    nhits (5 bits)   hits (16 bits
                                    each)            plain hit:
                                                     capitalized: 1 bit
7630486927 23                                        font size: 3 bits
                                                     position: 12 bits
...                                                    first 4095 chars,
                                                       everything else

                                                     extra info for
                                                     anchors, titles
                                                     (less position bits)

                     Suggested experiment for winter break:
                     is the position field still only 12 bits?
Building a Web Search Engine
Database of web pages
  Crawling the web collecting pages and links
  Indexing them efficiently
Responding to Searches
  Spell checking – edit distance
  How to find documents that match a query
  How to rank the “best” documents
Finding the “Best” Documents
Humans rate them
  “Jerry and David’s Guide to the World Wide Web”
    (became Yahoo!)
Machines rate them
  Count number of occurrences of keyword
     Easy for sites to rig this
  Machine language understanding not good enough
Business Model
  Whoever pays you the most is listed first
PageRank
If a site is important and interesting, other sites
   will link to it.
 Don’t ever take <a href=http://www.cs.virginia.edu/cs1120>cs1120</a>!



But…not all links are equal:
  if a lot of highly-ranked sites link to this site,
  this site should be highly-ranked.


                                                                         30
PageRank
def pageRank (u):
  rank = 0
  for b in linksToPage (u)
     rank = rank + PageRank (b) / Links (b)
  return rank


                                Would this work?
Converging PageRank
Ranks of all pages depend on ranks of all other
  pages
Keep recalculating ranks until they converge
def CalculatePageRanks (urls):
 initially, every rank is 1
 for as many times as necessary
    calculate a new rank for each page (using old ranks)
    replace the old ranks with the new ranks
                                  How do initial ranks effect results?
                                  How many iterations are necessary?
PageRank: 1998
Crawlable web (1998):
  150 million pages, 1.7 Billion links
Database of 322 million links
  Converges in about 50 iterations
Initialization matters
  All pages = 1: very democratic, models browser
    equally likely to start on random page
  www.yahoo.com = 1, ..., all others = 0
     More like what Google probably uses
Do we have a
  search engine?


Theoretician: Sure!

Ali G: No way! It’ll blow up.




                                Google’s First Server
                                                        34
How do we make our service fast
enough to index the whole web
 and serve billions of requests?




                                   35
Counting Word Occurrences

“When in the Course of human events, it
                                                            * <“When”, 1>,
becomes necessary for one people to dissolve
                                                              <“in”, 1>,
the political bands which have connected them
                                                              <“the”, 2>
with another, …”
                                                              …+

“We the People of the United States, in Order               * <“We”, 1>,
to form a more perfect Union, establish Justice,              <“in”, 1>,
insure domestic Tranquility, provide for the …”               <“the”, 2>
                                                              …+


                        map(doc, countWords)
         If we have enough machines, can we do this fast for the whole web?


                                                                              36
* <“When”, 1>,
  <“in”, 1>,
  <“the”, 2>
  …+
                   reduce
* <“We”, 1>,
  <“in”, 1>,
                  * <“We”, 1>,
  <“the”, 2>
                    <“in”, 2>,            * <“a”, 5>,
  …+
                    …+           reduce     <“in”, 6>,
* <“a”, 5>,                                 …+
  <“in”, 3>,
  <“the”, 2>
  …+
                    reduce
* <“apple”, 1>,
  <“in”, 1>,
  <“the”, 7>      * <“a”, 5>,
  …+                <“in”, 4>,
                    …+
MapReduce




            38
Key to Massive Parallel Execution


   Get rid of state and mutation!




                                    39
(define (count-matches p b)                                                                   Functional Programming
          (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p)))                                                 (PS 1-4)

        def meval(expr, env):
                                                                                                           Interpreters
          … return evalApplication(expr, env)

... #    1   0   1   1   0   1   1   1   0   1   1   0       1   1   1       #
                                                                                 ...


                                                                                                         Any Mechanical
             1                                   3                               Turing Machine
                                 2                                                                        Computation

                                                         A               B             C   R1   R0
   (or a b)
                                                     0               0                 0   0    0
   (not (and (not a)                                 0               0                 1   0    1    Any Discrete Function
             (not b)))                               …               …                 …   …    …
                                         AND                         NOT                             Mechanical Logic

                                                                                                     “Magic” Transistors
                                                                                                                               40
(define (count-matches p b)                                                                   Functional Programming
          (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p)))                                                 (PS 1-4)

        def meval(expr, env):
                                                                                                           Interpreters
          … return evalApplication(expr, env)

... #    1   0   1   1   0   1   1   1   0   1   1   0       1   1   1       #
                                                                                 ...


                                                                                                         Any Mechanical
             1                                   3                               Turing Machine
                                 2                                                                        Computation

                                                         A               B             C   R1   R0
   (or a b)
                                                     0               0                 0   0    0
   (not (and (not a)                                 0               0                 1   0    1    Any Discrete Function
             (not b)))                               …               …                 …   …    …
                                         AND                         NOT                             Mechanical Logic

                                                                                                     “Magic” Transistors
SimObject

                                 PhysicalObject                                                          Objects
                                                                             Place
                                 MobileObject


        m1:                                                                                           State and Mutation
                                     1                           2                         3


        (define (count-matches p b)                                                                  Functional Programming
          (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p)))                                                (PS 1-4)

        def meval(expr, env):
                                                                                                         Interpreters
          … return evalApplication(expr, env)

... #    1   0   1   1   0   1   1   1   0   1   1   0       1   1   1       #
                                                                                 ...


                                                                                                       Any Mechanical
             1                                   3                               Turing Machine
                                 2                                                                      Computation

                                                         A               B             C   R1   R0
   (or a b)
SimObject

                                 PhysicalObject                                                          Objects
                                                                             Place
                                 MobileObject


        m1:                                                                                           State and Mutation
                                     1                           2                         3


        (define (count-matches p b)                                                                  Functional Programming
          (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p)))                                                (PS 1-4)

        def meval(expr, env):
                                                                                                         Interpreters
          … return evalApplication(expr, env)

... #    1   0   1   1   0   1   1   1   0   1   1   0       1   1   1       #
                                                                                 ...


                                                                                                       Any Mechanical
             1                                   3                               Turing Machine
                                 2                                                                      Computation

                                                         A               B             C   R1   R0
   (or a b)
Objects




                         Recursive Definitions
   State and Mutation

Functional Programming
                                                                                     Charge
        (PS 1-4)




                                                 Universality

                                                                Abstraction
                                                                              Now, you know
      Interpreters
                                                                              almost everything
                                                                              you need to build the
    Any Mechanical
     Computation                                                              next reddit or
                                                                              google!
 Any Discrete Function

    Mechanical Logic

   “Magic” Transistors

More Related Content

KEY
Hello Drupal
PDF
Understanding Research 2.0 from a Socio-technical Perspective
PPT
Semantic Web, an introduction
PDF
Creating Semantic Mashups Bridging Web 2 0 And The Semantic Web Presentation 1
PDF
Creating Semantic Mashups Bridging Web 2 0 And The Semantic Web Presentation 1
PDF
Taming the Monster: Digital Preservation Planning and Implementation Tools
PDF
I Ntro Powerpoint
PPTX
Ways to learn oriolbrunet
Hello Drupal
Understanding Research 2.0 from a Socio-technical Perspective
Semantic Web, an introduction
Creating Semantic Mashups Bridging Web 2 0 And The Semantic Web Presentation 1
Creating Semantic Mashups Bridging Web 2 0 And The Semantic Web Presentation 1
Taming the Monster: Digital Preservation Planning and Implementation Tools
I Ntro Powerpoint
Ways to learn oriolbrunet

Similar to Class 39: ...and the World Wide Web (20)

PPT
Introduction to internet technology
PDF
Mozilla, the web and you! (including notes)
PPT
E3 chap-21
PDF
Web 2.0 101
PDF
Slides 1 - Internet and Web
PPT
Human Computer Interaction (HCI) _2024-2025
PDF
Hacking For Innovation
PPT
World wide web and Hyper Text Markup Language
KEY
Get ready for web3.0! Open up your app!
PDF
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
PDF
Cert Overview
PDF
Mozilla the web and you
PDF
Georgia Tech Hack Day
PDF
web technologies
PDF
Web engineering notes unit 2
PPTX
Lec 01 Introduction.pptx
PPTX
PPTX
01-Lecture Web System & Technology Introduction.pptx
PPTX
Presentation1
PDF
Content Used to be King: The Semantic Web in Education
Introduction to internet technology
Mozilla, the web and you! (including notes)
E3 chap-21
Web 2.0 101
Slides 1 - Internet and Web
Human Computer Interaction (HCI) _2024-2025
Hacking For Innovation
World wide web and Hyper Text Markup Language
Get ready for web3.0! Open up your app!
Episode 3(3): Birth & explosion of the World Wide Web - Meetup session11
Cert Overview
Mozilla the web and you
Georgia Tech Hack Day
web technologies
Web engineering notes unit 2
Lec 01 Introduction.pptx
01-Lecture Web System & Technology Introduction.pptx
Presentation1
Content Used to be King: The Semantic Web in Education
Ad

More from David Evans (20)

PPTX
Cryptocurrency Jeopardy!
PPTX
Trick or Treat?: Bitcoin for Non-Believers, Cryptocurrencies for Cypherpunks
PPTX
Hidden Services, Zero Knowledge
PPTX
Anonymity in Bitcoin
PPTX
Midterm Confirmations
PPTX
Scripting Transactions
PPTX
How to Live in Paradise
PPTX
Bitcoin Script
PPTX
Mining Economics
PPTX
Mining
PPTX
The Blockchain
PPTX
Becoming More Paranoid
PPTX
Asymmetric Key Signatures
PPTX
Introduction to Cryptography
PPTX
Class 1: What is Money?
PPTX
Multi-Party Computation for the Masses
PPTX
Proof of Reserve
PPTX
Silk Road
PPTX
Blooming Sidechains!
PPTX
Useful Proofs of Work, Permacoin
Cryptocurrency Jeopardy!
Trick or Treat?: Bitcoin for Non-Believers, Cryptocurrencies for Cypherpunks
Hidden Services, Zero Knowledge
Anonymity in Bitcoin
Midterm Confirmations
Scripting Transactions
How to Live in Paradise
Bitcoin Script
Mining Economics
Mining
The Blockchain
Becoming More Paranoid
Asymmetric Key Signatures
Introduction to Cryptography
Class 1: What is Money?
Multi-Party Computation for the Masses
Proof of Reserve
Silk Road
Blooming Sidechains!
Useful Proofs of Work, Permacoin
Ad

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
A Presentation on Artificial Intelligence
PDF
KodekX | Application Modernization Development
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
A Presentation on Artificial Intelligence
KodekX | Application Modernization Development
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The AUB Centre for AI in Media Proposal.docx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf

Class 39: ...and the World Wide Web

  • 1. Lecture 39: …and the World Wide Web cs1120 Fall 2011 David Evans http://www.cs.virginia.edu/evans
  • 2. Announcements Exam 2 due 61 seconds ago! 70 69 68 67 66 65 64 63 62 60 Friday: we will return graded Exam 2, along with guidance about the Final Must be present (or email me in advance) to win! If you want to present your PS8 in class Monday, remember to email me! 2
  • 3. Plan The World Wide Web Building Web Applications How Google Works (or, going back to pre-PS5 to make things really fast again!) cs1120 recap in one (heavily animated) slide! 3
  • 5. The “Desk Wide Web” Memex Machine Vannevar Bush, As We May Think, LIFE, 1945
  • 6. WorldWideWeb Sir Tim Berners-Lee First web server and client, 1990 CERN (Switzerland) (This picture, 1993) MIT
  • 7. Overview: Many of the discussions of the future at CERN and the LHC era end with the question – “Yes, but how will we ever keep track of such a large project?” This proposal provides an answer to such questions. Firstly, it discusses the problem of information access at CERN. Then, it introduces the idea of linked information systems, and compares them with less flexible ways of finding information. http://www.w3.org/History/1989/proposal-msw.html
  • 9. 9
  • 10. WorldWideWeb Established a common language for sharing information on computers Lots of previous attempts (Gopher, WAIS, Archie, Xanadu, etc.) failed 10
  • 11. Why the World Wide Web? World Wide Web succeeded because it was simple! Didn’t attempt to maintain links, just a common way to name things Uniform Resource Locators (URL) http://www.cs.virginia.edu/cs1120/index.html Service Hostname File Path HyperText Transfer Protocol
  • 12. HyperText Transfer Protocol Server GET /cs1120/index.html HTTP/1.0 <html> <head> Contents … of file Client (Browser) HTML HyperText Markup Language
  • 13. HTML: HyperText Markup Language Language for controlling display of web pages Uses formatting tags: between < and > Document ::= <html> Header Body </html> Header ::= <head> HeadElements </head> HeadElements ::= HeadElement HeadElements HeadElements ::= ε | <title> Element </title> Body ::= <body> Elements </body> Elements ::= ε | Element Elements Element ::= <p> Element </p> Element ::= <center> Element </center> …
  • 14. Popular Web Site: Strategy 1 Static, Authored Web Site Drawbacks: •Have to do all the work yourself •The world may already have enough Twinkie-experiment websites Content Producer http://www.twinkiesproject.com/
  • 15. Popular Web Site: Strategy 2 Dynamic Web Applications Attracts users Seed content and function Web Programmer Produce more content eBay in 1997 http://web.archive.org/web/19970614001443/http://www.ebay.com/
  • 16. Popular Web Site: Strategy 2 Dynamic Web Applications Attracts users Seed content and function Advantages: • Users do most of the work • If you’re lucky, they might even pay you for the privilege! Disadvantages: • Lose control over the content (you might Produce more get sued for things your users do) content reddit.com today • Have to know how to program a web application reddit.com in 2005
  • 17. Dynamic Web Sites Programs that run on the web server Can be written in any language (often in Python or Java), just need a way to connect the web server to the program Program generates HTML (often JavaScript also now) Every useful web site does this Programs that run on the client’s machine Java, JavaScript (aka, “Scheme for the Web”), Flash, etc.: language must be supported by the client’s browser Responsive interface: limited round-trips to server
  • 19. 19
  • 20. Building a Web Search Engine Database of web pages Crawling the web collecting pages and links Indexing them efficiently Responding to Searches Spell checking – edit distance How to find documents that match a query How to rank the “best” documents
  • 21. Crawling Crawler activeURLs = * “www.yahoo.com” + while (len(activeURLs) > 0) : newURLs = [ ] for URL in activeURLs: page = downloadPage (URL) newURLs += extractLinks (page) activeURLs = newURLs Problems: Will keep revisiting the same pages Will take very long to get a good view of the web Will annoy web server admins downloadPage and extractLinks must be very robust
  • 22. Building a Web Search Engine Database of web pages Crawling the web collecting pages and links Indexing them efficiently Responding to Searches How to find documents that match a query How to rank the “best” documents
  • 23. Building an Index What if we just stored all the pages? Answering a query would be (size of the database) (need to look at all characters in database) Google: about 40 Billion pages (1 Trillion URLs, but number actually indexed is a closely kept corporate secret) * 60 KB (average web page size) = ~2.4 Quadrillion bytes to search! Linear is not nearly good enough when n is Quadrillions
  • 24. Hash Table Index Key-Value Pairs 0 , <“Colleen”, ? >, <“virginia”, ? >, … - 1 , <“Bob”, ? >, … - 2 3 … [about a million bins?] def lookup(key, table) : searchEntries(table[H(key, len(table))]) Finding a good H is difficult You can download google’s from http://code.google.com/p/google-sparsehash/
  • 25. Google’s Lexicon 1998: 14 million words (billions today?) Lookup word in H(word, nbins): maps to WordID Key Words 0 *<“aardvark”, 1024235>, ... + 1 *<“aaa”, 224155>, ..., <“zzz”, 29543> + ... ... nbins – 1 *<“abba”, 25583>, ..., <“zeit”, 50395> +
  • 26. Google’s Reverse Index (Based on 1998 paper…definitely changed some since then, but now they are secretive!) WordId ndocs pointer 00000000 3 00000001 15 ... “Inverted Barrels”: 16777215 105 41 GB (1998) Today: many TB? Lexicon: 293 MB (1998) Today: many GB?
  • 27. Inverted Barrels docid (27 bits) nhits (5 bits) hits (16 bits each) plain hit: capitalized: 1 bit 7630486927 23 font size: 3 bits position: 12 bits ... first 4095 chars, everything else extra info for anchors, titles (less position bits) Suggested experiment for winter break: is the position field still only 12 bits?
  • 28. Building a Web Search Engine Database of web pages Crawling the web collecting pages and links Indexing them efficiently Responding to Searches Spell checking – edit distance How to find documents that match a query How to rank the “best” documents
  • 29. Finding the “Best” Documents Humans rate them “Jerry and David’s Guide to the World Wide Web” (became Yahoo!) Machines rate them Count number of occurrences of keyword Easy for sites to rig this Machine language understanding not good enough Business Model Whoever pays you the most is listed first
  • 30. PageRank If a site is important and interesting, other sites will link to it. Don’t ever take <a href=http://www.cs.virginia.edu/cs1120>cs1120</a>! But…not all links are equal: if a lot of highly-ranked sites link to this site, this site should be highly-ranked. 30
  • 31. PageRank def pageRank (u): rank = 0 for b in linksToPage (u) rank = rank + PageRank (b) / Links (b) return rank Would this work?
  • 32. Converging PageRank Ranks of all pages depend on ranks of all other pages Keep recalculating ranks until they converge def CalculatePageRanks (urls): initially, every rank is 1 for as many times as necessary calculate a new rank for each page (using old ranks) replace the old ranks with the new ranks How do initial ranks effect results? How many iterations are necessary?
  • 33. PageRank: 1998 Crawlable web (1998): 150 million pages, 1.7 Billion links Database of 322 million links Converges in about 50 iterations Initialization matters All pages = 1: very democratic, models browser equally likely to start on random page www.yahoo.com = 1, ..., all others = 0 More like what Google probably uses
  • 34. Do we have a search engine? Theoretician: Sure! Ali G: No way! It’ll blow up. Google’s First Server 34
  • 35. How do we make our service fast enough to index the whole web and serve billions of requests? 35
  • 36. Counting Word Occurrences “When in the Course of human events, it * <“When”, 1>, becomes necessary for one people to dissolve <“in”, 1>, the political bands which have connected them <“the”, 2> with another, …” …+ “We the People of the United States, in Order * <“We”, 1>, to form a more perfect Union, establish Justice, <“in”, 1>, insure domestic Tranquility, provide for the …” <“the”, 2> …+ map(doc, countWords) If we have enough machines, can we do this fast for the whole web? 36
  • 37. * <“When”, 1>, <“in”, 1>, <“the”, 2> …+ reduce * <“We”, 1>, <“in”, 1>, * <“We”, 1>, <“the”, 2> <“in”, 2>, * <“a”, 5>, …+ …+ reduce <“in”, 6>, * <“a”, 5>, …+ <“in”, 3>, <“the”, 2> …+ reduce * <“apple”, 1>, <“in”, 1>, <“the”, 7> * <“a”, 5>, …+ <“in”, 4>, …+
  • 38. MapReduce 38
  • 39. Key to Massive Parallel Execution Get rid of state and mutation! 39
  • 40. (define (count-matches p b) Functional Programming (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p))) (PS 1-4) def meval(expr, env): Interpreters … return evalApplication(expr, env) ... # 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 # ... Any Mechanical 1 3 Turing Machine 2 Computation A B C R1 R0 (or a b) 0 0 0 0 0 (not (and (not a) 0 0 1 0 1 Any Discrete Function (not b))) … … … … … AND NOT Mechanical Logic “Magic” Transistors 40
  • 41. (define (count-matches p b) Functional Programming (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p))) (PS 1-4) def meval(expr, env): Interpreters … return evalApplication(expr, env) ... # 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 # ... Any Mechanical 1 3 Turing Machine 2 Computation A B C R1 R0 (or a b) 0 0 0 0 0 (not (and (not a) 0 0 1 0 1 Any Discrete Function (not b))) … … … … … AND NOT Mechanical Logic “Magic” Transistors
  • 42. SimObject PhysicalObject Objects Place MobileObject m1: State and Mutation 1 2 3 (define (count-matches p b) Functional Programming (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p))) (PS 1-4) def meval(expr, env): Interpreters … return evalApplication(expr, env) ... # 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 # ... Any Mechanical 1 3 Turing Machine 2 Computation A B C R1 R0 (or a b)
  • 43. SimObject PhysicalObject Objects Place MobileObject m1: State and Mutation 1 2 3 (define (count-matches p b) Functional Programming (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p))) (PS 1-4) def meval(expr, env): Interpreters … return evalApplication(expr, env) ... # 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 # ... Any Mechanical 1 3 Turing Machine 2 Computation A B C R1 R0 (or a b)
  • 44. Objects Recursive Definitions State and Mutation Functional Programming Charge (PS 1-4) Universality Abstraction Now, you know Interpreters almost everything you need to build the Any Mechanical Computation next reddit or google! Any Discrete Function Mechanical Logic “Magic” Transistors