SlideShare a Scribd company logo
Building Mini‐Google in Ruby 


                                                                               Ilya Grigorik 
                                                                                        @igrigorik 


Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
postrank.com/topic/ruby 




                                The slides…                           Twi+er                         My blog 


Building Mini‐Google in Ruby       h:p://bit.ly/railsconf‐pagerank          @igrigorik #railsconf 
Ruby + Math 
                                                                             PageRank 
               OpDmizaDon 




   Misc Fun                     Examples                                           Indexing 


Building Mini‐Google in Ruby      h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
PageRank                                        PageRank + Ruby 




      Tools 
        +                        Examples                                        Indexing 
   OpDmizaDon 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Consume with care… 
                     everything that follows is based on released / public domain info 




Building Mini‐Google in Ruby        h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
Search‐engine graveyard 
                                                                   Google did pre9y well… 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Query: Ruby 




                                                                                              Results 




       1. Crawl                            2. Index                                          3. Rank 




                                                                   Search pipeline 
                                                                                    50,000‐foot view 



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Query: Ruby 




                                                                                               Results 




       1. Crawl                            2. Index                                            3. Rank 




             Bah                           InteresDng                                       Fun 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
CPU Speed                                         333Mhz 
          RAM                                               32‐64MB 

          Index                                             27,000,000 documents 
          Index refresh                                     once a month~ish 
          PageRank computaCon                               several days 

          Laptop CPU                                        2.1Ghz 
          VM RAM                                            1GB 
          1‐Million page web                                ~10 minutes 


                                                                       circa 1997‐1998 



Building Mini‐Google in Ruby        h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
CreaDng & Maintaining an Inverted Index  
                                                                     DIY and the gotchas within 




Building Mini‐Google in Ruby      h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
require 'set'
                                                       {
    pages = {                                            "it"=>#<Set: {"1", "2", "3"}>,
     "1" => "it is what it is",                          "a"=>#<Set: {"3"}>,
     "2" => "what is it",                                "banana"=>#<Set: {"3"}>,
     "3" => "it is a banana"                             "what"=>#<Set: {"1", "2"}>,
    }                                                    "is"=>#<Set: {"1", "2", "3"}>}
                                                        }
    index = {}

    pages.each do |page, content|
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                 Building an Inverted Index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
require 'set'
                                                       {
    pages = {                                            "it"=>#<Set: {"1", "2", "3"}>,
     "1" => "it is what it is",                          "a"=>#<Set: {"3"}>,
     "2" => "what is it",                                "banana"=>#<Set: {"3"}>,
     "3" => "it is a banana"                             "what"=>#<Set: {"1", "2"}>,
    }                                                    "is"=>#<Set: {"1", "2", "3"}>}
                                                        }
    index = {}

    pages.each do |page, content|
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                 Building an Inverted Index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
require 'set'
                                                       {
    pages = {                                            "it"=>#<Set: {"1", "2", "3"}>,
     "1" => "it is what it is",                          "a"=>#<Set: {"3"}>,
     "2" => "what is it",                                "banana"=>#<Set: {"3"}>,
     "3" => "it is a banana"                             "what"=>#<Set: {"1", "2"}>,
    }                                                    "is"=>#<Set: {"1", "2", "3"}>}
                                                        }
    index = {}

    pages.each do |page, content|
                                                                   Word => [Document] 
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                 Building an Inverted Index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
# query: "what is banana"
 p index["what"] & index["is"] & index["banana"]
 # > #<Set: {}>


 # query: "a banana"
 p index["a"] & index["banana"]
 # > #<Set: {"3"}>


 # query: "what is"                                                  1             2           3 
 p index["what"] & index["is"]
 # > #<Set: {"1", "2"}>


 {
   "it"=>#<Set: {"1", "2", "3"}>,
   "a"=>#<Set: {"3"}>,
   "banana"=>#<Set: {"3"}>,
   "what"=>#<Set: {"1", "2"}>,
   "is"=>#<Set: {"1", "2", "3"}>}
  }                                                                Querying the index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
# query: "what is banana"
 p index["what"] & index["is"] & index["banana"]
 # > #<Set: {}>


 # query: "a banana"
 p index["a"] & index["banana"]
 # > #<Set: {"3"}>


 # query: "what is"                                                  1             2           3 
 p index["what"] & index["is"]
 # > #<Set: {"1", "2"}>


 {
   "it"=>#<Set: {"1", "2", "3"}>,
   "a"=>#<Set: {"3"}>,
   "banana"=>#<Set: {"3"}>,
   "what"=>#<Set: {"1", "2"}>,
   "is"=>#<Set: {"1", "2", "3"}>}
  }                                                                Querying the index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
# query: "what is banana"
 p index["what"] & index["is"] & index["banana"]
 # > #<Set: {}>


 # query: "a banana"
 p index["a"] & index["banana"]
 # > #<Set: {"3"}>


 # query: "what is"                                                  1             2           3 
 p index["what"] & index["is"]
 # > #<Set: {"1", "2"}>


 {
   "it"=>#<Set: {"1", "2", "3"}>,
   "a"=>#<Set: {"3"}>,
   "banana"=>#<Set: {"3"}>,
   "what"=>#<Set: {"1", "2"}>,
   "is"=>#<Set: {"1", "2", "3"}>}
  }                                                                Querying the index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
# query: "what is banana"
 p index["what"] & index["is"] & index["banana"]
 # > #<Set: {}>


 # query: "a banana"
 p index["a"] & index["banana"]
 # > #<Set: {"3"}>

                                                                   What order? 
 # query: "what is"
 p index["what"] & index["is"]
 # > #<Set: {"1", "2"}>                                            [1, 2] or [2,1]  


 {
   "it"=>#<Set: {"1", "2", "3"}>,
   "a"=>#<Set: {"3"}>,
   "banana"=>#<Set: {"3"}>,
   "what"=>#<Set: {"1", "2"}>,
   "is"=>#<Set: {"1", "2", "3"}>}
  }                                                                  Querying the index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank             @igrigorik #railsconf 
require 'set'

    pages = {
     "1" => "it is what it is",
     "2" => "what is it",
     "3" => "it is a banana"
    }

    index = {}                                                       PDF, HTML, RSS? 
                                                                   Lowercase / Upcase? 
    pages.each do |page, content|                                    Compact Index? 
                                                                         Hmmm? 
     content.split(/s/).each do |word|                                Stop words? 
      if index[word]                                                   Persistence? 
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                 Building an Inverted Index 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank          @igrigorik #railsconf 
Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
 
                Ferret is a high‐performance, full‐featured text search engine library wri9en for Ruby



Building Mini‐Google in Ruby           h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => "1", :content => "it is what it is"}
    index << {:title => "2", :content => "what is it"}
    index << {:title => "3", :content => "it is a banana"}

    index.search_each('content:"banana"') do |id, score|
     puts "Score: #{score}, #{index[id][:title]} "
    end


    > Score: 1.0, 3




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => "1", :content => "it is what it is"}
    index << {:title => "2", :content => "what is it"}
    index << {:title => "3", :content => "it is a banana"}

    index.search_each('content:"banana"') do |id, score|
     puts "Score: #{score}, #{index[id][:title]} "
    end


    > Score: 1.0, 3


                                Hmmm? 




Building Mini‐Google in Ruby             h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
class Ferret::Analysis::Analyzer                                        class Ferret::Search::BooleanQuery 
   class Ferret::Analysis::AsciiLe+erAnalyzer                              class Ferret::Search::ConstantScoreQuery 
   class Ferret::Analysis::AsciiLe+erTokenizer                             class Ferret::Search::ExplanaCon 
   class Ferret::Analysis::AsciiLowerCaseFilter                            class Ferret::Search::Filter 
   class Ferret::Analysis::AsciiStandardAnalyzer                           class Ferret::Search::FilteredQuery 
   class Ferret::Analysis::AsciiStandardTokenizer                          class Ferret::Search::FuzzyQuery 
   class Ferret::Analysis::AsciiWhiteSpaceAnalyzer                         class Ferret::Search::Hit 
   class Ferret::Analysis::AsciiWhiteSpaceTokenizer                        class Ferret::Search::MatchAllQuery 
   class Ferret::Analysis::HyphenFilter                                    class Ferret::Search::MulCSearcher 
   class Ferret::Analysis::Le+erAnalyzer                                   class Ferret::Search::MulCTermQuery 
   class Ferret::Analysis::Le+erTokenizer                                  class Ferret::Search::PhraseQuery 
   class Ferret::Analysis::LowerCaseFilter                                 class Ferret::Search::PrefixQuery 
   class Ferret::Analysis::MappingFilter                                   class Ferret::Search::Query 
   class Ferret::Analysis::PerFieldAnalyzer                                class Ferret::Search::QueryFilter 
   class Ferret::Analysis::RegExpAnalyzer                                  class Ferret::Search::RangeFilter 
   class Ferret::Analysis::RegExpTokenizer                                 class Ferret::Search::RangeQuery 
   class Ferret::Analysis::StandardAnalyzer                                class Ferret::Search::Searcher 
   class Ferret::Analysis::StandardTokenizer                               class Ferret::Search::Sort 
   class Ferret::Analysis::StemFilter                                      class Ferret::Search::SortField 
   class Ferret::Analysis::StopFilter                                      class Ferret::Search::TermQuery 
   class Ferret::Analysis::Token                                           class Ferret::Search::TopDocs 
   class Ferret::Analysis::TokenStream                                     class Ferret::Search::TypedRangeFilter 
   class Ferret::Analysis::WhiteSpaceAnalyzer                              class Ferret::Search::TypedRangeQuery 
   class Ferret::Analysis::WhiteSpaceTokenizer                             class Ferret::Search::WildcardQuery 



Building Mini‐Google in Ruby            h:p://bit.ly/railsconf‐pagerank               @igrigorik #railsconf 
ferret.davebalmain.com/trac 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Ranking Results 
                                                                       0‐60 with PageRank… 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
index.search_each('content:"the brown cow"') do |id, score|
     puts "Score: #{score}, #{index[id][:title]} "
    end

    > Score: 0.827, 3
    > Score: 0.523, 5                                                   Relevance? 
    > Score: 0.125, 4

                                3                     5                    4 
            the                 4                     3                    5 
          brown                 1                     3                    1 
            cow                 1                     4                    1 
      Score                     6                    10                    7 


                                                                Naïve: Term Frequency 

Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
index.search_each('content:"the brown cow"') do |id, score|
     puts "Score: #{score}, #{index[id][:title]} "
    end

    > Score: 0.827, 3
    > Score: 0.523, 5
    > Score: 0.125, 4

                                3                     5                 4 
            the                 4                     3                 5 
                                                                                                  Skew 
          brown                 1                     3                 1 
            cow                 1                     4                 1 
      Score                     6                    10                 7 


                                                                Naïve: Term Frequency 

Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank     @igrigorik #railsconf 
3                          5                 4 
            the                 4                          3                 5 
          brown                 1                          3                 1                          Skew 

            cow                 1                          4                 1 


                                # of docs 
                                                                 Score = TF * IDF
                    the              6 
                   brown             3                           TF = # occurrences / # words
                                                                 IDF = # docs / # docs with W
                    cow              4 

      Total # of documents:                       10


                                                                                                       TF‐IDF 
                                              Term Frequency * Inverse Document Frequency 


Building Mini‐Google in Ruby              h:p://bit.ly/railsconf‐pagerank     @igrigorik #railsconf 
3                          5                   4 
            the                  4                          3                   5 
          brown                  1                          3                   1 
            cow                  1                          4                   1 


                                # of docs                                Doc # 3 score for ‘the’:
                                                                         4/10 * ln(10/6) = 0.204
                    the               6 
                   brown              3                                  Doc # 3 score for ‘brown’:
                                                                         1/10 * ln(10/3) = 0.120
                    cow               4 
                                                                         Doc # 3 score for ‘cow’:
                                                                         1/10 * ln(10/4) = 0.092
      Total # of documents:                        10
      # words in document:                         10


                                Score = 0.204 + 0.120 + 0.092 = 0.416                                     TF‐IDF 

Building Mini‐Google in Ruby               h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
W1         W2    …           …             …        …        …             …      WN 

         Doc 1        15        23    … 
         Doc 2        24        12    … 
         …            …         …     … 
         … 
         Doc K 

         Size = N * K * size of Ruby object
                                                                                     Ouch. 
          Pages = N = 10,000
          Words = K = 2,000
          Ruby Object = 20+ bytes

          Footprint = 384 MB                                              Frequency Matrix 

Building Mini‐Google in Ruby          h:p://bit.ly/railsconf‐pagerank         @igrigorik #railsconf 
NArray is an Numerical N‐dimensional Array class (implemented in C)  



                                                       #    create new NArray. initialize with 0.
       NArray.new(typecode, size, ...)
                                                       #    1 byte unsigned integer
       NArray.byte(size,...)
                                                       #    2 byte signed integer
       NArray.sint(size,...)
                                                       #    4 byte signed integer
       NArray.int(size,...)
                                                       #    single precision float
       NArray.sfloat(size,...)
                                                       #    double precision float
       NArray.float(size,...)
                                                       #    single precision complex
       NArray.scomplex(size,...)
                                                       #    double precision complex
       NArray.complex(size,...)
                                                       #    Ruby object
       NArray.object(size,...)




                                                                                                NArray 
                                                                     h9p://narray.rubyforge.org/ 



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
NArray is an Numerical N‐dimensional Array class (implemented in C)  




                                                                                              NArray 
                                                                   h9p://narray.rubyforge.org/ 



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
Links as votes 




                                                                                          PageRank 
                Problem: link gaming                                                    the google juice 




Building Mini‐Google in Ruby        h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
P = 0.85 



                                Follow link from page he/she is currently on.  



                                Teleport to a random locaGon on the web. 



                                    P = 0.15 

                                                                                  Random Surfer 
                                                                                      powerful abstracJon 




Building Mini‐Google in Ruby            h:p://bit.ly/railsconf‐pagerank          @igrigorik #railsconf 
Follow link from page he/she is currently on.  
                                                                                                        Page K 

                                Teleport to a random locaGon on the web. 



        Page N                          Page M 
                                                                                                      Surfin’ 
                                                                           rinse & repeat, ad naseum 




Building Mini‐Google in Ruby            h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
On Page P, clicks on link to K 
                                                                                          P = 0.85 


                                On Page K clicks on link to M 
                                                                                          P = 0.85 


                                On Page M teleports to X 

           P = 0.15 

                                                 …                                                    Surfin’ 
                                                                           rinse & repeat, ad naseum 




Building Mini‐Google in Ruby            h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
P = 0.05                                                       P = 0.20 
                                                                X 
                                N 

                                                                                       P = 0.15 
                                     K                                  M
                  P = 0.6 




                                                        Analyzing the Web Graph 
                                                                                  extracJng PageRank 




Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
What is PageRank? 
                                                                                               It’s a scalar! 

Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
P = 0.05                                                       P = 0.20 
                                                                X 
                                N 

                                                                                       P = 0.15 
                                     K                                  M
                  P = 0.6 




                                                                        What is PageRank? 
                                                                                         it’s a probability! 




Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
P = 0.05                                                       P = 0.20 
                                                                X 
                                N 

                                                                                       P = 0.15 
                                     K                                  M
                  P = 0.6 




                                                                        What is PageRank? 
          Higher Pr, Higher Importance? 
                                                                                         it’s a probability! 




Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
TeleportaDon? 
                                                                                             sci‐fi fans, … ? 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank     @igrigorik #railsconf 
1. No in‐links!                                                         3. Isolated Web 




                                            X 
          N 
                           K 
                                                                                         2. No out‐links! 
                                         M
                                                                   M



                                                   Reasons for teleportaDon 
                                                                       enumeraJng edge cases 



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
•  readth First Search 
                                 B
                                •  epth First Search 
                                 D
                                •  * Search  
                                 A
                                •  exicographic Search  
                                 L
                                •  ijkstra’s Algorithm  
                                 D
                                •  loyd‐Warshall  
                                 F
                                •  riangulaCon and Comparability detecCon  
                                 T

require 'gratr/import'

dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6]

dg.directed? # true
dg.vertex?(4) # true
dg.edge?(2,4) # true
dg.vertices # [5, 6, 1, 2, 3, 4]
                                                                        Exploring Graphs 
Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5]                                    gratr.rubyforge.com 
Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4]



Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank     @igrigorik #railsconf 
P(T) = 0.03 
        P(T) = 0.03                                                    P(T) = 0.15 / # of pages 
                                                                       P(T) = 0.03 
                                             X 
          N 
                           K                                        P(T) = 0.03 

                                          M
                 P(T) = 0.03 
                                                                    M
                                P(T) = 0.03 


                                                                              TeleportaDon 
                                                                                                  probabiliJes 



Building Mini‐Google in Ruby     h:p://bit.ly/railsconf‐pagerank         @igrigorik #railsconf 
Assume the web is N pages big 
    Assume that probability of teleportaCon (t) is 0.15, and following link (s) is 0.85 
    Assume that teleportaCon probability (E) is uniform 
    Assume that you start on any random page (uniform distribuDon L), then




    Then a^er one step, the probability your on page X is: 




                      PageRank: Simplified MathemaDcal Def’n 
                                                                    cause that’s how we roll 



Building Mini‐Google in Ruby     h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Link Graph                                                  No  link from 1 to N  



                           1    2                …                 …          N 
              1            1    0                …                 …           0 

              2            0    1                …                 …           1 

             …             …    …                …                 …          … 

             …             …    …                …                 …          … 

             N             0    1                …                 …           1 


                     Huge!                                          G = The Link Graph 
                                                                           ginormous and sparse 



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank         @igrigorik #railsconf 
Links to… 

                                {
                                    "1"      =>         [25, 26],
                  Page              "2"      =>         [1],
                                    "5"      =>         [123,2],
                                    "6"      =>         [67, 1]
                                }



                                                                            G as a dicDonary 
                                                                                            more compact… 



Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank         @igrigorik #railsconf 
Follow link from page he/she is currently on.  
                                                                                                        Page K 

                                Teleport to a random locaGon on the web. 




                                                                           CompuDng PageRank 
                                                                                              the tedious way 



Building Mini‐Google in Ruby            h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
Don’t trust me! Verify it yourself! 




                                IdenDty matrix 




                                                                      CompuDng PageRank 
                                                                                               in one swoop 



Building Mini‐Google in Ruby       h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
Enough hand‐waving, dammit! 
                                                                                   show me the code 




Building Mini‐Google in Ruby      h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Hot, Fast, Awesome 


                                                                   Birth of EM‐Proxy 
                                                                              flash of the obvious 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
h:p://rb‐gsl.rubyforge.org/ 




                                                                         Hot, Fast, Awesome 




                       Click there!  …  Give yourself a weekend.  


Building Mini‐Google in Ruby          h:p://bit.ly/railsconf‐pagerank       @igrigorik #railsconf 
h:p://ruby‐gsl.sourceforge.net/ 
                       Click there!  …  Give yourself a weekend.  


Building Mini‐Google in Ruby       h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
require "gsl"
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)
                                                                         Verify NxN 
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby 
                                                                                              6 lines, or less 



Building Mini‐Google in Ruby          h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
require "gsl"
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)                                                          Constants… 
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby 
                                                                                              6 lines, or less 



Building Mini‐Google in Ruby          h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
require "gsl"
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                   PageRank!                                             PageRank in Ruby 
                                                                                              6 lines, or less 



Building Mini‐Google in Ruby          h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
P = 0.33                                      X         P = 0.33 
                                N 


                                                                    P = 0.33 
                                                    K 


        pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]])
        > [0.33, 0.33, 0.33]


                                                                    Ex: Circular Web 
                                                                                 tesJng intuiJon… 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
P = 0.05                                      X         P = 0.07 
                                N 


                                                                    P = 0.87 
                                                    K 


        pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]])
        > [0.05, 0.07, 0.87]


                                                               Ex: All roads lead to K 
                                                                                 tesJng intuiJon… 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank      @igrigorik #railsconf 
PageRank + Ferret 
                                                                              awesome search, Tw! 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
P = 0.05                                         2                    P = 0.07 
                                                             1 


require 'ferret'                                                                                  P = 0.87 
include Ferret                                                          3 
index = Index::Index.new()

index << {:title => "1", :content => "it is what it is", :pr => 0.05 }
index << {:title => "2", :content => "what is it", :pr => 0.07 }
index << {:title => "3", :content => "it is a banana", :pr => 0.87 }



                                                                                  Store PageRank 




Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank     @igrigorik #railsconf 
index.search_each('content:"world"') do |id, score|
 puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"
end

puts "*" * 50                   TF‐IDF Search 

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:"world"', :sort => sf_pr) do |id, score|
 puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})"
end


#   Score: 0.267119228839874, 3 (PR: 0.87)
#   Score: 0.17807948589325, 1 (PR: 0.05)
#   Score: 0.17807948589325, 2 (PR: 0.07)
#   ***********************************
#   Score: 0.267119228839874, 3, (PR: 0.87)
#   Score: 0.17807948589325, 2, (PR: 0.07)
#   Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini‐Google in Ruby        h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
index.search_each('content:"world"') do |id, score|
 puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"
end
                                                 PageRank FTW! 
puts "*" * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:"world"', :sort => sf_pr) do |id, score|
 puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})"
end


#   Score: 0.267119228839874, 3 (PR: 0.87)
#   Score: 0.17807948589325, 1 (PR: 0.05)
#   Score: 0.17807948589325, 2 (PR: 0.07)
#   ***********************************
#   Score: 0.267119228839874, 3, (PR: 0.87)
#   Score: 0.17807948589325, 2, (PR: 0.07)
#   Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
index.search_each('content:"world"') do |id, score|
 puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"
end

puts "*" * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:"world"', :sort => sf_pr) do |id, score|
 puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})"
end


#    Score: 0.267119228839874, 3 (PR: 0.87)
#    Score: 0.17807948589325, 1 (PR: 0.05)                                           Others 
#    Score: 0.17807948589325, 2 (PR: 0.07)
#    ***********************************
#    Score: 0.267119228839874, 3, (PR: 0.87)
#    Score: 0.17807948589325, 2, (PR: 0.07)                                         Google 
#    Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Search*: Graphs are ubiquitous! 
                                                  PageRank is a general purpose hammer 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
Username               GitCred
                                                                     ==============================
                                                                     37signals              10.00
                                                                     imbriaco               9.76
                                                                     why                    8.74
                                                                     rails                  8.56
                                                                     defunkt                8.17
                                                                     technoweenie           7.83
                                                                     jeresig                7.60
                                                                     mojombo                7.51
                                                                     yui                    7.34
                                                                     drnic                  7.34
                                                                     pjhyett                6.91
                                                                     wycats                 6.85
                                                                     dhh                    6.84

            h:p://bit.ly/3YQPU 

                                                        PageRank + Social Graph 
                                                                                                      GitHub 




Building Mini‐Google in Ruby      h:p://bit.ly/railsconf‐pagerank            @igrigorik #railsconf 
Hmm… 




                                                                   Analyze the social graph: 
                                                                   ‐  Filter messages by ‘Twi:erRank’ 
                                                                   ‐  Suggest users by ‘Twi:erRank’ 
                                                                   ‐  … 
                                                      PageRank + Social Graph 
                                                                                                     Twi9er 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank             @igrigorik #railsconf 
PageRank + Product Graph 
                                                                                            E‐commerce 

                                   Link items purchased in same cart… Run PR on it. 



Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
PageRank = Powerful Hammer 
                                                                                            use it! 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
PersonalizaDon 
                                                                       how would you do it? 




Building Mini‐Google in Ruby    h:p://bit.ly/railsconf‐pagerank    @igrigorik #railsconf 
TeleportaDon distribuDon doesn’t 
                                                   have to be uniform! 




                                yahoo.com is 
                                my homepage! 


                                                  PageRank + PersonalizaDon 
                                                                  customize the teleportaJon vector 




Building Mini‐Google in Ruby         h:p://bit.ly/railsconf‐pagerank        @igrigorik #railsconf 
Make pages with links! 




                                                                       Gaming PageRank 
       hXp://bit.ly/pagerank‐spam                         for fun and profit (I don’t endorse it) 




Building Mini‐Google in Ruby     h:p://bit.ly/railsconf‐pagerank           @igrigorik #railsconf 
Slides: hXp://bit.ly/railsconf‐pagerank 

    Ferret: hXp://bit.ly/ferret 
    RB‐GSL: hXp://bit.ly/rb‐gsl 

    PageRank on Wikipedia: hXp://bit.ly/wp‐pagerank 
    Gaming PageRank: hXp://bit.ly/pagerank‐spam  

    Michael Nielsen’s lectures on PageRank: 
    hXp://michaelnielsen.org/blog   



                                                                                   QuesDons? 

                                The slides…                           Twi+er                         My blog 


Building Mini‐Google in Ruby       h:p://bit.ly/railsconf‐pagerank          @igrigorik #railsconf 

More Related Content

PPTX
Building Mini Google in Ruby
PDF
LF Collaboration Summit: Xen Project 4 4 Features and Futures
PDF
Performance Tuning Xen
PDF
Building A Mini Google High Performance Computing In Ruby
PDF
MongoDB and Node.js
KEY
It's Mechanize for it. Ruby as a Finder.
PPTX
Webinar: Building Your First App in Node.js
PPTX
Webinar: Building Your First App in Node.js
Building Mini Google in Ruby
LF Collaboration Summit: Xen Project 4 4 Features and Futures
Performance Tuning Xen
Building A Mini Google High Performance Computing In Ruby
MongoDB and Node.js
It's Mechanize for it. Ruby as a Finder.
Webinar: Building Your First App in Node.js
Webinar: Building Your First App in Node.js

Similar to Building A Mini Google High Performance Computing In Ruby Presentation 1 (20)

PDF
Practical ngx_mruby
KEY
Rails with mongodb
PDF
High Performance Ruby: Evented vs. Threaded
PDF
Web application intro
PDF
The Future of Dependency Management for Ruby
PDF
Mongodb
PDF
mongodb-introduction
PDF
Padrino - the Godfather of Sinatra
PDF
Monitoring web application behaviour with cucumber-nagios
PDF
Java Persistence Frameworks for MongoDB
PDF
Padrino is agnostic
PPT
Static Code Analysis For Ruby
PDF
Ruby on Rails 3.1: Let's bring the fun back into web programing
PDF
Scaling Rails Sites by default
PPTX
Toolbox of a Ruby Team
KEY
把鐵路開進視窗裡
KEY
Ender
KEY
Php resque
PDF
Breaking bad habits with GitLab CI
PDF
#CNX14 - Using Ruby for Reliability, Consistency, and Speed
Practical ngx_mruby
Rails with mongodb
High Performance Ruby: Evented vs. Threaded
Web application intro
The Future of Dependency Management for Ruby
Mongodb
mongodb-introduction
Padrino - the Godfather of Sinatra
Monitoring web application behaviour with cucumber-nagios
Java Persistence Frameworks for MongoDB
Padrino is agnostic
Static Code Analysis For Ruby
Ruby on Rails 3.1: Let's bring the fun back into web programing
Scaling Rails Sites by default
Toolbox of a Ruby Team
把鐵路開進視窗裡
Ender
Php resque
Breaking bad habits with GitLab CI
#CNX14 - Using Ruby for Reliability, Consistency, and Speed
Ad

More from elliando dias (20)

PDF
Clojurescript slides
PDF
Why you should be excited about ClojureScript
PDF
Functional Programming with Immutable Data Structures
PPT
Nomenclatura e peças de container
PDF
Geometria Projetiva
PDF
Polyglot and Poly-paradigm Programming for Better Agility
PDF
Javascript Libraries
PDF
How to Make an Eight Bit Computer and Save the World!
PDF
Ragel talk
PDF
A Practical Guide to Connecting Hardware to the Web
PDF
Introdução ao Arduino
PDF
Minicurso arduino
PDF
Incanter Data Sorcery
PDF
PDF
Fab.in.a.box - Fab Academy: Machine Design
PDF
The Digital Revolution: Machines that makes
PDF
Hadoop + Clojure
PDF
Hadoop - Simple. Scalable.
PDF
Hadoop and Hive Development at Facebook
PDF
Multi-core Parallelization in Clojure - a Case Study
Clojurescript slides
Why you should be excited about ClojureScript
Functional Programming with Immutable Data Structures
Nomenclatura e peças de container
Geometria Projetiva
Polyglot and Poly-paradigm Programming for Better Agility
Javascript Libraries
How to Make an Eight Bit Computer and Save the World!
Ragel talk
A Practical Guide to Connecting Hardware to the Web
Introdução ao Arduino
Minicurso arduino
Incanter Data Sorcery
Fab.in.a.box - Fab Academy: Machine Design
The Digital Revolution: Machines that makes
Hadoop + Clojure
Hadoop - Simple. Scalable.
Hadoop and Hive Development at Facebook
Multi-core Parallelization in Clojure - a Case Study
Ad

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Cloud computing and distributed systems.
KodekX | Application Modernization Development
Network Security Unit 5.pdf for BCA BBA.
Review of recent advances in non-invasive hemoglobin estimation
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Chapter 3 Spatial Domain Image Processing.pdf
Empathic Computing: Creating Shared Understanding
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
Programs and apps: productivity, graphics, security and other tools
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation_ Review paper, used for researhc scholars

Building A Mini Google High Performance Computing In Ruby Presentation 1

  • 1. Building Mini‐Google in Ruby  Ilya Grigorik  @igrigorik  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 2. postrank.com/topic/ruby  The slides…  Twi+er  My blog  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 3. Ruby + Math  PageRank  OpDmizaDon  Misc Fun  Examples  Indexing  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 4. PageRank  PageRank + Ruby  Tools  +   Examples  Indexing  OpDmizaDon  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 5. Consume with care…  everything that follows is based on released / public domain info  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 6. Search‐engine graveyard  Google did pre9y well…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 7. Query: Ruby  Results  1. Crawl  2. Index  3. Rank  Search pipeline  50,000‐foot view  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 8. Query: Ruby  Results  1. Crawl  2. Index  3. Rank  Bah  InteresDng  Fun  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 9. CPU Speed       333Mhz  RAM         32‐64MB  Index         27,000,000 documents  Index refresh      once a month~ish  PageRank computaCon  several days  Laptop CPU       2.1Ghz  VM RAM       1GB  1‐Million page web    ~10 minutes  circa 1997‐1998  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 10. CreaDng & Maintaining an Inverted Index   DIY and the gotchas within  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 11. require 'set' { pages = { "it"=>#<Set: {"1", "2", "3"}>, "1" => "it is what it is", "a"=>#<Set: {"3"}>, "2" => "what is it", "banana"=>#<Set: {"3"}>, "3" => "it is a banana" "what"=>#<Set: {"1", "2"}>, } "is"=>#<Set: {"1", "2", "3"}>} } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 12. require 'set' { pages = { "it"=>#<Set: {"1", "2", "3"}>, "1" => "it is what it is", "a"=>#<Set: {"3"}>, "2" => "what is it", "banana"=>#<Set: {"3"}>, "3" => "it is a banana" "what"=>#<Set: {"1", "2"}>, } "is"=>#<Set: {"1", "2", "3"}>} } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 13. require 'set' { pages = { "it"=>#<Set: {"1", "2", "3"}>, "1" => "it is what it is", "a"=>#<Set: {"3"}>, "2" => "what is it", "banana"=>#<Set: {"3"}>, "3" => "it is a banana" "what"=>#<Set: {"1", "2"}>, } "is"=>#<Set: {"1", "2", "3"}>} } index = {} pages.each do |page, content| Word => [Document]  content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 14. # query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}> # query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}> # query: "what is" 1  2  3  p index["what"] & index["is"] # > #<Set: {"1", "2"}> { "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} } Querying the index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 15. # query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}> # query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}> # query: "what is" 1  2  3  p index["what"] & index["is"] # > #<Set: {"1", "2"}> { "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} } Querying the index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 16. # query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}> # query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}> # query: "what is" 1  2  3  p index["what"] & index["is"] # > #<Set: {"1", "2"}> { "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} } Querying the index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 17. # query: "what is banana" p index["what"] & index["is"] & index["banana"] # > #<Set: {}> # query: "a banana" p index["a"] & index["banana"] # > #<Set: {"3"}> What order?  # query: "what is" p index["what"] & index["is"] # > #<Set: {"1", "2"}> [1, 2] or [2,1]   { "it"=>#<Set: {"1", "2", "3"}>, "a"=>#<Set: {"3"}>, "banana"=>#<Set: {"3"}>, "what"=>#<Set: {"1", "2"}>, "is"=>#<Set: {"1", "2", "3"}>} } Querying the index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 18. require 'set' pages = { "1" => "it is what it is", "2" => "what is it", "3" => "it is a banana" } index = {} PDF, HTML, RSS?  Lowercase / Upcase?  pages.each do |page, content| Compact Index?  Hmmm?  content.split(/s/).each do |word| Stop words?  if index[word] Persistence?  index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 19. Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 20.   Ferret is a high‐performance, full‐featured text search engine library wri9en for Ruby Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 21. require 'ferret' include Ferret index = Index::Index.new() index << {:title => "1", :content => "it is what it is"} index << {:title => "2", :content => "what is it"} index << {:title => "3", :content => "it is a banana"} index.search_each('content:"banana"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end > Score: 1.0, 3 Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 22. require 'ferret' include Ferret index = Index::Index.new() index << {:title => "1", :content => "it is what it is"} index << {:title => "2", :content => "what is it"} index << {:title => "3", :content => "it is a banana"} index.search_each('content:"banana"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end > Score: 1.0, 3 Hmmm?  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 23. class Ferret::Analysis::Analyzer  class Ferret::Search::BooleanQuery  class Ferret::Analysis::AsciiLe+erAnalyzer  class Ferret::Search::ConstantScoreQuery  class Ferret::Analysis::AsciiLe+erTokenizer  class Ferret::Search::ExplanaCon  class Ferret::Analysis::AsciiLowerCaseFilter  class Ferret::Search::Filter  class Ferret::Analysis::AsciiStandardAnalyzer  class Ferret::Search::FilteredQuery  class Ferret::Analysis::AsciiStandardTokenizer  class Ferret::Search::FuzzyQuery  class Ferret::Analysis::AsciiWhiteSpaceAnalyzer  class Ferret::Search::Hit  class Ferret::Analysis::AsciiWhiteSpaceTokenizer  class Ferret::Search::MatchAllQuery  class Ferret::Analysis::HyphenFilter  class Ferret::Search::MulCSearcher  class Ferret::Analysis::Le+erAnalyzer  class Ferret::Search::MulCTermQuery  class Ferret::Analysis::Le+erTokenizer  class Ferret::Search::PhraseQuery  class Ferret::Analysis::LowerCaseFilter  class Ferret::Search::PrefixQuery  class Ferret::Analysis::MappingFilter  class Ferret::Search::Query  class Ferret::Analysis::PerFieldAnalyzer  class Ferret::Search::QueryFilter  class Ferret::Analysis::RegExpAnalyzer  class Ferret::Search::RangeFilter  class Ferret::Analysis::RegExpTokenizer  class Ferret::Search::RangeQuery  class Ferret::Analysis::StandardAnalyzer  class Ferret::Search::Searcher  class Ferret::Analysis::StandardTokenizer  class Ferret::Search::Sort  class Ferret::Analysis::StemFilter  class Ferret::Search::SortField  class Ferret::Analysis::StopFilter  class Ferret::Search::TermQuery  class Ferret::Analysis::Token  class Ferret::Search::TopDocs  class Ferret::Analysis::TokenStream  class Ferret::Search::TypedRangeFilter  class Ferret::Analysis::WhiteSpaceAnalyzer  class Ferret::Search::TypedRangeQuery  class Ferret::Analysis::WhiteSpaceTokenizer class Ferret::Search::WildcardQuery  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 24. ferret.davebalmain.com/trac  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 25. Ranking Results  0‐60 with PageRank…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 26. index.search_each('content:"the brown cow"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end > Score: 0.827, 3 > Score: 0.523, 5 Relevance?  > Score: 0.125, 4 3  5  4  the  4  3  5  brown  1  3  1  cow  1  4  1  Score  6  10  7  Naïve: Term Frequency  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 27. index.search_each('content:"the brown cow"') do |id, score| puts "Score: #{score}, #{index[id][:title]} " end > Score: 0.827, 3 > Score: 0.523, 5 > Score: 0.125, 4 3  5  4  the  4  3  5  Skew  brown  1  3  1  cow  1  4  1  Score  6  10  7  Naïve: Term Frequency  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 28. 5  4  the  4  3  5  brown  1  3  1  Skew  cow  1  4  1  # of docs  Score = TF * IDF the  6  brown  3  TF = # occurrences / # words IDF = # docs / # docs with W cow  4  Total # of documents: 10 TF‐IDF  Term Frequency * Inverse Document Frequency  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 29. 5  4  the  4  3  5  brown  1  3  1  cow  1  4  1  # of docs  Doc # 3 score for ‘the’: 4/10 * ln(10/6) = 0.204 the  6  brown  3  Doc # 3 score for ‘brown’: 1/10 * ln(10/3) = 0.120 cow  4  Doc # 3 score for ‘cow’: 1/10 * ln(10/4) = 0.092 Total # of documents: 10 # words in document: 10 Score = 0.204 + 0.120 + 0.092 = 0.416  TF‐IDF  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 30. W1  W2  …  …  …  …  …  …  WN  Doc 1  15  23  …  Doc 2  24  12  …  …  …  …  …  …  Doc K  Size = N * K * size of Ruby object Ouch.  Pages = N = 10,000 Words = K = 2,000 Ruby Object = 20+ bytes Footprint = 384 MB Frequency Matrix  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 31. NArray is an Numerical N‐dimensional Array class (implemented in C)   # create new NArray. initialize with 0. NArray.new(typecode, size, ...) # 1 byte unsigned integer NArray.byte(size,...) # 2 byte signed integer NArray.sint(size,...) # 4 byte signed integer NArray.int(size,...) # single precision float NArray.sfloat(size,...) # double precision float NArray.float(size,...) # single precision complex NArray.scomplex(size,...) # double precision complex NArray.complex(size,...) # Ruby object NArray.object(size,...) NArray  h9p://narray.rubyforge.org/  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 32. NArray is an Numerical N‐dimensional Array class (implemented in C)   NArray  h9p://narray.rubyforge.org/  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 33. Links as votes  PageRank  Problem: link gaming  the google juice  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 34. P = 0.85  Follow link from page he/she is currently on.   Teleport to a random locaGon on the web.  P = 0.15  Random Surfer  powerful abstracJon  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 35. Follow link from page he/she is currently on.   Page K  Teleport to a random locaGon on the web.  Page N  Page M  Surfin’  rinse & repeat, ad naseum  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 36. On Page P, clicks on link to K  P = 0.85  On Page K clicks on link to M  P = 0.85  On Page M teleports to X  P = 0.15  …  Surfin’  rinse & repeat, ad naseum  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 37. P = 0.05  P = 0.20  X  N  P = 0.15  K  M P = 0.6  Analyzing the Web Graph  extracJng PageRank  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 38. What is PageRank?  It’s a scalar!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 39. P = 0.05  P = 0.20  X  N  P = 0.15  K  M P = 0.6  What is PageRank?  it’s a probability!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 40. P = 0.05  P = 0.20  X  N  P = 0.15  K  M P = 0.6  What is PageRank?  Higher Pr, Higher Importance?  it’s a probability!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 41. TeleportaDon?  sci‐fi fans, … ?  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 42. 1. No in‐links!  3. Isolated Web  X  N  K  2. No out‐links!  M M Reasons for teleportaDon  enumeraJng edge cases  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 43. •  readth First Search  B •  epth First Search  D •  * Search   A •  exicographic Search   L •  ijkstra’s Algorithm   D •  loyd‐Warshall   F •  riangulaCon and Comparability detecCon   T require 'gratr/import' dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6] dg.directed? # true dg.vertex?(4) # true dg.edge?(2,4) # true dg.vertices # [5, 6, 1, 2, 3, 4] Exploring Graphs  Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5] gratr.rubyforge.com  Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4] Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 44. P(T) = 0.03  P(T) = 0.03  P(T) = 0.15 / # of pages  P(T) = 0.03  X  N  K  P(T) = 0.03  M P(T) = 0.03  M P(T) = 0.03  TeleportaDon  probabiliJes  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 45. Assume the web is N pages big  Assume that probability of teleportaCon (t) is 0.15, and following link (s) is 0.85  Assume that teleportaCon probability (E) is uniform  Assume that you start on any random page (uniform distribuDon L), then Then a^er one step, the probability your on page X is:  PageRank: Simplified MathemaDcal Def’n  cause that’s how we roll  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 46. Link Graph  No  link from 1 to N   1  2  …  …  N  1  1  0  …  …  0  2  0  1  …  …  1  …  …  …  …  …  …  …  …  …  …  …  …  N  0  1  …  …  1  Huge!  G = The Link Graph  ginormous and sparse  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 47. Links to…  { "1" => [25, 26], Page   "2" => [1], "5" => [123,2], "6" => [67, 1] } G as a dicDonary  more compact…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 48. Follow link from page he/she is currently on.   Page K  Teleport to a random locaGon on the web.  CompuDng PageRank  the tedious way  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 49. Don’t trust me! Verify it yourself!  IdenDty matrix  CompuDng PageRank  in one swoop  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 50. Enough hand‐waving, dammit!  show me the code  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 51. Hot, Fast, Awesome  Birth of EM‐Proxy  flash of the obvious  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 52. h:p://rb‐gsl.rubyforge.org/  Hot, Fast, Awesome  Click there!  …  Give yourself a weekend.   Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 53. h:p://ruby‐gsl.sourceforge.net/  Click there!  …  Give yourself a weekend.   Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 54. require "gsl" include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Verify NxN  raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby  6 lines, or less  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 55. require "gsl" include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Constants…  raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby  6 lines, or less  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 56. require "gsl" include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank!  PageRank in Ruby  6 lines, or less  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 57. P = 0.33  X  P = 0.33  N  P = 0.33  K  pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]]) > [0.33, 0.33, 0.33] Ex: Circular Web  tesJng intuiJon…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 58. P = 0.05  X  P = 0.07  N  P = 0.87  K  pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]]) > [0.05, 0.07, 0.87] Ex: All roads lead to K  tesJng intuiJon…  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 59. PageRank + Ferret  awesome search, Tw!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 60. P = 0.05  2  P = 0.07  1  require 'ferret' P = 0.87  include Ferret 3  index = Index::Index.new() index << {:title => "1", :content => "it is what it is", :pr => 0.05 } index << {:title => "2", :content => "what is it", :pr => 0.07 } index << {:title => "3", :content => "it is a banana", :pr => 0.87 } Store PageRank  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 61. index.search_each('content:"world"') do |id, score| puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})" end puts "*" * 50 TF‐IDF Search  sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:"world"', :sort => sf_pr) do |id, score| puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})" end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 62. index.search_each('content:"world"') do |id, score| puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})" end PageRank FTW!  puts "*" * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:"world"', :sort => sf_pr) do |id, score| puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})" end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 63. index.search_each('content:"world"') do |id, score| puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})" end puts "*" * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:"world"', :sort => sf_pr) do |id, score| puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})" end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) Others  # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) Google  # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 64. Search*: Graphs are ubiquitous!  PageRank is a general purpose hammer  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 65. Username GitCred ============================== 37signals 10.00 imbriaco 9.76 why 8.74 rails 8.56 defunkt 8.17 technoweenie 7.83 jeresig 7.60 mojombo 7.51 yui 7.34 drnic 7.34 pjhyett 6.91 wycats 6.85 dhh 6.84 h:p://bit.ly/3YQPU  PageRank + Social Graph  GitHub  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 66. Hmm…  Analyze the social graph:  ‐  Filter messages by ‘Twi:erRank’  ‐  Suggest users by ‘Twi:erRank’  ‐  …  PageRank + Social Graph  Twi9er  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 67. PageRank + Product Graph  E‐commerce  Link items purchased in same cart… Run PR on it.  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 68. PageRank = Powerful Hammer  use it!  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 69. PersonalizaDon  how would you do it?  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 70. TeleportaDon distribuDon doesn’t  have to be uniform!  yahoo.com is  my homepage!  PageRank + PersonalizaDon  customize the teleportaJon vector  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 71. Make pages with links!  Gaming PageRank  hXp://bit.ly/pagerank‐spam   for fun and profit (I don’t endorse it)  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf 
  • 72. Slides: hXp://bit.ly/railsconf‐pagerank  Ferret: hXp://bit.ly/ferret  RB‐GSL: hXp://bit.ly/rb‐gsl  PageRank on Wikipedia: hXp://bit.ly/wp‐pagerank  Gaming PageRank: hXp://bit.ly/pagerank‐spam   Michael Nielsen’s lectures on PageRank:  hXp://michaelnielsen.org/blog    QuesDons?  The slides…  Twi+er  My blog  Building Mini‐Google in Ruby  h:p://bit.ly/railsconf‐pagerank  @igrigorik #railsconf