SlideShare a Scribd company logo
INCORPORATING
PROBABILISTIC
RETRIEVAL
KNOWLEDGE INTO
TFIDF-BASED SEARCH
ENGINE
Alex Lin
Senior Architect
Intelligent Mining
alin at IntelligentMinining.com
Overview of Retrieval Models
  Boolean Retrieval
  Vector Space Model

  Probabilistic Model

  Language Model
Boolean Retrieval
  lincolnAND NOT (car AND automobile)
  The earliest model and still in use today

  The result is very easy to explain to users

  Highly efficient computationally

  The major drawback – lack of sophisticated
   ranking algorithm.
Vector Space Model
    Term2

            Doc1


                   Doc2

                                                t
                   Query
                                            ∑d       ij   *qj
                                            j=1
                             Cos(Di ,Q) =   t              t
                     Term3
                                            ∑ d * ∑q2
                                                    ij
                                                                 2
                                                                 j
                                            j=1            j=1




 Major flaws: It lacks guidance on the details of
                   €
 how weighting and ranking algorithms are
 related to relevance
Probabilistic Retrieval Model

             Relevant       P(R|D)

                                     Document




              Non-
             Relevant      P(NR|D)




                             P(D | R)P(R)
    Bayes’ Rule   P(R | D) =
                                P(D)



    €
Probabilistic Retrieval Model
                     P(D | R)P(R)               P(D | NR)P(NR)
          P(R | D) =                P(NR | D) =
                        P(D)                          P(D)


          IfP(D | R)P(R) > P(D | NR)P(NR)
€                         €
          then classify D as relevant

    €
Estimate P(D|R) and P(D|NR)
  Define        D = (d1,d2 ,...,dt )
                                t
        then    P(D | R) = ∏ P(di | R)
                                i=1
                                t

    €          P(D | NR) = ∏ P(di | NR)
                                i=1


€
        Binary Independence Model
€        term independence + binary features in documents
Likelihood Ratio
      Likelihood   ratio:
           P(D | R)   P(NR)
                    >
          P(D | NR)    P(R)
                                    si: in non-relevant set, the probability of term i occurring
                                    pi: in relevant set, the probability of term i occurring

           P(D | R)          p           1− pi           p (1− si )
                    =∏ i⋅ ∏                    = ∑ log i
€         P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi )
                                                          (ri + 0.5) /(R − ri + 0.5)
                      =      ∑             log
                                                 (n i − ri + 0.5) /(N − n i − R + ri + 0.5)
                          i:d i = q i =1
€
                                N: total number of Non-relevant documents
                                ni: number of non-relevant documents that contain a term
                                ri: number of relevant documents that contain a term
                                R: total number of Relevant documents
          €
Combine with BM25 Ranking
    Algorithm
      BM25   extends the scoring function for the binary
       independence model to include document and
       query term weight.
      It performs very well in TREC experiments


                              (ri + 0.5) /(R − ri + 0.5)        (k + 1) f i (k 2 + 1)qf i
    R(q,D) = ∑ log                                             ⋅ i         ⋅
            i∈Q      (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i      k 2 + qf i

                                                                                         dl
                                                                 K = k1 ((1− b) + b ⋅         )
                                                                                        avgdl
€
                                k1 k2 b: tuning parameters
                                dl: document length
                                avgdl: average document length in data set
                                                  €
                                qf: term frequency in query terms
Weighted Fields Boolean Search
 doc-id       field0     field1                     …   text
   1
   2
   3
   …
   n


                   R(q,D) = ∑    ∑w        f   mi
                          i∈q f ∈ fileds




          €
Apply Probabilistic Knowledge
into Fields
           Higher     gradient         Lower

 doc-id   field0      field1           …       Text
   1
   2      Lightyear    Buzz

   3
   …
   n



          Relevant


                          P(R|D)


                                   Document
           Non-
          Relevant    P(NR|D)
Use the Knowledge during Ranking
     doc-id         field0      field1    …           Text
       1
       2            Lightyear    Buzz

       3
       …
       n



      The    goal is:
                                    t
                         t
      P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi
                         i=1
                                   i=1           i∈q f ∈F



                                                    Learnable

€
Comparison of Approaches
                                     f ik                N
    RTF −IDF = tf ik ⋅ idf i =   t               ⋅ log
                                                         nk
                                 ∑f         ij
                                 j=1

                     (k1 + 1) f i (k2 + 1)qf i                                        dl
    Rbm 25 (q,D) =               ⋅                            K = k1 ((1− b) + b ⋅         )
                      K + fi       k 2 + qf i                                        avgdl
€                           (ri + 0.5) /(R − ri + 0.5)         (k1 + 1) f i (k 2 + 1)qf i
    R(q,D) = ∑ log                                           ⋅             ⋅
             i∈Q   (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i        k 2 + qf i
€                                        €
                                                              IDF                      TF


€                                (k1 + 1) f i (k 2 + 1)qf i
    R(q,D) = ∑ ∑ w f mi ⋅                    ⋅
               i∈q f ∈F           K + fi       k 2 + qf i

                          IDF                            TF

€
Other Considerations
  Thisis not a formal model
  Require user relevance feedback (search log)

  Harder to handle real-time search queries

  How to prevent Love/Hate attacks
Thank you

More Related Content

PDF
Probabilistic Retrieval
PPTX
Aaex7 group2(中英夾雜)
PDF
20110319 parameterized algorithms_fomin_lecture01-02
PDF
Lecture note4c limf
PDF
140106 isaim-okayama
PDF
Modular representation theory of finite groups
PPTX
Aaex4 group2(中英夾雜)
PPTX
Inductive Triple Graphs: A purely functional approach to represent RDF
Probabilistic Retrieval
Aaex7 group2(中英夾雜)
20110319 parameterized algorithms_fomin_lecture01-02
Lecture note4c limf
140106 isaim-okayama
Modular representation theory of finite groups
Aaex4 group2(中英夾雜)
Inductive Triple Graphs: A purely functional approach to represent RDF

What's hot (20)

PDF
Csr2011 june14 15_45_musatov
PPT
Algorithm
PPT
Threshold and Proactive Pseudo-Random Permutations
PDF
High-dimensional polytopes defined by oracles: algorithms, computations and a...
PPTX
Aaex5 group2(中英夾雜)
PPT
Solving problems by searching Informed (heuristics) Search
PDF
Formal methods 4 - Z notation
PDF
On complementarity in qec and quantum cryptography
PDF
RuleML 2015 Constraint Handling Rules - What Else?
PDF
Path Contraction Faster than 2^n
PDF
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
PPTX
Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
PDF
Athens workshop on MCMC
PDF
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
PDF
Efficient end-to-end learning for quantizable representations
PDF
Lec 5-nn-slides
PDF
High-dimensional polytopes defined by oracles: algorithms, computations and a...
PDF
Sparse Kernel Learning for Image Annotation
Csr2011 june14 15_45_musatov
Algorithm
Threshold and Proactive Pseudo-Random Permutations
High-dimensional polytopes defined by oracles: algorithms, computations and a...
Aaex5 group2(中英夾雜)
Solving problems by searching Informed (heuristics) Search
Formal methods 4 - Z notation
On complementarity in qec and quantum cryptography
RuleML 2015 Constraint Handling Rules - What Else?
Path Contraction Faster than 2^n
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Athens workshop on MCMC
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
Efficient end-to-end learning for quantizable representations
Lec 5-nn-slides
High-dimensional polytopes defined by oracles: algorithms, computations and a...
Sparse Kernel Learning for Image Annotation
Ad

Similar to Probabilistic Retrieval TFIDF (20)

PDF
Ml4nlp04 1
PDF
Scope Graphs: A fresh look at name binding in programming languages
PPT
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
PPTX
Class 18: Measuring Cost
PPT
Analysis of algo
PDF
Newfile6
PDF
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
PDF
Volume and edge skeleton computation in high dimensions
PDF
Lista exercintegrais
PDF
Data Exchange over RDF
PDF
Algorithm Design and Complexity - Course 11
PPTX
Nbvtalkatbzaonencryptionpuzzles
PPTX
Nbvtalkatbzaonencryptionpuzzles
PDF
Lecture4 kenrels functions_rkhs
PDF
Problem
PDF
S 7
PDF
A note on arithmetic progressions in sets of integers
PDF
Parallel Evaluation of Multi-Semi-Joins
PDF
Codes and Isogenies
PPTX
Functional programming in f sharp
Ml4nlp04 1
Scope Graphs: A fresh look at name binding in programming languages
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Class 18: Measuring Cost
Analysis of algo
Newfile6
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
Volume and edge skeleton computation in high dimensions
Lista exercintegrais
Data Exchange over RDF
Algorithm Design and Complexity - Course 11
Nbvtalkatbzaonencryptionpuzzles
Nbvtalkatbzaonencryptionpuzzles
Lecture4 kenrels functions_rkhs
Problem
S 7
A note on arithmetic progressions in sets of integers
Parallel Evaluation of Multi-Semi-Joins
Codes and Isogenies
Functional programming in f sharp
Ad

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Machine Learning_overview_presentation.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPT
Teaching material agriculture food technology
PDF
Mushroom cultivation and it's methods.pdf
PPTX
Tartificialntelligence_presentation.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
TLE Review Electricity (Electricity).pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Encapsulation_ Review paper, used for researhc scholars
Machine Learning_overview_presentation.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Teaching material agriculture food technology
Mushroom cultivation and it's methods.pdf
Tartificialntelligence_presentation.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine learning based COVID-19 study performance prediction
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Digital-Transformation-Roadmap-for-Companies.pptx
1. Introduction to Computer Programming.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Probabilistic Retrieval TFIDF

  • 1. INCORPORATING PROBABILISTIC RETRIEVAL KNOWLEDGE INTO TFIDF-BASED SEARCH ENGINE Alex Lin Senior Architect Intelligent Mining alin at IntelligentMinining.com
  • 2. Overview of Retrieval Models   Boolean Retrieval   Vector Space Model   Probabilistic Model   Language Model
  • 3. Boolean Retrieval   lincolnAND NOT (car AND automobile)   The earliest model and still in use today   The result is very easy to explain to users   Highly efficient computationally   The major drawback – lack of sophisticated ranking algorithm.
  • 4. Vector Space Model Term2 Doc1 Doc2 t Query ∑d ij *qj j=1 Cos(Di ,Q) = t t Term3 ∑ d * ∑q2 ij 2 j j=1 j=1 Major flaws: It lacks guidance on the details of € how weighting and ranking algorithms are related to relevance
  • 5. Probabilistic Retrieval Model Relevant P(R|D) Document Non- Relevant P(NR|D) P(D | R)P(R) Bayes’ Rule P(R | D) = P(D) €
  • 6. Probabilistic Retrieval Model P(D | R)P(R) P(D | NR)P(NR) P(R | D) = P(NR | D) = P(D) P(D)   IfP(D | R)P(R) > P(D | NR)P(NR) € € then classify D as relevant €
  • 7. Estimate P(D|R) and P(D|NR)   Define D = (d1,d2 ,...,dt ) t then P(D | R) = ∏ P(di | R) i=1 t € P(D | NR) = ∏ P(di | NR) i=1 €   Binary Independence Model € term independence + binary features in documents
  • 8. Likelihood Ratio   Likelihood ratio: P(D | R) P(NR) > P(D | NR) P(R) si: in non-relevant set, the probability of term i occurring pi: in relevant set, the probability of term i occurring P(D | R) p 1− pi p (1− si ) =∏ i⋅ ∏ = ∑ log i € P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi ) (ri + 0.5) /(R − ri + 0.5) = ∑ log (n i − ri + 0.5) /(N − n i − R + ri + 0.5) i:d i = q i =1 € N: total number of Non-relevant documents ni: number of non-relevant documents that contain a term ri: number of relevant documents that contain a term R: total number of Relevant documents €
  • 9. Combine with BM25 Ranking Algorithm   BM25 extends the scoring function for the binary independence model to include document and query term weight.   It performs very well in TREC experiments (ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ i ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i dl K = k1 ((1− b) + b ⋅ ) avgdl € k1 k2 b: tuning parameters dl: document length avgdl: average document length in data set € qf: term frequency in query terms
  • 10. Weighted Fields Boolean Search doc-id field0 field1 … text 1 2 3 … n R(q,D) = ∑ ∑w f mi i∈q f ∈ fileds €
  • 11. Apply Probabilistic Knowledge into Fields Higher gradient Lower doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n Relevant P(R|D) Document Non- Relevant P(NR|D)
  • 12. Use the Knowledge during Ranking doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n   The goal is: t t P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi i=1 i=1 i∈q f ∈F Learnable €
  • 13. Comparison of Approaches f ik N RTF −IDF = tf ik ⋅ idf i = t ⋅ log nk ∑f ij j=1 (k1 + 1) f i (k2 + 1)qf i dl Rbm 25 (q,D) = ⋅ K = k1 ((1− b) + b ⋅ ) K + fi k 2 + qf i avgdl € (ri + 0.5) /(R − ri + 0.5) (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i € € IDF TF € (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ ∑ w f mi ⋅ ⋅ i∈q f ∈F K + fi k 2 + qf i IDF TF €
  • 14. Other Considerations   Thisis not a formal model   Require user relevance feedback (search log)   Harder to handle real-time search queries   How to prevent Love/Hate attacks