SlideShare a Scribd company logo
Tamer Elsayed, Jimmy Lin, and Douglas Oard


         Niveda Krishnamoorthy
 PairwiseSimilarity
 MapReduce Framework
 Proposed algorithm
  • Inverted Index Construction
  • Pairwise document similarity calculation
 Results
 PubMed   – “More like this”
 Similar blog posts
 Google – Similar pages
 Framework   that supports distributed
  computing on clusters of computers
 Introduced by Google in 2004
 Map step
 Reduce step
 Combine step (Optional)
 Applications
Pairwise document similarity in large collections with map reduce
 Consider    two files:

      Hello                Hello
                                      Hello ,2
      World                Hadoop     World ,2
      Bye                  Goodbye     Bye,1
                                     Hadoop ,2
      World                Hadoop    Goodbye ,1
Hello             <Hello,1>

World             <World,1>
          Map 1
Bye               <Bye,1>

World             <World,1>


Hello             <Hello,1>

Hadoop            <Hadoop,1>
          Map 2
Goodbye           <Goodbye,1>

Hadoop            <Hadoop,1>
<Hello,1>
              S   <Hello (1,1)>   Reduce 1    Hello ,2
<World,1>
              H
              U
<Bye,1>           <World(1,1)>    Reduce 2    World ,2
              F
              F
<World,1>
              L    <Bye(1)>       Reduce 3     Bye,1
              E
<Hello,1>         <Hadoop(1,1)>   Reduce 4   Hadoop ,2
              &
<Hadoop,1>
              S   <Goodbye(1)>    Reduce 5   Goodbye ,1
<Goodbye,1>   O
              R
<Hadoop,1>    T
MAPREDUCE ALGORITHM           Scalable
•Inverted Index Computation      and
•Pairwise Similarity          Efficient
Document 1
A                    <A,(d1,2)>
A
B            Map 1   <B,(d1,1)>
C
                     <C,(d1,1)>
Document 2
B                    <B,(d2,1)>
D
D            Map 2
                     <D,(d2,2)>


Document 1           <A,(d3,1)>
A
B                    <B,(d3,2)>
             Map 3
B
E                    <E,(d3,1)>
<A,(d1,2)>
             S     <A,[(d1,2),                   <A,[(d1,2),
<B,(d1,1)>   H      (d3,1)]>        Reduce 1      (d3,1)]>
             U
<C,(d1,1)>   F   <B,[(d1,1), (d2,              <B,[(d1,1), (d2,
             F                      Reduce 2
                 1),(d3,2)]>                   1),(d3,2)]>
             L
<B,(d2,1)>   E     <C,[(d1,1)]>     Reduce 3    <C,[(d1,1)]>

<D,(d2,2)>   &
                   <D,[(d2,2)]>     Reduce 4    <D,[(d2,2)]>
             S
<A,(d3,1)>   O
             R     <E,[(d3,1)]>     Reduce 5    <E,[(d3,1)]>
<B,(d3,2)>   T

<E,(d3,1)>
 Group   by document ID, not pairs




 Golomb’s   compression for postings
 Individual Postings
 List of Postings
<(d1,d3),2>
  <A,[(d1,2),      Map 1
   (d3,1)]>
                           <(d1,d2),1
<B,[(d1,1), (d2,
                   Map 2   (d2,d3),2
1),(d3,2)]>
                           (d1,d3),2>
 <C,[(d1,1)]>


 <D,[(d2,2)]>


 <E,[(d3,1)]>
S
              H
<(d1,d3),2>   U
              F   <(d1,d2)[1]>                <(d1,d2)[1]>
                                   Reduce 1
              F
<(d1,d2),1    L
              E   <(d2,d3)[2]>     Reduce 2   <(d2,d3)[2]>
(d2,d3),2
(d1,d3),2>
              &
                                   Reduce 3
                  <(d1,d3)[2,2]>              <(d1,d3)[4]>
              S
              O
              R
              T
 Hadoop   0.16.0
 20 machine (4GB memory, 100GB disk)
 Similarity function - BM25
 Dataset: AQUAINT-2 (newswire text)
  • 2.5 GB
  • 906k documents
 Tokenization
 Stop word removal
 Stemming
 Df-cut
  • Fraction of terms with highest document
   frequency is eliminated – 99% cut (9093)

            Linear space and time complexity

  • 3.7 billion pairs (vs) 81. trillion pairs
Pairwise document similarity in large collections with map reduce
Pairwise document similarity in large collections with map reduce
 Complexity:      O(n2)



 Df-cut
       of 99 percent eliminates meaning bearing
 terms and some irrelevant terms
  • Cornell, arthritis
  • sleek, frail
 Df-cut   can be relaxed to 99.9 percent
 Exact  algorithms used for inverted index
  construction and pair-wise document
  similarity are not specified.
 Df-cut – Does a df-cut of 99 percent affect
  the quality of the results significantly?
 The results have not been evaluated.
Pairwise document similarity in large collections with map reduce

More Related Content

PPTX
Pairwise document similarity in large collections with map reduce
PDF
Geoff Rothman Presentation on Parallel Processing
PDF
Intro to Map Reduce
PPTX
LalitBDA2015V3
PPTX
Introduction to HADOOP
PDF
10th Maths model3 question paper
PDF
10th Maths
PDF
Maths`
Pairwise document similarity in large collections with map reduce
Geoff Rothman Presentation on Parallel Processing
Intro to Map Reduce
LalitBDA2015V3
Introduction to HADOOP
10th Maths model3 question paper
10th Maths
Maths`

Similar to Pairwise document similarity in large collections with map reduce (11)

PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
PPTX
Graph analysis platform comparison, pregel/goldenorb/giraph
PDF
End sem solution
PDF
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
PDF
Distributed batch processing with Hadoop
PDF
Introduction to spark
PDF
A gentle introduction to functional programming through music and clojure
PDF
MapReduce
PDF
Large Scale Data Analysis with Map/Reduce, part I
PDF
Visual Api Training
PDF
Introduction to Hadoop and MapReduce
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Graph analysis platform comparison, pregel/goldenorb/giraph
End sem solution
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Distributed batch processing with Hadoop
Introduction to spark
A gentle introduction to functional programming through music and clojure
MapReduce
Large Scale Data Analysis with Map/Reduce, part I
Visual Api Training
Introduction to Hadoop and MapReduce
Ad

Recently uploaded (20)

PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Basic Mud Logging Guide for educational purpose
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Computing-Curriculum for Schools in Ghana
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Classroom Observation Tools for Teachers
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Institutional Correction lecture only . . .
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
2.FourierTransform-ShortQuestionswithAnswers.pdf
GDM (1) (1).pptx small presentation for students
Basic Mud Logging Guide for educational purpose
Anesthesia in Laparoscopic Surgery in India
Renaissance Architecture: A Journey from Faith to Humanism
O7-L3 Supply Chain Operations - ICLT Program
Supply Chain Operations Speaking Notes -ICLT Program
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Computing-Curriculum for Schools in Ghana
01-Introduction-to-Information-Management.pdf
Cell Types and Its function , kingdom of life
O5-L3 Freight Transport Ops (International) V1.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Classroom Observation Tools for Teachers
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Institutional Correction lecture only . . .
Abdominal Access Techniques with Prof. Dr. R K Mishra
Ad

Pairwise document similarity in large collections with map reduce

  • 1. Tamer Elsayed, Jimmy Lin, and Douglas Oard Niveda Krishnamoorthy
  • 2.  PairwiseSimilarity  MapReduce Framework  Proposed algorithm • Inverted Index Construction • Pairwise document similarity calculation  Results
  • 3.  PubMed – “More like this”  Similar blog posts  Google – Similar pages
  • 4.  Framework that supports distributed computing on clusters of computers  Introduced by Google in 2004  Map step  Reduce step  Combine step (Optional)  Applications
  • 6.  Consider two files: Hello Hello Hello ,2 World Hadoop World ,2 Bye Goodbye Bye,1 Hadoop ,2 World Hadoop Goodbye ,1
  • 7. Hello <Hello,1> World <World,1> Map 1 Bye <Bye,1> World <World,1> Hello <Hello,1> Hadoop <Hadoop,1> Map 2 Goodbye <Goodbye,1> Hadoop <Hadoop,1>
  • 8. <Hello,1> S <Hello (1,1)> Reduce 1 Hello ,2 <World,1> H U <Bye,1> <World(1,1)> Reduce 2 World ,2 F F <World,1> L <Bye(1)> Reduce 3 Bye,1 E <Hello,1> <Hadoop(1,1)> Reduce 4 Hadoop ,2 & <Hadoop,1> S <Goodbye(1)> Reduce 5 Goodbye ,1 <Goodbye,1> O R <Hadoop,1> T
  • 9. MAPREDUCE ALGORITHM Scalable •Inverted Index Computation and •Pairwise Similarity Efficient
  • 10. Document 1 A <A,(d1,2)> A B Map 1 <B,(d1,1)> C <C,(d1,1)> Document 2 B <B,(d2,1)> D D Map 2 <D,(d2,2)> Document 1 <A,(d3,1)> A B <B,(d3,2)> Map 3 B E <E,(d3,1)>
  • 11. <A,(d1,2)> S <A,[(d1,2), <A,[(d1,2), <B,(d1,1)> H (d3,1)]> Reduce 1 (d3,1)]> U <C,(d1,1)> F <B,[(d1,1), (d2, <B,[(d1,1), (d2, F Reduce 2 1),(d3,2)]> 1),(d3,2)]> L <B,(d2,1)> E <C,[(d1,1)]> Reduce 3 <C,[(d1,1)]> <D,(d2,2)> & <D,[(d2,2)]> Reduce 4 <D,[(d2,2)]> S <A,(d3,1)> O R <E,[(d3,1)]> Reduce 5 <E,[(d3,1)]> <B,(d3,2)> T <E,(d3,1)>
  • 12.  Group by document ID, not pairs  Golomb’s compression for postings  Individual Postings  List of Postings
  • 13. <(d1,d3),2> <A,[(d1,2), Map 1 (d3,1)]> <(d1,d2),1 <B,[(d1,1), (d2, Map 2 (d2,d3),2 1),(d3,2)]> (d1,d3),2> <C,[(d1,1)]> <D,[(d2,2)]> <E,[(d3,1)]>
  • 14. S H <(d1,d3),2> U F <(d1,d2)[1]> <(d1,d2)[1]> Reduce 1 F <(d1,d2),1 L E <(d2,d3)[2]> Reduce 2 <(d2,d3)[2]> (d2,d3),2 (d1,d3),2> & Reduce 3 <(d1,d3)[2,2]> <(d1,d3)[4]> S O R T
  • 15.  Hadoop 0.16.0  20 machine (4GB memory, 100GB disk)  Similarity function - BM25  Dataset: AQUAINT-2 (newswire text) • 2.5 GB • 906k documents
  • 16.  Tokenization  Stop word removal  Stemming  Df-cut • Fraction of terms with highest document frequency is eliminated – 99% cut (9093) Linear space and time complexity • 3.7 billion pairs (vs) 81. trillion pairs
  • 19.  Complexity: O(n2)  Df-cut of 99 percent eliminates meaning bearing terms and some irrelevant terms • Cornell, arthritis • sleek, frail  Df-cut can be relaxed to 99.9 percent
  • 20.  Exact algorithms used for inverted index construction and pair-wise document similarity are not specified.  Df-cut – Does a df-cut of 99 percent affect the quality of the results significantly?  The results have not been evaluated.