Pairwise document similarity in large collections with map reduce

Tamer Elsayed, Jimmy Lin, and Douglas Oard

Niveda Krishnamoorthy

 PairwiseSimilarity
 MapReduce Framework
 Proposed algorithm
• Inverted Index Construction
• Pairwise document similarity calculation
 Results

 PubMed – “More like this”
 Similar blog posts
 Google – Similar pages

 Framework that supports distributed
computing on clusters of computers
 Introduced by Google in 2004
 Map step
 Reduce step
 Combine step (Optional)
 Applications

 Consider two files:

Hello Hello
Hello ,2
World Hadoop World ,2
Bye Goodbye Bye,1
Hadoop ,2
World Hadoop Goodbye ,1

Hello <Hello,1>

World <World,1>
Map 1
Bye <Bye,1>

World <World,1>

Hello <Hello,1>

Hadoop <Hadoop,1>
Map 2
Goodbye <Goodbye,1>

Hadoop <Hadoop,1>

<Hello,1>
S <Hello (1,1)> Reduce 1 Hello ,2
<World,1>
H
U
<Bye,1> <World(1,1)> Reduce 2 World ,2
F
F
<World,1>
L <Bye(1)> Reduce 3 Bye,1
E
<Hello,1> <Hadoop(1,1)> Reduce 4 Hadoop ,2
&
<Hadoop,1>
S <Goodbye(1)> Reduce 5 Goodbye ,1
<Goodbye,1> O
R
<Hadoop,1> T

MAPREDUCE ALGORITHM Scalable
•Inverted Index Computation and
•Pairwise Similarity Efficient

Document 1
A <A,(d1,2)>
A
B Map 1 <B,(d1,1)>
C
<C,(d1,1)>
Document 2
B <B,(d2,1)>
D
D Map 2
<D,(d2,2)>

Document 1 <A,(d3,1)>
A
B <B,(d3,2)>
Map 3
B
E <E,(d3,1)>

<A,(d1,2)>
S <A,[(d1,2), <A,[(d1,2),
<B,(d1,1)> H (d3,1)]> Reduce 1 (d3,1)]>
U
<C,(d1,1)> F <B,[(d1,1), (d2, <B,[(d1,1), (d2,
F Reduce 2
1),(d3,2)]> 1),(d3,2)]>
L
<B,(d2,1)> E <C,[(d1,1)]> Reduce 3 <C,[(d1,1)]>

<D,(d2,2)> &
<D,[(d2,2)]> Reduce 4 <D,[(d2,2)]>
S
<A,(d3,1)> O
R <E,[(d3,1)]> Reduce 5 <E,[(d3,1)]>
<B,(d3,2)> T

<E,(d3,1)>

 Group by document ID, not pairs

 Golomb’s compression for postings
 Individual Postings
 List of Postings

<(d1,d3),2>
<A,[(d1,2), Map 1
(d3,1)]>
<(d1,d2),1
<B,[(d1,1), (d2,
Map 2 (d2,d3),2
1),(d3,2)]>
(d1,d3),2>
<C,[(d1,1)]>

<D,[(d2,2)]>

<E,[(d3,1)]>

S
H
<(d1,d3),2> U
F <(d1,d2)[1]> <(d1,d2)[1]>
Reduce 1
F
<(d1,d2),1 L
E <(d2,d3)[2]> Reduce 2 <(d2,d3)[2]>
(d2,d3),2
(d1,d3),2>
&
Reduce 3
<(d1,d3)[2,2]> <(d1,d3)[4]>
S
O
R
T

 Hadoop 0.16.0
 20 machine (4GB memory, 100GB disk)
 Similarity function - BM25
 Dataset: AQUAINT-2 (newswire text)
• 2.5 GB
• 906k documents

 Tokenization
 Stop word removal
 Stemming
 Df-cut
• Fraction of terms with highest document
frequency is eliminated – 99% cut (9093)

Linear space and time complexity

• 3.7 billion pairs (vs) 81. trillion pairs

 Complexity: O(n2)

 Df-cut
of 99 percent eliminates meaning bearing
terms and some irrelevant terms
• Cornell, arthritis
• sleek, frail
 Df-cut can be relaxed to 99.9 percent

 Exact algorithms used for inverted index
construction and pair-wise document
similarity are not specified.
 Df-cut – Does a df-cut of 99 percent affect
the quality of the results significantly?
 The results have not been evaluated.

Pairwise document similarity in large collections with map reduce

More Related Content

Similar to Pairwise document similarity in large collections with map reduce (11)

Recently uploaded (20)

Pairwise document similarity in large collections with map reduce