Mining of massive datasets

Mining of Massive
Datasets
Ashic Mahtab
@ashic
www.heartysoft.com

Stream Processing
 Have I already processed this?
 How many distinct queries were made?
 How many hits did I get?

Stream Processing – Bloom Filters
 Guaranteed detection of negatives.
 Possible false positive.

 Have a collection of hash functions (h1, h2, h3…).
 For an input, run the hash functions. Map to bit array.
 If all bits are lit in working store, might have been processed (possibility of
false positives).
 If any of the lit bits in hashed array are not lit in working store, need to
process this. (Guaranteed…no false negatives).

1 0 0 1 1 0 1 1 1 0
0 0 1 0 0 1 1 0 0 0
0 0 1 1 1 0 0 0 0 1
0 1 0 0 0 1 0 0 0 1
Input 1: “Foo” hashes to:
1 0 0 1 1 0 0 0 0 0
Input 2: “Bar” hashes to:
1 0 1 1 1 0 0 0 0 0

 Not just for streams (everything is a stream, right?)
 Cassandra uses bloom filters to detect if some data is in a low level storage
file.

Map Reduce
 A little smarts goes a l-o-o-o-n-g way.

Map Reduce – Multiway Joins
 R join S join T
 size(R) = r, size(S) = s, size(T) = t
 Probability of match for R and S = p
 Probability of match for S and T = p
 Which do we join first?

 R (A, B) join S(B, C) join T(C, D)
 size(R) = r, size(S) = s, size(T) = t
 Probability of match for R and S = p
 Probability of match for S and T = p
 Communication cost:
* If we join R and S first: O(r + s + t + pst)
* If we join S and T first: O(r + s + t + prs)

 Can we do better?

 Hash B to b buckets, c to C buckets.
 bc = k
 Cost ~ r + 2s + t + 2 * sqrt(krt)
Usually, can neglect r + t compared to the k term. So,
2s + 2*sqrt(krt)
[Single MR job]

 Hash B to b buckets, c to C buckets.
 bc = k
 Cost ~ r + 2s + t + 2 * sqrt(krt)
Usually, can neglect r + t compared to the k term. So,
2s + 2*sqrt(krt)
[Single MR job]
 vs (r + s + t + prs)
[Two MR jobs]

 So…is this always better?

Map Reduce – Complexity
 Replication Rate (r):
Number of outputs by all Map tasks / number of inputs
 Reducer Size (q):
Max number of items per key at reducers
 p = number of inputs
 For nxn:
qr >= 2n^2
r >= p / q

Map Reduce – Matrix Multiplication
 Approach 1
 Matrix M, N
 M(i, j), N(j, k)
 Map1: Map matrices to (j, (M, i, mij)), (j, (N, k, njk))
 Reduce1: for each key, output ((i, k), mij*njk)
 Map2: Identity
 Reduce2: For each key, (i, k) get the sum of values.

 Approach 2
 One step:
 Map:
For M, produce ((i, k), (M, j, mij)) for k = 1…Ncolumns_in_N
For M, produce ((i, k), (N, j, njk)) for k = 1…Nrows_in_M
 Reduce:
For each key (i, k), multiple values, and sum.

 Approach 3
 Two steps again.

 One pass:
(4n^4) / q
 Two pass:
(4n^3) / sqrt(q)

Similarity - Shingling
 “abcdef” -> [“abc”, “bcd”, “cde”…]
 Jaccard similarity - > N(intersection) / N(union)

Similarity - Shingling
 “abcdef” -> [“abc”, “bcd”, “cde”…]
 Jaccard similarity - > N(intersection) / N(union)
 Problem?
 Size

Similarity - Minhashing
h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a

Similarity - Minhashing
Problem?
h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a

Similarity – Minhash Signatures

Similarity – Minhash Signatures
Problem? Still can’t find pairs with greatest similarity efficiently

Similarity – LSH for Minhash Signatures

Clustering – K Means
1. Pick k points (centroids)
2. Assign points to clusters
3. Shift centroids to “centre”.
4. Repeat

Clustering – FBR
• 3 sets – Discard, Compressed and Retained
• First two have summaries. N, sum per dimension, sum of squares per dimension
• High dimensional Euclidian space
Mahalanobis Distance

Clustering – CURE
• Sample. Run clustering on sample.
• Pick “representatives” from each sample.
• Move representatives about 20% or so to the centre.
• Merge of close.

Dimentionality Reduction - SVD

Dimensionality Reduction - CUR
 SVD results in U and V being dense, even when M is sparse.
 O(n^3)

Dimensionality Reduction - CUR
 Choose r.
 Choose r rows and r columns of M.
 Intersection is W.
 Run SVD on W (much smaller than M). W = XΣY’
 Compute Σ+, the Moore-Penrose pseudoinverse of Σ.
 Then, U = Y * (Σ+)^2 * X’

Dimensionality Reduction – CUR
Choosing Rows and Columns
 Random, but with bias for importance.
 (Frobenius Norm)^2
 Probability of picking a row or column:
Sum of squares for row or column / Sum of squares of all elements

 Same row / column may get picked (selection with replacement).
 Reduces rank.

 Reduces rank.
 Can be combined: multiply vector by sqrt(k) if it appears k times.

 Reduces rank.
 Can be combined: multiply vector by sqrt(k) if it appears k times.
 Compute pseudo-inverse as before, but transpose the result.

Thanks
 Mining of Massive Datasets
Leskovec, Rajaraman, Ullman
Coursera / Stanford Course
Book: http://www.mmds.org/ [free]

Mining of massive datasets

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Mining of massive datasets (20)

Recently uploaded (20)

Mining of massive datasets