SlideShare a Scribd company logo
Mining of Massive 
Datasets 
Ashic Mahtab 
@ashic 
www.heartysoft.com
Stream Processing
Stream Processing 
 Have I already processed this? 
 How many distinct queries were made? 
 How many hits did I get?
Stream Processing – Bloom Filters 
 Guaranteed detection of negatives. 
 Possible false positive.
Stream Processing – Bloom Filters 
 Have a collection of hash functions (h1, h2, h3…). 
 For an input, run the hash functions. Map to bit array. 
 If all bits are lit in working store, might have been processed (possibility of 
false positives). 
 If any of the lit bits in hashed array are not lit in working store, need to 
process this. (Guaranteed…no false negatives).
Stream Processing – Bloom Filters 
1 0 0 1 1 0 1 1 1 0 
0 0 1 0 0 1 1 0 0 0 
0 0 1 1 1 0 0 0 0 1 
0 1 0 0 0 1 0 0 0 1 
Input 1: “Foo” hashes to: 
1 0 0 1 1 0 0 0 0 0 
Input 2: “Bar” hashes to: 
1 0 1 1 1 0 0 0 0 0
Stream Processing – Bloom Filters 
 Not just for streams (everything is a stream, right?) 
 Cassandra uses bloom filters to detect if some data is in a low level storage 
file.
Map Reduce 
 A little smarts goes a l-o-o-o-n-g way.
Map Reduce – Multiway Joins 
 R join S join T 
 size(R) = r, size(S) = s, size(T) = t 
 Probability of match for R and S = p 
 Probability of match for S and T = p 
 Which do we join first?
Map Reduce – Multiway Joins 
 R (A, B) join S(B, C) join T(C, D) 
 size(R) = r, size(S) = s, size(T) = t 
 Probability of match for R and S = p 
 Probability of match for S and T = p 
 Communication cost: 
* If we join R and S first: O(r + s + t + pst) 
* If we join S and T first: O(r + s + t + prs)
Map Reduce – Multiway Joins 
 Can we do better?
Map Reduce – Multiway Joins 
 Hash B to b buckets, c to C buckets. 
 bc = k 
 Cost ~ r + 2s + t + 2 * sqrt(krt) 
Usually, can neglect r + t compared to the k term. So, 
2s + 2*sqrt(krt) 
[Single MR job]
Map Reduce – Multiway Joins 
 Hash B to b buckets, c to C buckets. 
 bc = k 
 Cost ~ r + 2s + t + 2 * sqrt(krt) 
Usually, can neglect r + t compared to the k term. So, 
2s + 2*sqrt(krt) 
[Single MR job] 
 vs (r + s + t + prs) 
[Two MR jobs]
Map Reduce – Multiway Joins 
 So…is this always better?
Map Reduce – Complexity 
 Replication Rate (r): 
Number of outputs by all Map tasks / number of inputs 
 Reducer Size (q): 
Max number of items per key at reducers 
 p = number of inputs 
 For nxn: 
qr >= 2n^2 
r >= p / q
Map Reduce – Matrix Multiplication 
 Approach 1 
 Matrix M, N 
 M(i, j), N(j, k) 
 Map1: Map matrices to (j, (M, i, mij)), (j, (N, k, njk)) 
 Reduce1: for each key, output ((i, k), mij*njk) 
 Map2: Identity 
 Reduce2: For each key, (i, k) get the sum of values.
Map Reduce – Matrix Multiplication 
 Approach 2 
 One step: 
 Map: 
For M, produce ((i, k), (M, j, mij)) for k = 1…Ncolumns_in_N 
For M, produce ((i, k), (N, j, njk)) for k = 1…Nrows_in_M 
 Reduce: 
For each key (i, k), multiple values, and sum.
Map Reduce – Matrix Multiplication 
 Approach 3 
 Two steps again.
Map Reduce – Matrix Multiplication 
 One pass: 
(4n^4) / q 
 Two pass: 
(4n^3) / sqrt(q)
Similarity - Shingling 
 “abcdef” -> [“abc”, “bcd”, “cde”…] 
 Jaccard similarity - > N(intersection) / N(union)
Similarity - Shingling 
 “abcdef” -> [“abc”, “bcd”, “cde”…] 
 Jaccard similarity - > N(intersection) / N(union) 
 Problem? 
 Size
Similarity - Minhashing
Similarity - Minhashing 
h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a
Similarity - Minhashing 
Problem? 
h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a
Similarity – Minhash Signatures
Similarity – Minhash Signatures 
Problem? Still can’t find pairs with greatest similarity efficiently
Similarity – LSH for Minhash Signatures
Clustering – Hierarchical
Clustering – K Means 
1. Pick k points (centroids) 
2. Assign points to clusters 
3. Shift centroids to “centre”. 
4. Repeat
Clustering – K Means
Clustering – FBR 
• 3 sets – Discard, Compressed and Retained 
• First two have summaries. N, sum per dimension, sum of squares per dimension 
• High dimensional Euclidian space 
Mahalanobis Distance
Clustering – CURE
Clustering – CURE 
• Sample. Run clustering on sample. 
• Pick “representatives” from each sample. 
• Move representatives about 20% or so to the centre. 
• Merge of close.
Dimentionality Reduction
Dimentionality Reduction
Dimentionality Reduction - SVD
Dimentionality Reduction - SVD
Dimensionality Reduction - CUR 
 SVD results in U and V being dense, even when M is sparse. 
 O(n^3)
Dimensionality Reduction - CUR 
 Choose r. 
 Choose r rows and r columns of M. 
 Intersection is W. 
 Run SVD on W (much smaller than M). W = XΣY’ 
 Compute Σ+, the Moore-Penrose pseudoinverse of Σ. 
 Then, U = Y * (Σ+)^2 * X’
Dimensionality Reduction – CUR 
Choosing Rows and Columns 
 Random, but with bias for importance. 
 (Frobenius Norm)^2 
 Probability of picking a row or column: 
Sum of squares for row or column / Sum of squares of all elements
Dimensionality Reduction – CUR 
Choosing Rows and Columns 
 Same row / column may get picked (selection with replacement). 
 Reduces rank.
Dimensionality Reduction – CUR 
Choosing Rows and Columns 
 Same row / column may get picked (selection with replacement). 
 Reduces rank. 
 Can be combined: multiply vector by sqrt(k) if it appears k times.
Dimensionality Reduction – CUR 
Choosing Rows and Columns 
 Same row / column may get picked (selection with replacement). 
 Reduces rank. 
 Can be combined: multiply vector by sqrt(k) if it appears k times. 
 Compute pseudo-inverse as before, but transpose the result.
Dimensionality Reduction – CUR 
Choosing Rows and Columns 
 Same row / column may get picked (selection with replacement). 
 Reduces rank. 
 Can be combined: multiply vector by sqrt(k) if it appears k times. 
 Compute pseudo-inverse as before, but transpose the result.
Thanks 
 Mining of Massive Datasets 
Leskovec, Rajaraman, Ullman 
Coursera / Stanford Course 
Book: http://www.mmds.org/ [free]

More Related Content

PPTX
Arc Length, Curvature and Torsion
PDF
Rabin Karp Algorithm
PPTX
17 integrals of rational functions x
PPTX
16 partial fraction decompositions x
PPTX
Integration application (Aplikasi Integral)
PDF
Matrix chain multiplication
PPTX
3 d scaling and translation in homogeneous coordinates
PPTX
10 b review-cross-sectional formula
Arc Length, Curvature and Torsion
Rabin Karp Algorithm
17 integrals of rational functions x
16 partial fraction decompositions x
Integration application (Aplikasi Integral)
Matrix chain multiplication
3 d scaling and translation in homogeneous coordinates
10 b review-cross-sectional formula

What's hot (19)

PDF
Shortest path search for real road networks and dynamic costs with pgRouting
PPT
Application of Integrals
PPTX
Double Integrals
PPT
Lesson 11 plane areas area by integration
PPT
Computer graphics
PDF
Applications of integrals
PPT
Application of integral calculus
PPTX
Formulas for calculating surface area and volume
PPTX
Multiple integral(tripple integral)
PPTX
10 fluid pressures x
PPT
Coordinate geometry
DOCX
Basic Calculus in R.
PPTX
Total Surface Area of Prisms
PPT
Matrix 2 d
PPTX
multiple intrigral lit
PPTX
Equations of Straight Lines
PPT
Lesson 16 length of an arc
PPT
Application of Calculus in Real World
PPT
Surface area and volume
Shortest path search for real road networks and dynamic costs with pgRouting
Application of Integrals
Double Integrals
Lesson 11 plane areas area by integration
Computer graphics
Applications of integrals
Application of integral calculus
Formulas for calculating surface area and volume
Multiple integral(tripple integral)
10 fluid pressures x
Coordinate geometry
Basic Calculus in R.
Total Surface Area of Prisms
Matrix 2 d
multiple intrigral lit
Equations of Straight Lines
Lesson 16 length of an arc
Application of Calculus in Real World
Surface area and volume
Ad

Viewers also liked (20)

PPTX
Ifonly
PDF
PPTX
Social Networking - Personal learning networts 2013 june tafe managers
PDF
Urogenitalis képalkotó vizsgálati protokollok
ODP
Aan de slag met social media
PPTX
Cqrs, Event Sourcing
PPSX
In Memory of Laura Weber
DOC
Koalas Cut Into Sections
PPT
Agriculture
PDF
CT vizsgálati protokollok I-II.
PPTX
Brother Gemalto
PPT
RCP Company Information,
PDF
Wk 1 Intro Text Types
PPT
Prednosti Internet promocije putem portala za nekretnine
PPT
International Copyright
KEY
Team One Keynote
ZIP
Mediaproof def
PPT
Adobe connect set up instructions str
PPT
Uitnodiging Verjaardag Pieter Krauch
ZIP
STAP Nederlands instituut voor alcoholbeleid & Social Media door NCRV/Wilko v...
Ifonly
Social Networking - Personal learning networts 2013 june tafe managers
Urogenitalis képalkotó vizsgálati protokollok
Aan de slag met social media
Cqrs, Event Sourcing
In Memory of Laura Weber
Koalas Cut Into Sections
Agriculture
CT vizsgálati protokollok I-II.
Brother Gemalto
RCP Company Information,
Wk 1 Intro Text Types
Prednosti Internet promocije putem portala za nekretnine
International Copyright
Team One Keynote
Mediaproof def
Adobe connect set up instructions str
Uitnodiging Verjaardag Pieter Krauch
STAP Nederlands instituut voor alcoholbeleid & Social Media door NCRV/Wilko v...
Ad

Similar to Mining of massive datasets (20)

PDF
Applied machine learning for search engine relevance 3
PDF
Test
PPT
Tree distance algorithm
PPT
Strings matching in pattern recognition.ppt
PDF
Review of Trigonometry for Calculus “Trigon” =triangle +“metry”=measurement =...
PPTX
Divide and Conquer in DAA concept. For B Tech CSE
PDF
Parallel Evaluation of Multi-Semi-Joins
PPT
Matrix 2 d
PPT
Transforms UNIt 2
PPT
PDF
Mtc ssample05
PDF
Mtc ssample05
PPTX
ch16.pptx
PPTX
ch16 (1).pptx
PDF
Sample0 mtechcs06
PDF
Sample0 mtechcs06
PPTX
Introduction to matlab
Applied machine learning for search engine relevance 3
Test
Tree distance algorithm
Strings matching in pattern recognition.ppt
Review of Trigonometry for Calculus “Trigon” =triangle +“metry”=measurement =...
Divide and Conquer in DAA concept. For B Tech CSE
Parallel Evaluation of Multi-Semi-Joins
Matrix 2 d
Transforms UNIt 2
Mtc ssample05
Mtc ssample05
ch16.pptx
ch16 (1).pptx
Sample0 mtechcs06
Sample0 mtechcs06
Introduction to matlab

Recently uploaded (20)

PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Database Infoormation System (DBIS).pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Lecture1 pattern recognition............
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
Introduction to Business Data Analytics.
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
climate analysis of Dhaka ,Banglades.pptx
Mega Projects Data Mega Projects Data
Database Infoormation System (DBIS).pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Knowledge Engineering Part 1
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Supervised vs unsupervised machine learning algorithms
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Reliability_Chapter_ presentation 1221.5784
Lecture1 pattern recognition............
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Clinical guidelines as a resource for EBP(1).pdf
Moving the Public Sector (Government) to a Digital Adoption
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
1_Introduction to advance data techniques.pptx
Introduction to Business Data Analytics.
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj

Mining of massive datasets

  • 1. Mining of Massive Datasets Ashic Mahtab @ashic www.heartysoft.com
  • 3. Stream Processing  Have I already processed this?  How many distinct queries were made?  How many hits did I get?
  • 4. Stream Processing – Bloom Filters  Guaranteed detection of negatives.  Possible false positive.
  • 5. Stream Processing – Bloom Filters  Have a collection of hash functions (h1, h2, h3…).  For an input, run the hash functions. Map to bit array.  If all bits are lit in working store, might have been processed (possibility of false positives).  If any of the lit bits in hashed array are not lit in working store, need to process this. (Guaranteed…no false negatives).
  • 6. Stream Processing – Bloom Filters 1 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 Input 1: “Foo” hashes to: 1 0 0 1 1 0 0 0 0 0 Input 2: “Bar” hashes to: 1 0 1 1 1 0 0 0 0 0
  • 7. Stream Processing – Bloom Filters  Not just for streams (everything is a stream, right?)  Cassandra uses bloom filters to detect if some data is in a low level storage file.
  • 8. Map Reduce  A little smarts goes a l-o-o-o-n-g way.
  • 9. Map Reduce – Multiway Joins  R join S join T  size(R) = r, size(S) = s, size(T) = t  Probability of match for R and S = p  Probability of match for S and T = p  Which do we join first?
  • 10. Map Reduce – Multiway Joins  R (A, B) join S(B, C) join T(C, D)  size(R) = r, size(S) = s, size(T) = t  Probability of match for R and S = p  Probability of match for S and T = p  Communication cost: * If we join R and S first: O(r + s + t + pst) * If we join S and T first: O(r + s + t + prs)
  • 11. Map Reduce – Multiway Joins  Can we do better?
  • 12. Map Reduce – Multiway Joins  Hash B to b buckets, c to C buckets.  bc = k  Cost ~ r + 2s + t + 2 * sqrt(krt) Usually, can neglect r + t compared to the k term. So, 2s + 2*sqrt(krt) [Single MR job]
  • 13. Map Reduce – Multiway Joins  Hash B to b buckets, c to C buckets.  bc = k  Cost ~ r + 2s + t + 2 * sqrt(krt) Usually, can neglect r + t compared to the k term. So, 2s + 2*sqrt(krt) [Single MR job]  vs (r + s + t + prs) [Two MR jobs]
  • 14. Map Reduce – Multiway Joins  So…is this always better?
  • 15. Map Reduce – Complexity  Replication Rate (r): Number of outputs by all Map tasks / number of inputs  Reducer Size (q): Max number of items per key at reducers  p = number of inputs  For nxn: qr >= 2n^2 r >= p / q
  • 16. Map Reduce – Matrix Multiplication  Approach 1  Matrix M, N  M(i, j), N(j, k)  Map1: Map matrices to (j, (M, i, mij)), (j, (N, k, njk))  Reduce1: for each key, output ((i, k), mij*njk)  Map2: Identity  Reduce2: For each key, (i, k) get the sum of values.
  • 17. Map Reduce – Matrix Multiplication  Approach 2  One step:  Map: For M, produce ((i, k), (M, j, mij)) for k = 1…Ncolumns_in_N For M, produce ((i, k), (N, j, njk)) for k = 1…Nrows_in_M  Reduce: For each key (i, k), multiple values, and sum.
  • 18. Map Reduce – Matrix Multiplication  Approach 3  Two steps again.
  • 19. Map Reduce – Matrix Multiplication  One pass: (4n^4) / q  Two pass: (4n^3) / sqrt(q)
  • 20. Similarity - Shingling  “abcdef” -> [“abc”, “bcd”, “cde”…]  Jaccard similarity - > N(intersection) / N(union)
  • 21. Similarity - Shingling  “abcdef” -> [“abc”, “bcd”, “cde”…]  Jaccard similarity - > N(intersection) / N(union)  Problem?  Size
  • 23. Similarity - Minhashing h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a
  • 24. Similarity - Minhashing Problem? h(S1) = a, h(S2) = c, h(S3) = b, h(S4) = a
  • 26. Similarity – Minhash Signatures Problem? Still can’t find pairs with greatest similarity efficiently
  • 27. Similarity – LSH for Minhash Signatures
  • 29. Clustering – K Means 1. Pick k points (centroids) 2. Assign points to clusters 3. Shift centroids to “centre”. 4. Repeat
  • 31. Clustering – FBR • 3 sets – Discard, Compressed and Retained • First two have summaries. N, sum per dimension, sum of squares per dimension • High dimensional Euclidian space Mahalanobis Distance
  • 33. Clustering – CURE • Sample. Run clustering on sample. • Pick “representatives” from each sample. • Move representatives about 20% or so to the centre. • Merge of close.
  • 38. Dimensionality Reduction - CUR  SVD results in U and V being dense, even when M is sparse.  O(n^3)
  • 39. Dimensionality Reduction - CUR  Choose r.  Choose r rows and r columns of M.  Intersection is W.  Run SVD on W (much smaller than M). W = XΣY’  Compute Σ+, the Moore-Penrose pseudoinverse of Σ.  Then, U = Y * (Σ+)^2 * X’
  • 40. Dimensionality Reduction – CUR Choosing Rows and Columns  Random, but with bias for importance.  (Frobenius Norm)^2  Probability of picking a row or column: Sum of squares for row or column / Sum of squares of all elements
  • 41. Dimensionality Reduction – CUR Choosing Rows and Columns  Same row / column may get picked (selection with replacement).  Reduces rank.
  • 42. Dimensionality Reduction – CUR Choosing Rows and Columns  Same row / column may get picked (selection with replacement).  Reduces rank.  Can be combined: multiply vector by sqrt(k) if it appears k times.
  • 43. Dimensionality Reduction – CUR Choosing Rows and Columns  Same row / column may get picked (selection with replacement).  Reduces rank.  Can be combined: multiply vector by sqrt(k) if it appears k times.  Compute pseudo-inverse as before, but transpose the result.
  • 44. Dimensionality Reduction – CUR Choosing Rows and Columns  Same row / column may get picked (selection with replacement).  Reduces rank.  Can be combined: multiply vector by sqrt(k) if it appears k times.  Compute pseudo-inverse as before, but transpose the result.
  • 45. Thanks  Mining of Massive Datasets Leskovec, Rajaraman, Ullman Coursera / Stanford Course Book: http://www.mmds.org/ [free]