SlideShare a Scribd company logo
MapReduce and the 

art of “Thinking Parallel”
Shailesh Kumar
Third Leap, Inc.
Three I’s of a great product!
Interface Intuitive |Functional | Elegant
Infrastructur
e
Storage |Computation |
Network
Intelligence Learn |Predict | Adapt |
Evolve
Drowning in Data, Starving for Knowledge
ATATTAGGTTTTTACCTACCC
AGGAAAAGCCAACCAACCTC
GATCTCTTGTAGATCTGTTCT
CTAAACGAACTTTAAAATCTG
TGTAGCTGTCGCTCGGCTG
CATGCCTAGTGCACCTACGC
AGTATAAACAATAATAAATTTT
ACTGTCGTTGACAAGAAACG
AGTAACTCGTCCCTCTTCTG
CAGACTGCTTATTACGCGAC
CGTAAGCTAC…
How BIG is Big Data?
600 million
tweets per DAY
100 hours per
MINUTE
800+ websites
per MINUTE
100 TB of data
uploaded DAILY
3.5 Billion
queries PER DAY
300 Million
Active customers
How BIG is BigData?
▪ Better Sensors
▪ Higher resolution, Real-time, Diverse measurements, …
▪ Faster Communication
▪ Network infrastructure, Compression Technologies, …
▪ Cheaper Storage
▪ Cloud based storage, large warehouses, NoSQL databases
▪ Massive Computation
▪ Cloud computing, Mapreduce/Hadoop parallel processing paradigms
▪ Intelligent Decisions
▪ Advances in Machine Learning and Artificial Intelligence
How did we get here?
The Evolution of “Computing”
Parallel Computing Basics
▪ Data Parallelism (distributed computing)
▪ Lots of data ! Break it into “chunks”,
▪ Process each “chunk” of data in parallel,
▪ Combine results from each “chunk”
▪ MAPREDUCE = Data Parallelism
▪ Process Parallelism (data flow computing)
▪ Lots of stages ! Set up process graph
▪ Pass data through all stages
▪ All stages running in parallel on different data
▪ Assembly line = process parallelism
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
MAPREDUCE 101: A 4-stage ProcessLotsofdata
Shard
1
Shard
N
Shard
2
Reduce
1
Reduce
R
Map
1
Map
2
Map
K
Combine
1
Combine
2
Combine
K
Shuffle
1
Shuffle
2
Shuffle
K
Output
1
Output
R
Each Map
processes
N/K shards
MAPREDUCE 101: An example Task
▪ Count total frequency of all words on the web
▪ Total number of documents > 20Billion
▪ Total number of unique words > 20Million
▪ Non-Parallel / Linear Implementation
for each document d on the Web
for each unique word w in d
DocCount w d( )= # times w occurred in d
WebCount w( ) += DocCount w d( )
MAPREDUCE – MAP/COMBINE
Shard1
Key Value
A 10
B 7
C 9
D 3
B 4
Key Value
A 10
B 11
C 9
D 3
Shard2
Key Value
A 3
D 1
C 4
D 9
B 6
Key Value
A 3
B 6
C 4
D 10
Shard3
Key Value
B 3
D 5
C 4
A 6
A 3
Map-1
Map-2
Map-3
Key Value
A 9
B 3
C 4
D 5
Combine-1
Combine-2
Combine-3
MAPREDUCE – Shuffle/Reduce
Key Value
A 10
B 11
C 9
D 3
Key Value
A 3
B 6
C 4
D 10
Key Value
A 9
B 3
C 4
D 5
Key Value
A 10
A 3
A 9
C 9
C 4
C 4
Key Value
B 11
B 6
B 3
D 3
D 10
D 5
Shuffle
1
Shuffle
2
Shuffle
3
Key Value
A 22
C 17
Key Value
B 20
D 18
Reduce
1
Reduce
2
Key Questions in MAPREDUCE
▪ Is the task really “data-parallelizable”?
▪ High dependence tasks (e.g. Fibonacci series)
▪ Recursive tasks (e.g. Binary Search)
▪ What is the key-value pair output for MAP step?
▪ Each map processes only one data record at a time
▪ It can generate none, one, or multiple key-value pairs
▪ How to combine values of a key in REDUCE step?
▪ The key for reduce is same as key for Map output
▪ The reduce function must be “order agnostic”
Other considerations
▪ Reliability/Robustness
▪ A processor or disk might go bad during the process
▪ Optimization/Efficiency
▪ Allocate CPU’s near data shards to reduce network overhead
▪ Scale/Parallelism
▪ Parallelization linearly proportional to number of machines
▪ Simplicity/Usability
▪ Just specify the Map task and the Reduce task and be done!
▪ Generality
▪ Lots of parallelizable tasks can be written in MapReduce
▪ With some creativity, many more than you can imagine!
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
Similarity between all pairs of docs.
▪ Why bother?
▪ Document Clustering, Similar document search, etc.
▪ Document represented as a “Bag-of-Tokens”
▪ A weight associated with each tokens in vocabulary.
▪ Most weights are zero – Sparsity
▪ Cosine Similarity between two documents
di = w1
i
,w2
i
,...,wT
i
{ }, dj = w1
j
,w2
j
,...,wT
j
{ }
Sim di ,dj( )= wt
i
t=1
T
∑ × wt
j
Non-Parallel / Linear Implementation
For each document di
For each document dj ( j > i)
Sim di ,dj( )= wt
i
t=1
T
∑ × wt
j
Complexity = O D2
Tσ( )
σ = Sparsity factor =10−5
= Average Fraction of vocabulary per document
D = O(10B), T = O(10M )
Complexity = O 1020+7−5
( )= O 1022
( )
Toy Example for doc-doc similarity
A classic “Join”
Documents = W, X,Y, Z{ }, Words = a,b,c,d,e{ }
W → a,1 , b,2 , e,5{ }
X → a,3 , c,4 , d,5{ }
Y → b,6 , c,7 , d,8{ }
Z → a,9 , e,10{ }
Input W, X → Sim W, X( )= 3
W,Y → Sim W,Y( )= 12
W,Z → Sim W,Z( )= 59
X,Y → Sim X,Y( )= 68
X,Z → Sim X,Z( )= 27
Y,Z → Sim Y,Z( )= 0
Output
Reverse Indexing to the rescue
First convert the data to reverse index
a→ W,1 , X,3 , Z,9{ }
b→ W,2 , Y,6{ }
c→ X,4 , Y,7{ }
d → X,5 , Y,8{ }
e→ W,5 , Z,10{ }
W → a,1 , b,2 , e,5{ }
X → a,3 , c,4 , d,5{ }
Y → b,6 , c,7 , d,8{ }
Z → a,9 , e,10{ }
Key/Value for the MAP-Step
a→ W,1 , X,3 , Z,9{ }
W, X → 3
W,Z → 9
X,Z → 27
b→ W,2 , Y,6{ }
c→ X,4 , Y,7{ }
W,Y →12
e→ W,5 , Z,10{ }
d → X,5 , Y,8{ }
X,Y → 28
X,Y → 40
W,Z → 50
W, X → 3
W,Y →12
W,Z → 9
W,Z → 50
X,Y → 40
X,Y → 28
X,Z → 27
Value combining in REDUCE-Step
W, X → 3
W,Y →12
W,Z → 9
W,Z → 50
X,Y → 40
X,Y → 28
X,Z → 27
W, X → Sim W, X( )= 3
W,Y → Sim W,Y( )= 12
W,Z → Sim W,Z( )= 59
X,Y → Sim X,Y( )= 68
X,Z → Sim X,Z( )= 27
Y,Z → Sim Y,Z( )= 0
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
assignments ! centers
K-Means Clustering
mk
(t+1)
←
δn,k
(t )
xn
n=1
N
∑
δn,k
(t)
n=1
N
∑
m1
(t+1)
m2
(t+1)
δn,2
(t )
= 1
δn,1
(t )
= 1
m1
(t )
m2
(t )
centers ! assignments
δn,k
(t+1)
= k == arg min
j=1...K
Δ x n( )
,mj
(t)
( ){ }( )
K-means clustering 101 – Non-parallel
E-Step – Update assignments from centers


M-Step – Update centers from cluster assignments
πn
(t)
← arg min
k=1...K
Δ xn
,mk
(t)
( ){ }
mk
(t+1)
←
δ πn
(t)
= k( )xn
n=1
N
∑
δ πn
(t)
= k( )
n=1
N
∑
O NKD( ):
N = Number of data points
K = Number of clusters
D = number of dimensions
⎧
⎨
⎪
⎩
⎪
O ND( ):
N = Number of data points
D = number of dimensions
⎧
⎨
⎩
K-Means MapReduce
mk
(t)
{ }k=1
K
Key = πn
(t)
→ Value = xn
πn
(t)
= arg min
k=1...K
Δ xn
,mk
(t)
( ){ } mk
(t+1)
←
δ πn
(t)
= k( )xn
n=1
N
∑
δ πn
(t)
= k( )
n=1
N
∑
mk
(t+1)
{ }k=1
K
πn
(t)
mk
(t+1)
Map
Shuffle
Reduce
Iterative MapReduce: Update Cluster Centers/iteration
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
Cliques: Useful structures in Graphs
• People
• Products
• Movies
• Keywords
• Documents
• Genes
• Neurons
• Co-Social
• Co-purchase
• Co-like
• Co-occurrence
• Similarity
• Co-expressions
• Co-firing
guitarist
rock-music
guitar
song
musician
rock-band
singer
electric-guitar
singing
university
school
college
student
classroom
school-teacher
teacher
teacher-student-relationship
judge
lawsuit
trial
lawyerfalse-persecution
perjury
courtroom
Example Concepts in IMDB
Graph, Cliques, and Maximal Cliques
Clique = a “fully connected” sub-graph
Maximal Clique = a clique with no “Super-clique”
Finding all Maximal Cliques is NP-hard: O(3n/3)
a
e
b
f
c
g
d
h
Neighborhood of a Clique
a
e
b
f
c
g
d
h
f is connected to BOTH b and c
g is connected to BOTH b and c
N({b,c}) = {f,g}
CLIQUEMAP: Clique (key) ! Its Neighbor (value)
{a} → {b,e}
{a,b} → {e}
{b,c} → { f,g}
{b,c, f } → {g}
{h} → ∅
{c,d} → ∅
{a,b,e} → ∅
{b,c, f ,g} → ∅
Growing Cliques from CliqueMap
{b,c, f} → {g}
a
e
b
f
c
g
d
h
{b,c, f} is a clique
g is connected to all of them
⎫
⎬
⎭
⇒ {b,c, f,g} is a clique
MapReduce for Maximal Cliques
CliqueMap of size k ! size k + 1
{a,b} → {e}
{a,e} → {b}
{b,c} → { f,g}
{b,e} → {a}
{b, f } → {c,g}
{b,g} → {c, f }
{c, f } → {b,g}
{c, g} → {b, f }
{ f, g} → {b,c}
{c,d} → ∅
Iteration 2
{a,b,e} → ∅
{b,c, f } → {g}
{b,c,g} → { f }
{b, f ,g} → {c}
{c, f ,g} → {b}
Iteration 3
{b,c, f,g} → ∅
Iteration 4
{a} → {b,e}
{b} → {a,c,e, f ,g}
{c} → {b,d, f ,g}
{d} → {c}
{e} → {a,b}
{ f } → {b,c,g}
{g} → {b,c, f }
{h} → ∅
Iteration 1
Input: Adjacency List
a
e
b
f
c
g
d
h
Key/Value for the MAP-Step
a
e
b
f
c
g
d
h
{a} → {b,e} {a,b} ⇒ {e}
{a,e} ⇒ {b}
{e} → {a,b}
{b} → {a,c,e, f,g}
{a,e} ⇒ {b}
{b,e} ⇒ {a}
{a,b} ⇒ {c,e, f, g}
{b,c} ⇒ {a,e, f, g}
{b,e} ⇒ {a,c, f, g}
{b, f } ⇒ {a,c,e, g}
{b,g} ⇒ {a,c,e, f }
{a,e} ⇒ {b}
{a,e} ⇒ {b}
{a,b} ⇒ {e}
{a,b} ⇒ {c,e, f ,g}
{b,e} ⇒ {a.c, f,g}
{b,e} ⇒ {a}
SHUFFLE
MAP
Value combining in REDUCE-Step
a
e
b
f
c
g
d
h
{a,e} ⇒ {b}
{a,e} ⇒ {b}
{a,b} ⇒ {e}
{a,b} ⇒ {c,e, f ,g}
{b,e} ⇒ {a,c, f ,g}
{b,e} ⇒ {a}
SHUFFLE
{a,b} → {e}∩{c,e, f,g} = {e}
{b,e} → {a,c, f,g}∩{a} = {a}
{a,e} → {b}∩{b} = {b}
REDUCE
Reduce = Intersection
Value combining in REDUCE-Step
a
e
b
f
c
g
d
h
c,d{ }⇒ b, f ,g{ }
c,d{ }⇒ ∅
c{ }→ b,d, f,g{ }
d{ }→ c{ }
b,c{ }⇒ a,e, f,g{ }
b,c{ }⇒ d, f,g{ }
b{ }→ a,c,e, f,g{ }
c{ }→ b,d, f ,g{ }
c,d{ }→
{b, f ,g}∩∅ = ∅
b,c{ }→
a,e, f ,g{ }∩ d, f ,g{ }
= f ,g{ }
“Art of Thinking Parallel” is about
▪ Transforming the Input Data appropriately
▪ e.g. Reverse Indexing (doc-doc similarity)
▪ Breaking the problem into smaller ones
▪ e.g. Iterative MapReduce (clustering)
▪ Designing the Map step - Key/Value output
▪ e.g. CliqueMaps in Maximal Cliques
▪ Design the Reduce step – Combine values of key
▪ e.g. Intersections in Maximal Cliques

More Related Content

PDF
Interval Pattern Structures: An introdution
PPTX
Backtraking pic&def
PDF
2017 RM-URISA Track: Spatial SQL - The Best Kept Secret in the Geospatial World
PDF
Polyadic systems and multiplace representations
DOCX
R-ggplot2 package Examples
PDF
High-Performance Approach to String Similarity using Most Frequent K Characters
PDF
Spatial Analysis with R - the Good, the Bad, and the Pretty
PDF
A new generalized lindley distribution
Interval Pattern Structures: An introdution
Backtraking pic&def
2017 RM-URISA Track: Spatial SQL - The Best Kept Secret in the Geospatial World
Polyadic systems and multiplace representations
R-ggplot2 package Examples
High-Performance Approach to String Similarity using Most Frequent K Characters
Spatial Analysis with R - the Good, the Bad, and the Pretty
A new generalized lindley distribution

What's hot (18)

PDF
An application of gd
ODP
Geospatial Data in R
PDF
A Visual Exploration of Distance, Documents, and Distributions
PDF
Multi-scalar multiplication: state of the art and new ideas
PDF
The Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery
DOCX
k-means Clustering and Custergram with R
PDF
1452 86301000013 m
PDF
Kumaraswamy distributin:
PDF
Treewidth and Applications
PDF
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
PDF
Direct split-radix algorithm for fast computation of type-II discrete Hartley...
PPTX
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PDF
DocEng2013 Bilauca Healy - Splitting Wide Tables Optimally
PPTX
Sqlserver 2008 r2
PDF
Distributed Support Vector Machines
PDF
S. Duplij. Polyadic algebraic structures and their applications
PDF
Multilayerity within multilayerity? On multilayer assortativity in social net...
PPTX
An Interactive Introduction To R (Programming Language For Statistics)
An application of gd
Geospatial Data in R
A Visual Exploration of Distance, Documents, and Distributions
Multi-scalar multiplication: state of the art and new ideas
The Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery
k-means Clustering and Custergram with R
1452 86301000013 m
Kumaraswamy distributin:
Treewidth and Applications
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
Direct split-radix algorithm for fast computation of type-II discrete Hartley...
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
DocEng2013 Bilauca Healy - Splitting Wide Tables Optimally
Sqlserver 2008 r2
Distributed Support Vector Machines
S. Duplij. Polyadic algebraic structures and their applications
Multilayerity within multilayerity? On multilayer assortativity in social net...
An Interactive Introduction To R (Programming Language For Statistics)
Ad

Viewers also liked (10)

PDF
Offline first geeknight
PDF
GeekNight: Evolution of Programming Languages
PDF
Serverless architectures
PPTX
Geeknight : Artificial Intelligence and Machine Learning
PPTX
Proving parallelism
PDF
Understanding and building big data Architectures - NoSQL
PPTX
Parallelism and perpendicularity
PPT
Pertemuan ii mankiw krugman
PPTX
Sim Photosynthesis
PPTX
What is Parallelism?
Offline first geeknight
GeekNight: Evolution of Programming Languages
Serverless architectures
Geeknight : Artificial Intelligence and Machine Learning
Proving parallelism
Understanding and building big data Architectures - NoSQL
Parallelism and perpendicularity
Pertemuan ii mankiw krugman
Sim Photosynthesis
What is Parallelism?
Ad

Similar to Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar (20)

PPTX
#8 Graph Analytics in Machine Learning.pptx
PDF
Outrageous Ideas for Graph Databases
PDF
Interactive High-Dimensional Visualization of Social Graphs
PDF
Relaxation methods for the matrix exponential on large networks
PPTX
Minimum spanning tree
PDF
Localized methods for diffusions in large graphs
PDF
Gaps between the theory and practice of large-scale matrix-based network comp...
PPTX
LalitBDA2015V3
PDF
COCOA: Communication-Efficient Coordinate Ascent
PPTX
Developer Intro Deck-PowerPoint - Download for Speaker Notes
PPSX
FUNCTION- Algebraic Function
PDF
Empowering Fourier-based Pricing Methods for Efficient Valuation of High-Dime...
PDF
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
PDF
Visual Api Training
PDF
MLconf NYC Shan Shan Huang
PDF
CDT 22 slides.pdf
PDF
Make money fast! department of computer science-copypasteads.com
PDF
Parallel Evaluation of Multi-Semi-Joins
PDF
Chapter 09-Trees
PPT
Tree distance algorithm
#8 Graph Analytics in Machine Learning.pptx
Outrageous Ideas for Graph Databases
Interactive High-Dimensional Visualization of Social Graphs
Relaxation methods for the matrix exponential on large networks
Minimum spanning tree
Localized methods for diffusions in large graphs
Gaps between the theory and practice of large-scale matrix-based network comp...
LalitBDA2015V3
COCOA: Communication-Efficient Coordinate Ascent
Developer Intro Deck-PowerPoint - Download for Speaker Notes
FUNCTION- Algebraic Function
Empowering Fourier-based Pricing Methods for Efficient Valuation of High-Dime...
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
Visual Api Training
MLconf NYC Shan Shan Huang
CDT 22 slides.pdf
Make money fast! department of computer science-copypasteads.com
Parallel Evaluation of Multi-Semi-Joins
Chapter 09-Trees
Tree distance algorithm

More from Hyderabad Scalability Meetup (10)

PPTX
Turbo charging v8 engine
PPTX
Internet of Things - GeekNight - Hyderabad
PDF
Demystify Big Data, Data Science & Signal Extraction Deep Dive
PDF
Demystify Big Data, Data Science & Signal Extraction Deep Dive
PPTX
Java 8 Lambda Expressions
PPT
No SQL and MongoDB - Hyderabad Scalability Meetup
PPTX
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
PPT
Turbo charging v8 engine
Internet of Things - GeekNight - Hyderabad
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Java 8 Lambda Expressions
No SQL and MongoDB - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
A Presentation on Artificial Intelligence
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Spectroscopy.pptx food analysis technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Accuracy of neural networks in brain wave diagnosis of schizophrenia
A Presentation on Artificial Intelligence
MIND Revenue Release Quarter 2 2025 Press Release
NewMind AI Weekly Chronicles - August'25-Week II
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Group 1 Presentation -Planning and Decision Making .pptx
cuic standard and advanced reporting.pdf
Machine Learning_overview_presentation.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectroscopy.pptx food analysis technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
A comparative analysis of optical character recognition models for extracting...

Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar

  • 1. MapReduce and the 
 art of “Thinking Parallel” Shailesh Kumar Third Leap, Inc.
  • 2. Three I’s of a great product! Interface Intuitive |Functional | Elegant Infrastructur e Storage |Computation | Network Intelligence Learn |Predict | Adapt | Evolve
  • 3. Drowning in Data, Starving for Knowledge ATATTAGGTTTTTACCTACCC AGGAAAAGCCAACCAACCTC GATCTCTTGTAGATCTGTTCT CTAAACGAACTTTAAAATCTG TGTAGCTGTCGCTCGGCTG CATGCCTAGTGCACCTACGC AGTATAAACAATAATAAATTTT ACTGTCGTTGACAAGAAACG AGTAACTCGTCCCTCTTCTG CAGACTGCTTATTACGCGAC CGTAAGCTAC…
  • 4. How BIG is Big Data? 600 million tweets per DAY 100 hours per MINUTE 800+ websites per MINUTE 100 TB of data uploaded DAILY 3.5 Billion queries PER DAY 300 Million Active customers How BIG is BigData?
  • 5. ▪ Better Sensors ▪ Higher resolution, Real-time, Diverse measurements, … ▪ Faster Communication ▪ Network infrastructure, Compression Technologies, … ▪ Cheaper Storage ▪ Cloud based storage, large warehouses, NoSQL databases ▪ Massive Computation ▪ Cloud computing, Mapreduce/Hadoop parallel processing paradigms ▪ Intelligent Decisions ▪ Advances in Machine Learning and Artificial Intelligence How did we get here?
  • 6. The Evolution of “Computing”
  • 7. Parallel Computing Basics ▪ Data Parallelism (distributed computing) ▪ Lots of data ! Break it into “chunks”, ▪ Process each “chunk” of data in parallel, ▪ Combine results from each “chunk” ▪ MAPREDUCE = Data Parallelism ▪ Process Parallelism (data flow computing) ▪ Lots of stages ! Set up process graph ▪ Pass data through all stages ▪ All stages running in parallel on different data ▪ Assembly line = process parallelism
  • 8. Agenda MAPREDUCE Background Problem 1 – Similarity between all pairs of documents! Problem 2 – Parallelizing K-Means clustering Problem 3 – Finding all Maximal Cliques in a Graph
  • 9. MAPREDUCE 101: A 4-stage ProcessLotsofdata Shard 1 Shard N Shard 2 Reduce 1 Reduce R Map 1 Map 2 Map K Combine 1 Combine 2 Combine K Shuffle 1 Shuffle 2 Shuffle K Output 1 Output R Each Map processes N/K shards
  • 10. MAPREDUCE 101: An example Task ▪ Count total frequency of all words on the web ▪ Total number of documents > 20Billion ▪ Total number of unique words > 20Million ▪ Non-Parallel / Linear Implementation for each document d on the Web for each unique word w in d DocCount w d( )= # times w occurred in d WebCount w( ) += DocCount w d( )
  • 11. MAPREDUCE – MAP/COMBINE Shard1 Key Value A 10 B 7 C 9 D 3 B 4 Key Value A 10 B 11 C 9 D 3 Shard2 Key Value A 3 D 1 C 4 D 9 B 6 Key Value A 3 B 6 C 4 D 10 Shard3 Key Value B 3 D 5 C 4 A 6 A 3 Map-1 Map-2 Map-3 Key Value A 9 B 3 C 4 D 5 Combine-1 Combine-2 Combine-3
  • 12. MAPREDUCE – Shuffle/Reduce Key Value A 10 B 11 C 9 D 3 Key Value A 3 B 6 C 4 D 10 Key Value A 9 B 3 C 4 D 5 Key Value A 10 A 3 A 9 C 9 C 4 C 4 Key Value B 11 B 6 B 3 D 3 D 10 D 5 Shuffle 1 Shuffle 2 Shuffle 3 Key Value A 22 C 17 Key Value B 20 D 18 Reduce 1 Reduce 2
  • 13. Key Questions in MAPREDUCE ▪ Is the task really “data-parallelizable”? ▪ High dependence tasks (e.g. Fibonacci series) ▪ Recursive tasks (e.g. Binary Search) ▪ What is the key-value pair output for MAP step? ▪ Each map processes only one data record at a time ▪ It can generate none, one, or multiple key-value pairs ▪ How to combine values of a key in REDUCE step? ▪ The key for reduce is same as key for Map output ▪ The reduce function must be “order agnostic”
  • 14. Other considerations ▪ Reliability/Robustness ▪ A processor or disk might go bad during the process ▪ Optimization/Efficiency ▪ Allocate CPU’s near data shards to reduce network overhead ▪ Scale/Parallelism ▪ Parallelization linearly proportional to number of machines ▪ Simplicity/Usability ▪ Just specify the Map task and the Reduce task and be done! ▪ Generality ▪ Lots of parallelizable tasks can be written in MapReduce ▪ With some creativity, many more than you can imagine!
  • 15. Agenda MAPREDUCE Background Problem 1 – Similarity between all pairs of documents! Problem 2 – Parallelizing K-Means clustering Problem 3 – Finding all Maximal Cliques in a Graph
  • 16. Similarity between all pairs of docs. ▪ Why bother? ▪ Document Clustering, Similar document search, etc. ▪ Document represented as a “Bag-of-Tokens” ▪ A weight associated with each tokens in vocabulary. ▪ Most weights are zero – Sparsity ▪ Cosine Similarity between two documents di = w1 i ,w2 i ,...,wT i { }, dj = w1 j ,w2 j ,...,wT j { } Sim di ,dj( )= wt i t=1 T ∑ × wt j
  • 17. Non-Parallel / Linear Implementation For each document di For each document dj ( j > i) Sim di ,dj( )= wt i t=1 T ∑ × wt j Complexity = O D2 Tσ( ) σ = Sparsity factor =10−5 = Average Fraction of vocabulary per document D = O(10B), T = O(10M ) Complexity = O 1020+7−5 ( )= O 1022 ( )
  • 18. Toy Example for doc-doc similarity A classic “Join” Documents = W, X,Y, Z{ }, Words = a,b,c,d,e{ } W → a,1 , b,2 , e,5{ } X → a,3 , c,4 , d,5{ } Y → b,6 , c,7 , d,8{ } Z → a,9 , e,10{ } Input W, X → Sim W, X( )= 3 W,Y → Sim W,Y( )= 12 W,Z → Sim W,Z( )= 59 X,Y → Sim X,Y( )= 68 X,Z → Sim X,Z( )= 27 Y,Z → Sim Y,Z( )= 0 Output
  • 19. Reverse Indexing to the rescue First convert the data to reverse index a→ W,1 , X,3 , Z,9{ } b→ W,2 , Y,6{ } c→ X,4 , Y,7{ } d → X,5 , Y,8{ } e→ W,5 , Z,10{ } W → a,1 , b,2 , e,5{ } X → a,3 , c,4 , d,5{ } Y → b,6 , c,7 , d,8{ } Z → a,9 , e,10{ }
  • 20. Key/Value for the MAP-Step a→ W,1 , X,3 , Z,9{ } W, X → 3 W,Z → 9 X,Z → 27 b→ W,2 , Y,6{ } c→ X,4 , Y,7{ } W,Y →12 e→ W,5 , Z,10{ } d → X,5 , Y,8{ } X,Y → 28 X,Y → 40 W,Z → 50 W, X → 3 W,Y →12 W,Z → 9 W,Z → 50 X,Y → 40 X,Y → 28 X,Z → 27
  • 21. Value combining in REDUCE-Step W, X → 3 W,Y →12 W,Z → 9 W,Z → 50 X,Y → 40 X,Y → 28 X,Z → 27 W, X → Sim W, X( )= 3 W,Y → Sim W,Y( )= 12 W,Z → Sim W,Z( )= 59 X,Y → Sim X,Y( )= 68 X,Z → Sim X,Z( )= 27 Y,Z → Sim Y,Z( )= 0
  • 22. Agenda MAPREDUCE Background Problem 1 – Similarity between all pairs of documents! Problem 2 – Parallelizing K-Means clustering Problem 3 – Finding all Maximal Cliques in a Graph
  • 23. assignments ! centers K-Means Clustering mk (t+1) ← δn,k (t ) xn n=1 N ∑ δn,k (t) n=1 N ∑ m1 (t+1) m2 (t+1) δn,2 (t ) = 1 δn,1 (t ) = 1 m1 (t ) m2 (t ) centers ! assignments δn,k (t+1) = k == arg min j=1...K Δ x n( ) ,mj (t) ( ){ }( )
  • 24. K-means clustering 101 – Non-parallel E-Step – Update assignments from centers 
 M-Step – Update centers from cluster assignments πn (t) ← arg min k=1...K Δ xn ,mk (t) ( ){ } mk (t+1) ← δ πn (t) = k( )xn n=1 N ∑ δ πn (t) = k( ) n=1 N ∑ O NKD( ): N = Number of data points K = Number of clusters D = number of dimensions ⎧ ⎨ ⎪ ⎩ ⎪ O ND( ): N = Number of data points D = number of dimensions ⎧ ⎨ ⎩
  • 25. K-Means MapReduce mk (t) { }k=1 K Key = πn (t) → Value = xn πn (t) = arg min k=1...K Δ xn ,mk (t) ( ){ } mk (t+1) ← δ πn (t) = k( )xn n=1 N ∑ δ πn (t) = k( ) n=1 N ∑ mk (t+1) { }k=1 K πn (t) mk (t+1) Map Shuffle Reduce Iterative MapReduce: Update Cluster Centers/iteration
  • 26. Agenda MAPREDUCE Background Problem 1 – Similarity between all pairs of documents! Problem 2 – Parallelizing K-Means clustering Problem 3 – Finding all Maximal Cliques in a Graph
  • 27. Cliques: Useful structures in Graphs • People • Products • Movies • Keywords • Documents • Genes • Neurons • Co-Social • Co-purchase • Co-like • Co-occurrence • Similarity • Co-expressions • Co-firing
  • 29. Graph, Cliques, and Maximal Cliques Clique = a “fully connected” sub-graph Maximal Clique = a clique with no “Super-clique” Finding all Maximal Cliques is NP-hard: O(3n/3) a e b f c g d h
  • 30. Neighborhood of a Clique a e b f c g d h f is connected to BOTH b and c g is connected to BOTH b and c N({b,c}) = {f,g} CLIQUEMAP: Clique (key) ! Its Neighbor (value) {a} → {b,e} {a,b} → {e} {b,c} → { f,g} {b,c, f } → {g} {h} → ∅ {c,d} → ∅ {a,b,e} → ∅ {b,c, f ,g} → ∅
  • 31. Growing Cliques from CliqueMap {b,c, f} → {g} a e b f c g d h {b,c, f} is a clique g is connected to all of them ⎫ ⎬ ⎭ ⇒ {b,c, f,g} is a clique
  • 32. MapReduce for Maximal Cliques CliqueMap of size k ! size k + 1 {a,b} → {e} {a,e} → {b} {b,c} → { f,g} {b,e} → {a} {b, f } → {c,g} {b,g} → {c, f } {c, f } → {b,g} {c, g} → {b, f } { f, g} → {b,c} {c,d} → ∅ Iteration 2 {a,b,e} → ∅ {b,c, f } → {g} {b,c,g} → { f } {b, f ,g} → {c} {c, f ,g} → {b} Iteration 3 {b,c, f,g} → ∅ Iteration 4 {a} → {b,e} {b} → {a,c,e, f ,g} {c} → {b,d, f ,g} {d} → {c} {e} → {a,b} { f } → {b,c,g} {g} → {b,c, f } {h} → ∅ Iteration 1 Input: Adjacency List a e b f c g d h
  • 33. Key/Value for the MAP-Step a e b f c g d h {a} → {b,e} {a,b} ⇒ {e} {a,e} ⇒ {b} {e} → {a,b} {b} → {a,c,e, f,g} {a,e} ⇒ {b} {b,e} ⇒ {a} {a,b} ⇒ {c,e, f, g} {b,c} ⇒ {a,e, f, g} {b,e} ⇒ {a,c, f, g} {b, f } ⇒ {a,c,e, g} {b,g} ⇒ {a,c,e, f } {a,e} ⇒ {b} {a,e} ⇒ {b} {a,b} ⇒ {e} {a,b} ⇒ {c,e, f ,g} {b,e} ⇒ {a.c, f,g} {b,e} ⇒ {a} SHUFFLE MAP
  • 34. Value combining in REDUCE-Step a e b f c g d h {a,e} ⇒ {b} {a,e} ⇒ {b} {a,b} ⇒ {e} {a,b} ⇒ {c,e, f ,g} {b,e} ⇒ {a,c, f ,g} {b,e} ⇒ {a} SHUFFLE {a,b} → {e}∩{c,e, f,g} = {e} {b,e} → {a,c, f,g}∩{a} = {a} {a,e} → {b}∩{b} = {b} REDUCE Reduce = Intersection
  • 35. Value combining in REDUCE-Step a e b f c g d h c,d{ }⇒ b, f ,g{ } c,d{ }⇒ ∅ c{ }→ b,d, f,g{ } d{ }→ c{ } b,c{ }⇒ a,e, f,g{ } b,c{ }⇒ d, f,g{ } b{ }→ a,c,e, f,g{ } c{ }→ b,d, f ,g{ } c,d{ }→ {b, f ,g}∩∅ = ∅ b,c{ }→ a,e, f ,g{ }∩ d, f ,g{ } = f ,g{ }
  • 36. “Art of Thinking Parallel” is about ▪ Transforming the Input Data appropriately ▪ e.g. Reverse Indexing (doc-doc similarity) ▪ Breaking the problem into smaller ones ▪ e.g. Iterative MapReduce (clustering) ▪ Designing the Map step - Key/Value output ▪ e.g. CliqueMaps in Maximal Cliques ▪ Design the Reduce step – Combine values of key ▪ e.g. Intersections in Maximal Cliques