SlideShare a Scribd company logo
Thinking in MapReduce
Ryan Brush
@ryanbrush
2
We	
  programmers	
  have	
  had
it	
  pre1y	
  good
3
Hardware	
  has	
  scaled	
  up	
  faster
than	
  our	
  problem	
  sets
4
5
So#ware
Engineers
Moore’s	
  
Law
6
But	
  the	
  party	
  is	
  ending
(or	
  at	
  least	
  changing)
7
Data	
  is	
  growing	
  faster	
  than
we	
  can	
  scale	
  individual	
  machines
8
So	
  we	
  have	
  to	
  spread	
  our	
  work	
  
across	
  many	
  machines
9
This	
  is	
  a	
  big	
  deal	
  in	
  health	
  care
9
This	
  is	
  a	
  big	
  deal	
  in	
  health	
  care
Fragmented	
  InformaKon
9
This	
  is	
  a	
  big	
  deal	
  in	
  health	
  care
Fragmented	
  InformaKon
Spread	
  across	
  many	
  systems
9
This	
  is	
  a	
  big	
  deal	
  in	
  health	
  care
Fragmented	
  InformaKon
Spread	
  across	
  many	
  systems
No	
  one	
  has	
  the	
  complete	
  picture
10
We	
  need	
  to	
  put	
  the	
  picture	
  back
together	
  again
10
We	
  need	
  to	
  put	
  the	
  picture	
  back
together	
  again
Be1er-­‐informed	
  decisions
10
We	
  need	
  to	
  put	
  the	
  picture	
  back
together	
  again
Be1er-­‐informed	
  decisions
Reduce	
  systemaKc	
  fricKon
10
We	
  need	
  to	
  put	
  the	
  picture	
  back
together	
  again
Be1er-­‐informed	
  decisions
Understand	
  and	
  improve	
  the
health	
  of	
  populaKons
Reduce	
  systemaKc	
  fricKon
Chart	
  Search
Chart	
  Search
Chart	
  Search
-InformaKon	
  
extracKon
Chart	
  Search
-InformaKon	
  
extracKon
-SemanKc	
  markup	
  of	
  
documents
Chart	
  Search
-InformaKon	
  
extracKon
-SemanKc	
  markup	
  of	
  
documents
-Related	
  concepts	
  in	
  
search	
  results
Medical	
  Alerts
Medical	
  Alerts
Medical	
  Alerts
-Detect	
  health	
  risks	
  
in	
  incoming	
  data
Medical	
  Alerts
-Detect	
  health	
  risks	
  
in	
  incoming	
  data
-NoKfy	
  clinicians	
  to	
  
address	
  those	
  risks
Medical	
  Alerts
-Detect	
  health	
  risks	
  
in	
  incoming	
  data
-NoKfy	
  clinicians	
  to	
  
address	
  those	
  risks
-Quickly	
  include	
  new	
  
knowledge
PopulaKon	
  Health
PopulaKon	
  Health
PopulaKon	
  Health
- Securely	
  bring	
  
together	
  health	
  data	
  
PopulaKon	
  Health
- Securely	
  bring	
  
together	
  health	
  data	
  
- IdenKfy	
  opportuniKes	
  
to	
  improve	
  care
PopulaKon	
  Health
- Securely	
  bring	
  
together	
  health	
  data	
  
- IdenKfy	
  opportuniKes	
  
to	
  improve	
  care
- Support	
  applicaKon	
  of	
  
improvements
PopulaKon	
  Health
- Securely	
  bring	
  
together	
  health	
  data	
  
- IdenKfy	
  opportuniKes	
  
to	
  improve	
  care
- Support	
  applicaKon	
  of	
  
improvements
- Close	
  the	
  loop
17
Peter	
  Norvig,	
  h1p://www.youtube.com/watch?v=yvDCzhbjYWs
The	
  Unreasonable	
  
EffecKveness	
  of	
  Data
17
Peter	
  Norvig,	
  h1p://www.youtube.com/watch?v=yvDCzhbjYWs
Simple	
  models	
  with	
  lots	
  of	
  data	
  almost	
  always	
  
outperform	
  complex	
  models	
  with	
  less	
  data
The	
  Unreasonable	
  
EffecKveness	
  of	
  Data
18
So	
  how	
  can	
  we	
  tackle	
  
such	
  large	
  data	
  sets?
19
Can	
  we	
  adapt	
  what	
  has
worked	
  historically?
Rela%onal	
  Databases	
  are	
  Awesome
Acer	
  all,
Rela%onal	
  Databases	
  are	
  Awesome
Atomic,	
  transacKonal	
  updates
DeclaraKve	
  queries
Guaranteed	
  consistency
Easy	
  to	
  reason	
  about
Long	
  track	
  record	
  of	
  success
Rela%onal	
  Databases	
  are	
  Awesome
…so	
  use	
  them!
Rela%onal	
  Databases	
  are	
  Awesome
…so	
  use	
  them!
But…
Those	
  advantages	
  have	
  a	
  cost
Global,	
  atomic,	
  consistent	
  state	
  means	
  
global	
  coordinaKon
Those	
  advantages	
  have	
  a	
  cost
Global,	
  atomic,	
  consistent	
  state	
  means	
  
global	
  coordinaKon
CoordinaKon	
  does	
  not	
  scale	
  linearly
The	
  costs	
  of	
  coordinaKon
Remember	
  the	
  
network	
  effect?
The	
  costs	
  of	
  coordinaKon
2	
  nodes	
  =	
  1	
  channel
5	
  nodes	
  =	
  10	
  channels
12	
  nodes	
  =	
  66	
  channels
25	
  nodes	
  =	
  300	
  channels
The	
  result	
  is	
  we	
  don’t	
  scale	
  
linearly	
  as	
  we	
  add	
  nodes
Independence Parallelizable
Independence Parallelizable
Parallelizable Scalable
“Shared	
  Nothing”	
  architectures	
  are	
  the
most	
  scalable…
“Shared	
  Nothing”	
  architectures	
  are	
  the
most	
  scalable…
…but	
  most	
  real-­‐world	
  problems	
  
require	
  us	
  to	
  share	
  something…
“Shared	
  Nothing”	
  architectures	
  are	
  the
most	
  scalable…
…but	
  most	
  real-­‐world	
  problems	
  
require	
  us	
  to	
  share	
  something…
…so	
  our	
  designs	
  usually	
  have	
  a	
  parallel
part	
  and	
  a	
  serial	
  part
The	
  key	
  is	
  to	
  make	
  sure	
  the	
  vast	
  majority
of	
  our	
  work	
  in	
  the	
  cloud	
  is	
  independent	
  and
parallelizable.
Amdahl’s	
  Law
S	
  :	
  speed	
  improvement
P	
  :	
  raKo	
  of	
  the	
  problem	
  that	
  
	
  	
  	
  	
  	
  	
  can	
  be	
  parallelized
N:	
  number	
  of	
  processors
MapReduce	
  Primer
Input	
  Data
Split	
  1
Split	
  2
Split	
  3
Split	
  N
.
.
.
Mapper	
  1
Mapper	
  2
Mapper	
  3
Mapper	
  N
.
.
.
Map	
  Phase
Reducer	
  1
Reducer	
  2
Reducer	
  N
.
.
Reduce
Phase
Shuffle
MapReduce	
  Example:	
  Word	
  Count
Books
Count	
  words	
  
per	
  book
.
.
.
Map	
  Phase
Sum	
  words	
  
A-­‐C
.
.
Reduce
Phase
Shuffle
Sum	
  words
D-­‐E
Sum	
  words	
  
W-­‐Z
Count	
  words	
  
per	
  book
Count	
  words	
  
per	
  book
The	
  network	
  is	
  a	
  shared	
  resource
The	
  network	
  is	
  a	
  shared	
  resource
Too	
  much	
  data	
  to	
  move	
  to	
  
computaKon
The	
  network	
  is	
  a	
  shared	
  resource
So	
  move	
  computa3on	
  to	
  data
Too	
  much	
  data	
  to	
  move	
  to	
  
computaKon
MapReduce	
  Data	
  Locality
Input	
  Data
Split	
  1
Split	
  2
Split	
  3
Split	
  N
.
.
.
Mapper	
  1
Mapper	
  2
Mapper	
  3
Mapper	
  N
.
.
.
Map	
  Phase
Reducer	
  1
Reducer	
  2
Reducer	
  N
.
.
Reduce
Phase
Shuffle
=	
  a	
  physical	
  machine
Data	
  locality	
  only	
  guaranteed	
  in	
  
the	
  Map	
  phase
Data	
  locality	
  only	
  guaranteed	
  in	
  
the	
  Map	
  phase
So	
  do	
  as	
  much	
  work	
  as	
  possible	
  there
Data	
  locality	
  only	
  guaranteed	
  in	
  
the	
  Map	
  phase
So	
  do	
  as	
  much	
  work	
  as	
  possible	
  there
Some	
  jobs	
  have	
  no	
  reducer	
  at	
  all!
38
MapReduce	
  is	
  a	
  building	
  block
39
So	
  let’s	
  build	
  higher-­‐level	
  funcKons
Grouping	
  and	
  AggregaKng
Books
Count	
  words	
  
per	
  book
.
.
.
Map	
  Phase
Sum	
  words	
  
A-­‐C
.
.
Reduce
Phase
Shuffle
Sum	
  words
D-­‐E
Sum	
  words	
  
W-­‐Z
Count	
  words	
  
per	
  book
Count	
  words	
  
per	
  book
Joins
Data	
  Set	
  1
Split	
  1
Split	
  2
Split	
  3
Group	
  by	
  key
Map	
  Phase
Reducer	
  1
Reducer	
  2
Reducer	
  N
.
.
Reduce
Phase
Shuffle
Group	
  by	
  key
Group	
  by	
  key
Data	
  Set	
  2
Split	
  1
Split	
  2
Split	
  3
Group	
  by	
  key
Group	
  by	
  key
Group	
  by	
  key
Persons
Split	
  1
Split	
  2
Split	
  3
Group	
  by	
  person	
  id
Map	
  Phase
Reducer	
  1
Reducer	
  2
Reducer	
  N
.
.
Reduce
Phase
Shuffle
Group	
  by	
  person	
  id
Group	
  by	
  person	
  id
Visits
Split	
  1
Split	
  2
Split	
  3
Group	
  by	
  person	
  id
Group	
  by	
  person	
  id
Group	
  by	
  person	
  id
Joins
Map-­‐Side	
  Joins
Data	
  Set	
  1
Split	
  3
Mapper	
  3
Map	
  Phase
Reducer	
  1
Reducer	
  2
.
.
Reduce
Phase
Shuffle
Data	
  set	
  2
Split	
  1
Mapper	
  1
Data	
  set	
  2
Split	
  2
Mapper	
  2
Data	
  set	
  2
44
Filtering
Map	
  or	
  reduce	
  funcKons	
  can	
  simply	
  
discard	
  data	
  we’re	
  not	
  interested	
  in
45
And	
  Others
More	
  sophisKcated	
  
pa1erns	
  composable	
  
DisKnct
Sort
Binning
Top	
  N
...	
  
46
Chain	
  Jobs	
  Together
Large-­‐scale	
  joins	
  must	
  have	
  a	
  reduce	
  phase
MulKple	
  joins	
  or	
  group-­‐by	
  operaKons	
  
mean	
  mulKple	
  jobs
Normalize
Data
Join
Related
Items
Compute
Summary Output
Codified	
  in	
  High-­‐Level	
  Libraries
Hive,	
  Pig,	
  Cascading,	
  and	
  Crunch	
  provide
simple	
  means	
  to	
  use	
  these	
  pa1erns
Apache
Crunch
The	
  era	
  of	
  wriKng	
  MapReduce	
  by	
  hand	
  is	
  over
48
How	
  do	
  we	
  use	
  these	
  tools?
49
Start	
  with	
  the	
  ques3on	
  you
want	
  to	
  ask,	
  then	
  transform	
  the
data	
  to	
  answer	
  it.
50
output	
  =	
  transform	
  (input)
50
output	
  =	
  transform	
  (input)
FuncKonal	
  over	
  
Place-­‐Oriented	
  Programming
51
Work	
  with	
  data	
  holisKcally
51
Work	
  with	
  data	
  holisKcally
Re-­‐running	
  funcKons	
  simpler	
  to	
  
reason	
  about	
  than	
  updaKng	
  state
51
Work	
  with	
  data	
  holisKcally
Re-­‐running	
  funcKons	
  simpler	
  to	
  
reason	
  about	
  than	
  updaKng	
  state
Hadoop	
  makes	
  this	
  possible	
  at	
  scale
52
Don’t	
  be	
  afraid	
  to	
  re-­‐process	
  
the	
  world
52
Don’t	
  be	
  afraid	
  to	
  re-­‐process	
  
the	
  world
Something’s	
  wrong,	
  we’re	
  above	
  95%	
  usage!
-­‐TradiKonal	
  System	
  Administrator
52
Don’t	
  be	
  afraid	
  to	
  re-­‐process	
  
the	
  world
Something’s	
  wrong,	
  we’re	
  above	
  95%	
  usage!
-­‐TradiKonal	
  System	
  Administrator
Something’s	
  wrong,	
  we’re	
  below	
  95%	
  usage!
-­‐Hadoop	
  System	
  Administrator
53
Maximize	
  Resource	
  Usage
54
Franklin,	
  Halevy,	
  Maier,	
  h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
From	
  Databases	
  to	
  Dataspaces
54
Franklin,	
  Halevy,	
  Maier,	
  h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
From	
  Databases	
  to	
  Dataspaces
(Also	
  referred	
  to	
  as	
  Data	
  Lakes)
55
Franklin,	
  Halevy,	
  Maier,	
  h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
Bring	
  all	
  of	
  your	
  data	
  together...
55
Franklin,	
  Halevy,	
  Maier,	
  h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
Bring	
  all	
  of	
  your	
  data	
  together...
..structured	
  or	
  unstructured...
55
Franklin,	
  Halevy,	
  Maier,	
  h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
Bring	
  all	
  of	
  your	
  data	
  together...
...transform	
  it	
  with	
  unlimited
computaKon...
..structured	
  or	
  unstructured...
55
Franklin,	
  Halevy,	
  Maier,	
  h1p://homes.cs.washington.edu/~alon/files/dataspacesDec05.pdf
Bring	
  all	
  of	
  your	
  data	
  together...
...transform	
  it	
  with	
  unlimited
computaKon...
...at	
  any	
  Kme	
  for	
  any	
  new	
  need.
..structured	
  or	
  unstructured...
56
And	
  offer	
  a	
  variety	
  of	
  interacKve
access	
  pa1erns.
56
And	
  offer	
  a	
  variety	
  of	
  interacKve
access	
  pa1erns.
SQL,	
  Search,	
  Domain-­‐Specific	
  Apps
57
Hadoop	
  is	
  becoming	
  an	
  adapKve,	
  
mulK-­‐purpose	
  plasorm.
58
The	
  gap	
  between	
  asking	
  novel	
  
quesKons	
  and	
  our	
  ability	
  to	
  answer	
  
them	
  is	
  closing.
QuesKons?
@ryanbrush
h1ps://engineering.cerner.com
We’re	
  hiring!

More Related Content

PDF
Interpreting the data parallel analysis with sawzall
PPT
Hw09 Protein Alignment
PPTX
WELCOME TO BIG DATA TRANING
PPTX
Hadoop interview questions
DOCX
Summer Independent Study Report
PDF
MachineLearning_MPI_vs_Spark
PDF
Big Data with Semantics - StampedeCon 2012
PDF
Bloom filter
Interpreting the data parallel analysis with sawzall
Hw09 Protein Alignment
WELCOME TO BIG DATA TRANING
Hadoop interview questions
Summer Independent Study Report
MachineLearning_MPI_vs_Spark
Big Data with Semantics - StampedeCon 2012
Bloom filter

Similar to Thinking in MapReduce - StampedeCon 2013 (20)

PPTX
How to Feed a Data Hungry Organization – by Traveloka Data Team
PPTX
Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010
PDF
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
PPTX
ML-Lec-18-NEW Dimensionality Reduction-PCA (1).pptx
PDF
Introduction to Big Data
PDF
Using Topological Data Analysis on your BigData
PPT
DIET_BLAST
PPTX
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
PDF
PDF
Using MapReduce for Large–scale Medical Image Analysis
PPTX
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
PDF
Whitepaper: Mining the AWR repository for Capacity Planning and Visualization
PDF
2951085 dzone-2016guidetobigdata
PPTX
Follow the money with graphs
PPTX
Goto amsterdam-2013-skinned
PPTX
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
PPTX
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
PPTX
Reactive programming at scale
PPTX
GoTo Amsterdam 2013 Skinned
PDF
VOLT - ESWC 2016
How to Feed a Data Hungry Organization – by Traveloka Data Team
Exact Inference in Bayesian Networks using MapReduce__HadoopSummit2010
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
ML-Lec-18-NEW Dimensionality Reduction-PCA (1).pptx
Introduction to Big Data
Using Topological Data Analysis on your BigData
DIET_BLAST
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Using MapReduce for Large–scale Medical Image Analysis
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Whitepaper: Mining the AWR repository for Capacity Planning and Visualization
2951085 dzone-2016guidetobigdata
Follow the money with graphs
Goto amsterdam-2013-skinned
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Reactive programming at scale
GoTo Amsterdam 2013 Skinned
VOLT - ESWC 2016
Ad

More from StampedeCon (20)

PDF
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
PDF
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
PDF
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
PDF
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
PDF
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
PDF
Foundations of Machine Learning - StampedeCon AI Summit 2017
PDF
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
PDF
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
PDF
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
PDF
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
PDF
A Different Data Science Approach - StampedeCon AI Summit 2017
PDF
Graph in Customer 360 - StampedeCon Big Data Conference 2017
PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
PDF
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
PDF
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
PDF
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
PDF
Innovation in the Data Warehouse - StampedeCon 2016
PPTX
Creating a Data Driven Organization - StampedeCon 2016
PPTX
Using The Internet of Things for Population Health Management - StampedeCon 2016
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Innovation in the Data Warehouse - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
Ad

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
Teaching material agriculture food technology
PDF
cuic standard and advanced reporting.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Cloud computing and distributed systems.
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
KodekX | Application Modernization Development
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Chapter 3 Spatial Domain Image Processing.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
20250228 LYD VKU AI Blended-Learning.pptx
A Presentation on Artificial Intelligence
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Teaching material agriculture food technology
cuic standard and advanced reporting.pdf
Approach and Philosophy of On baking technology
Review of recent advances in non-invasive hemoglobin estimation
Cloud computing and distributed systems.
Network Security Unit 5.pdf for BCA BBA.
KodekX | Application Modernization Development
Advanced methodologies resolving dimensionality complications for autism neur...

Thinking in MapReduce - StampedeCon 2013