SlideShare a Scribd company logo
8/9/2013 © MapR Confidential 1
R
Hadoop
and MapR
8/9/2013 © MapR Confidential 2
The bad old days (i.e. now)
• Hadoop is a silo
• HDFS isn’t a normal file system
• Hadoop doesn’t really like C++
• R is limited
• One machine, one memory space
• Isn’t there any way we can just get along?
8/9/2013 © MapR Confidential 3
The white knight
• MapR changes things
• Lots of new stuff like snapshots, NFS
• All you need to know, you already know
• NFS provides cluster wide file access
• Everything works the way you expect
• Performance high enough to use as a message bus
8/9/2013 © MapR Confidential 4
Example, out-of-core SVD
• SVD provides compressed matrix form
• Based on sum of rank-1 matrices
A =s1u1 ¢v1 +s2u2 ¢v2 +e
± ±≈ + + ?
8/9/2013 © MapR Confidential 5
More on SVD
• SVD provides a very nice basis
Ax = A aiviå = s juj ¢vj
j
å
é
ë
ê
ê
ù
û
ú
ú
aivi
i
å
é
ë
ê
ù
û
ú= aisiui
i
å
8/9/2013 © MapR Confidential 6
• And a nifty approximation property
Ax =s1a1u1 +s2a2u2 + siaiui
i>2
å
e 2
£ si
2
i>2
å
8/9/2013 © MapR Confidential 7
Also known as …
• Latent Semantic Indexing
• PCA
• Eigenvectors
8/9/2013 © MapR Confidential 8
An application, approximate translation
• Translation distributes over concatenation
• But counting turns concatenation into
addition
• This means that translation is linear!
T(s1 | s2 )=T(s1)| T(s2 )
k(s1 | s2 )= k(s1) + k(s2 )
k(T(s1 | s2 )) = k(T(s1)) + k(T(s2 ))
8/9/2013 © MapR Confidential 9
ish
8/9/2013 © MapR Confidential 10
Traditional computation
• Products of A are dominated by large singular
values and corresponding vectors
• Subtracting these dominate singular values
allows the next ones to appear
• Lanczos method, generally Krylov sub-space
A ¢A A( )
n
=US2n+1
¢V
8/9/2013 © MapR Confidential 11
But …
8/9/2013 © MapR Confidential 12
The gotcha
• Iteration in Hadoop is death
• Huge process invocation costs
• Lose all memory residency of data
• Total lost cause
8/9/2013 © MapR Confidential 13
Randomness to the rescue
• To save the day, run all iterations at the same
time
Y = AW
QR = Y
B = ¢Q A
US ¢V = B
QU( )S ¢V » A
==
A
8/9/2013 © MapR Confidential 14
In R
lsa = function(a, k, p) {
n = dim(a)[1]
m = dim(a)[2]
y = a %*% matrix(rnorm(m*(k+p)), nrow=m)
y.qr = qr(y)
b = t(qr.Q(y.qr)) %*% a
b.qr = qr(t(b))
svd = svd(t(qr.R(b.qr)))
list(u=qr.Q(y.qr) %*% svd$u[,1:k],
d=svd$d[1:k],
v=qr.Q(b.qr) %*% svd$v[,1:k])
}
8/9/2013 © MapR Confidential 15
Not good enough yet
• Limited to memory size
• After memory limits, feature extraction
dominates
8/9/2013 © MapR Confidential 16
Hybrid architecture
Feature
extraction
and
down
sampling
I
n
p
u
t
Side-data
Data
join
Sequential
SVD
Map-reduce
Via NFS
8/9/2013 © MapR Confidential 17
Hybrid architecture
Feature
extraction
and
down
sampling
I
n
p
u
t
Side-data
Data
join
Map-reduce
Via NFS
R
Visualization
Sequential
SVD
8/9/2013 © MapR Confidential 18
Randomness to the rescue
• To save the day again, use blocks
Yi = AiW
¢R R = ¢Y Y = ¢Yi Yiå
Bj = AiWR-1
( )Aij
i
å
LL' = B ¢B
US ¢V = L
AWR-1
U( )S L-1
B ¢V( )» A
==
=
8/9/2013 © MapR Confidential 19
Hybrid architecture
Map-reduce
Feature extraction
and
down sampling Via NFS
R
Visualization
Map-reduce
Block-wise
parallel
SVD
8/9/2013 © MapR Confidential 20
Conclusions
• Inter-operability allows massively scalability
• Prototyping in R not wasted
• Map-reduce iteration not needed for SVD
• Feasible scale ~10^9 non-zeros or more

More Related Content

PDF
Q4 2016 GeoTrellis Presentation
PPTX
R user-group-2011-09
PDF
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
PPTX
Deep Learning on Aerial Imagery: What does it look like on a map?
PDF
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
PDF
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
PPTX
Working with LiDAR
PDF
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
Q4 2016 GeoTrellis Presentation
R user-group-2011-09
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Deep Learning on Aerial Imagery: What does it look like on a map?
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
Working with LiDAR
Enabling Access to Big Geospatial Data with LocationTech and Apache projects

What's hot (19)

PPTX
Time Series Data in a Time Series World
PDF
Ch 5: Introduction to heap overflows
PDF
DSD-INT 2017 High Performance Parallel Computing with iMODFLOW-MetaSWAP - Ver...
PPT
Cassandra at talkbits
PPTX
Weather Data Analytics Using Hadoop
PDF
Locality Sensitive Hashing By Spark
PDF
LIDAR-derived DTM for archaeology and landscape history research some recent ...
PPTX
LocationTech Projects
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PPTX
06 how to write a map reduce version of k-means clustering
PDF
Leveraging Map Reduce With Hadoop for Weather Data Analytics
PDF
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
PDF
Building maps for apps in the cloud - a Softlayer Use Case
PPTX
High Throughput Processing of Space Debris Data
PPTX
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
PDF
CNIT 127 Ch 5: Introduction to heap overflows
PPTX
Advancing Scientific Data Support in ArcGIS
PDF
CS205 Final project
Time Series Data in a Time Series World
Ch 5: Introduction to heap overflows
DSD-INT 2017 High Performance Parallel Computing with iMODFLOW-MetaSWAP - Ver...
Cassandra at talkbits
Weather Data Analytics Using Hadoop
Locality Sensitive Hashing By Spark
LIDAR-derived DTM for archaeology and landscape history research some recent ...
LocationTech Projects
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
06 how to write a map reduce version of k-means clustering
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Building maps for apps in the cloud - a Softlayer Use Case
High Throughput Processing of Space Debris Data
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
CNIT 127 Ch 5: Introduction to heap overflows
Advancing Scientific Data Support in ArcGIS
CS205 Final project
Ad

Viewers also liked (7)

PPTX
Recommendation as Search: Reflections on Symmetry
PPTX
LA HUG 2012 02-07
PPTX
Oscon Data 2011 Ted Dunning
PPTX
Paris Data Geeks
PPTX
London Data Science - Super-Fast Clustering Report
PPTX
Big Data Paris
PDF
Storm Users Group Real Time Hadoop
Recommendation as Search: Reflections on Symmetry
LA HUG 2012 02-07
Oscon Data 2011 Ted Dunning
Paris Data Geeks
London Data Science - Super-Fast Clustering Report
Big Data Paris
Storm Users Group Real Time Hadoop
Ad

Similar to R user group 2011 09 (20)

PPTX
Lawrence Livermore Labs talk 2011
PDF
MapReduce Algorithm Design
PDF
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
PDF
Introduction to Spark
PDF
DSD-INT 2016 The new parallel Krylov Solver package - Verkaik
PPTX
Cleveland Hadoop Users Group - Spark
PDF
Tall and Skinny QRs in MapReduce
PPTX
Real-time and Long-time Together
PPTX
dmapply: A functional primitive to express distributed machine learning algor...
PDF
MapReduce with Hadoop
PDF
Sparse matrix computations in MapReduce
PDF
Resilient Distributed Datasets
PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
PDF
Introduction to Spark on Hadoop
PDF
ePOM - Intro to Ocean Data Science - Raster and Vector Data Formats
PPTX
How the Internet of Things are Turning the Internet Upside Down
PPTX
Dealing with an Upside Down Internet
PDF
Apache Spark Overview part1 (20161107)
PDF
Big data matrix factorizations and Overlapping community detection in graphs
PDF
Why Spark Is the Next Top (Compute) Model
Lawrence Livermore Labs talk 2011
MapReduce Algorithm Design
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Introduction to Spark
DSD-INT 2016 The new parallel Krylov Solver package - Verkaik
Cleveland Hadoop Users Group - Spark
Tall and Skinny QRs in MapReduce
Real-time and Long-time Together
dmapply: A functional primitive to express distributed machine learning algor...
MapReduce with Hadoop
Sparse matrix computations in MapReduce
Resilient Distributed Datasets
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark on Hadoop
ePOM - Intro to Ocean Data Science - Raster and Vector Data Formats
How the Internet of Things are Turning the Internet Upside Down
Dealing with an Upside Down Internet
Apache Spark Overview part1 (20161107)
Big data matrix factorizations and Overlapping community detection in graphs
Why Spark Is the Next Top (Compute) Model

More from MapR Technologies (20)

PPTX
Converging your data landscape
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
PPTX
Enabling Real-Time Business with Change Data Capture
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
PPTX
Machine Learning Success: The Key to Easier Model Management
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
PDF
Live Machine Learning Tutorial: Churn Prediction
PDF
An Introduction to the MapR Converged Data Platform
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
PPTX
Best Practices for Data Convergence in Healthcare
PPTX
Geo-Distributed Big Data and Analytics
PPTX
MapR Product Update - Spring 2017
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
PPTX
MapR and Cisco Make IT Better
PPTX
Evolving from RDBMS to NoSQL + SQL
Converging your data landscape
ML Workshop 2: Machine Learning Model Comparison & Evaluation
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Enabling Real-Time Business with Change Data Capture
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
ML Workshop 1: A New Architecture for Machine Learning Logistics
Machine Learning Success: The Key to Easier Model Management
Data Warehouse Modernization: Accelerating Time-To-Action
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Live Machine Learning Tutorial: Churn Prediction
An Introduction to the MapR Converged Data Platform
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
Best Practices for Data Convergence in Healthcare
Geo-Distributed Big Data and Analytics
MapR Product Update - Spring 2017
3 Benefits of Multi-Temperature Data Management for Data Analytics
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR and Cisco Make IT Better
Evolving from RDBMS to NoSQL + SQL

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
KodekX | Application Modernization Development
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
KodekX | Application Modernization Development
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Review of recent advances in non-invasive hemoglobin estimation
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
NewMind AI Weekly Chronicles - August'25 Week I
Advanced methodologies resolving dimensionality complications for autism neur...
Mobile App Security Testing_ A Comprehensive Guide.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

R user group 2011 09

  • 1. 8/9/2013 © MapR Confidential 1 R Hadoop and MapR
  • 2. 8/9/2013 © MapR Confidential 2 The bad old days (i.e. now) • Hadoop is a silo • HDFS isn’t a normal file system • Hadoop doesn’t really like C++ • R is limited • One machine, one memory space • Isn’t there any way we can just get along?
  • 3. 8/9/2013 © MapR Confidential 3 The white knight • MapR changes things • Lots of new stuff like snapshots, NFS • All you need to know, you already know • NFS provides cluster wide file access • Everything works the way you expect • Performance high enough to use as a message bus
  • 4. 8/9/2013 © MapR Confidential 4 Example, out-of-core SVD • SVD provides compressed matrix form • Based on sum of rank-1 matrices A =s1u1 ¢v1 +s2u2 ¢v2 +e ± ±≈ + + ?
  • 5. 8/9/2013 © MapR Confidential 5 More on SVD • SVD provides a very nice basis Ax = A aiviå = s juj ¢vj j å é ë ê ê ù û ú ú aivi i å é ë ê ù û ú= aisiui i å
  • 6. 8/9/2013 © MapR Confidential 6 • And a nifty approximation property Ax =s1a1u1 +s2a2u2 + siaiui i>2 å e 2 £ si 2 i>2 å
  • 7. 8/9/2013 © MapR Confidential 7 Also known as … • Latent Semantic Indexing • PCA • Eigenvectors
  • 8. 8/9/2013 © MapR Confidential 8 An application, approximate translation • Translation distributes over concatenation • But counting turns concatenation into addition • This means that translation is linear! T(s1 | s2 )=T(s1)| T(s2 ) k(s1 | s2 )= k(s1) + k(s2 ) k(T(s1 | s2 )) = k(T(s1)) + k(T(s2 ))
  • 9. 8/9/2013 © MapR Confidential 9 ish
  • 10. 8/9/2013 © MapR Confidential 10 Traditional computation • Products of A are dominated by large singular values and corresponding vectors • Subtracting these dominate singular values allows the next ones to appear • Lanczos method, generally Krylov sub-space A ¢A A( ) n =US2n+1 ¢V
  • 11. 8/9/2013 © MapR Confidential 11 But …
  • 12. 8/9/2013 © MapR Confidential 12 The gotcha • Iteration in Hadoop is death • Huge process invocation costs • Lose all memory residency of data • Total lost cause
  • 13. 8/9/2013 © MapR Confidential 13 Randomness to the rescue • To save the day, run all iterations at the same time Y = AW QR = Y B = ¢Q A US ¢V = B QU( )S ¢V » A == A
  • 14. 8/9/2013 © MapR Confidential 14 In R lsa = function(a, k, p) { n = dim(a)[1] m = dim(a)[2] y = a %*% matrix(rnorm(m*(k+p)), nrow=m) y.qr = qr(y) b = t(qr.Q(y.qr)) %*% a b.qr = qr(t(b)) svd = svd(t(qr.R(b.qr))) list(u=qr.Q(y.qr) %*% svd$u[,1:k], d=svd$d[1:k], v=qr.Q(b.qr) %*% svd$v[,1:k]) }
  • 15. 8/9/2013 © MapR Confidential 15 Not good enough yet • Limited to memory size • After memory limits, feature extraction dominates
  • 16. 8/9/2013 © MapR Confidential 16 Hybrid architecture Feature extraction and down sampling I n p u t Side-data Data join Sequential SVD Map-reduce Via NFS
  • 17. 8/9/2013 © MapR Confidential 17 Hybrid architecture Feature extraction and down sampling I n p u t Side-data Data join Map-reduce Via NFS R Visualization Sequential SVD
  • 18. 8/9/2013 © MapR Confidential 18 Randomness to the rescue • To save the day again, use blocks Yi = AiW ¢R R = ¢Y Y = ¢Yi Yiå Bj = AiWR-1 ( )Aij i å LL' = B ¢B US ¢V = L AWR-1 U( )S L-1 B ¢V( )» A == =
  • 19. 8/9/2013 © MapR Confidential 19 Hybrid architecture Map-reduce Feature extraction and down sampling Via NFS R Visualization Map-reduce Block-wise parallel SVD
  • 20. 8/9/2013 © MapR Confidential 20 Conclusions • Inter-operability allows massively scalability • Prototyping in R not wasted • Map-reduce iteration not needed for SVD • Feasible scale ~10^9 non-zeros or more