SlideShare a Scribd company logo
Distributed Machine Learning and
Graph Processing with Sparse
Matrices
Speaker: LIN Qian
http://www.comp.nus.edu.sg/~linqian/
Big Data, Complex Algorithms
PageRank
(Dominant eigenvector)
Recommendations
(Matrix factorization)
Anomaly detection
(Top-K eigenvalues)
User Importance
(Vertex Centrality)
Machine learning + Graph algorithms
Large-Scale Processing Frameworks
Data-parallel frameworks – MapReduce/Dryad (2004)
– Process each record in parallel
– Use case: Computing sufficient statistics, analytics queries
Graph-centric frameworks – Pregel/GraphLab (2010)
– Process each vertex in parallel
– Use case: Graphical models
Array-based frameworks – MadLINQ (2012)
– Process blocks of array in parallel
– Use case: Linear Algebra Operations
PageRank using Matrices
Power Method
Dominant
eigenvector
Mp
M = web graph matrix
p = PageRank vector
Simplified algorithm repeat { p = M*p }
Linear Algebra Operations on Sparse Matrices
p
Statistical software
moderately-sized datasets
single server, entirely in memory
Work-around
for massive dataset
Vertical scalability
Sampling
MapReduce
Limited to aggregation processing
Data analytics
Deep vs. Scalable
Statistical software
(R, MATLAB, SPASS, SAS)
MapReduce
Improvement ways
1. Statistical sw. += large-scale data mgnt
2. MapReduce += statistical functionality
3. Combining both existing technologies
Parallel MATLAB, pR
HAMA, SciHadoop
MadLINQ [EuroSys’12]
Linear algebra platform on Dryad
Not efficient for sparse matrix comp.
Ricardo [SIGMOD’10]
But ends up inheriting the
inefficiencies of the MapReduce
interface
R Hadoop
aggregation-processing queries
aggregated data
Array-based
Single-threaded
Limited support for scaling
Challenge 1: Sparse Matrices
Challenge 1 – Sparse Matrices
1
10
100
1000
10000
1 11 21 31 41 51 61 71 81 91
Blockdensity(normalized)
Block ID
LiveJournal Netflix ClueWeb-1B
1000x more data  Computation imbalance
Challenge 2 – Data Sharing
Sharing data through pipes/network
Time-inefficient (sending copies)
Space-inefficient (extra copies)
Process
copy of
data
local copy
Process
data
Process
copy of
data
Process
copy of
data
Server 1
network
copy
network
copy
Server 2
Sparse matrices 
Communication overhead
Extend R – make it scalable, distributed
Large-scale machine learning and graph
processing on sparse matrices
Presto architecture
Presto architecture
WorkerWorker
Master
R instanceR instance
DRAM
R instance R instanceR instance
DRAM
R instance
Distributed array (darray)
Partitioned
Shared
Dynamic
foreach
Parallel execution
of the loop body
f
(x
)
Barrier
Call Update to publish changes
PageRank Using Presto
M  darray(dim=c(N,N),blocks=(s,N))
P  darray(dim=c(N,1),blocks=(s,1))
while(..){
foreach(i,1:len,
calculate(m=splits(M,i),
x=splits(P), p=splits(P,i)) {
p  m*x
}
)}
Create Distributed Array
M p
P1
P2
PN/s
PageRank Using Presto
M  darray(dim=c(N,N),blocks=(s,N))
P  darray(dim=c(N,1),blocks=(s,1))
while(..){
foreach(i,1:len,
calculate(m=splits(M,i),
x=splits(P), p=splits(P,i)) {
p  m*x
}
)}
Execute function in a cluster
Pass array partitions
p
P1
P2
PN/s
M
Dynamic repartitioning
To address load imbalance
Correctness
Repartitioning Matrices
Profile execution
Repartition
Invariants
compatibility in array sizes
Maintaining Size Invariants
invariant(mat, vec, type=ROW)
Data sharing
for multi-core
Zero-copy sharing across cores
Data sharing challenges
1. Garbage collection
2. Header conflict
R object data part
R object
header
R instance R instance
Overriding R’s allocator
Allocate process-local headers
Map data in shared memory
page
Shared R object data part
Local R
object
header
page boundary page boundary
Immutable partitions
 Safe sharing
Only share read-only data
Versioning arrays
To ensure correctness when arrays
are shared across machines
Fault tolerance
Master: primary-backup replication
Worker: heartbeat-based failure detection
Presto applications
Presto doubles LOC w.r.t. purely programming in R.
Evaluation
Faster than Spark and Hadoop
using in-memory data
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Multi-core support benefits
Data sharing benefits
4.45
2.49
1.63
0.71
0.7
0.72
10
20
40
CORES
4.38
2.21
1.22
1.22
2.12
4.16
10
20
40
CORES
Compute TransferNo sharing
Sharing
Repartitioning benefits
0 20 40 60 80 100 120 140 160
Workers Transfer Compute
0 20 40 60 80 100 120 140 160
WorkersNo Repartition
Repartition
Repartitioning benefits
0
50
100
150
200
250
300
350
400
2000
3000
4000
5000
6000
7000
8000
0 5 10 15 20
Cumulativepartitioningtime(s)
Timetoconvergence(s)
Number of Repartitions
Convergence Time
Time spent partitioning
Limitations
1. In-memory computation
2. One writer per partition
3. Array-based programming
• Presto: Large scale array-based
framework extends R
• Challenges with Sparse matrices
• Repartitioning, sharing versioned arrays
Conclusion
IMDb Rating: 8.5
Release Date: 27 June 2008
Director: Doug Sweetland
Studio: Pixar
Runtime: 5 min
Brief:
A stage magician’s rabbit
gets into a magical onstage
brawl against his neglectful
guardian with two magic
hats.

More Related Content

PDF
010_20160216_Variational Gaussian Process
PPTX
Spectral graph theory
PPTX
Optimal Chain Matrix Multiplication Big Data Perspective
PDF
GraphChi big graph processing
PPTX
Transfer learningforclp
PPT
5.3 dyn algo-i
PDF
Graph Analyses with Python and NetworkX
PPTX
Differential privacy without sensitivity [NIPS2016読み会資料]
010_20160216_Variational Gaussian Process
Spectral graph theory
Optimal Chain Matrix Multiplication Big Data Perspective
GraphChi big graph processing
Transfer learningforclp
5.3 dyn algo-i
Graph Analyses with Python and NetworkX
Differential privacy without sensitivity [NIPS2016読み会資料]

What's hot (15)

PDF
Implementation of Thorup's Linear Time Algorithm for Undirected Single Source...
PDF
Gradient Estimation Using Stochastic Computation Graphs
PDF
Tutorial of topological data analysis part 3(Mapper algorithm)
PPT
Machine Learning and Statistical Analysis
PDF
Sampling from Massive Graph Streams: A Unifying Framework
PPTX
Parallel Algorithms for Geometric Graph Problems (at Stanford)
PDF
On Sampling from Massive Graph Streams
PDF
Large scale logistic regression and linear support vector machines using spark
PDF
Broom: Converting Statistical Models to Tidy Data Frames
PDF
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
PDF
Parallel Optimization in Machine Learning
PDF
Advanced Support Vector Machine for classification in Neural Network
PDF
Graph Sample and Hold: A Framework for Big Graph Analytics
PDF
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
PDF
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
Implementation of Thorup's Linear Time Algorithm for Undirected Single Source...
Gradient Estimation Using Stochastic Computation Graphs
Tutorial of topological data analysis part 3(Mapper algorithm)
Machine Learning and Statistical Analysis
Sampling from Massive Graph Streams: A Unifying Framework
Parallel Algorithms for Geometric Graph Problems (at Stanford)
On Sampling from Massive Graph Streams
Large scale logistic regression and linear support vector machines using spark
Broom: Converting Statistical Models to Tidy Data Frames
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Parallel Optimization in Machine Learning
Advanced Support Vector Machine for classification in Neural Network
Graph Sample and Hold: A Framework for Big Graph Analytics
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
Ad

Viewers also liked (20)

PDF
F8 tech talk_pinterest_v4
PPT
Seda an architecture for well-conditioned scalable internet services
PPTX
Facebook's TAO & Unicorn data storage and search platforms
PPTX
Data Driven Growth
PPTX
IDEs y Frameworks mas utilizados
PPTX
Cassandra Summit - What's New In Apache TinkerPop?
PDF
Real World Tales of Repair (Alexander Dejanovski, The Last Pickle) | Cassandr...
PDF
Configurando o Geany para Python - 03/2012
PDF
Configurando o geany_para_python
PDF
Introduction to cassandra 2014
PPTX
Epiphany: Connecting Millions of Events to Thirty Billion Data Points in Real...
PDF
The Gremlin Graph Traversal Language
PDF
Quantum Processes in Graph Computing
PPTX
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
PDF
Cassandra Virtual Node talk
PDF
Gremlin's Graph Traversal Machinery
PPTX
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
PDF
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
PDF
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
PDF
PagerDuty: Span the WAN? Yes you can!
F8 tech talk_pinterest_v4
Seda an architecture for well-conditioned scalable internet services
Facebook's TAO & Unicorn data storage and search platforms
Data Driven Growth
IDEs y Frameworks mas utilizados
Cassandra Summit - What's New In Apache TinkerPop?
Real World Tales of Repair (Alexander Dejanovski, The Last Pickle) | Cassandr...
Configurando o Geany para Python - 03/2012
Configurando o geany_para_python
Introduction to cassandra 2014
Epiphany: Connecting Millions of Events to Thirty Billion Data Points in Real...
The Gremlin Graph Traversal Language
Quantum Processes in Graph Computing
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Cassandra Virtual Node talk
Gremlin's Graph Traversal Machinery
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
PagerDuty: Span the WAN? Yes you can!
Ad

Similar to Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices (20)

PPTX
Rattle Graphical Interface for R Language
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
PPTX
PDF
No more struggles with Apache Spark workloads in production
PPT
Lec5 Pagerank
PPT
Lec5 pagerank
PPT
Lec5 Pagerank
PPT
MapReduceAlgorithms.ppt
PDF
Graph convolutional networks in apache spark
PPTX
Big data analytics_beyond_hadoop_public_18_july_2013
PPT
Spark training-in-bangalore
PPTX
Manipulation and Python Tools-fundamantals of data science
PPT
Pagerank (from Google)
PPTX
Big Data Analytics with Storm, Spark and GraphLab
PDF
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
PDF
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
PDF
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
PPTX
Pandas data transformational data structure patterns and challenges final
PDF
Spark 4th Meetup Londond - Building a Product with Spark
PPTX
Hadoop in sigmod 2011
Rattle Graphical Interface for R Language
Yarn spark next_gen_hadoop_8_jan_2014
No more struggles with Apache Spark workloads in production
Lec5 Pagerank
Lec5 pagerank
Lec5 Pagerank
MapReduceAlgorithms.ppt
Graph convolutional networks in apache spark
Big data analytics_beyond_hadoop_public_18_july_2013
Spark training-in-bangalore
Manipulation and Python Tools-fundamantals of data science
Pagerank (from Google)
Big Data Analytics with Storm, Spark and GraphLab
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Pandas data transformational data structure patterns and challenges final
Spark 4th Meetup Londond - Building a Product with Spark
Hadoop in sigmod 2011

More from Qian Lin (13)

PDF
Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
PDF
PaxosStore: High-availability Storage Made Practical in WeChat
PPTX
Trinity: A Distributed Graph Engine on a Memory Cloud
PPTX
Adaptive Execution Support for Malleable Computation
PPTX
C-Cube: Elastic Continuous Clustering in the Cloud
PPTX
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
PPTX
Optimizing Virtual Machines Using Hybrid Virtualization
PPT
Virtual Machine Performance
PPTX
Be an Explorer, Be a Coder, Be a Writer
PPTX
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
PPTX
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
PPTX
In-situ MapReduce for Log Processing
PPTX
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
PaxosStore: High-availability Storage Made Practical in WeChat
Trinity: A Distributed Graph Engine on a Memory Cloud
Adaptive Execution Support for Malleable Computation
C-Cube: Elastic Continuous Clustering in the Cloud
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Optimizing Virtual Machines Using Hybrid Virtualization
Virtual Machine Performance
Be an Explorer, Be a Coder, Be a Writer
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
In-situ MapReduce for Log Processing
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
20250228 LYD VKU AI Blended-Learning.pptx
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MIND Revenue Release Quarter 2 2025 Press Release
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Per capita expenditure prediction using model stacking based on satellite ima...
Review of recent advances in non-invasive hemoglobin estimation
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

Editor's Notes

  • #2: MapReduce excels in massively parallel processing, scalability, and fault tolerance.In terms of analytics, however, such systems have been limited primarily to aggregation processing, i.e., computation of simple aggregates such as SUM, COUNT, and AVERAGE, after using filtering, joining, and grouping operations to prepare the data for the aggregation step. Although most DMSs provide hooks for user-defined functions and procedures, they do not deliver the rich analytic functionality found in statistical packages.
  • #3: Virtually all prior work attempts to get along with only one type of system, either adding large-scale data management capability to statistical packages or adding statistical functionality to DMSs. This approach leads to solutions that are often cumbersome, unfriendly to analysts, or wasteful in that a great deal of well established technology is needlessly re-invented or re-implemented.
  • #4: Convert matrix operations to MapReduce functions.
  • #5: R sending aggregation-processing queries to Hadoop (written in the high-level Jaql query language), and Hadoop sending aggregated data to R for advanced statistical processing or visualization.
  • #6: R has serious limitations when applied to very large datasets: limited support for distributed processing, no strategy for load balancing, no fault tolerance, and is constrained by a server’s DRAM capacity.
  • #8: Large-scale machine learning and graph processing on sparse matrices
  • #9: Distributed array (darray) provides a shared, in-memory view of multi-dimensional data stored across multiple servers.
  • #10: Repartitioning can be used to subdivide an array into a specified number of parts. Repartitioning is an optional performance optimization which helps when there is load imbalance in the system.
  • #11: Note that for programs with general data structures (e.g., trees) writing invariants is difficult. However, for matrix computation, arrays are the only data structure and the relevant invariant is the compatibility in array sizes.
  • #12: Qian’s comment: same concept as the snapshot isolation.