SlideShare a Scribd company logo
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Data Structures and Performance for Scienti
c 
Computing with Hadoop and Dumbo 
Austin R. Benson 
Computer Sciences Division, UC-Berkeley 
ICME, Stanford University 
May 15, 2012
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
1 
1 Matrix storage 
2 Data 
3 Example: outputting many small matrices 
4 Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Dense matrix storage 
A = 
0 
11 12 13 14 
21 22 23 24 
31 32 33 34 
41 42 42 44 
BB@ 
1 
CCA 
How do we store the matrix in HDFS?
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Dense matrix storage 
A = 
0 
11 12 13 14 
21 22 23 24 
31 32 33 34 
41 42 42 44 
BB@ 
1 
CCA 
In HDFS: 
h1; [11; 12; 13; 14]i 
h2; [21; 22; 23; 24]i 
h3; [31; 32; 33; 34]i 
h4; [41; 42; 43; 44]i
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Two rows per record 
or we might use: 
h1; [[11; 12; 13; 14]; [21; 22; 23; 24]]i 
h3; [[31; 32; 33; 34]; [41; 42; 43; 44]]i
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Flattened list 
or maybe 
h1; [11; 12; 13; 14; 21; 22; 23; 24]i 
h3; [31; 32; 33; 34; 41; 42; 43; 44]i 
... but we do lose information here (maybe it's not important)
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Full matrix 
or maybe 
h1; [[11; 12; 13; 14]; [21; 22; 23; 24]; [31; 32; 33; 34]; [41; 42; 43; 44]]i
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
What is the "best" way?
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
What is the "best" way? 
Depends on the application... we will look at an example later.
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
2 
1 Matrix storage 
2 Data 
3 Example: outputting many small matrices 
4 Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Data Serialization 
Small optimizations ! 2.5x speedup! 
*all data from the NERSC Magellan cluster
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Data Serialization 
Same experiment but dierent matrix size (200 columns): 
Again, 2.5x speedup!
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Languages 
Switching from Python to C++... 
same general trend
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
More speedups 
Algorithm performance isn't the only place where we see speedups
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Why can we expect these speedups? 
These are not high-performance implementations. We care about 
I/O performance.
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
3 
1 Matrix storage 
2 Data 
3 Example: outputting many small matrices 
4 Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Suppose we need to write many small matrices to disk.
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Code 
Code: 
git clone git://github.com/icme/mapreduce-workshop.git 
cd mapreduce-workshop/arbenson 
Files: 
speed test.py (tester) 
small matrix test.py (driver)
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
4 
1 Matrix storage 
2 Data 
3 Example: outputting many small matrices 
4 Example: Cholesky QR
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Algorithm 
Cholesky QR: R = chol(ATA, 'upper')
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Implementation for MapReduce
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Mapper implementation 
Which of these implementations is better?
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Mapper implementation 
Which of these implementations is better? 
Answer: the one on the left (usually)
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Why? 
1 Shue time 
2 Reduce bottleneck 
However, the left implementation could run out of memory.
Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 
Mapper implementation 
Can we do better? Yes

More Related Content

PDF
Go and Uber’s time series database m3
PDF
Time Series Processing with Solr and Spark
PDF
Query optimization in Apache Tajo
PDF
Apache Solr as a compressed, scalable, and high performance time series database
PDF
The new time series kid on the block
PDF
Go at uber
PPTX
Bucket your partitions wisely - Cassandra summit 2016
PPT
JVM performance options. How it works
Go and Uber’s time series database m3
Time Series Processing with Solr and Spark
Query optimization in Apache Tajo
Apache Solr as a compressed, scalable, and high performance time series database
The new time series kid on the block
Go at uber
Bucket your partitions wisely - Cassandra summit 2016
JVM performance options. How it works

What's hot (19)

PDF
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...
PDF
Handling 20 billion requests a month
PDF
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
PDF
Tweaking performance on high-load projects
PDF
Introduction to Apache Tajo: Future of Data Warehouse
PDF
Probabilistic Data Structures and Approximate Solutions
PDF
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
PPTX
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
PPTX
Query Rewriting in RDF Stream Processing
PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
PDF
A Fast and Efficient Time Series Storage Based on Apache Solr
PPTX
Performance .NET Core - M. Terech, P. Janowski
PPTX
Pycon 2016-open-space
PDF
Berlin buzzwords 2013 - Faceting analyzed fields with some sprinkles of proba...
PDF
Golang in TiDB (GopherChina 2017)
PPTX
Tracking the Performance of the Web with HTTP Archive
PPTX
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
PPTX
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
PDF
Performance evaluation of apache tajo
Leveraging the Web of Data: Managing, Analysing and Making Use of Linked Open...
Handling 20 billion requests a month
FOSDEM 2020: Querying over millions and billions of metrics with M3DB's index
Tweaking performance on high-load projects
Introduction to Apache Tajo: Future of Data Warehouse
Probabilistic Data Structures and Approximate Solutions
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Query Rewriting in RDF Stream Processing
Introduction to Apache Tajo: Data Warehouse for Big Data
A Fast and Efficient Time Series Storage Based on Apache Solr
Performance .NET Core - M. Terech, P. Janowski
Pycon 2016-open-space
Berlin buzzwords 2013 - Faceting analyzed fields with some sprinkles of proba...
Golang in TiDB (GopherChina 2017)
Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
Performance evaluation of apache tajo
Ad

Viewers also liked (20)

PPT
Hoja de ruta
PPS
Jetabook - El Facebook argentino
PDF
A l'abordatge presentació setmana
PDF
DOC
How not to be a dick Как Не быть хуем и засранцем
PPT
E-learning a jeho možnosti
DOCX
Actividades del proyecto de aula enriquecida con tpack
PDF
Sony VGP-BPS8 Akku
PDF
Naturaleza y turismo
PDF
Curso Superior de Dirección Estrategica marketing 11 / 12
PDF
1310 manual de conservacion de suelos
PPT
How to set up PPPoE on your Fonera - FON
PPTX
El bulldog francés
PPTX
Global organic textile standard
PDF
Mapa conceptual gestion de calidad en los servicios3
PDF
Cleo Studio Wedding Package Promotion
PDF
08 abril-2014
PDF
Manual gesuser
PDF
Customer-centric IT - Enterprise IT trends and investment 2013
 
Hoja de ruta
Jetabook - El Facebook argentino
A l'abordatge presentació setmana
How not to be a dick Как Не быть хуем и засранцем
E-learning a jeho možnosti
Actividades del proyecto de aula enriquecida con tpack
Sony VGP-BPS8 Akku
Naturaleza y turismo
Curso Superior de Dirección Estrategica marketing 11 / 12
1310 manual de conservacion de suelos
How to set up PPPoE on your Fonera - FON
El bulldog francés
Global organic textile standard
Mapa conceptual gestion de calidad en los servicios3
Cleo Studio Wedding Package Promotion
08 abril-2014
Manual gesuser
Customer-centric IT - Enterprise IT trends and investment 2013
 
Ad

Similar to Data Structures and Performance for Scientific Computing with Hadoop and Dumbo (ICME MR 2012) (20)

PDF
2021 04-20 apache arrow and its impact on the database industry.pptx
PPT
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
PPT
On the need for a W3C community group on RDF Stream Processing
PPTX
Inside SQL Server In-Memory OLTP
PDF
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
PDF
Microservices, containers, and machine learning
PPTX
Forensic Memory Analysis of Android's Dalvik Virtual Machine
PPT
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
PDF
Afterwork big data et data viz - du lac à votre écran
PPTX
Real-Time Integration Between MongoDB and SQL Databases
PPT
NOSQL and Cassandra
PDF
Data science at the command line
PPTX
An introduction to column store indexes and batch mode
PDF
Making sense of your data jug
PPTX
Data stores: beyond relational databases
PDF
print mod 2.pdf
PDF
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
PPTX
Optimizing Tcl Bytecode
PDF
User-space Network Processing
PPTX
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
2021 04-20 apache arrow and its impact on the database industry.pptx
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
On the need for a W3C community group on RDF Stream Processing
Inside SQL Server In-Memory OLTP
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Microservices, containers, and machine learning
Forensic Memory Analysis of Android's Dalvik Virtual Machine
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
Afterwork big data et data viz - du lac à votre écran
Real-Time Integration Between MongoDB and SQL Databases
NOSQL and Cassandra
Data science at the command line
An introduction to column store indexes and batch mode
Making sense of your data jug
Data stores: beyond relational databases
print mod 2.pdf
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Optimizing Tcl Bytecode
User-space Network Processing
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...

More from Austin Benson (20)

PDF
Hypergraph Cuts with General Splitting Functions (JMM)
PDF
Spectral embeddings and evolving networks
PDF
Computational Frameworks for Higher-order Network Data Analysis
PDF
Higher-order link prediction and other hypergraph modeling
PDF
Hypergraph Cuts with General Splitting Functions
PDF
Hypergraph Cuts with General Splitting Functions
PDF
Higher-order link prediction
PDF
Simplicial closure & higher-order link prediction
PDF
Three hypergraph eigenvector centralities
PDF
Semi-supervised learning of edge flows
PDF
Choosing to grow a graph
PDF
Link prediction in networks with core-fringe structure
PDF
Higher-order Link Prediction GraphEx
PDF
Higher-order Link Prediction Syracuse
PDF
Random spatial network models for core-periphery structure
PDF
Random spatial network models for core-periphery structure.
PDF
Simplicial closure & higher-order link prediction
PDF
Simplicial closure and simplicial diffusions
PDF
Sampling methods for counting temporal motifs
PDF
Set prediction three ways
Hypergraph Cuts with General Splitting Functions (JMM)
Spectral embeddings and evolving networks
Computational Frameworks for Higher-order Network Data Analysis
Higher-order link prediction and other hypergraph modeling
Hypergraph Cuts with General Splitting Functions
Hypergraph Cuts with General Splitting Functions
Higher-order link prediction
Simplicial closure & higher-order link prediction
Three hypergraph eigenvector centralities
Semi-supervised learning of edge flows
Choosing to grow a graph
Link prediction in networks with core-fringe structure
Higher-order Link Prediction GraphEx
Higher-order Link Prediction Syracuse
Random spatial network models for core-periphery structure
Random spatial network models for core-periphery structure.
Simplicial closure & higher-order link prediction
Simplicial closure and simplicial diffusions
Sampling methods for counting temporal motifs
Set prediction three ways

Recently uploaded (20)

PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
Introduction to Business Data Analytics.
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Foundation of Data Science unit number two notes
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
oil_refinery_comprehensive_20250804084928 (1).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Major-Components-ofNKJNNKNKNKNKronment.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Quality review (1)_presentation of this 21
Acceptance and paychological effects of mandatory extra coach I classes.pptx
1_Introduction to advance data techniques.pptx
Introduction to Business Data Analytics.
STUDY DESIGN details- Lt Col Maksud (21).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Miokarditis (Inflamasi pada Otot Jantung)
Data_Analytics_and_PowerBI_Presentation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Foundation of Data Science unit number two notes
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Launch Your Data Science Career in Kochi – 2025
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Acumen Training GuidePresentation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo (ICME MR 2012)

  • 1. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Data Structures and Performance for Scienti
  • 2. c Computing with Hadoop and Dumbo Austin R. Benson Computer Sciences Division, UC-Berkeley ICME, Stanford University May 15, 2012
  • 3. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 1 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR
  • 4. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Dense matrix storage A = 0 11 12 13 14 21 22 23 24 31 32 33 34 41 42 42 44 BB@ 1 CCA How do we store the matrix in HDFS?
  • 5. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Dense matrix storage A = 0 11 12 13 14 21 22 23 24 31 32 33 34 41 42 42 44 BB@ 1 CCA In HDFS: h1; [11; 12; 13; 14]i h2; [21; 22; 23; 24]i h3; [31; 32; 33; 34]i h4; [41; 42; 43; 44]i
  • 6. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Two rows per record or we might use: h1; [[11; 12; 13; 14]; [21; 22; 23; 24]]i h3; [[31; 32; 33; 34]; [41; 42; 43; 44]]i
  • 7. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Flattened list or maybe h1; [11; 12; 13; 14; 21; 22; 23; 24]i h3; [31; 32; 33; 34; 41; 42; 43; 44]i ... but we do lose information here (maybe it's not important)
  • 8. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Full matrix or maybe h1; [[11; 12; 13; 14]; [21; 22; 23; 24]; [31; 32; 33; 34]; [41; 42; 43; 44]]i
  • 9. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR What is the "best" way?
  • 10. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR What is the "best" way? Depends on the application... we will look at an example later.
  • 11. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 2 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR
  • 12. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Data Serialization Small optimizations ! 2.5x speedup! *all data from the NERSC Magellan cluster
  • 13. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Data Serialization Same experiment but dierent matrix size (200 columns): Again, 2.5x speedup!
  • 14. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Languages Switching from Python to C++... same general trend
  • 15. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR More speedups Algorithm performance isn't the only place where we see speedups
  • 16. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Why can we expect these speedups? These are not high-performance implementations. We care about I/O performance.
  • 17. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 3 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR
  • 18. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Suppose we need to write many small matrices to disk.
  • 19. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Code Code: git clone git://github.com/icme/mapreduce-workshop.git cd mapreduce-workshop/arbenson Files: speed test.py (tester) small matrix test.py (driver)
  • 20. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
  • 21. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
  • 22. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
  • 23. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR
  • 24. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR 4 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR
  • 25. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Algorithm Cholesky QR: R = chol(ATA, 'upper')
  • 26. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Implementation for MapReduce
  • 27. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Mapper implementation Which of these implementations is better?
  • 28. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Mapper implementation Which of these implementations is better? Answer: the one on the left (usually)
  • 29. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Why? 1 Shue time 2 Reduce bottleneck However, the left implementation could run out of memory.
  • 30. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Mapper implementation Can we do better? Yes
  • 31. Matrix storage Data Example: outputting many small matrices Example: Cholesky QR Questions? Austin R. Benson arbenson@gmail.com https://github.com/arbenson/mrtsqr