SlideShare a Scribd company logo
Parallel BFS on Distributed Memory Systems
Aydin Buluc and Kamesh Madduri
Sapta
DC reading group
September 29, 2016
Outline
Introduction
Shared Memory BFS
Model
Contributions
Serial BFS overview
Another paper: Parallel BFS using 2 queues
This paper: Hybrid Parallel BFS using 2 stacks
Experimental Results
Conclusion
Introduction
BFS is important.
BFS usually forms a sub-part to more complex graph
algorithms.
Now that we have BIG graphs, parallelizing it is very
important
Shared Memory BFS involves: (1) communication between
processors and (2) distribution of the graph(vertices) among
processors
Model
Graph G(V , E), and |V | = n and |E| = m, also m is O(n);
i.e. sparse graphs.
Edge weights = 1.
Contributions
Traditional representation: 1 dimensional BFS (1D adjacency
arrays).
Sparse matrix representation: 2D partitioning of the graph
(Not discussed).
Serial BFS overview
Sequential BFS uses a queue data structure
BFS requirement :
all vertices at a distance k from the source should be “visited”
before vertices at distance k + 1.
Explanation?
Level Synchronous BFS is a key concept in correct shared
memory BFS.
Modified BFS : Use 2 stacks
Can be parallelized as is: perform lines 6-7 in parallel,
lines 8-10 are atomic
Related Work: Level Synchronous Parallel BFS using 2
queues by Agarwal et al SC’10 [1]
Hybrid 1D Parallel BFS Algorithm
One of the main areas for optimization to this basic parallel
algorithm is
load-balancing: ensuring that parallelization of the edge visit
steps is load-balanced
1D partitioning: If there are p processors in the system, give
ownership of n/p vertices, to each processor.
Random shuffling of the vertice identifiers prior to
partitioning. So all processors ge roughly same number of
vertices(n/p) and edges(m/p)
Use of local stacks NSi for pushes and then global
union.(Overhead < 3% of execution time)
1D BFS
1D BFS contd..
1D BFS errors
The value of level is not incremented
The Next Stack NSi data structure should be emptied before
traversing next level.
Experiments
1D Flat MPI: one process per core
1D Hybrid: one or more MPI processes within a node
synthetic graphs based on the R-MAT random graph
model(default m : n 16) , web crawl of the UK domain (133
million vertices and 5.5 billion edges).
Systems: Hopper (6392-node Cray XE6) and Franklin
(9660-node Cray XT4)
Experimental Results
Strong scaling on Franklin
Higher is better
GTEPS: Giga Traversed Edges per Second
Experimental Results
lower is better
Strong scaling on Franklin
Experimental Results
Weak Scaling on Franklin
Lower is better
Experiments
Flat 1D algorithms are about 1.5 − 1.8 times faster than the
2D algorithms.
The 1D hybrid algorithm, are slower than the flat 1D
algorithm for smaller concurrencies, starts to perform
significantly faster for larger concurrencies.
Conclusion
Conjecture: Level synchronous BFS can be implemented
without any error with relaxed queues
Question: Can the error be bounded if we don’t have a level
synchronous algorithm?
V. Agarwal, F. Petrini, D. Pasetto, and D.A. Bader. Scalable
graph exploration on multicore processors. In Proc. ACM/IEEE
Conference on Supercomputing (SC10), November 2010.
A. Buluc K. Madduri. Parallel breadth-first search on
distributed memory systems. In Proceedings of 2011
International Conference for High Performance Computing,
Networking, Storage and Analysis, SC ’11, pages 65:1–65:12,
New York, NY, USA, 2011. ACM.
C.E. Leiserson and T.B. Schardl. A work-efficient parallel
breadth-first search algorithm (or how to cope with the
nondeterminism of reducers). In Proc. 22nd ACM Symp. on
Parallism in Algorithms and Architectures (SPAA ’10), pages
303–314, June 2010.
Thank You :)

More Related Content

PDF
Electric vehicles as the future of personal transportation?
PPTX
Smart grid jenifer 120316
PDF
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
PDF
The More the Merrier - Efficient Multi-Source Graph Traversal
PDF
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
PPTX
Understanding Breadth First Search (BFS) Algorithm
PDF
Fast & Energy-Efficient Breadth-First Search on a Single NUMA System
DOC
BFS, Breadth first search | Search Traversal Algorithm
Electric vehicles as the future of personal transportation?
Smart grid jenifer 120316
Lecture 7: Data-Intensive Computing for Text Analysis (Fall 2011)
The More the Merrier - Efficient Multi-Source Graph Traversal
Fast and Scalable NUMA-based Thread Parallel Breadth-first Search
Understanding Breadth First Search (BFS) Algorithm
Fast & Energy-Efficient Breadth-First Search on a Single NUMA System
BFS, Breadth first search | Search Traversal Algorithm

Similar to Parallel bfs using 2 stacks (20)

PDF
NUMA optimized Parallel Breadth first Search on Multicore Single node System
PDF
Extreme Scale Breadth-First Search on Supercomputers
PDF
NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System
PPTX
BFS_DFS_Enhanced_Presentation124567.pptx
PDF
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
PPTX
Breadth first search (bfs)
PPTX
pptx - Distributed Parallel Inference on Large Factor Graphs
PPTX
Breadth First Search with example and solutions
PPTX
Breath first search Traversal algorithm DSA .pptx
PPTX
Presentation on Breadth First Search (BFS)
PPTX
BFS_Presentation_Sourabh.pptx. Explain the bfs
PPTX
BFS and DFS
PPT
Graph traversal-BFS & DFS
PDF
Breadth First Search .
PPTX
bfs tree searching ,sortingUntitled presentation.pptx
PDF
Xbfs HPDC'2019
PDF
Breadth First Search and Depth First Search Algorithm
PPTX
6CS4_AI_Unit-1.pptx helo to leairn dsa in a eay
PPTX
Technical_Seminar .pptx
PDF
OmpSs – improving the scalability of OpenMP
NUMA optimized Parallel Breadth first Search on Multicore Single node System
Extreme Scale Breadth-First Search on Supercomputers
NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System
BFS_DFS_Enhanced_Presentation124567.pptx
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Breadth first search (bfs)
pptx - Distributed Parallel Inference on Large Factor Graphs
Breadth First Search with example and solutions
Breath first search Traversal algorithm DSA .pptx
Presentation on Breadth First Search (BFS)
BFS_Presentation_Sourabh.pptx. Explain the bfs
BFS and DFS
Graph traversal-BFS & DFS
Breadth First Search .
bfs tree searching ,sortingUntitled presentation.pptx
Xbfs HPDC'2019
Breadth First Search and Depth First Search Algorithm
6CS4_AI_Unit-1.pptx helo to leairn dsa in a eay
Technical_Seminar .pptx
OmpSs – improving the scalability of OpenMP
Ad

Recently uploaded (20)

PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PPTX
Fluid dynamics vivavoce presentation of prakash
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Microbes in human welfare class 12 .pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PDF
The Land of Punt — A research by Dhani Irwanto
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPTX
The Minerals for Earth and Life Science SHS.pptx
PPTX
Pharmacology of Autonomic nervous system
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Introcution to Microbes Burton's Biology for the Health
PPTX
Introduction to Cardiovascular system_structure and functions-1
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
Phytochemical Investigation of Miliusa longipes.pdf
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Hypertension_Training_materials_English_2024[1] (1).pptx
Fluid dynamics vivavoce presentation of prakash
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Microbes in human welfare class 12 .pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
BODY FLUIDS AND CIRCULATION class 11 .pptx
The Land of Punt — A research by Dhani Irwanto
Seminar Hypertension and Kidney diseases.pptx
The Minerals for Earth and Life Science SHS.pptx
Pharmacology of Autonomic nervous system
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Introcution to Microbes Burton's Biology for the Health
Introduction to Cardiovascular system_structure and functions-1
Ad

Parallel bfs using 2 stacks

  • 1. Parallel BFS on Distributed Memory Systems Aydin Buluc and Kamesh Madduri Sapta DC reading group September 29, 2016
  • 2. Outline Introduction Shared Memory BFS Model Contributions Serial BFS overview Another paper: Parallel BFS using 2 queues This paper: Hybrid Parallel BFS using 2 stacks Experimental Results Conclusion
  • 3. Introduction BFS is important. BFS usually forms a sub-part to more complex graph algorithms. Now that we have BIG graphs, parallelizing it is very important Shared Memory BFS involves: (1) communication between processors and (2) distribution of the graph(vertices) among processors
  • 4. Model Graph G(V , E), and |V | = n and |E| = m, also m is O(n); i.e. sparse graphs. Edge weights = 1.
  • 5. Contributions Traditional representation: 1 dimensional BFS (1D adjacency arrays). Sparse matrix representation: 2D partitioning of the graph (Not discussed).
  • 6. Serial BFS overview Sequential BFS uses a queue data structure BFS requirement : all vertices at a distance k from the source should be “visited” before vertices at distance k + 1. Explanation? Level Synchronous BFS is a key concept in correct shared memory BFS.
  • 7. Modified BFS : Use 2 stacks Can be parallelized as is: perform lines 6-7 in parallel, lines 8-10 are atomic
  • 8. Related Work: Level Synchronous Parallel BFS using 2 queues by Agarwal et al SC’10 [1]
  • 9. Hybrid 1D Parallel BFS Algorithm One of the main areas for optimization to this basic parallel algorithm is load-balancing: ensuring that parallelization of the edge visit steps is load-balanced 1D partitioning: If there are p processors in the system, give ownership of n/p vertices, to each processor. Random shuffling of the vertice identifiers prior to partitioning. So all processors ge roughly same number of vertices(n/p) and edges(m/p) Use of local stacks NSi for pushes and then global union.(Overhead < 3% of execution time)
  • 12. 1D BFS errors The value of level is not incremented The Next Stack NSi data structure should be emptied before traversing next level.
  • 13. Experiments 1D Flat MPI: one process per core 1D Hybrid: one or more MPI processes within a node synthetic graphs based on the R-MAT random graph model(default m : n 16) , web crawl of the UK domain (133 million vertices and 5.5 billion edges). Systems: Hopper (6392-node Cray XE6) and Franklin (9660-node Cray XT4)
  • 14. Experimental Results Strong scaling on Franklin Higher is better GTEPS: Giga Traversed Edges per Second
  • 15. Experimental Results lower is better Strong scaling on Franklin
  • 16. Experimental Results Weak Scaling on Franklin Lower is better
  • 17. Experiments Flat 1D algorithms are about 1.5 − 1.8 times faster than the 2D algorithms. The 1D hybrid algorithm, are slower than the flat 1D algorithm for smaller concurrencies, starts to perform significantly faster for larger concurrencies.
  • 18. Conclusion Conjecture: Level synchronous BFS can be implemented without any error with relaxed queues Question: Can the error be bounded if we don’t have a level synchronous algorithm?
  • 19. V. Agarwal, F. Petrini, D. Pasetto, and D.A. Bader. Scalable graph exploration on multicore processors. In Proc. ACM/IEEE Conference on Supercomputing (SC10), November 2010. A. Buluc K. Madduri. Parallel breadth-first search on distributed memory systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pages 65:1–65:12, New York, NY, USA, 2011. ACM. C.E. Leiserson and T.B. Schardl. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In Proc. 22nd ACM Symp. on Parallism in Algorithms and Architectures (SPAA ’10), pages 303–314, June 2010.