Parallel bfs using 2 stacks

Parallel BFS on Distributed Memory Systems
Aydin Buluc and Kamesh Madduri
Sapta
DC reading group
September 29, 2016

Outline
Introduction
Shared Memory BFS
Model
Contributions
Serial BFS overview
Another paper: Parallel BFS using 2 queues
This paper: Hybrid Parallel BFS using 2 stacks
Experimental Results
Conclusion

Introduction
BFS is important.
BFS usually forms a sub-part to more complex graph
algorithms.
Now that we have BIG graphs, parallelizing it is very
important
Shared Memory BFS involves: (1) communication between
processors and (2) distribution of the graph(vertices) among
processors

Model
Graph G(V , E), and |V | = n and |E| = m, also m is O(n);
i.e. sparse graphs.
Edge weights = 1.

Contributions
Traditional representation: 1 dimensional BFS (1D adjacency
arrays).
Sparse matrix representation: 2D partitioning of the graph
(Not discussed).

Serial BFS overview
Sequential BFS uses a queue data structure
BFS requirement :
all vertices at a distance k from the source should be “visited”
before vertices at distance k + 1.
Explanation?
Level Synchronous BFS is a key concept in correct shared
memory BFS.

Modiﬁed BFS : Use 2 stacks
Can be parallelized as is: perform lines 6-7 in parallel,
lines 8-10 are atomic

Related Work: Level Synchronous Parallel BFS using 2
queues by Agarwal et al SC’10 [1]

Hybrid 1D Parallel BFS Algorithm
One of the main areas for optimization to this basic parallel
algorithm is
load-balancing: ensuring that parallelization of the edge visit
steps is load-balanced
1D partitioning: If there are p processors in the system, give
ownership of n/p vertices, to each processor.
Random shuﬄing of the vertice identiﬁers prior to
partitioning. So all processors ge roughly same number of
vertices(n/p) and edges(m/p)
Use of local stacks NSi for pushes and then global
union.(Overhead < 3% of execution time)

1D BFS errors
The value of level is not incremented
The Next Stack NSi data structure should be emptied before
traversing next level.

Experiments
1D Flat MPI: one process per core
1D Hybrid: one or more MPI processes within a node
synthetic graphs based on the R-MAT random graph
model(default m : n 16) , web crawl of the UK domain (133
million vertices and 5.5 billion edges).
Systems: Hopper (6392-node Cray XE6) and Franklin
(9660-node Cray XT4)

Strong scaling on Franklin
Higher is better
GTEPS: Giga Traversed Edges per Second

lower is better
Strong scaling on Franklin

Weak Scaling on Franklin
Lower is better

Experiments
Flat 1D algorithms are about 1.5 − 1.8 times faster than the
2D algorithms.
The 1D hybrid algorithm, are slower than the ﬂat 1D
algorithm for smaller concurrencies, starts to perform
signiﬁcantly faster for larger concurrencies.

Conclusion
Conjecture: Level synchronous BFS can be implemented
without any error with relaxed queues
Question: Can the error be bounded if we don’t have a level
synchronous algorithm?

V. Agarwal, F. Petrini, D. Pasetto, and D.A. Bader. Scalable
graph exploration on multicore processors. In Proc. ACM/IEEE
Conference on Supercomputing (SC10), November 2010.
A. Buluc K. Madduri. Parallel breadth-first search on
distributed memory systems. In Proceedings of 2011
International Conference for High Performance Computing,
Networking, Storage and Analysis, SC ’11, pages 65:1–65:12,
New York, NY, USA, 2011. ACM.
C.E. Leiserson and T.B. Schardl. A work-efficient parallel
breadth-first search algorithm (or how to cope with the
nondeterminism of reducers). In Proc. 22nd ACM Symp. on
Parallism in Algorithms and Architectures (SPAA ’10), pages
303–314, June 2010.

Parallel bfs using 2 stacks

More Related Content

Similar to Parallel bfs using 2 stacks (20)

Recently uploaded (20)

Parallel bfs using 2 stacks