SlideShare a Scribd company logo
1
Twitter Storm

1st Mentioning of Big Data
Michael Cox and David Ellsworth publish “Application-controlled
demand paging for out-of-core visualization” in the Proceedings of the
IEEE 8th conference on Visualization, 1997

http://www.gartner.com/it-glossary/big-data/

Pregel
Map Reduce
2
• Scalable platform for large scale graph analytics
in commodity clusters and cloud.
• Simple yet scalable programming abstraction for
large scale graph processing
• More than 10x improvements in some graph
algorithms over traditional graph processing
systems
• Components
– GoFS : Sub-graph centric Distributed Graph Storage
– Gopher : Distributed BSP-based programming
abstraction for Large Scale graph analytics framework
3
• Iterative vertex centric programming model
based on Bulk synchronous parallel model.
• In each iteration vertex can
1. Receive messages sent to it in previous iteration
2. Send messages to other vertices
3. Modify its own state

• Vertex centric programming mode- Think like
a vertex
• Very simple,
• High communication overhead
• Pregel, http://giraph.apache.org/
4
• Partition graph in to set of sets of vertices and edges
• Sub-graphs – Connected components in a partition.
• User logic operates on a sub graph
– Independent unit of computation

• Resource allocation
– Single Partition → Single Machine | Single Sub-graph → Single CPU

• Data loading
– Entire partition is loaded into memory before computation
– Tasks retain sub-graphs in memory within the task scope
Sub-graph-Task 1

Sub-graph-Task 2

Sub-graph-Task 3

5
• Algorithm : Seqn of super-steps separated
by global barrier synchronization
• Super Step i:
1.
2.
3.
4.

Sub-graphs compute in parallel
Receive messages sent to it in super step i-1
Execute same user defined function
Send messages to other sub graphs (to be
received in super step i+1)
5. Can vote to halt : I’m done / de-activate

http://www.ibm.com/developerworks/opensource/library/os-giraph/

• Global Vote to Halt check

BSP Manager

Control Channels

– termination → all sub-graphs voted to halt

• Active sub graphs participate in every
computation
• De-activated sub-graphs will not get
executed/activated unless it get new
messages

P1

P1

Data Channnels

P1

6
• Algorithm – Max vertex value
• Input – Connected graph with different vertex values {n1,n2…. ni}
• Output – Each vertex with the Max ({n1,n2…. ni})

• Compared to Vertex centric
model
• Less communication
• Fast convergence
7
8
• Cluster
– 12 Nodes, 8-Core Intel Xeon CPU (each)
– 16GB RAM (each), 1TB HDD (each)

• Network
– Gigabit Ethernet

• Apache Giraph latest version from trunk
– Includes Performance improvements from Facebook !
Data Set

Vertices

Edges

Diameter

RN: California Road
Network

1,965,206

2,766,607

849

TR: Internet path graph 19,442,778
from Tracesroutes

22,782,842

25

LJ: LiveJournal
Social Network

68,475,391

10

4,847,571

9
• Connected components
– 81x improvement using California
Road Network (RN) dataset
– 21x improvement using a Trace
route path network of a CDN. (TR)
dataset

Page Rank
4x improvement using RN dataset
1.5x improvement using TR dataset
Not an ideal algorithm for sub-graph
centric programming model

• Single Source Shortest Path
– 32x improvement using RN dataset
– 10x improvement using TR dataset

10
• Connected components
– ~79x reduction using RN dataset.
– ~4x reduction using TR dataset
– ~2x reduction using Live Journal dataset (LJ)

• Single source shortest path
– ~38x reduction using RN data set
– 4x reduction using TR dataset
– ~1.6x reduction using Live journal dataset

11
• Introduced a sub-graph centric programming abstraction
for large scale graph analytics on distributed systems
– Simple
– Enable using shared memory algorithms at sub-graph level.

• Sub-graph centric algorithms and performance results
– Connected Components
– Single Source Shortest Path
– Page Rank

• Issues and Future work
– Sub-graph aware partitioning
– Sub-graph centric algorithms

• Try now
– https://github.com/usc-cloud/goffish
12
http://thesciencepresenter.wordpress.com/category/behaviour-management/

13

More Related Content

PPTX
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
PDF
MapReduce and the New Software Stack
PDF
m2r2: A Framework for Results Materialization and Reuse
PPTX
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PDF
Big data processing systems research
PPS
Career Showcase
PDF
MapReduce: Optimizations, Limitations, and Open Issues
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
MapReduce and the New Software Stack
m2r2: A Framework for Results Materialization and Reuse
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Big data processing systems research
Career Showcase
MapReduce: Optimizations, Limitations, and Open Issues

What's hot (20)

PPSX
What's new in IP 4.4
PDF
Benchmarking tool for graph algorithms
PPTX
Map reduce presentation
PPSX
Whats new in IC 2016?
PDF
Spark Based Distributed Deep Learning Framework For Big Data Applications
PPTX
Mapreduce
PPTX
Scaling graphite to handle a zerg rush
PPTX
Sky Arrays - ArrayDB in action for Sky View Factor Computation
PDF
Hopsworks - ExtremeEarth Open Workshop
PPTX
MapReduce
PDF
Partitioning SKA Dataflows for Optimal Graph Execution
PPTX
Hadoop MapReduce joins
PDF
Introduction to yarn
PDF
Apache Giraph
PDF
Dive in with Databases – FME Summer Camp 2018
PDF
LWA 2015: The Apache Flink Platform (Poster)
PDF
MapReduce Algorithm Design
PDF
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
PDF
Highly Available Graphite
PPTX
Map reduce in Hadoop
What's new in IP 4.4
Benchmarking tool for graph algorithms
Map reduce presentation
Whats new in IC 2016?
Spark Based Distributed Deep Learning Framework For Big Data Applications
Mapreduce
Scaling graphite to handle a zerg rush
Sky Arrays - ArrayDB in action for Sky View Factor Computation
Hopsworks - ExtremeEarth Open Workshop
MapReduce
Partitioning SKA Dataflows for Optimal Graph Execution
Hadoop MapReduce joins
Introduction to yarn
Apache Giraph
Dive in with Databases – FME Summer Camp 2018
LWA 2015: The Apache Flink Platform (Poster)
MapReduce Algorithm Design
Apache HAMA: An Introduction toBulk Synchronization Parallel on Hadoop
Highly Available Graphite
Map reduce in Hadoop
Ad

Similar to GoFFish - A Sub-graph centric framework for large scale graph analytics (20)

PDF
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
PDF
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
PDF
Graph Stream Processing : spinning fast, large scale, complex analytics
PDF
Ling liu part 02:big graph processing
PDF
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
PDF
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
PPT
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
PDF
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
PPT
Hadoop trainingin bangalore
PDF
Graphalytics: A big data benchmark for graph-processing platforms
PPTX
Big Stream Processing Systems, Big Graphs
PPTX
Multivariate algorithms in distributed data processing computing.pptx
PPTX
Multivariate algorithms in distributed data processing computing.pptx
PPTX
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
PDF
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
PPTX
Big data analytics_7_giants_public_24_sep_2013
PDF
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
PDF
Rapid Cluster Computing with Apache Spark 2016
PDF
Introducing Apache Giraph for Large Scale Graph Processing
PPTX
Automatic Scaling Iterative Computations
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
Graph Stream Processing : spinning fast, large scale, complex analytics
Ling liu part 02:big graph processing
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Hadoop trainingin bangalore
Graphalytics: A big data benchmark for graph-processing platforms
Big Stream Processing Systems, Big Graphs
Multivariate algorithms in distributed data processing computing.pptx
Multivariate algorithms in distributed data processing computing.pptx
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
Big data analytics_7_giants_public_24_sep_2013
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
Rapid Cluster Computing with Apache Spark 2016
Introducing Apache Giraph for Large Scale Graph Processing
Automatic Scaling Iterative Computations
Ad

Recently uploaded (20)

PDF
Indian roads congress 037 - 2012 Flexible pavement
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
Trump Administration's workforce development strategy
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
HVAC Specification 2024 according to central public works department
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
Indian roads congress 037 - 2012 Flexible pavement
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
TNA_Presentation-1-Final(SAVE)) (1).pptx
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
Trump Administration's workforce development strategy
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Introduction to pro and eukaryotes and differences.pptx
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
HVAC Specification 2024 according to central public works department
FORM 1 BIOLOGY MIND MAPS and their schemes
History, Philosophy and sociology of education (1).pptx
Chinmaya Tiranga quiz Grand Finale.pdf
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
Paper A Mock Exam 9_ Attempt review.pdf.
202450812 BayCHI UCSC-SV 20250812 v17.pptx
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)

GoFFish - A Sub-graph centric framework for large scale graph analytics

  • 1. 1
  • 2. Twitter Storm 1st Mentioning of Big Data Michael Cox and David Ellsworth publish “Application-controlled demand paging for out-of-core visualization” in the Proceedings of the IEEE 8th conference on Visualization, 1997 http://www.gartner.com/it-glossary/big-data/ Pregel Map Reduce 2
  • 3. • Scalable platform for large scale graph analytics in commodity clusters and cloud. • Simple yet scalable programming abstraction for large scale graph processing • More than 10x improvements in some graph algorithms over traditional graph processing systems • Components – GoFS : Sub-graph centric Distributed Graph Storage – Gopher : Distributed BSP-based programming abstraction for Large Scale graph analytics framework 3
  • 4. • Iterative vertex centric programming model based on Bulk synchronous parallel model. • In each iteration vertex can 1. Receive messages sent to it in previous iteration 2. Send messages to other vertices 3. Modify its own state • Vertex centric programming mode- Think like a vertex • Very simple, • High communication overhead • Pregel, http://giraph.apache.org/ 4
  • 5. • Partition graph in to set of sets of vertices and edges • Sub-graphs – Connected components in a partition. • User logic operates on a sub graph – Independent unit of computation • Resource allocation – Single Partition → Single Machine | Single Sub-graph → Single CPU • Data loading – Entire partition is loaded into memory before computation – Tasks retain sub-graphs in memory within the task scope Sub-graph-Task 1 Sub-graph-Task 2 Sub-graph-Task 3 5
  • 6. • Algorithm : Seqn of super-steps separated by global barrier synchronization • Super Step i: 1. 2. 3. 4. Sub-graphs compute in parallel Receive messages sent to it in super step i-1 Execute same user defined function Send messages to other sub graphs (to be received in super step i+1) 5. Can vote to halt : I’m done / de-activate http://www.ibm.com/developerworks/opensource/library/os-giraph/ • Global Vote to Halt check BSP Manager Control Channels – termination → all sub-graphs voted to halt • Active sub graphs participate in every computation • De-activated sub-graphs will not get executed/activated unless it get new messages P1 P1 Data Channnels P1 6
  • 7. • Algorithm – Max vertex value • Input – Connected graph with different vertex values {n1,n2…. ni} • Output – Each vertex with the Max ({n1,n2…. ni}) • Compared to Vertex centric model • Less communication • Fast convergence 7
  • 8. 8
  • 9. • Cluster – 12 Nodes, 8-Core Intel Xeon CPU (each) – 16GB RAM (each), 1TB HDD (each) • Network – Gigabit Ethernet • Apache Giraph latest version from trunk – Includes Performance improvements from Facebook ! Data Set Vertices Edges Diameter RN: California Road Network 1,965,206 2,766,607 849 TR: Internet path graph 19,442,778 from Tracesroutes 22,782,842 25 LJ: LiveJournal Social Network 68,475,391 10 4,847,571 9
  • 10. • Connected components – 81x improvement using California Road Network (RN) dataset – 21x improvement using a Trace route path network of a CDN. (TR) dataset Page Rank 4x improvement using RN dataset 1.5x improvement using TR dataset Not an ideal algorithm for sub-graph centric programming model • Single Source Shortest Path – 32x improvement using RN dataset – 10x improvement using TR dataset 10
  • 11. • Connected components – ~79x reduction using RN dataset. – ~4x reduction using TR dataset – ~2x reduction using Live Journal dataset (LJ) • Single source shortest path – ~38x reduction using RN data set – 4x reduction using TR dataset – ~1.6x reduction using Live journal dataset 11
  • 12. • Introduced a sub-graph centric programming abstraction for large scale graph analytics on distributed systems – Simple – Enable using shared memory algorithms at sub-graph level. • Sub-graph centric algorithms and performance results – Connected Components – Single Source Shortest Path – Page Rank • Issues and Future work – Sub-graph aware partitioning – Sub-graph centric algorithms • Try now – https://github.com/usc-cloud/goffish 12