SlideShare a Scribd company logo
Dynamic graph / iterative
computation on Apache Giraph
6/3/2014
Avery Ching
Hadoop Summit
Motivation
Apache Giraph
• Inspired by Google’s Pregel but runs on Hadoop
• “Think like a vertex”
• Maximum value vertex example
Processor 1
Processor 2
Time
5
5
5
5
2
5
5
5
2
1
5
5
2
1
Giraph on Hadoop / Yarn
MapReduce YARN
Giraph
Hadoop
0.20.x
Hadoop
0.20.203
Hadoop
2.0.x
Hadoop
1.x
Send page rank
value to
neighbors for
30 iterations
Calculate
updated
page rank
value from
neighbors
Page rank in Giraph
!
!
public class PageRankComputation extends BasicComputation<LongWritable,
DoubleWritable, FloatWritable, DoubleWritable> {
public void compute(Vertex<LongWritable, DoubleWritable, FloatWritable> vertex,
Iterable<DoubleWritable> messages) {
if (getSuperstep() >= 1) {
double sum = 0;
for (DoubleWritable message : messages) {
sum += message.get();
}
vertex.getValue().set(DoubleWritable((0.15d / getTotalNumVertices()) + 0.85d * sum);
}
if (getSuperstep() < 30) {
sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges()));
} else {
voteToHalt();
}
}
}
Apache Giraph data flow
Loading the graph
Input

format
Split0
Split1
Split2
Split3
Master
Load/
Send
Graph
Worker0
Load/
Send
Graph
Worker1
Storing the graph
Worker0Worker1
Output

format
Part0
Part1
Part2
Part3
Part0
Part1
Part2
Part3
Compute / Iterate
Compute/
Send
Messages
Compute/
Send
Messages
In-memory
graph
Part0
Part1
Part2
Part3Master
Worker0Worker1
Sendstats/iterate!
Pipelined computation
Master “computes”
• Sets computation, in/out message, combiner for next super step
•Can set/modify aggregator values
Time
Worker 0
Worker 1
Master
phase 1a
phase 1a
phase 1b
phase 1b
phase 2
phase 2
phase 3
phase 3
Use case
Affinity propagation
Frey and Dueck “Clustering by passing messages between data points”
Science 2007
Organically discover exemplars based on similarity
Initialization Intermediate Convergence
Responsibility r(i,k)
• How well suited is k to be an exemplar for i?
Availability a(i,k)
• How appropriate for point i to choose point k as an exemplar given all
of i’s responsibilities?
Update exemplars
• Based on known responsibilities/availabilities, which vertex should be
my exemplar?
!
* Dampen responsibility, availability
3 stages
Responsibility
Every vertex i with an edge to k maintains responsibility of k for i
Sends responsibility to k in ResponsibilityMessage (senderid,
responsibility(i,k))
C
A
D
B
r(c,a)
r(d,a)
r(b,d)
r(b,a)
Availability
Vertex sums positive messages
Sends availability to i in AvailabilityMessage (senderid, availability(i,k))
C
A
D
B
a(c,a)
a(d,a)
a(b,d)
a(b,a)
Update exemplars
Dampens availabilities and scans edges to find exemplar k
Updates self-exemplar
C
A
D
Bupdate update
update update
exemplar=a exemplar=d
exemplar=a exemplar=a
Master logic
calculate
responsibility
calculate
availability
update
exemplars
initial
state
halt
if (exemplars agree they are exemplars
&& changed exemplars < ∆) then halt,
otherwise continue
Performance
Faster than Hive?
Application Graph Size CPU Time Speedup Elapsed Time Speedup
Page rank

(single iteration)
400B+ edges 26x 120x
Friends of
friends score
 71B+ edges 12.5x 48x
Apache Giraph scalability
Scalability of workers
(200B edges)
Seconds
0
125
250
375
500
# of Workers
50 100 150 200 250 300
Giraph Ideal
Scalability of edges (50
workers)
Seconds
0
125
250
375
500
# of Edges
1E+09 7E+10 1E+11 2E+11
Giraph Ideal
Trillion social edges page rank
Minutesperiteration
0
1
2
3
4
6/30/2013 6/2/2014
Improvements
• GIRAPH-840 - Netty 4 upgrade
• G1 Collector / tuning
Graph partitioning
Why balanced partitioning
Random partitioning == good balance
BUT ignores entity affinity
0 1
2
3
4 5
6
7
8 9
10
11
Balanced partitioning application
Results from one service:
Cache hit rate grew from 70% to 85%, bandwidth cut in 1/2
!
!
0
2
3
5
6 9
11
1 4 7
8
10
Balanced label propagation results
* Loosely based on Ugander and Backstrom. Balanced label
propagation for partitioning massive graphs, WSDM '13
Partitioning experimentsSecondsperiteration
0
40
80
120
160
Random 47% Local Edges
345B edge page rank
Improvements
• 56% faster!
• Native vertex remapping
Power-law graphs
Avoiding out-of-core
Example: Mutual friends calculation between
neighbors
1. Send your friends a list of your friends
2. Intersect with your friend list
!
1.23B (as of 1/2014)
200+ average friends (2011 S1)
8-byte ids (longs)
= 394 TB / 100 GB machines
3,940 machines (not including the graph)
A B
C
D
E
A:{D}
D:{A,E}
E:{D}
B:{}
C:{D}
D:{C}
A:{C}
C:{A,E}
E:{C}
!
C:{D}
D:{C}
!
!
E:{}
Dynamic Draph / Iterative Computation on Apache Giraph
Superstep splitting
Subsets of sources/destinations edges per superstep
* Currently manual - future work automatic!
A
Sources: A (on), B (off)
Destinations: A (on), B (off)
B
B
B
A
A
A
Sources: A (on), B (off)
Destinations: A (off), B (on)
B
B
B
A
A
A
Sources: A (off), B (on)
Destinations: A (on), B (off)
B
B
B
A
A
A
Sources: A (off), B (on)
Destinations: A (off), B (on)
B
B
B
A
A
Giraph in production
Over 1.5 years in production
Over 100 jobs processed a week
30+ applications in our internal application repository
Sample production job - 700B+ edge
GiraphicJam demo
Giraph related projects
Graft: The distributed
Giraph debugger
Giraph roadmap
2/12 - 0.1 6/14 - 1.15/13 - 1.0
Future work
Automatic checkpointing make scheduling more fair
Investigate alternative computing models
•Giraph++ (IBM research)
•Giraphx (University at Buffalo, SUNY)
Performance
Lower the barrier to entry
Applications
Our team
!
Maja
Kabiljo
Sergey
Edunov
Pavan
Athivarapu
Avery
Ching
Sambavi
Muthukrishnan
Dynamic Draph / Iterative Computation on Apache Giraph

More Related Content

PDF
Introducing Apache Giraph for Large Scale Graph Processing
PDF
Apache Giraph: Large-scale graph processing done better
PPT
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
PDF
Processing edges on apache giraph
PDF
Giraph
PDF
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
PDF
Hadoop Internals (2.3.0 or later)
Introducing Apache Giraph for Large Scale Graph Processing
Apache Giraph: Large-scale graph processing done better
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Processing edges on apache giraph
Giraph
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
SparkR - Play Spark Using R (20160909 HadoopCon)
Hadoop Internals (2.3.0 or later)

What's hot (20)

PDF
Introduction To Elastic MapReduce at WHUG
PPTX
Map Reduce
PDF
Yarn Resource Management Using Machine Learning
PDF
Introduction To Apache Pig at WHUG
PDF
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PPTX
Map reduce prashant
PDF
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
PPTX
Apache pig
PPTX
MapReduce Paradigm
PPTX
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
PDF
High-level Programming Languages: Apache Pig and Pig Latin
PDF
Apache Hadoop MapReduce Tutorial
PPTX
Overview of Spark for HPC
PPTX
Hadoop performance optimization tips
PPTX
MapReduce basic
PDF
Introduction to Map-Reduce
PPT
Introduction To Map Reduce
PPTX
Introduction to Yarn
PDF
Resource Aware Scheduling for Hadoop [Final Presentation]
PPTX
How LinkedIn Uses Scalding for Data Driven Product Development
Introduction To Elastic MapReduce at WHUG
Map Reduce
Yarn Resource Management Using Machine Learning
Introduction To Apache Pig at WHUG
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
Map reduce prashant
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
Apache pig
MapReduce Paradigm
SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
High-level Programming Languages: Apache Pig and Pig Latin
Apache Hadoop MapReduce Tutorial
Overview of Spark for HPC
Hadoop performance optimization tips
MapReduce basic
Introduction to Map-Reduce
Introduction To Map Reduce
Introduction to Yarn
Resource Aware Scheduling for Hadoop [Final Presentation]
How LinkedIn Uses Scalding for Data Driven Product Development
Ad

Viewers also liked (20)

PPTX
2013.09.10 Giraph at London Hadoop Users Group
PDF
Link prediction 방법의 개념 및 활용
PDF
Graph Sample and Hold: A Framework for Big Graph Analytics
PPTX
Improving personalized recommendations through temporal overlapping community...
PPTX
Apache giraph
PDF
Fast, Scalable Graph Processing: Apache Giraph on YARN
PPTX
Hadoop Graph Processing with Apache Giraph
PPT
Giraph at Hadoop Summit 2014
PPT
Graph Analytics for big data
PPTX
2011.10.14 Apache Giraph - Hortonworks
ODP
Graphs are everywhere! Distributed graph computing with Spark GraphX
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
PPTX
Graph Analytics
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PDF
GraphX: Graph analytics for insights about developer communities
DOCX
Bio heidi reformer training
PDF
MALDEF y la Prevención Primaria de la Violencia Domestica Abril 2012
PPS
Portugal
PDF
Informe sobre el Encuentro Pedagógico del 31 de julio del 2014, Comunicado 013
PPTX
Brand Storytelling Awards 2012
2013.09.10 Giraph at London Hadoop Users Group
Link prediction 방법의 개념 및 활용
Graph Sample and Hold: A Framework for Big Graph Analytics
Improving personalized recommendations through temporal overlapping community...
Apache giraph
Fast, Scalable Graph Processing: Apache Giraph on YARN
Hadoop Graph Processing with Apache Giraph
Giraph at Hadoop Summit 2014
Graph Analytics for big data
2011.10.14 Apache Giraph - Hortonworks
Graphs are everywhere! Distributed graph computing with Spark GraphX
Spark Concepts - Spark SQL, Graphx, Streaming
Graph Analytics
An excursion into Graph Analytics with Apache Spark GraphX
GraphX: Graph analytics for insights about developer communities
Bio heidi reformer training
MALDEF y la Prevención Primaria de la Violencia Domestica Abril 2012
Portugal
Informe sobre el Encuentro Pedagógico del 31 de julio del 2014, Comunicado 013
Brand Storytelling Awards 2012
Ad

Similar to Dynamic Draph / Iterative Computation on Apache Giraph (20)

PDF
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
PDF
Apache Flink & Graph Processing
PDF
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
PPTX
Data Pipeline at Tapad
PDF
Tajo_Meetup_20141120
PDF
Strata Stinger Talk October 2013
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PDF
Osd ctw spark
PDF
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
PDF
Comparing pregel related systems
PDF
Hadoop 101 for bioinformaticians
PDF
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
PDF
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
PDF
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
PDF
Creating PostgreSQL-as-a-Service at Scale
PDF
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
PDF
Apache Eagle - Monitor Hadoop in Real Time
PDF
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
PDF
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
PPTX
Yahoo's Experience Running Pig on Tez at Scale
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
Apache Flink & Graph Processing
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Data Pipeline at Tapad
Tajo_Meetup_20141120
Strata Stinger Talk October 2013
Introduction to Apache Flink - Fast and reliable big data processing
Osd ctw spark
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Comparing pregel related systems
Hadoop 101 for bioinformaticians
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Creating PostgreSQL-as-a-Service at Scale
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
Apache Eagle - Monitor Hadoop in Real Time
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Yahoo's Experience Running Pig on Tez at Scale

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Electronic commerce courselecture one. Pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
KodekX | Application Modernization Development
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Programs and apps: productivity, graphics, security and other tools
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Electronic commerce courselecture one. Pdf
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Building Integrated photovoltaic BIPV_UPV.pdf
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity

Dynamic Draph / Iterative Computation on Apache Giraph

  • 1. Dynamic graph / iterative computation on Apache Giraph 6/3/2014 Avery Ching Hadoop Summit
  • 3. Apache Giraph • Inspired by Google’s Pregel but runs on Hadoop • “Think like a vertex” • Maximum value vertex example Processor 1 Processor 2 Time 5 5 5 5 2 5 5 5 2 1 5 5 2 1
  • 4. Giraph on Hadoop / Yarn MapReduce YARN Giraph Hadoop 0.20.x Hadoop 0.20.203 Hadoop 2.0.x Hadoop 1.x
  • 5. Send page rank value to neighbors for 30 iterations Calculate updated page rank value from neighbors Page rank in Giraph ! ! public class PageRankComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute(Vertex<LongWritable, DoubleWritable, FloatWritable> vertex, Iterable<DoubleWritable> messages) { if (getSuperstep() >= 1) { double sum = 0; for (DoubleWritable message : messages) { sum += message.get(); } vertex.getValue().set(DoubleWritable((0.15d / getTotalNumVertices()) + 0.85d * sum); } if (getSuperstep() < 30) { sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges())); } else { voteToHalt(); } } }
  • 6. Apache Giraph data flow Loading the graph Input
 format Split0 Split1 Split2 Split3 Master Load/ Send Graph Worker0 Load/ Send Graph Worker1 Storing the graph Worker0Worker1 Output
 format Part0 Part1 Part2 Part3 Part0 Part1 Part2 Part3 Compute / Iterate Compute/ Send Messages Compute/ Send Messages In-memory graph Part0 Part1 Part2 Part3Master Worker0Worker1 Sendstats/iterate!
  • 7. Pipelined computation Master “computes” • Sets computation, in/out message, combiner for next super step •Can set/modify aggregator values Time Worker 0 Worker 1 Master phase 1a phase 1a phase 1b phase 1b phase 2 phase 2 phase 3 phase 3
  • 9. Affinity propagation Frey and Dueck “Clustering by passing messages between data points” Science 2007 Organically discover exemplars based on similarity Initialization Intermediate Convergence
  • 10. Responsibility r(i,k) • How well suited is k to be an exemplar for i? Availability a(i,k) • How appropriate for point i to choose point k as an exemplar given all of i’s responsibilities? Update exemplars • Based on known responsibilities/availabilities, which vertex should be my exemplar? ! * Dampen responsibility, availability 3 stages
  • 11. Responsibility Every vertex i with an edge to k maintains responsibility of k for i Sends responsibility to k in ResponsibilityMessage (senderid, responsibility(i,k)) C A D B r(c,a) r(d,a) r(b,d) r(b,a)
  • 12. Availability Vertex sums positive messages Sends availability to i in AvailabilityMessage (senderid, availability(i,k)) C A D B a(c,a) a(d,a) a(b,d) a(b,a)
  • 13. Update exemplars Dampens availabilities and scans edges to find exemplar k Updates self-exemplar C A D Bupdate update update update exemplar=a exemplar=d exemplar=a exemplar=a
  • 14. Master logic calculate responsibility calculate availability update exemplars initial state halt if (exemplars agree they are exemplars && changed exemplars < ∆) then halt, otherwise continue
  • 16. Faster than Hive? Application Graph Size CPU Time Speedup Elapsed Time Speedup Page rank
 (single iteration) 400B+ edges 26x 120x Friends of friends score
 71B+ edges 12.5x 48x
  • 17. Apache Giraph scalability Scalability of workers (200B edges) Seconds 0 125 250 375 500 # of Workers 50 100 150 200 250 300 Giraph Ideal Scalability of edges (50 workers) Seconds 0 125 250 375 500 # of Edges 1E+09 7E+10 1E+11 2E+11 Giraph Ideal
  • 18. Trillion social edges page rank Minutesperiteration 0 1 2 3 4 6/30/2013 6/2/2014 Improvements • GIRAPH-840 - Netty 4 upgrade • G1 Collector / tuning
  • 20. Why balanced partitioning Random partitioning == good balance BUT ignores entity affinity 0 1 2 3 4 5 6 7 8 9 10 11
  • 21. Balanced partitioning application Results from one service: Cache hit rate grew from 70% to 85%, bandwidth cut in 1/2 ! ! 0 2 3 5 6 9 11 1 4 7 8 10
  • 22. Balanced label propagation results * Loosely based on Ugander and Backstrom. Balanced label propagation for partitioning massive graphs, WSDM '13
  • 23. Partitioning experimentsSecondsperiteration 0 40 80 120 160 Random 47% Local Edges 345B edge page rank Improvements • 56% faster! • Native vertex remapping
  • 25. Avoiding out-of-core Example: Mutual friends calculation between neighbors 1. Send your friends a list of your friends 2. Intersect with your friend list ! 1.23B (as of 1/2014) 200+ average friends (2011 S1) 8-byte ids (longs) = 394 TB / 100 GB machines 3,940 machines (not including the graph) A B C D E A:{D} D:{A,E} E:{D} B:{} C:{D} D:{C} A:{C} C:{A,E} E:{C} ! C:{D} D:{C} ! ! E:{}
  • 27. Superstep splitting Subsets of sources/destinations edges per superstep * Currently manual - future work automatic! A Sources: A (on), B (off) Destinations: A (on), B (off) B B B A A A Sources: A (on), B (off) Destinations: A (off), B (on) B B B A A A Sources: A (off), B (on) Destinations: A (on), B (off) B B B A A A Sources: A (off), B (on) Destinations: A (off), B (on) B B B A A
  • 28. Giraph in production Over 1.5 years in production Over 100 jobs processed a week 30+ applications in our internal application repository Sample production job - 700B+ edge
  • 30. Giraph related projects Graft: The distributed Giraph debugger
  • 31. Giraph roadmap 2/12 - 0.1 6/14 - 1.15/13 - 1.0
  • 32. Future work Automatic checkpointing make scheduling more fair Investigate alternative computing models •Giraph++ (IBM research) •Giraphx (University at Buffalo, SUNY) Performance Lower the barrier to entry Applications