SlideShare a Scribd company logo
NFSv4 Replication for Grid Computing Peter Honeyman Center for Information Technology Integration University of Michigan, Ann Arbor
Acknowledgements Joint work with Jiaying Zhang UM CSE doctoral candidate Defending later this month Partially supported by  NSF/NMI GridNFS DOE/SciDAC Petascale Data Storage Institute Network Appliance, Inc. IBM ARC
Outline Background Consistent replication Fine-grained replication control Hierarchical replication control Evaluation Durability revisited  NEW! Conclusion SKIP SKIP SKIP SKIP SKIP SKIP SKIP SKIP
Grid computing Emerging global scientific collaborations require access to widely distributed data that is reliable, efficient, and convenient SKIP SKIP SKIP Grid Computing
GridFTP Advantages Automatic negotiation of TCP options Parallel data transfer Integrated Grid security Easy to install and support across a broad range of platforms Drawbacks Data sharing requires manual synchronization SKIP SKIP SKIP
NFSv4 Advantages Traditional, well-understood file system semantics Supports multiple security mechanisms Close-to-open consistency Reader is is guaranteed to see data written by the last writer to close the file Drawbacks Wide-area performance SKIP SKIP SKIP
NFSv4.r Research prototype developed at CITI Replicated file system build on NFSv4 Server-to-server replication control protocol High performance data access Conventional file system semantics SKIP SKIP SKIP
Replication in practice Read-only replication Clumsy manual release model Lacks complex data sharing (concurrent writes) Optimistic replication Inconsistent consistency SKIP SKIP SKIP
Consistent replication Problem: state of the practice in file system replication  does not  satisfy the requirements of global scientific collaborations How can we provide Grid applications efficient and reliable data access? Consistent replication SKIP SKIP SKIP
Design principles Optimal read-only behavior Performance must be identical to un-replicated local system Concurrent write behavior Ordered writes, i.e., one-copy serializability Close-to-open semantics Fine-grained replication control The granularity of replication control is a single file or directory SKIP SKIP SKIP
Replication control client When a client opens a file for writing, the selected server temporarily becomes the primary for that file Other replication servers are instructed to forward client requests for that file to the primary if concurrent writes occur SKIP SKIP SKIP wopen
Replication control client The primary server asynchronously distributes updates to other servers during file modification   SKIP SKIP SKIP write
Replication control client When the file is closed and all replication servers are synchronized, the primary server notifies the other replication servers that it is no longer the primary server for the file  SKIP SKIP SKIP close
Directory updates Prohibit concurrent updates A replication server waits for the primary to relinquish its role Atomicity for updates that involve multiple objects (e.g. rename) A server must become primary for all objects Updates are grouped and processed together SKIP SKIP SKIP
Close-to-open semantics Server becomes primary after it collects votes from a  majority  of replication servers Use a majority consensus algorithm Cost is dominated by the median RTT from the primary server to other replication servers Primary server must ensure that  every  replication server has acknowledged its election when a written file is closed Guarantees close-to-open semantics Heuristic: allow a new file to inherit the primary server that controls its parent directory for file creation SKIP SKIP SKIP
Durability guarantee “ Active view” update policy Every server keeps track of the liveness of other servers (active view) Primary server removes from its active view any server that fails to respond to its request Primary server distributes updates synchronously and in parallel Primary server acknowledges a client write after a majority of replication servers reply Primary sends other servers its active view with file close  A failed replication server must synchronize with the up-to-date copy before it can rejoin the active group I suppose this is expensive SKIP SKIP SKIP
What I skipped Not the Right Stuff GridFTP: manual synchronization NFSv4.r: write-mostly WAN performance AFS, Coda, et al.: sharing semantics Consistent replication for Grid computing Ordered writes too weak Strict consistency too strong Open-to-close just right
NFSv4.r in brief View-based replication control protocol Based on (provably correct) El-Abbadi, Skeen, and Cristian Dynamic election of primary server At the granularity of a single file or directory Majority consensus on open (for synchronization) Synchronous updates to a majority (for durability) Total consensus on close (for close-to-open)
Write-mostly WAN performance Durability overhead Synchronous updates Synchronization overhead Consensus management
Asynchronous updates Consensus requirement delays client updates Median RTT between the primary server and other replication servers is costly Synchronous write performance is worse Solution: asynchronous update Let application decide whether to wait for server recovery or regenerate the computation results OK for Grid computations that checkpoint Revisit at end with  new ideas
Hierarchical replication control Synchronization is costly over WAN Hierarchical replication control Amortizes consensus management A primary server can assert control at different granularities
Shallow & deep control /usr bin local /usr bin local A server with a  shallow control  on a file or directory is the primary server for that single object A server with a  deep control  on a directory is the primary server for everything in the subtree rooted at that directory
Primary server election Allow deep control for a directory  D  if  D  has no descendent is controlled by another server Grant a  shallow control  request for object  L  from peer server  P  if  L  is not controlled by a server other than  P Grant a  deep control  request for directory  D  from peer server  P  if D  is not controlled by a server other than  P No descendant of  D  is controlled by a server other than  P SKIP SKIP SKIP
Ancestry table /root a b c f2 d2 controlled by S1 controlled by S0 controlled by S0 controlled by S2 …… Ancestry Table The data structure of entries in the ancestry table d1 f1 Ancestry Entry an ancestry entry has the following attributes id =  unique identifier of the directory array of counters =  set of counters recording which servers controls     the directory’s descendants counter array S0  S1  S2 Id 2  0  0 c 0  0  1 b 2  1  0 a 2  1  1 root
Primary election S0 and S1 succeed in their primary server elections S2’s election fails due to conflicts Solution - S2 then re-tries by asking for shallow control of a a b c S0 S1 S2 control b control c control b deep control a control c deep control a S0 S1 S2  SKIP SKIP SKIP
Performance vs. concurrency Associate a timer with deep control Reset the timer with subsequent updates Release deep control when timer expires A small timer value captures bursty updates Issue a separate shallow control for a file written under a deep controlled directory Still process the write request immediately Subsequent writes on the file  do not  reset the timer of the deep controlled directory SKIP SKIP SKIP SKIP
Performance vs. concurrency Increase concurrency when the system consists of multiple writers Send a revoke request upon concurrent writes The primary server shortens releasing timer Optimally issues a deep control request for a directory that contains many updates in single writer cases SKIP SKIP SKIP
Single remote NFS N.B.: log scale
Deep vs. shallow Shallow controls vs. deep + shallow controls
Deep control timer
Durability revisited Synchronization is expensive, but … When we abandon the durability guarantee, we risk losing the results of the computation And may be forced to rerun it But it might be worth it Goal: maximize utilization NEW NEW NEW
Utilization tradeoffs Adding synchronous replication servers enhances durability Which reduces the risk that results are lost And that the computation must be restarted Which benefits utilization But increases run time Which reduces utilization
Placement tradeoffs Nearby replication servers reduce the replication penalty Which benefits utilization Nearby replication servers are more vulnerable to correlated failure Which reduces utilization
Run-time model
Parameters F: failure free, single server run time C: replication overhead R: recovery time p fail : server failure p recover : successful recovery
F: run time Failure-free, single server run time Can be estimated or measured Our focus is on 1 to 10 days
C: replication overhead Penalty associated with replication to backup servers Proportional to RTT Ratio can be measured by running with a backup server a few msec away
R: recovery time Time to detect failure of the primary server and switch to a backup server We assume R << F Arbitrary realistic value: 10 minutes
Failure distributions Estimated by analyzing PlanetLab ping data 716 nodes, 349 sites, 25 countries All-pairs, 15 minute interval From January 2004 to June 2005 692 nodes were alive throughout We ascribe missing pings to node failure and network partition
PlanetLab failure CDF
Same-site correlated failures sites nodes 11 21 65 259 0.488 5 0.488 0.378 4 0.538 0.440 0.546 3 0.561 0.552 0.593 0.526 2
Different-site correlated failures
Run-time model Discrete event simulation yields expected run time E and utilization (F ÷ E)
Simulated utilization F = one hour One backup server Four backup servers
Simulation results F = one day One backup server Four backup servers
Simulation results F = ten days One backup server Four backup servers
Simulation results discussion For long-running jobs Replication improves utilization Distant servers improve utilization For short jobs Replication does not improve utilization In general, multiple backup servers don’t help much Implications for checkpoint interval …
Checkpoint interval F = one day One backup server 20% checkpoint overhead F = ten days, 2% checkpoint overhead One backup server Four backup servers
Next steps Checkpoint overhead? Replication overhead? Depends on amount of computation We measure < 10% for NAS Grid Benchmarks, which do  no  computation Refine model Account for other failures Because they are common Other model improvements
Conclusions Conventional wisdom holds that consistent   mutable   replication   in large-scale distributed systems is  too expensive to consider Our study proves otherwise
Conclusions Consistent replication in large-scale distributed storage systems is  feasible  and  practical Superior  performance Rigorous adherence to conventional file system  semantics Improves cluster  utilization
Thank you for your attention! www.citi.umich.edu Questions?

More Related Content

PPTX
Kafka replication apachecon_2013
PDF
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
PPTX
Decoupling Decisions with Apache Kafka
PPTX
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
PPTX
Apache Kafka Best Practices
PDF
Everything You Thought You Already Knew About Orchestration
PDF
SwarmKit in Theory and Practice
PPTX
Lessons learned running large real-world Docker environments
Kafka replication apachecon_2013
Hermes Reliable Replication Protocol - ASPLOS'20 Presentation
Decoupling Decisions with Apache Kafka
Kafka Summit NYC 2017 - Deep Dive Into Apache Kafka
Apache Kafka Best Practices
Everything You Thought You Already Knew About Orchestration
SwarmKit in Theory and Practice
Lessons learned running large real-world Docker environments

What's hot (20)

PPTX
Apache kafka
PPTX
Introduction to Kafka
PDF
Scaling ingest pipelines with high performance computing principles - Rajiv K...
PDF
Scalable and Available Services with Docker and Kubernetes
PPTX
From a Kafkaesque Story to The Promised Land at LivePerson
PDF
How to build a Neutron Plugin (stadium edition)
PPTX
Tuning kafka pipelines
PDF
Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...
PPTX
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
PPTX
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
PDF
Apache Kafka Introduction
PDF
Troubleshooting Kafka's socket server: from incident to resolution
PDF
Testing Kafka components with Kafka for JUnit
PPT
Kafka Reliability - When it absolutely, positively has to be there
PDF
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
PPTX
Apache Kafka Reliability
PDF
An Introduction to Apache Kafka
PPTX
Kafka Reliability Guarantees ATL Kafka User Group
PPTX
Kafka reliability velocity 17
ODP
Apache kafka
Introduction to Kafka
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scalable and Available Services with Docker and Kubernetes
From a Kafkaesque Story to The Promised Land at LivePerson
How to build a Neutron Plugin (stadium edition)
Tuning kafka pipelines
Data Con LA 2018 - A Serverless Approach to Data Processing using Apache Puls...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Apache Kafka Introduction
Troubleshooting Kafka's socket server: from incident to resolution
Testing Kafka components with Kafka for JUnit
Kafka Reliability - When it absolutely, positively has to be there
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
Apache Kafka Reliability
An Introduction to Apache Kafka
Kafka Reliability Guarantees ATL Kafka User Group
Kafka reliability velocity 17
Ad

Similar to NFSv4 Replication for Grid Computing (20)

PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PDF
Webinar: From Frustration to Fascination: Dissecting Replication
PPT
Showdown: IBM DB2 versus Oracle Database for OLTP
PDF
From frustration to fascination: dissecting Replication
PDF
Data correlation using PySpark and HDFS
PPT
GFS - Google File System
PPT
Serverless (Distributed computing)
PPT
Dfs (Distributed computing)
PPT
MYSQL
PPT
Spinnaker VLDB 2011
PPT
Sinfonia
PDF
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
PPTX
Cnam azure 2014 storage
PPTX
Google file system
PPTX
Neo4j 3.2 Launch
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
PPTX
QCon London - Stream Processing with Apache Flink
PPTX
Handling Data in Mega Scale Systems
PDF
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Stephan Ewen - Experiences running Flink at Very Large Scale
Webinar: From Frustration to Fascination: Dissecting Replication
Showdown: IBM DB2 versus Oracle Database for OLTP
From frustration to fascination: dissecting Replication
Data correlation using PySpark and HDFS
GFS - Google File System
Serverless (Distributed computing)
Dfs (Distributed computing)
MYSQL
Spinnaker VLDB 2011
Sinfonia
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Cnam azure 2014 storage
Google file system
Neo4j 3.2 Launch
GOTO Night Amsterdam - Stream processing with Apache Flink
QCon London - Stream Processing with Apache Flink
Handling Data in Mega Scale Systems
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Ad

Recently uploaded (20)

PDF
Dr Tran Quoc Bao the first Vietnamese speaker at GITEX DigiHealth Conference ...
PPTX
Who’s winning the race to be the world’s first trillionaire.pptx
PDF
Bitcoin Layer August 2025: Power Laws of Bitcoin: The Core and Bubbles
PDF
way to join Real illuminati agent 0782561496,0756664682
PDF
final_dropping_the_baton_-_how_america_is_failing_to_use_russia_sanctions_and...
PDF
ECONOMICS AND ENTREPRENEURS LESSONSS AND
PPTX
Introduction to Managemeng Chapter 1..pptx
PDF
discourse-2025-02-building-a-trillion-dollar-dream.pdf
PDF
how_to_earn_50k_monthly_investment_guide.pdf
PPTX
Introduction to Customs (June 2025) v1.pptx
PDF
ssrn-3708.kefbkjbeakjfiuheioufh ioehoih134.pdf
PDF
Dialnet-DynamicHedgingOfPricesOfNaturalGasInMexico-8788871.pdf
PDF
THE EFFECT OF FOREIGN AID ON ECONOMIC GROWTH IN ETHIOPIA
PDF
CLIMATE CHANGE AS A THREAT MULTIPLIER: ASSESSING ITS IMPACT ON RESOURCE SCARC...
PDF
Buy Verified Stripe Accounts for Sale - Secure and.pdf
PPTX
kyc aml guideline a detailed pt onthat.pptx
PDF
Q2 2025 :Lundin Gold Conference Call Presentation_Final.pdf
PPT
E commerce busin and some important issues
PPTX
Antihypertensive_Drugs_Presentation_Poonam_Painkra.pptx
PDF
HCWM AND HAI FOR BHCM STUDENTS(1).Pdf and ptts
Dr Tran Quoc Bao the first Vietnamese speaker at GITEX DigiHealth Conference ...
Who’s winning the race to be the world’s first trillionaire.pptx
Bitcoin Layer August 2025: Power Laws of Bitcoin: The Core and Bubbles
way to join Real illuminati agent 0782561496,0756664682
final_dropping_the_baton_-_how_america_is_failing_to_use_russia_sanctions_and...
ECONOMICS AND ENTREPRENEURS LESSONSS AND
Introduction to Managemeng Chapter 1..pptx
discourse-2025-02-building-a-trillion-dollar-dream.pdf
how_to_earn_50k_monthly_investment_guide.pdf
Introduction to Customs (June 2025) v1.pptx
ssrn-3708.kefbkjbeakjfiuheioufh ioehoih134.pdf
Dialnet-DynamicHedgingOfPricesOfNaturalGasInMexico-8788871.pdf
THE EFFECT OF FOREIGN AID ON ECONOMIC GROWTH IN ETHIOPIA
CLIMATE CHANGE AS A THREAT MULTIPLIER: ASSESSING ITS IMPACT ON RESOURCE SCARC...
Buy Verified Stripe Accounts for Sale - Secure and.pdf
kyc aml guideline a detailed pt onthat.pptx
Q2 2025 :Lundin Gold Conference Call Presentation_Final.pdf
E commerce busin and some important issues
Antihypertensive_Drugs_Presentation_Poonam_Painkra.pptx
HCWM AND HAI FOR BHCM STUDENTS(1).Pdf and ptts

NFSv4 Replication for Grid Computing

  • 1. NFSv4 Replication for Grid Computing Peter Honeyman Center for Information Technology Integration University of Michigan, Ann Arbor
  • 2. Acknowledgements Joint work with Jiaying Zhang UM CSE doctoral candidate Defending later this month Partially supported by NSF/NMI GridNFS DOE/SciDAC Petascale Data Storage Institute Network Appliance, Inc. IBM ARC
  • 3. Outline Background Consistent replication Fine-grained replication control Hierarchical replication control Evaluation Durability revisited NEW! Conclusion SKIP SKIP SKIP SKIP SKIP SKIP SKIP SKIP
  • 4. Grid computing Emerging global scientific collaborations require access to widely distributed data that is reliable, efficient, and convenient SKIP SKIP SKIP Grid Computing
  • 5. GridFTP Advantages Automatic negotiation of TCP options Parallel data transfer Integrated Grid security Easy to install and support across a broad range of platforms Drawbacks Data sharing requires manual synchronization SKIP SKIP SKIP
  • 6. NFSv4 Advantages Traditional, well-understood file system semantics Supports multiple security mechanisms Close-to-open consistency Reader is is guaranteed to see data written by the last writer to close the file Drawbacks Wide-area performance SKIP SKIP SKIP
  • 7. NFSv4.r Research prototype developed at CITI Replicated file system build on NFSv4 Server-to-server replication control protocol High performance data access Conventional file system semantics SKIP SKIP SKIP
  • 8. Replication in practice Read-only replication Clumsy manual release model Lacks complex data sharing (concurrent writes) Optimistic replication Inconsistent consistency SKIP SKIP SKIP
  • 9. Consistent replication Problem: state of the practice in file system replication does not satisfy the requirements of global scientific collaborations How can we provide Grid applications efficient and reliable data access? Consistent replication SKIP SKIP SKIP
  • 10. Design principles Optimal read-only behavior Performance must be identical to un-replicated local system Concurrent write behavior Ordered writes, i.e., one-copy serializability Close-to-open semantics Fine-grained replication control The granularity of replication control is a single file or directory SKIP SKIP SKIP
  • 11. Replication control client When a client opens a file for writing, the selected server temporarily becomes the primary for that file Other replication servers are instructed to forward client requests for that file to the primary if concurrent writes occur SKIP SKIP SKIP wopen
  • 12. Replication control client The primary server asynchronously distributes updates to other servers during file modification SKIP SKIP SKIP write
  • 13. Replication control client When the file is closed and all replication servers are synchronized, the primary server notifies the other replication servers that it is no longer the primary server for the file SKIP SKIP SKIP close
  • 14. Directory updates Prohibit concurrent updates A replication server waits for the primary to relinquish its role Atomicity for updates that involve multiple objects (e.g. rename) A server must become primary for all objects Updates are grouped and processed together SKIP SKIP SKIP
  • 15. Close-to-open semantics Server becomes primary after it collects votes from a majority of replication servers Use a majority consensus algorithm Cost is dominated by the median RTT from the primary server to other replication servers Primary server must ensure that every replication server has acknowledged its election when a written file is closed Guarantees close-to-open semantics Heuristic: allow a new file to inherit the primary server that controls its parent directory for file creation SKIP SKIP SKIP
  • 16. Durability guarantee “ Active view” update policy Every server keeps track of the liveness of other servers (active view) Primary server removes from its active view any server that fails to respond to its request Primary server distributes updates synchronously and in parallel Primary server acknowledges a client write after a majority of replication servers reply Primary sends other servers its active view with file close A failed replication server must synchronize with the up-to-date copy before it can rejoin the active group I suppose this is expensive SKIP SKIP SKIP
  • 17. What I skipped Not the Right Stuff GridFTP: manual synchronization NFSv4.r: write-mostly WAN performance AFS, Coda, et al.: sharing semantics Consistent replication for Grid computing Ordered writes too weak Strict consistency too strong Open-to-close just right
  • 18. NFSv4.r in brief View-based replication control protocol Based on (provably correct) El-Abbadi, Skeen, and Cristian Dynamic election of primary server At the granularity of a single file or directory Majority consensus on open (for synchronization) Synchronous updates to a majority (for durability) Total consensus on close (for close-to-open)
  • 19. Write-mostly WAN performance Durability overhead Synchronous updates Synchronization overhead Consensus management
  • 20. Asynchronous updates Consensus requirement delays client updates Median RTT between the primary server and other replication servers is costly Synchronous write performance is worse Solution: asynchronous update Let application decide whether to wait for server recovery or regenerate the computation results OK for Grid computations that checkpoint Revisit at end with new ideas
  • 21. Hierarchical replication control Synchronization is costly over WAN Hierarchical replication control Amortizes consensus management A primary server can assert control at different granularities
  • 22. Shallow & deep control /usr bin local /usr bin local A server with a shallow control on a file or directory is the primary server for that single object A server with a deep control on a directory is the primary server for everything in the subtree rooted at that directory
  • 23. Primary server election Allow deep control for a directory D if D has no descendent is controlled by another server Grant a shallow control request for object L from peer server P if L is not controlled by a server other than P Grant a deep control request for directory D from peer server P if D is not controlled by a server other than P No descendant of D is controlled by a server other than P SKIP SKIP SKIP
  • 24. Ancestry table /root a b c f2 d2 controlled by S1 controlled by S0 controlled by S0 controlled by S2 …… Ancestry Table The data structure of entries in the ancestry table d1 f1 Ancestry Entry an ancestry entry has the following attributes id = unique identifier of the directory array of counters = set of counters recording which servers controls the directory’s descendants counter array S0 S1 S2 Id 2 0 0 c 0 0 1 b 2 1 0 a 2 1 1 root
  • 25. Primary election S0 and S1 succeed in their primary server elections S2’s election fails due to conflicts Solution - S2 then re-tries by asking for shallow control of a a b c S0 S1 S2 control b control c control b deep control a control c deep control a S0 S1 S2  SKIP SKIP SKIP
  • 26. Performance vs. concurrency Associate a timer with deep control Reset the timer with subsequent updates Release deep control when timer expires A small timer value captures bursty updates Issue a separate shallow control for a file written under a deep controlled directory Still process the write request immediately Subsequent writes on the file do not reset the timer of the deep controlled directory SKIP SKIP SKIP SKIP
  • 27. Performance vs. concurrency Increase concurrency when the system consists of multiple writers Send a revoke request upon concurrent writes The primary server shortens releasing timer Optimally issues a deep control request for a directory that contains many updates in single writer cases SKIP SKIP SKIP
  • 28. Single remote NFS N.B.: log scale
  • 29. Deep vs. shallow Shallow controls vs. deep + shallow controls
  • 31. Durability revisited Synchronization is expensive, but … When we abandon the durability guarantee, we risk losing the results of the computation And may be forced to rerun it But it might be worth it Goal: maximize utilization NEW NEW NEW
  • 32. Utilization tradeoffs Adding synchronous replication servers enhances durability Which reduces the risk that results are lost And that the computation must be restarted Which benefits utilization But increases run time Which reduces utilization
  • 33. Placement tradeoffs Nearby replication servers reduce the replication penalty Which benefits utilization Nearby replication servers are more vulnerable to correlated failure Which reduces utilization
  • 35. Parameters F: failure free, single server run time C: replication overhead R: recovery time p fail : server failure p recover : successful recovery
  • 36. F: run time Failure-free, single server run time Can be estimated or measured Our focus is on 1 to 10 days
  • 37. C: replication overhead Penalty associated with replication to backup servers Proportional to RTT Ratio can be measured by running with a backup server a few msec away
  • 38. R: recovery time Time to detect failure of the primary server and switch to a backup server We assume R << F Arbitrary realistic value: 10 minutes
  • 39. Failure distributions Estimated by analyzing PlanetLab ping data 716 nodes, 349 sites, 25 countries All-pairs, 15 minute interval From January 2004 to June 2005 692 nodes were alive throughout We ascribe missing pings to node failure and network partition
  • 41. Same-site correlated failures sites nodes 11 21 65 259 0.488 5 0.488 0.378 4 0.538 0.440 0.546 3 0.561 0.552 0.593 0.526 2
  • 43. Run-time model Discrete event simulation yields expected run time E and utilization (F ÷ E)
  • 44. Simulated utilization F = one hour One backup server Four backup servers
  • 45. Simulation results F = one day One backup server Four backup servers
  • 46. Simulation results F = ten days One backup server Four backup servers
  • 47. Simulation results discussion For long-running jobs Replication improves utilization Distant servers improve utilization For short jobs Replication does not improve utilization In general, multiple backup servers don’t help much Implications for checkpoint interval …
  • 48. Checkpoint interval F = one day One backup server 20% checkpoint overhead F = ten days, 2% checkpoint overhead One backup server Four backup servers
  • 49. Next steps Checkpoint overhead? Replication overhead? Depends on amount of computation We measure < 10% for NAS Grid Benchmarks, which do no computation Refine model Account for other failures Because they are common Other model improvements
  • 50. Conclusions Conventional wisdom holds that consistent mutable replication in large-scale distributed systems is too expensive to consider Our study proves otherwise
  • 51. Conclusions Consistent replication in large-scale distributed storage systems is feasible and practical Superior performance Rigorous adherence to conventional file system semantics Improves cluster utilization
  • 52. Thank you for your attention! www.citi.umich.edu Questions?