SlideShare a Scribd company logo
Efficient Support for MPI-I/O Atomicity
              Based on Versioning

Viet-Trung Tran1, Bogdan Nicolae2, Gabriel Antoniu2, Luc Bougé1
                   KerData Research Team

                 1
                   ENS Cachan, IRISA, France
                2
                  INRIA, IRISA, Rennes, France




                                                                  1
Context: Data Intensive Large-scale HPC
Simulations
 Large-scale simulations of natural phenomena
 Highly parallel platform
 I/O challenges
    High I/O performance
    Huge data sizes (~PB)
    Highly concurrency




                                                 2
Data Access Pattern

 Spatial splitting in parallelization
      Ghost cells
 Application data model vs storage model



                     
                         •Sequence of bytes

 Concurrent overlapping non-contiguous I/O
      Require atomicity guarantees


                                              3
Goal:

High throughput non-contiguous I/O
    under atomicity guarantees




                                     4
State of The Art

 Locking-based approaches to ensure atomicity
 3 level of implementations
    Application
    MPI-I/O               Application (Visit, Tornado simulation)
    Storage
                                Data model (HDF5, NetCDF)

                                     MPI-IO middleware


                          Parallel file systems (PVFS, GPFS, Lustre)



                                                                       5
Our Approach

 Dedicated interface for atomic non-contiguous I/O
    Provide atomicity guarantees at storage level
    No need to translate MPI consistency to storage consistency model

 Shadowing as a key to enhance data access under concurrency
    No locking
    Concurrent overlapped writes are allowed
    Atomicity guarantees

 Data striping




                                                                         6
Building Block: BlobSeer

 A KerData project (blobseer.gforge.inria.fr)
    Data striping
    Versioning-based concurrency control
    Distributed metadata management




                                                 7
Building Block: BlobSeer (continued)

 Distributed metadata management
    Organized as a segment tree                                                            [0, 8]

    Distributed over a DHT
                                                           [0, 4]      [0, 4]                        [4, 4]
 Two phases I/O              Metadata trees
    Data access
                                            [0, 2]      [0, 2]         [2, 2]            [2, 2]      [4, 2]
    Metadata access


                                   [0, 1]      [1, 1]   [1, 1]      [2, 1]      [2, 1]      [3, 1]   [4, 1]



                           Blob



                                                                                                              8
Proposal for A Non-contiguous,
Versioning Oriented Access Interface

 Non-contiguous Write
    vw = NONCONT_WRITE(id, buffers[], offsets[], sizes[])

 Non-contiguous Read
    NONCONT_READ(id, v, buffers[], offsets[], sizes[])

 Challenges
    Noncontiguous I/O must be atomic
    Efficient under concurrency




                                                             9
1st challenge: Non-contiguous I/O Must Be Atomic

 Shadowing techniques
 Isolate non-contiguous update into one single consistent snapshot
    Done at metadata level




                                                                      10
2nd challenge: Efficiency Under Concurrent Accesses


    Advantages of Shadowing
                                                 Our        Locking-
       Parallel data I/O phases                 approach   based
                                                            approach
    Parallel Metadata I/O
                                   Overlapping   Parallel   No
     phases ?                      Data I/O




                                                                       11
Minimize Ordering Overhead

 Ordering is done at metadata level
 Arbitrary order




                                       12
Avoid Synchronization for Concurrent Segment Tree
Generation
 Delegate the generation of shadowing tree to client side
 Shadowing tree are generated in parallel thank to predictable
  metadata node ID




                                                                  13
Lazy Evaluation During Border Node Calculation

 Building metadata tree in bottom-up fashion
 Optimized for non-contiguous pattern




                                                 14
Sumary: Overlapping Non-contiguous I/O

                      Our approach                           Locking-based
                                                             approaches
Data I/O phases       Parallel                               Serialization
Metadata I/O phases   Close to parallel thanks to            Serialization
                      1- Arbitrary ordering
                      2- Metadata level’s ordering
                      3- Client side’s shadowing in parallel
                      4- Lazy evaluation




                                                                             15
Leveraging Our Versioning-Oriented Interface in
Parallel I/O Stack


              Application (Visit, Tornado simulation)


                   Data model (HDF5, NetCDF)


                        MPI-IO middleware


               Storage optimized for atomic MPI-I/O


    Integrating BlobSeer to MPI-I/O middleware is straightforward




                                                                    16
Experimental Evaluation

• Our machines: Reservation on Grid'5000 platform
   – 80 nodes
   – Pentium-4 CPU@2.6Ghz, 4GB RAM, Gigabit Ethernet
   – Measured bandwidth: 117.5 MB/s for MTU=1500B
• 3 sets of experiments:
   – Scalability of non-contiguous I/O
   – Scalability under concurrency
   – MPI-tile-I/O




                                                       17
Scalability of Non-contiguous I/O




                                    18
Scalability Under Concurrency




                                19
MPI-tile-I/O: 128 KB Chunk Size




                                  20
MPI-tile-IO: 1MB Chunk Size




                              21
Conclusion

• Experiments show promising results
   • We outperform locking-based approaches
   • Key features: shadowing, dedicated API for atomic non-contiguous I/O
   • Comparison to Lustre file system

• High throughput non-contiguous I/O under atomicity guarantees
• Future work
   • Exposing versioning-interface to MPI-I/O applications
   • Potential improvement for producer-consumer workflow
   • Pyramid: A large-scale array-oriented active storage system




                                                                            22
Context




                    Application (Visit, Tornado simulation)

                         Data model (HDF5, NetCDF)

                              MPI-IO middleware

                             Parallel file systems



•Parallel file systems do not provide atomic non-contiguous I/O interface




                                                                        23
2nd challenge: Efficiency under concurrent
accesses
 Minimize ordering overhead
    Ordering is done at metadata level
    Arbitrary order

 Avoid synchronization for concurrent segment tree generation
    Delegate the generation of shadowing tree to client side
    Shadowing tree are generated in parallel

 Lazy evaluation during border node calculation




                                                                 24

More Related Content

DOCX
Best points for fabric
PDF
The HPE Machine and Gen-Z - BUD17-503
PDF
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
PDF
Assisting User’s Transition to Titan’s Accelerated Architecture
KEY
Optimization Techniques at the I/O Forwarding Layer
PDF
LITE Kernel RDMA Support for Datacenter Applications
PDF
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PDF
Design and Implementation of Quintuple Processor Architecture Using FPGA
Best points for fabric
The HPE Machine and Gen-Z - BUD17-503
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Assisting User’s Transition to Titan’s Accelerated Architecture
Optimization Techniques at the I/O Forwarding Layer
LITE Kernel RDMA Support for Datacenter Applications
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
Design and Implementation of Quintuple Processor Architecture Using FPGA

What's hot (7)

PPTX
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
PDF
Memory, IPC and L4Re
PDF
Advanced Components on Top of L4Re
PDF
SoC-2012-pres-2
PDF
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
PDF
PDF
Conference Paper: Universal Node: Towards a high-performance NFV environment
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
Memory, IPC and L4Re
Advanced Components on Top of L4Re
SoC-2012-pres-2
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ad

Viewers also liked (18)

PDF
Etihad Fares and Ticketing Programme2014
PDF
Boost your Sales 2015
PPTX
Underpinning Marketing Strategy With Email Automation
PDF
Guion docente cuento - Clase 1 Ok!
PDF
Advanced Reservation and Ticketing Program for Contact Center (CRT )2015
DOCX
Arizona Broadband Strategic Plan Resource Guide
PPT
Toma de decisiones
PPSX
DOCX
Interview naria
PDF
Redes Sociais no Mundo Corporativo
PDF
Alegações finais de Dirceu
PPTX
Decor arte
PDF
Iran Startup Ecosystem by September 2016
PPTX
Failure to Launch Across the Lifespan
PDF
Ethics and Behavioral Health Care
PDF
Ask the Expert - Essential Strategies for Your Next Career Development Event
PPTX
RNA-seq: A High-resolution View of the Transcriptome
PDF
Empreendedorismo no Setor Público - Associação Brasileira de Recursos Humanos
Etihad Fares and Ticketing Programme2014
Boost your Sales 2015
Underpinning Marketing Strategy With Email Automation
Guion docente cuento - Clase 1 Ok!
Advanced Reservation and Ticketing Program for Contact Center (CRT )2015
Arizona Broadband Strategic Plan Resource Guide
Toma de decisiones
Interview naria
Redes Sociais no Mundo Corporativo
Alegações finais de Dirceu
Decor arte
Iran Startup Ecosystem by September 2016
Failure to Launch Across the Lifespan
Ethics and Behavioral Health Care
Ask the Expert - Essential Strategies for Your Next Career Development Event
RNA-seq: A High-resolution View of the Transcriptome
Empreendedorismo no Setor Público - Associação Brasileira de Recursos Humanos
Ad

Similar to Efficient Support for MPI-I/O Atomicity (20)

PDF
IPDPS PhDForum 2011
PPTX
Pyramid: A large-scale array-oriented active storage system
PDF
Memory consistency models
PPTX
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
PDF
Design Patterns For Distributed NO-reational databases
PPTX
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
PDF
Parallel Computing Scenarios and the new challenges for the Software Architect
PDF
Automated conflict resolution - enabling masterless data distribution (Rune S...
PPT
Role of locking
PPT
Parallel architecture
PDF
Replication Solutions for PostgreSQL
DOCX
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
PDF
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
PDF
Design Patterns for Distributed Non-Relational Databases
PDF
Presentazione laurea 1.2 matteo concas
PPT
Role of locking- cds
PDF
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
PDF
Betting On Data Grids
PDF
Erlang Cache
PDF
Using Distributed In-Memory Computing for Fast Data Analysis
IPDPS PhDForum 2011
Pyramid: A large-scale array-oriented active storage system
Memory consistency models
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Design Patterns For Distributed NO-reational databases
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
Parallel Computing Scenarios and the new challenges for the Software Architect
Automated conflict resolution - enabling masterless data distribution (Rune S...
Role of locking
Parallel architecture
Replication Solutions for PostgreSQL
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Design Patterns for Distributed Non-Relational Databases
Presentazione laurea 1.2 matteo concas
Role of locking- cds
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
Betting On Data Grids
Erlang Cache
Using Distributed In-Memory Computing for Fast Data Analysis

More from Viet-Trung TRAN (20)

PDF
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
PDF
Dynamo: Amazon’s Highly Available Key-value Store
PDF
Pregel: Hệ thống xử lý đồ thị lớn
PDF
Mapreduce simplified-data-processing
PDF
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
PPTX
giasan.vn real-estate analytics: a Vietnam case study
PDF
Giasan.vn @rstars
PDF
A Vietnamese Language Model Based on Recurrent Neural Network
PDF
A Vietnamese Language Model Based on Recurrent Neural Network
PPTX
Large-Scale Geographically Weighted Regression on Spark
PDF
Recent progress on distributing deep learning
PDF
success factors for project proposals
PDF
GPSinsights poster
PPTX
OCR processing with deep learning: Apply to Vietnamese documents
PDF
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
PDF
Deep learning for nlp
PDF
Introduction to BigData @TCTK2015
PDF
From neural networks to deep learning
PDF
From decision trees to random forests
PPTX
Recommender systems: Content-based and collaborative filtering
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Dynamo: Amazon’s Highly Available Key-value Store
Pregel: Hệ thống xử lý đồ thị lớn
Mapreduce simplified-data-processing
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
giasan.vn real-estate analytics: a Vietnam case study
Giasan.vn @rstars
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
Large-Scale Geographically Weighted Regression on Spark
Recent progress on distributing deep learning
success factors for project proposals
GPSinsights poster
OCR processing with deep learning: Apply to Vietnamese documents
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Deep learning for nlp
Introduction to BigData @TCTK2015
From neural networks to deep learning
From decision trees to random forests
Recommender systems: Content-based and collaborative filtering

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Machine learning based COVID-19 study performance prediction
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Cloud computing and distributed systems.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
KodekX | Application Modernization Development
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Review of recent advances in non-invasive hemoglobin estimation
The Rise and Fall of 3GPP – Time for a Sabbatical?
Machine learning based COVID-19 study performance prediction
Reach Out and Touch Someone: Haptics and Empathic Computing
Programs and apps: productivity, graphics, security and other tools
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Building Integrated photovoltaic BIPV_UPV.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The AUB Centre for AI in Media Proposal.docx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Cloud computing and distributed systems.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KodekX | Application Modernization Development
Per capita expenditure prediction using model stacking based on satellite ima...
Review of recent advances in non-invasive hemoglobin estimation

Efficient Support for MPI-I/O Atomicity

  • 1. Efficient Support for MPI-I/O Atomicity Based on Versioning Viet-Trung Tran1, Bogdan Nicolae2, Gabriel Antoniu2, Luc Bougé1 KerData Research Team 1 ENS Cachan, IRISA, France 2 INRIA, IRISA, Rennes, France 1
  • 2. Context: Data Intensive Large-scale HPC Simulations  Large-scale simulations of natural phenomena  Highly parallel platform  I/O challenges  High I/O performance  Huge data sizes (~PB)  Highly concurrency 2
  • 3. Data Access Pattern  Spatial splitting in parallelization  Ghost cells  Application data model vs storage model  •Sequence of bytes  Concurrent overlapping non-contiguous I/O  Require atomicity guarantees 3
  • 4. Goal: High throughput non-contiguous I/O under atomicity guarantees 4
  • 5. State of The Art  Locking-based approaches to ensure atomicity  3 level of implementations  Application  MPI-I/O Application (Visit, Tornado simulation)  Storage Data model (HDF5, NetCDF) MPI-IO middleware Parallel file systems (PVFS, GPFS, Lustre) 5
  • 6. Our Approach  Dedicated interface for atomic non-contiguous I/O  Provide atomicity guarantees at storage level  No need to translate MPI consistency to storage consistency model  Shadowing as a key to enhance data access under concurrency  No locking  Concurrent overlapped writes are allowed  Atomicity guarantees  Data striping 6
  • 7. Building Block: BlobSeer  A KerData project (blobseer.gforge.inria.fr)  Data striping  Versioning-based concurrency control  Distributed metadata management 7
  • 8. Building Block: BlobSeer (continued)  Distributed metadata management  Organized as a segment tree [0, 8]  Distributed over a DHT [0, 4] [0, 4] [4, 4]  Two phases I/O Metadata trees  Data access [0, 2] [0, 2] [2, 2] [2, 2] [4, 2]  Metadata access [0, 1] [1, 1] [1, 1] [2, 1] [2, 1] [3, 1] [4, 1] Blob 8
  • 9. Proposal for A Non-contiguous, Versioning Oriented Access Interface  Non-contiguous Write  vw = NONCONT_WRITE(id, buffers[], offsets[], sizes[])  Non-contiguous Read  NONCONT_READ(id, v, buffers[], offsets[], sizes[])  Challenges  Noncontiguous I/O must be atomic  Efficient under concurrency 9
  • 10. 1st challenge: Non-contiguous I/O Must Be Atomic  Shadowing techniques  Isolate non-contiguous update into one single consistent snapshot  Done at metadata level 10
  • 11. 2nd challenge: Efficiency Under Concurrent Accesses  Advantages of Shadowing Our Locking-  Parallel data I/O phases approach based approach  Parallel Metadata I/O Overlapping Parallel No phases ? Data I/O 11
  • 12. Minimize Ordering Overhead  Ordering is done at metadata level  Arbitrary order 12
  • 13. Avoid Synchronization for Concurrent Segment Tree Generation  Delegate the generation of shadowing tree to client side  Shadowing tree are generated in parallel thank to predictable metadata node ID 13
  • 14. Lazy Evaluation During Border Node Calculation  Building metadata tree in bottom-up fashion  Optimized for non-contiguous pattern 14
  • 15. Sumary: Overlapping Non-contiguous I/O Our approach Locking-based approaches Data I/O phases Parallel Serialization Metadata I/O phases Close to parallel thanks to Serialization 1- Arbitrary ordering 2- Metadata level’s ordering 3- Client side’s shadowing in parallel 4- Lazy evaluation 15
  • 16. Leveraging Our Versioning-Oriented Interface in Parallel I/O Stack Application (Visit, Tornado simulation) Data model (HDF5, NetCDF) MPI-IO middleware Storage optimized for atomic MPI-I/O Integrating BlobSeer to MPI-I/O middleware is straightforward 16
  • 17. Experimental Evaluation • Our machines: Reservation on Grid'5000 platform – 80 nodes – Pentium-4 CPU@2.6Ghz, 4GB RAM, Gigabit Ethernet – Measured bandwidth: 117.5 MB/s for MTU=1500B • 3 sets of experiments: – Scalability of non-contiguous I/O – Scalability under concurrency – MPI-tile-I/O 17
  • 20. MPI-tile-I/O: 128 KB Chunk Size 20
  • 22. Conclusion • Experiments show promising results • We outperform locking-based approaches • Key features: shadowing, dedicated API for atomic non-contiguous I/O • Comparison to Lustre file system • High throughput non-contiguous I/O under atomicity guarantees • Future work • Exposing versioning-interface to MPI-I/O applications • Potential improvement for producer-consumer workflow • Pyramid: A large-scale array-oriented active storage system 22
  • 23. Context Application (Visit, Tornado simulation) Data model (HDF5, NetCDF) MPI-IO middleware Parallel file systems •Parallel file systems do not provide atomic non-contiguous I/O interface 23
  • 24. 2nd challenge: Efficiency under concurrent accesses  Minimize ordering overhead  Ordering is done at metadata level  Arbitrary order  Avoid synchronization for concurrent segment tree generation  Delegate the generation of shadowing tree to client side  Shadowing tree are generated in parallel  Lazy evaluation during border node calculation 24