McMPI
Managed-code MPI library in Pure C#
Dr D Holmes, EPCC
dholmes@epcc.ed.ac.uk
Outline
• Yet another MPI library?
• Managed-code, C#, Windows
• McMPI, design and implementation details
• Object-orientation, design patterns,
communication performance results
• Threads and the MPI Standard
• Pre-“End Points proposal” ideas
Why Implement MPI Again?
• Parallel program, distributed memory => MPI library
• Most (all?) MPI libraries written in C
• MPI Standard provides C and FORTRAN bindings
• C++ can use the C functions
• Other languages can follow the C++ model
• Use the C functions
• Alternatively, MPI can be implemented in that language
• Removes inter-language function call overheads but …
• May not be possible to achieve comparable performance
Why Did I Choose C#?
• Experience and knowledge I gained from my career in
software development
• My impression of the popularity of C# in commercial
software development
• My desire to bridge the gap between high-performance
programming and high-productivity programming
• One of the UK research councils offered me funding for a
PhD that proposed to use C# to implement MPI
C# Myths
• C# only runs on Windows
• Not such a bad thing – 3 of the Top500 machines use Windows
• Not actually true – Mono works on multiple operating systems
• C# is a Microsoft language
• Not such a bad thing – resources, commitment, support, training
• Not actually true – C# follows ECMA and ISO standards
• C# is slow like Java
• Not such a bad thing – expressivity, readability, re-usability
• Not actually true – no easy way to prove this conclusively
• C# and its ilk are not things we need to care about
• Not such a bad thing – they will survive/thrive, or not, without us
• Not actually true – popularity trumps utility
McMPI Design & Implementation
• Desirable features of code
• Isolation of concerns -> easier to understand
• Human readability -> easier to maintain
• Compiler readability -> easier to get good performance
• Object-orientation can help with isolation of concerns
• So can modularisation and judiciously reducing LOC per code file
• Design patterns can help with human readability
• So can documentation and useful in-code comments
• Choice of language & compiler can help with performance
• So can coding style and detailed examination of compiler output
• What is the best compromise?
Communication Layer
• Abstract class factory design pattern
• Similar to plug-ins
• Enables addition of new functionality without re-compilation of the
rest of the library
• All communication modules:
• Implement the same Abstract Device Interface (ADI)
• Isolate the details of their implementation from other layers
• Provide the same semantics and capabilities
• Reliable delivery
• Ordering of delivery
• Preservation of message boundaries
• Message = fixed size envelope information and variable size user data
Communication Layer – UML
Protocol Layer
• Bridge design pattern
• Enables addition of new functionality without re-compilation of the
rest of the library
• All protocol messages:
• Implement inherit from the same base class
• Isolate the details of their implementation from other layers
• Modify state of internal shared data structures independently
• Shared data structures (message ‘queues’)
• Unexpected queue – message envelope at receiver before receive
• Request queue – receive called before message envelope arrival
• Matched queue – at receiver waiting for message data to arrive
• Pending queue – message data waiting at sender
Protocol Layer – UML
Interface Layer
• Simple façade design pattern
• Translates MPI Standard-like syntax into protocol layer syntax
• Will become adapter design pattern
• For example, when custom data-types are implemented
• Current version of McMPI covers parts of MPI 1 only
• Initialisation and finalisation
• Administration functions, e.g. to get rank and size of communicator
• Point-to-point communication functions
• ready, synchronous, standard (not buffered)
• blocking, non-blocking, persistent
• Previous version had collectives
• Implemented on top of point-to-point
• Using hypercube or binary tree algorithms
McMPI Implementation Overview
Performance Results – Introduction 1
• Shared-memory results – hardware details
• Number of Nodes: 1 Armari Magnetar server
• CPUs per Node: 2 Intel Xeon E5420
• Threads per CPU: 4 Quad-core, no hyper-threading
• Core Clock Speed: 2.5GHz Front-side bus 1333MHz
• Level 1 Cache: 4x2x32KB Data & instruction per core
• Level 2 Cache: 2x6MB One per pair of cores
• Memory per Node: 16GB DDR2 667MHz
• Network Hardware: 2xNIC Intel 82575EB Gigabit Ethernet
• Operating System: WinXP Pro 64bit with SP3 version 5.2.3790
Performance Results – Introduction 2
• Distributed-memory results – hardware details
• Number of Nodes: 18 Dell PowerEdge 2900
• CPUs per Node: 2 Intel Xeon 5130 Fam 6 mod 15 step 6
• Threads per CPU: 2 Dual-core, no hyper-threading
• Core Clock Speed: 2.0GHz Front-side bus 1333MHz
• Level 1 Cache: 2x2x32KB Data & instruction per core
• Level 2 Cache: 1x4MB One per CPU
• Memory per Node: 4GB DDR2 533MHz
• Network Hardware: 2xNIC BCM5708C NetXtreme II GigE
• Operating System: Win2008 Server x64, SP2 version 6.0.6002
Shared-memory – Latency
0
1
2
3
4
5
6
1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768
Latency(µs)
Message Size (bytes)
MPICH2 Shared Memory
MS-MPI Shared Memory
McMPI thread-to-thread
Shared-memory – Bandwidth
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576
Bandwidth(Mbit/s)
Message Size (bytes)
McMPI thread-to-thread
MPICH2 shared-memory
MS-MPI shared-memory
Distributed-memory – Latency
0
50
100
150
200
250
300
350
400
450
500
550
600
1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768
Latency(µs)
Message Size (bytes)
McMPI Eager
MS-MPI
Distributed-memory – Bandwidth
0
250
500
750
1,000
4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576
Bandwidth(Mbit/s)
Message Size (bytes)
McMPI Rendezvous
McMPI Eager
MS-MPI
Thread-as-rank – Threading Level
• McMPI allows MPI_THREAD_AS_RANK as input for the
MPI_INIT_THREAD function
• McMPI creates new threads during initialisation
• Not needed – MPI_INIT_THREAD must be called enough times
• McMPI uses thread-local storage to store ‘rank’
• Not needed – each communicator handle can encode ‘rank’
• Thread-to-thread message delivery is zero-copy
• Direct copy from user send buffer to user receive buffer
• Any thread can progress MPI messages
Thread-as-rank – MPI Process
Diagram created
by Gaurav Saxena
MSc, 2013
Thread-as-rank – MPI Standard
• Is thread-as-rank compliant with the MPI Standard?
• Does the MPI Standard allow/support thread-as-rank?
• Ambiguous/debatable at best
• The MPI Standard assumes MPI process = OS process
• Call MPI_INIT or MPI_INIT_THREAD twice in one OS process
• Erroneous by definition or results in two MPI processes?
• MPI Standard “thread compliant” prohibits thread-as-rank
• To maintain a POSIX-process-like interface for MPI process
• End-points proposal violates this principle in exactly the same way
• Other possible interfaces exist
Thread-as-rank – End-points
• Similarities
• Multiple threads can communicate reliably without using tags
• Thread ‘rank’ can be stored in thread-local storage or handles
• Most common use-case likely requires MPI_THREAD_MULTIPLE
• Differences
• Thread-as-rank part of initialisation and active until finalisation
• End-points created after initialisation and can be destroyed
• Thread-as-rank has all possible ranks in MPI_COMM_WORLD
• End-points only has some ranks in MPI_COMM_WORLD
• Thread-as-rank cannot create ranks but may need to merge ranks
• End-points can create ranks and does not need to merge ranks
Thread-as-rank – MPI Forum Proposal?
• Short answer: no
• Long answer: not yet, it’s complicated
• More likely to be suggested amendments to end-points proposal
• Thread-as-rank is a special case of end-points
• Standard MPI_COMM_WORLD replaced with an end-points
communicator during MPI_INIT_THREAD
• Thread-safety implications are similar (possibly identical?)
• Advantages/opportunities similar
• Thread-to-thread delivery rather than process-to-process delivery
• Work-stealing MPI progress engine or per-thread message queues
Questions?

More Related Content

PPTX
MPI Requirements of the Network Layer
PPTX
2014 01-21-mpi-community-feedback
PDF
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
PDF
[HK Roni] C Programming Lectures
PPTX
What's the "right" PHP Framework?
PDF
Sybsc cs sem 3 core java
MPI Requirements of the Network Layer
2014 01-21-mpi-community-feedback
PEARC17: Interactive Code Adaptation Tool for Modernizing Applications for In...
[HK Roni] C Programming Lectures
What's the "right" PHP Framework?
Sybsc cs sem 3 core java

What's hot (7)

PPT
Metaprogramming by brandon
PPTX
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
PPTX
Week 5
PPT
Introduction to Computers, the Internet and the Web
PPTX
Interpreted and compiled language
PPT
Database layer in php
PPT
Interpreters & Debuggers
Metaprogramming by brandon
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Week 5
Introduction to Computers, the Internet and the Web
Interpreted and compiled language
Database layer in php
Interpreters & Debuggers
Ad

Similar to EuroMPI 2013 presentation: McMPI (20)

PPTX
Realtime traffic analyser
PPTX
Presentation - Programming a Heterogeneous Computing Cluster
PPTX
Ice Age melting down: Intel features considered usefull!
PPTX
Putting Compilers to Work
PPTX
Parallelization using open mp
PDF
C++ Programming and the Persistent Memory Developers Kit
PDF
OpenPOWER Acceleration of HPCC Systems
PDF
High-Performance Computing with C++
PDF
Hpc lunch and learn
PDF
Client Technical Analysis of Legacy Software and Future Replacement
PDF
OSGi Community Event 2010 - Experiences with OSGi in Industrial Applications
PDF
Overcoming software development challenges by using an integrated software fr...
PDF
Introduction to multicore .ppt
PPTX
MPI n OpenMP
PDF
Computer Architecture And Design-Introduction.pdf
PPTX
Pune-Cocoa: Blocks and GCD
PPTX
VTU 6th Sem Elective CSE - Module 3 cloud computing
PDF
Effective admin and development in iib
PPTX
lecture03_EmbeddedSoftware for Beginners
PDF
Linux Distribution Collaboration …on a Mainframe!
Realtime traffic analyser
Presentation - Programming a Heterogeneous Computing Cluster
Ice Age melting down: Intel features considered usefull!
Putting Compilers to Work
Parallelization using open mp
C++ Programming and the Persistent Memory Developers Kit
OpenPOWER Acceleration of HPCC Systems
High-Performance Computing with C++
Hpc lunch and learn
Client Technical Analysis of Legacy Software and Future Replacement
OSGi Community Event 2010 - Experiences with OSGi in Industrial Applications
Overcoming software development challenges by using an integrated software fr...
Introduction to multicore .ppt
MPI n OpenMP
Computer Architecture And Design-Introduction.pdf
Pune-Cocoa: Blocks and GCD
VTU 6th Sem Elective CSE - Module 3 cloud computing
Effective admin and development in iib
lecture03_EmbeddedSoftware for Beginners
Linux Distribution Collaboration …on a Mainframe!
Ad

Recently uploaded (20)

PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Five Habits of High-Impact Board Members
PPTX
Modernising the Digital Integration Hub
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPT
What is a Computer? Input Devices /output devices
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
August Patch Tuesday
PDF
WOOl fibre morphology and structure.pdf for textiles
DOCX
search engine optimization ppt fir known well about this
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Getting Started with Data Integration: FME Form 101
Module 1.ppt Iot fundamentals and Architecture
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Five Habits of High-Impact Board Members
Modernising the Digital Integration Hub
Assigned Numbers - 2025 - Bluetooth® Document
Univ-Connecticut-ChatGPT-Presentaion.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
What is a Computer? Input Devices /output devices
A comparative study of natural language inference in Swahili using monolingua...
Group 1 Presentation -Planning and Decision Making .pptx
August Patch Tuesday
WOOl fibre morphology and structure.pdf for textiles
search engine optimization ppt fir known well about this
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
sustainability-14-14877-v2.pddhzftheheeeee
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Benefits of Physical activity for teenagers.pptx
A novel scalable deep ensemble learning framework for big data classification...
observCloud-Native Containerability and monitoring.pptx
Getting Started with Data Integration: FME Form 101

EuroMPI 2013 presentation: McMPI

  • 1. McMPI Managed-code MPI library in Pure C# Dr D Holmes, EPCC dholmes@epcc.ed.ac.uk
  • 2. Outline • Yet another MPI library? • Managed-code, C#, Windows • McMPI, design and implementation details • Object-orientation, design patterns, communication performance results • Threads and the MPI Standard • Pre-“End Points proposal” ideas
  • 3. Why Implement MPI Again? • Parallel program, distributed memory => MPI library • Most (all?) MPI libraries written in C • MPI Standard provides C and FORTRAN bindings • C++ can use the C functions • Other languages can follow the C++ model • Use the C functions • Alternatively, MPI can be implemented in that language • Removes inter-language function call overheads but … • May not be possible to achieve comparable performance
  • 4. Why Did I Choose C#? • Experience and knowledge I gained from my career in software development • My impression of the popularity of C# in commercial software development • My desire to bridge the gap between high-performance programming and high-productivity programming • One of the UK research councils offered me funding for a PhD that proposed to use C# to implement MPI
  • 5. C# Myths • C# only runs on Windows • Not such a bad thing – 3 of the Top500 machines use Windows • Not actually true – Mono works on multiple operating systems • C# is a Microsoft language • Not such a bad thing – resources, commitment, support, training • Not actually true – C# follows ECMA and ISO standards • C# is slow like Java • Not such a bad thing – expressivity, readability, re-usability • Not actually true – no easy way to prove this conclusively • C# and its ilk are not things we need to care about • Not such a bad thing – they will survive/thrive, or not, without us • Not actually true – popularity trumps utility
  • 6. McMPI Design & Implementation • Desirable features of code • Isolation of concerns -> easier to understand • Human readability -> easier to maintain • Compiler readability -> easier to get good performance • Object-orientation can help with isolation of concerns • So can modularisation and judiciously reducing LOC per code file • Design patterns can help with human readability • So can documentation and useful in-code comments • Choice of language & compiler can help with performance • So can coding style and detailed examination of compiler output • What is the best compromise?
  • 7. Communication Layer • Abstract class factory design pattern • Similar to plug-ins • Enables addition of new functionality without re-compilation of the rest of the library • All communication modules: • Implement the same Abstract Device Interface (ADI) • Isolate the details of their implementation from other layers • Provide the same semantics and capabilities • Reliable delivery • Ordering of delivery • Preservation of message boundaries • Message = fixed size envelope information and variable size user data
  • 9. Protocol Layer • Bridge design pattern • Enables addition of new functionality without re-compilation of the rest of the library • All protocol messages: • Implement inherit from the same base class • Isolate the details of their implementation from other layers • Modify state of internal shared data structures independently • Shared data structures (message ‘queues’) • Unexpected queue – message envelope at receiver before receive • Request queue – receive called before message envelope arrival • Matched queue – at receiver waiting for message data to arrive • Pending queue – message data waiting at sender
  • 11. Interface Layer • Simple façade design pattern • Translates MPI Standard-like syntax into protocol layer syntax • Will become adapter design pattern • For example, when custom data-types are implemented • Current version of McMPI covers parts of MPI 1 only • Initialisation and finalisation • Administration functions, e.g. to get rank and size of communicator • Point-to-point communication functions • ready, synchronous, standard (not buffered) • blocking, non-blocking, persistent • Previous version had collectives • Implemented on top of point-to-point • Using hypercube or binary tree algorithms
  • 13. Performance Results – Introduction 1 • Shared-memory results – hardware details • Number of Nodes: 1 Armari Magnetar server • CPUs per Node: 2 Intel Xeon E5420 • Threads per CPU: 4 Quad-core, no hyper-threading • Core Clock Speed: 2.5GHz Front-side bus 1333MHz • Level 1 Cache: 4x2x32KB Data & instruction per core • Level 2 Cache: 2x6MB One per pair of cores • Memory per Node: 16GB DDR2 667MHz • Network Hardware: 2xNIC Intel 82575EB Gigabit Ethernet • Operating System: WinXP Pro 64bit with SP3 version 5.2.3790
  • 14. Performance Results – Introduction 2 • Distributed-memory results – hardware details • Number of Nodes: 18 Dell PowerEdge 2900 • CPUs per Node: 2 Intel Xeon 5130 Fam 6 mod 15 step 6 • Threads per CPU: 2 Dual-core, no hyper-threading • Core Clock Speed: 2.0GHz Front-side bus 1333MHz • Level 1 Cache: 2x2x32KB Data & instruction per core • Level 2 Cache: 1x4MB One per CPU • Memory per Node: 4GB DDR2 533MHz • Network Hardware: 2xNIC BCM5708C NetXtreme II GigE • Operating System: Win2008 Server x64, SP2 version 6.0.6002
  • 15. Shared-memory – Latency 0 1 2 3 4 5 6 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 Latency(µs) Message Size (bytes) MPICH2 Shared Memory MS-MPI Shared Memory McMPI thread-to-thread
  • 16. Shared-memory – Bandwidth 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 Bandwidth(Mbit/s) Message Size (bytes) McMPI thread-to-thread MPICH2 shared-memory MS-MPI shared-memory
  • 17. Distributed-memory – Latency 0 50 100 150 200 250 300 350 400 450 500 550 600 1 2 4 8 16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 Latency(µs) Message Size (bytes) McMPI Eager MS-MPI
  • 18. Distributed-memory – Bandwidth 0 250 500 750 1,000 4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 Bandwidth(Mbit/s) Message Size (bytes) McMPI Rendezvous McMPI Eager MS-MPI
  • 19. Thread-as-rank – Threading Level • McMPI allows MPI_THREAD_AS_RANK as input for the MPI_INIT_THREAD function • McMPI creates new threads during initialisation • Not needed – MPI_INIT_THREAD must be called enough times • McMPI uses thread-local storage to store ‘rank’ • Not needed – each communicator handle can encode ‘rank’ • Thread-to-thread message delivery is zero-copy • Direct copy from user send buffer to user receive buffer • Any thread can progress MPI messages
  • 20. Thread-as-rank – MPI Process Diagram created by Gaurav Saxena MSc, 2013
  • 21. Thread-as-rank – MPI Standard • Is thread-as-rank compliant with the MPI Standard? • Does the MPI Standard allow/support thread-as-rank? • Ambiguous/debatable at best • The MPI Standard assumes MPI process = OS process • Call MPI_INIT or MPI_INIT_THREAD twice in one OS process • Erroneous by definition or results in two MPI processes? • MPI Standard “thread compliant” prohibits thread-as-rank • To maintain a POSIX-process-like interface for MPI process • End-points proposal violates this principle in exactly the same way • Other possible interfaces exist
  • 22. Thread-as-rank – End-points • Similarities • Multiple threads can communicate reliably without using tags • Thread ‘rank’ can be stored in thread-local storage or handles • Most common use-case likely requires MPI_THREAD_MULTIPLE • Differences • Thread-as-rank part of initialisation and active until finalisation • End-points created after initialisation and can be destroyed • Thread-as-rank has all possible ranks in MPI_COMM_WORLD • End-points only has some ranks in MPI_COMM_WORLD • Thread-as-rank cannot create ranks but may need to merge ranks • End-points can create ranks and does not need to merge ranks
  • 23. Thread-as-rank – MPI Forum Proposal? • Short answer: no • Long answer: not yet, it’s complicated • More likely to be suggested amendments to end-points proposal • Thread-as-rank is a special case of end-points • Standard MPI_COMM_WORLD replaced with an end-points communicator during MPI_INIT_THREAD • Thread-safety implications are similar (possibly identical?) • Advantages/opportunities similar • Thread-to-thread delivery rather than process-to-process delivery • Work-stealing MPI progress engine or per-thread message queues