Ioannis.Charalampidis@cern.ch
Lazaros.Lazaridis@cern.ch
CERN, June 2016
libFabric
ofi://
Transport
nanomsg
ALFA
usNIC
FairMQ
For the ALICE O2 upgrade, the simulation and
reconstruction software for the ALICE
experiment are using the ALFA1 framework.
Among other abstractions, ALFA/FairROOT
framework provides a Message-Queue
abstraction library, the FairMQ2, that is
lightweight wrapper around ØMQ and NanoMsg
libraries.
Since the project’s goal is to implement a new
transport for FairMQ we decided to extend the
functionality of either of these libraries.
We chose NanoMsg3 because of it’s
clean and modular internals.
Introduc)on	
ØMQ
fi_send(
endpoint,
buffer, len,
mr_desc,
context
);
buffer	…	 ..	
fi_recv(
endpoint,
buffer, len,
mr_desc,
context
);
buffer	…	 ..	
RDMA*	
Memory	Region	 Memory	Region	
fi_send(
endpoint,
buffer, len,
mr_desc,
context
);
fi_recv(
endpoint,
buffer, len,
mr_desc,
context
);
Tx	CQ	 Rx	CQ	 Tx	CQ	 Rx	CQ	
SEND	
ACK	
fi_cq_read( &event ); fi_cq_read( &event );
*	libfabric	has	custom	event	polling	func?ons	
One of the powerful features of the usNIC
fabric is the fact that it can bypass the linux
kernel from user-space when using the
libFabric4 library.
This relieves the kernel from the IP stack
overhead, reclaiming it’s CPU time for more
useful operations.
usNIC	+	Kernel	Bypass	
The	ofi://	Transport	
The project is implemented as a patch to the
NanoMsg sources5 that introduces the Open
Fabrics Interface (OFI) transport.
The transport translates the POSIX-like API of
NanoMsg into an RDMA-like API for libFabric,
transparently to the user. To achieve this it
uses a dynamic memory registration (MR)
mechanism that tries to reduce the amount of
MR performed, while being agnostic of the
user’s intentions.
1.  Technical Design Report for the Upgrade of the Online–
Offline Computing System, The ALICE Collaboration
2.  https://github.com/FairRootGroup/FairRoot/tree/master/
fairmq
3.  http://nanomsg.org
4.  http://ofiwg.github.io/libfabric
5.  https://github.com/wavesoft/nanomsg-transport-ofi
6.  https://github.com/wavesoft/robob
7.  https://github.com/ofiwg/libfabric/issues?q=author
%3Awavesoft
8.  https://github.com/nanomsg/nanomsg/pull/612
In a similar manner, it uses the high-level
libFabric polling API, instead of the FD - based
NanoMsg polling API, making it possible to
support any fabric without any modification.
True	Zero-Copy	
One of the implementation requirements of the
OFI transport was to ensure that no memcpy
operations will take place between the user’s
request and the transfer on the wire.
That’s reasonable if you consider
that the message sizes vary
from 50Mb to 1Gb
A useful by-product of this project was the
development of robob6, a fully automated benchmarking
utility, for ensuring the quality of the measured values
By-Products 	
Outcomes	
We frequently encountered roadblocks while
working this project, since we were using
new products and open source components.
We had frequent interactions with NanoMsg
and CISCO developers7 and we contributed
our own modifications8.
Nonetheless we managed to create a
prototype were we demonstrated the
feasibility of the transport and it’s
performance.
On the right we present some preliminary
measurements using the OFI transport
between two Intel Xeon E5-2690 machines
with UCSC-PCIE-C40Q NICs, connected
through switch with a 40Gbit copper cable. 0	
5	
10	
15	
20	
25	
30	
35	
8192	 16384	 32768	 65536	 1048576	 2097152	 4194304	 8388608	 16777216	 33554432	 67108864	 134217728	
Throughput	(GB/s)	for	Different	Message	Sizes	
OFI	[GBit/s]	 TCP	[Gbit/s]	 ØMQ	[Gbit/s]	
www.cern.ch/openlab
Poster by Ioannis Charalampidis. Special thanks
to our supervisor, Predrag Buncic, to Artur
Barczyk for his guidance, to Mohammad Al-
Turany and Peter Hristov

More Related Content

PDF
New Process/Thread Runtime
PDF
Programming Languages & Tools for Higher Performance & Productivity
PDF
Porting and Optimization of Numerical Libraries for ARM SVE
PDF
Performance evaluation with Arm HPC tools for SVE
PPTX
Assembly Language Tutorials for Windows - 05 Procedures Part 1
PDF
Post-K: Building the Arm HPC Ecosystem
PDF
Lustre Best Practices
ODP
C++ development within OOo
New Process/Thread Runtime
Programming Languages & Tools for Higher Performance & Productivity
Porting and Optimization of Numerical Libraries for ARM SVE
Performance evaluation with Arm HPC tools for SVE
Assembly Language Tutorials for Windows - 05 Procedures Part 1
Post-K: Building the Arm HPC Ecosystem
Lustre Best Practices
C++ development within OOo

What's hot (20)

PDF
Mbuf oflow - Finding vulnerabilities in iOS/macOS networking code - kevin ba...
PDF
Involvement in OpenHPC
PDF
Exploring the Programming Models for the LUMI Supercomputer
PDF
OpenDataPlane Testing in Travis
PDF
Post-K: Building the Arm HPC Ecosystem
PDF
Open cl programming using python syntax
PDF
Kernel Recipes 2014 - kGraft: Live Patching of the Linux Kernel
PDF
On the Capability and Achievable Performance of FPGAs for HPC Applications
PDF
Segment Routing v6 (SRv6) Academy Update
PDF
BPF - All your packets belong to me
PDF
Open MPI State of the Union X SC'16 BOF
PPTX
Compiling P4 to XDP, IOVISOR Summit 2017
PDF
Parallel R
PDF
Towards ruby-3x3-performance
PDF
Linaro HPC Workshop Note
PPTX
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
PPT
Porting To Symbian
DOCX
GNU GCC - what just a compiler...?
PDF
The State of libfabric in Open MPI
PDF
Circuit Simplifier
Mbuf oflow - Finding vulnerabilities in iOS/macOS networking code - kevin ba...
Involvement in OpenHPC
Exploring the Programming Models for the LUMI Supercomputer
OpenDataPlane Testing in Travis
Post-K: Building the Arm HPC Ecosystem
Open cl programming using python syntax
Kernel Recipes 2014 - kGraft: Live Patching of the Linux Kernel
On the Capability and Achievable Performance of FPGAs for HPC Applications
Segment Routing v6 (SRv6) Academy Update
BPF - All your packets belong to me
Open MPI State of the Union X SC'16 BOF
Compiling P4 to XDP, IOVISOR Summit 2017
Parallel R
Towards ruby-3x3-performance
Linaro HPC Workshop Note
Talk Python To Me: Stream Processing in your favourite Language with Beam on ...
Porting To Symbian
GNU GCC - what just a compiler...?
The State of libfabric in Open MPI
Circuit Simplifier
Ad

Similar to [9-6-2016] Openlab Poster-v3 (20)

PDF
Intel the-latest-on-ofi
PDF
Intel the-latest-on-ofi
PDF
Introduction of eBPF - 時下最夯的Linux Technology
PPTX
Building cloud-enabled genomics workflows with Luigi and Docker
PPTX
OFI Overview 2019 Webinar
PPTX
LLVM-based Communication Optimizations for PGAS Programs
PDF
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
PDF
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
PDF
Realizing the Promise of Portable Data Processing with Apache Beam
PPTX
Apache Kafka
PDF
OpenMP
PDF
Streaming Processing with a Distributed Commit Log
PDF
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
PDF
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
PDF
Starting with OpenCV on i.MX 6 Processors
PDF
A Framework For Performance Analysis Of Co-Array Fortran
PDF
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
PDF
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
PDF
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
PDF
MQTT, Eclipse Paho and Java - Messaging for the Internet of Things
Intel the-latest-on-ofi
Intel the-latest-on-ofi
Introduction of eBPF - 時下最夯的Linux Technology
Building cloud-enabled genomics workflows with Luigi and Docker
OFI Overview 2019 Webinar
LLVM-based Communication Optimizations for PGAS Programs
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
Realizing the Promise of Portable Data Processing with Apache Beam
Apache Kafka
OpenMP
Streaming Processing with a Distributed Commit Log
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
Starting with OpenCV on i.MX 6 Processors
A Framework For Performance Analysis Of Co-Array Fortran
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
AI/ML Infra Meetup | How Uber Optimizes LLM Training and Finetune
Learning Concurrent Programming in Scala Second Edition Aleksandar Prokopec
MQTT, Eclipse Paho and Java - Messaging for the Internet of Things
Ad

[9-6-2016] Openlab Poster-v3

  • 1. Ioannis.Charalampidis@cern.ch Lazaros.Lazaridis@cern.ch CERN, June 2016 libFabric ofi:// Transport nanomsg ALFA usNIC FairMQ For the ALICE O2 upgrade, the simulation and reconstruction software for the ALICE experiment are using the ALFA1 framework. Among other abstractions, ALFA/FairROOT framework provides a Message-Queue abstraction library, the FairMQ2, that is lightweight wrapper around ØMQ and NanoMsg libraries. Since the project’s goal is to implement a new transport for FairMQ we decided to extend the functionality of either of these libraries. We chose NanoMsg3 because of it’s clean and modular internals. Introduc)on ØMQ fi_send( endpoint, buffer, len, mr_desc, context ); buffer … .. fi_recv( endpoint, buffer, len, mr_desc, context ); buffer … .. RDMA* Memory Region Memory Region fi_send( endpoint, buffer, len, mr_desc, context ); fi_recv( endpoint, buffer, len, mr_desc, context ); Tx CQ Rx CQ Tx CQ Rx CQ SEND ACK fi_cq_read( &event ); fi_cq_read( &event ); * libfabric has custom event polling func?ons One of the powerful features of the usNIC fabric is the fact that it can bypass the linux kernel from user-space when using the libFabric4 library. This relieves the kernel from the IP stack overhead, reclaiming it’s CPU time for more useful operations. usNIC + Kernel Bypass The ofi:// Transport The project is implemented as a patch to the NanoMsg sources5 that introduces the Open Fabrics Interface (OFI) transport. The transport translates the POSIX-like API of NanoMsg into an RDMA-like API for libFabric, transparently to the user. To achieve this it uses a dynamic memory registration (MR) mechanism that tries to reduce the amount of MR performed, while being agnostic of the user’s intentions. 1.  Technical Design Report for the Upgrade of the Online– Offline Computing System, The ALICE Collaboration 2.  https://github.com/FairRootGroup/FairRoot/tree/master/ fairmq 3.  http://nanomsg.org 4.  http://ofiwg.github.io/libfabric 5.  https://github.com/wavesoft/nanomsg-transport-ofi 6.  https://github.com/wavesoft/robob 7.  https://github.com/ofiwg/libfabric/issues?q=author %3Awavesoft 8.  https://github.com/nanomsg/nanomsg/pull/612 In a similar manner, it uses the high-level libFabric polling API, instead of the FD - based NanoMsg polling API, making it possible to support any fabric without any modification. True Zero-Copy One of the implementation requirements of the OFI transport was to ensure that no memcpy operations will take place between the user’s request and the transfer on the wire. That’s reasonable if you consider that the message sizes vary from 50Mb to 1Gb A useful by-product of this project was the development of robob6, a fully automated benchmarking utility, for ensuring the quality of the measured values By-Products Outcomes We frequently encountered roadblocks while working this project, since we were using new products and open source components. We had frequent interactions with NanoMsg and CISCO developers7 and we contributed our own modifications8. Nonetheless we managed to create a prototype were we demonstrated the feasibility of the transport and it’s performance. On the right we present some preliminary measurements using the OFI transport between two Intel Xeon E5-2690 machines with UCSC-PCIE-C40Q NICs, connected through switch with a 40Gbit copper cable. 0 5 10 15 20 25 30 35 8192 16384 32768 65536 1048576 2097152 4194304 8388608 16777216 33554432 67108864 134217728 Throughput (GB/s) for Different Message Sizes OFI [GBit/s] TCP [Gbit/s] ØMQ [Gbit/s] www.cern.ch/openlab Poster by Ioannis Charalampidis. Special thanks to our supervisor, Predrag Buncic, to Artur Barczyk for his guidance, to Mohammad Al- Turany and Peter Hristov