SlideShare a Scribd company logo
LA-UR 10-05188Implementation & Comparison of RDMA Over EthernetStudents:	   Lee Gaiser, Brian Kraus, and James WernickeMentors:	   Andree Jacobson, Susan Coulter, JharrodLaFon, and Ben McClelland
SummaryBackgroundObjectiveTesting EnvironmentMethodologyResultsConclusionFurther WorkChallengesLessons LearnedAcknowledgmentsReferences & LinksQuestions
Background : Remote Direct Memory Access (RDMA)RDMA provides high-throughput, low-latency networking:Reduce consumption of CPU cyclesReduce communication latencyImages courtesy of http://www.hpcwire.com/features/17888274.html
Background : InfiniBandInfiniband is a switched fabric communication link designed for HPC:High throughputLow latencyQuality of serviceFailoverScalabilityReliable transportHow do we interface this high performance link with existing Ethernet infrastructure?
Background : RDMA over Converged Ethernet (RoCE)Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure.Utilize the same transport and network layers from IB stack and swap the link layer for Ethernet.Implement IB verbs over Ethernet.Not quite IB strength, but it’s getting close.As of OFED 1.5.1, code written for OFED RDMA auto-magically works with RoCE.
ObjectiveWe would like to answer the following questions:What kind of performance can we get out of RoCE on our cluster?Can we implement RoCE in software (Soft RoCE) and how does it compare with hardware RoCE?
Testing EnvironmentHardware:HP ProLiant DL160 G6 serversMellanox MNPH29B-XTC 10GbE adapters50/125 OFNR cablingOperating System:CentOS 5.32.6.32.16 kernelSoftware/Drivers:Open Fabrics Enterprise Distribution (OFED) 1.5.2-rc2 (RoCE) & 1.5.1-rxe (Soft RoCE)OSU Micro Benchmarks (OMB) 3.1.1OpenMPI 1.4.2
MethodologySet up a pair of nodes for each technology:IB, RoCE, Soft RoCE, and no RDMAInstall, configure & run minimal services on test nodes to maximize machine performance.Directly connect nodes to maximize network performance.Acquire latency benchmarksOSU MPI Latency TestAcquire bandwidth benchmarksOSU MPI Uni-Directional Bandwidth TestOSU MPI Bi-Directional Bandwidth TestScript it all to perform many repetitions
Results : Latency
Results : Uni-directional Bandwidth
Results : Bi-directional Bandwidth
Results : AnalysisRoCE performance gains over 10GbE:
Up to 5.7x speedup in latency
Up to 3.7x increase in bandwidth
IB QDR vs. RoCE:
IB less than 1µs faster than RoCE at 128-byte message.
IB peak bandwidth is 2-2.5x greater than RoCE.ConclusionRoCE is capable of providing near-Infiniband QDR performance for:Latency-critical applications at message sizes from 128B to 8KBBandwidth-intensive applications for messages <1KB.Soft RoCE is comparable to hardware RoCE at message sizes above 65KB.Soft RoCE can improve performance where RoCE-enabled hardware is unavailable.
Further Work & QuestionsHow does RoCE perform over collectives?Can we further optimize RoCE configuration to yield better performance?Can we stabilize the Soft RoCE configuration?How much does Soft RoCE affect the compute nodes ability to perform?How does RoCE compare with iWARP?
ChallengesFinding an OS that works with OFED & RDMA:Fedora 13 was too new.Ubuntu 10 wasn’t supported.CentOS 5.5 was missing some drivers.Had to compile a new kernel with IB/RoCE support.Built OpenMPI 1.4.2 from source, but wasn’t configured for RDMA; used OpenMPI 1.4.1 supplied with OFED instead.The machines communicating via Soft RoCE frequently lock up during OSU bandwidth tests.

More Related Content

PDF
Fundamentals of Apache Kafka
PPTX
Apache kafka
PPTX
Introduction to Apache Kafka
PDF
Quarkus - a next-generation Kubernetes Native Java framework
PDF
쿠키런: 킹덤 대규모 인프라 및 서버 운영 사례 공유 [데브시스터즈 - 레벨 200] - 발표자: 용찬호, R&D 엔지니어, 데브시스터즈 ...
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PPTX
Kafka 101
PPTX
Apache Kafka Best Practices
Fundamentals of Apache Kafka
Apache kafka
Introduction to Apache Kafka
Quarkus - a next-generation Kubernetes Native Java framework
쿠키런: 킹덤 대규모 인프라 및 서버 운영 사례 공유 [데브시스터즈 - 레벨 200] - 발표자: 용찬호, R&D 엔지니어, 데브시스터즈 ...
APACHE KAFKA / Kafka Connect / Kafka Streams
Kafka 101
Apache Kafka Best Practices

What's hot (20)

PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PDF
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
PPTX
Kafka presentation
PPTX
OpenStack Introduction
PDF
게임사를 위한 Amazon GameLift 세션 - 이정훈, AWS 솔루션즈 아키텍트
PDF
A Deep Dive into Kafka Controller
PDF
Zabbix Performance Tuning
PDF
An Introduction to Apache Kafka
PDF
Handle Large Messages In Apache Kafka
PDF
Introduction to apache kafka, confluent and why they matter
PPTX
Autoscaling in Kubernetes
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PPTX
Integrating Apache NiFi and Apache Flink
PDF
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
PPTX
Kubernetes Networking 101
PDF
Alexei Vladishev - Zabbix - Monitoring Solution for Everyone
PPTX
Deep Dive into Apache Kafka
PDF
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
PPTX
AWS로 게임 기반 다지기 - 김병수, 박진성 :: AWS Game Master 온라인 세미나 #3
PDF
Introduction to Kubernetes Workshop
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Kafka presentation
OpenStack Introduction
게임사를 위한 Amazon GameLift 세션 - 이정훈, AWS 솔루션즈 아키텍트
A Deep Dive into Kafka Controller
Zabbix Performance Tuning
An Introduction to Apache Kafka
Handle Large Messages In Apache Kafka
Introduction to apache kafka, confluent and why they matter
Autoscaling in Kubernetes
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Integrating Apache NiFi and Apache Flink
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
Kubernetes Networking 101
Alexei Vladishev - Zabbix - Monitoring Solution for Everyone
Deep Dive into Apache Kafka
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
AWS로 게임 기반 다지기 - 김병수, 박진성 :: AWS Game Master 온라인 세미나 #3
Introduction to Kubernetes Workshop
Ad

Similar to Implementation &amp; Comparison Of Rdma Over Ethernet (20)

PDF
International Journal of Engineering Research and Development
DOC
Bandwidth estimation for ieee 802
PDF
Dell PowerEdge R7615 servers with Broadcom BCM57508 NICs can accelerate your ...
PDF
Deploying flash storage for Ceph without compromising performance
PPTX
Link_NwkingforDevOps
PDF
Opt for modern 100Gb Broadcom 57508 NICs in your Dell PowerEdge R750 servers ...
PDF
Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet
PPT
Network Analysis & Designing
PDF
Linac Coherent Light Source (LCLS) Data Transfer Requirements
PDF
Linkage aggregation
PDF
Mellanox High Performance Networks for Ceph
PDF
Why 10 Gigabit Ethernet Draft v2
PDF
Madge LANswitch 3LS Application Guide
PDF
Multapplied Networks - Bonding and Load Balancing together in Bonded Internet™
PPTX
Ccna interview questions
PDF
Cooperation without synchronization practical cooperative relaying for wirele...
PDF
Computer Network Performance evaluation based on Network scalability using OM...
PPT
Super Computer
PPTX
Networking & Servers
PDF
Turbocharge the NFV Data Plane in the SDN Era - a Radisys presentation
International Journal of Engineering Research and Development
Bandwidth estimation for ieee 802
Dell PowerEdge R7615 servers with Broadcom BCM57508 NICs can accelerate your ...
Deploying flash storage for Ceph without compromising performance
Link_NwkingforDevOps
Opt for modern 100Gb Broadcom 57508 NICs in your Dell PowerEdge R750 servers ...
Performance Evaluation of Soft RoCE over 1 Gigabit Ethernet
Network Analysis & Designing
Linac Coherent Light Source (LCLS) Data Transfer Requirements
Linkage aggregation
Mellanox High Performance Networks for Ceph
Why 10 Gigabit Ethernet Draft v2
Madge LANswitch 3LS Application Guide
Multapplied Networks - Bonding and Load Balancing together in Bonded Internet™
Ccna interview questions
Cooperation without synchronization practical cooperative relaying for wirele...
Computer Network Performance evaluation based on Network scalability using OM...
Super Computer
Networking & Servers
Turbocharge the NFV Data Plane in the SDN Era - a Radisys presentation
Ad

Implementation &amp; Comparison Of Rdma Over Ethernet

  • 1. LA-UR 10-05188Implementation & Comparison of RDMA Over EthernetStudents: Lee Gaiser, Brian Kraus, and James WernickeMentors: Andree Jacobson, Susan Coulter, JharrodLaFon, and Ben McClelland
  • 3. Background : Remote Direct Memory Access (RDMA)RDMA provides high-throughput, low-latency networking:Reduce consumption of CPU cyclesReduce communication latencyImages courtesy of http://www.hpcwire.com/features/17888274.html
  • 4. Background : InfiniBandInfiniband is a switched fabric communication link designed for HPC:High throughputLow latencyQuality of serviceFailoverScalabilityReliable transportHow do we interface this high performance link with existing Ethernet infrastructure?
  • 5. Background : RDMA over Converged Ethernet (RoCE)Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure.Utilize the same transport and network layers from IB stack and swap the link layer for Ethernet.Implement IB verbs over Ethernet.Not quite IB strength, but it’s getting close.As of OFED 1.5.1, code written for OFED RDMA auto-magically works with RoCE.
  • 6. ObjectiveWe would like to answer the following questions:What kind of performance can we get out of RoCE on our cluster?Can we implement RoCE in software (Soft RoCE) and how does it compare with hardware RoCE?
  • 7. Testing EnvironmentHardware:HP ProLiant DL160 G6 serversMellanox MNPH29B-XTC 10GbE adapters50/125 OFNR cablingOperating System:CentOS 5.32.6.32.16 kernelSoftware/Drivers:Open Fabrics Enterprise Distribution (OFED) 1.5.2-rc2 (RoCE) & 1.5.1-rxe (Soft RoCE)OSU Micro Benchmarks (OMB) 3.1.1OpenMPI 1.4.2
  • 8. MethodologySet up a pair of nodes for each technology:IB, RoCE, Soft RoCE, and no RDMAInstall, configure & run minimal services on test nodes to maximize machine performance.Directly connect nodes to maximize network performance.Acquire latency benchmarksOSU MPI Latency TestAcquire bandwidth benchmarksOSU MPI Uni-Directional Bandwidth TestOSU MPI Bi-Directional Bandwidth TestScript it all to perform many repetitions
  • 12. Results : AnalysisRoCE performance gains over 10GbE:
  • 13. Up to 5.7x speedup in latency
  • 14. Up to 3.7x increase in bandwidth
  • 15. IB QDR vs. RoCE:
  • 16. IB less than 1µs faster than RoCE at 128-byte message.
  • 17. IB peak bandwidth is 2-2.5x greater than RoCE.ConclusionRoCE is capable of providing near-Infiniband QDR performance for:Latency-critical applications at message sizes from 128B to 8KBBandwidth-intensive applications for messages <1KB.Soft RoCE is comparable to hardware RoCE at message sizes above 65KB.Soft RoCE can improve performance where RoCE-enabled hardware is unavailable.
  • 18. Further Work & QuestionsHow does RoCE perform over collectives?Can we further optimize RoCE configuration to yield better performance?Can we stabilize the Soft RoCE configuration?How much does Soft RoCE affect the compute nodes ability to perform?How does RoCE compare with iWARP?
  • 19. ChallengesFinding an OS that works with OFED & RDMA:Fedora 13 was too new.Ubuntu 10 wasn’t supported.CentOS 5.5 was missing some drivers.Had to compile a new kernel with IB/RoCE support.Built OpenMPI 1.4.2 from source, but wasn’t configured for RDMA; used OpenMPI 1.4.1 supplied with OFED instead.The machines communicating via Soft RoCE frequently lock up during OSU bandwidth tests.
  • 20. Lessons LearnedInstalling and configuring HPC clustersBuilding, installing, and fixing Linux kernel, modules, and driversWorking with IB, 10GbE, and RDMA technologiesUsing tools such as OMB-3.1.1 and netperf for benchmarking performance
  • 23. References & LinksSubmaroni, H. et al. RDMA over Ethernet – A Preliminary Study. OSU. http://nowlab.cse.ohio-state.edu/publications/conf-presentations/2009/subramoni-hpidc09.pdfFeldman, M. RoCE: An Ethernet-InfiniBand Love Story. HPCWire.com. April 22, 2010.http://www.hpcwire.com/blogs/RoCE-An-Ethernet-InfiniBand-Love-Story-91866499.htmlWoodruff, R. Access to InfiniBand from Linux.Intel. October 29, 2009. http://software.intel.com/en-us/articles/access-to-infiniband-from-linux/OFED 1.5.2-rc2http://www.openfabrics.org/downloads/OFED/ofed-1.5.2/OFED-1.5.2-rc2.tgzOFED 1.5.1-rxehttp://www.systemfabricworks.com/pub/OFED-1.5.1-rxe.tgzOMB 3.1.1http://mvapich.cse.ohio-state.edu/benchmarks/OMB-3.1.1.tgz

Editor's Notes

  • #2: Introduce yourself, the institute, your teammates, and what you’ve been working on for the past two months.
  • #3: Rephrase these bullet points with a little more elaboration.
  • #4: Emphasize how RDMA eliminates unnecessary communication.
  • #5: Explain that we are using IB QDR and what that means.
  • #6: Emphasize that the biggest advantage of RoCE is latency, not necessarily bandwidth. Talk about 40Gb &amp; 100Gb Ethernet on the horizon.
  • #9: OSU benchmarks were more appropriate than netperf
  • #10: The highlight here is that latency between IB &amp; RoCE differs by only 1.7us at 128 byte messages. It continues to be very close up through 4K messages. Also notice that latency in RoCE and no RDMA converge at higher messages.
  • #11: Note that RoCE &amp; IB are very close up to 1K message size. IB QDR peaks out at 3MB/s, RoCE at 1.2MB/s
  • #12: Note that the bandwidth trends are similar to the uni-directional bandwidths. Explain that Soft RoCE could not complete this test. IB QDR peaks at 5.5 MB/s and RoCE peaks at 2.3 MB/s.