ETHERNET FABRIC
MAKING CC WORK FOR LOW LATENCY
HIGH BANDWIDTH UNDER INCAST
A QUICK TOUR OF THE TRANSITION FROM SDP TO TCP
OVERVIEW
▸ What is SDP and why did it work well as a fabric?
▸ Why TCP?
▸ What are the problems with default TCP?
▸ Incast
▸ Slow start and delayed acks
▸ NewReno sawtooth - ECN and DCTCP
▸ What we did to implement the research suggestions
▸ Several overlooked implications
▸ Performance in practice
SDP - IN THE BEGINNING THERE WAS INFINIBAND
SDP - SOCKETS DIRECT PROTOCOL
▸ “The Sockets Direct Protocol (SDP) is a transport-agnostic
protocol to support stream sockets over Remote Direct Memory
Access (RDMA) network fabrics” - Wikipedia
▸ All congestion control, queueing, and pacing handled in
hardware/firmware
▸ Provides guaranteed delivery, low latency, and (in theory)
reduced CPU overhead
▸ On paper provides the perfect solution for a hardware agnostic,
robust, low-latency cluster interconnect
THE GOOD - COST
WHY TCP? ETHERNET - OBVIOUSLY
▸ “It’s a single set of skills, plugs and protocols to rule them
all. Perhaps it’s time to create the second law of Metcalfe:
Never bet against Ethernet.” NetworkWorld Aug 15, 2006
▸ Multiple Vendors
▸ Very commoditized below 40Gbps
▸ Fewer firmware issues
▸ Much easier to understand failure modes
THE BAD & UGLY - UTILIZATION AND LOSS RECOVERY
WHAT’S WRONG WITH TCP?
▸ Incast - catastrophic loss in high synchronous fan in
▸ Slow start - recovery too conservative
▸ Delayed acks - 100ms is a very long time when the RTT is
50us
▸ NewReno can’t reach consistent utilization
▸ Partially mitigated by ECN
▸ DCTCP may be a long-term fix
PROBLEMS UNIQUE TO A MESH
INCAST - DEFINITION AND MITIGATION
▸ Term coined in [PANFS] for the case of increasing the number of
simultaneously initiated, effectively barrier synchronized, fan-in flows in to a
single port to the point where the instantaneous switch / NIC buffering
capacity was exceeded. Thus causing a decline in aggregate bandwidth as
the need for retransmits increases. This is further exacerbated by tail-drop
behavior in the switch whereby multiple losses within individual streams
exceeds the recovery abilities of duplicate ACKs or SACK, leading to RTOs
before the flow is resumed.
▸ Solution: scale RTO continuously
▸ Make retransmit timer higher resolution
▸ Make RTT calculation higher resolution
BURSTINESS & LOSS
“SLOW START” - RENEGOTIATING THE LINK
▸ Under loss window size drops to 1 - given that we know the RTT
and maximum peer bandwidth, this should really be a larger
value determined by the maximum number of active peers (i.e.
1/RTT * (max BW/ #peers)
▸ LRO - stretch ACKs ignored per RFC - Linux avoids ack division
while growing the window more rapidly by acknowledging total
acked bytes
▸ Delayed ACKs 100ms - particularly when in slow start this can
delay window growth - at other times can create an artificially
elevated RTO
CONGESTION CONTROL AND UTILIZATION
THE NEWRENO SAWTOOTH
▸ ECN
▸ when a CE is seen we only reduce the congestion window by half rather
than resetting it to 1
▸ still causes the window bounce between 2 values even at a stable “steady
state”
▸ DCTCP
▸ allows a continuous scaling of the congestion window - each CE only
reduces the congestion window by a small fraction
▸ interoperability issues due to redefining ECN and setting both the ECN
min/max to the same value of K
IMPLEMENTATION
WHAT WE DID FOR INCAST
▸ Separating callout scheduling granularity and hardclock
frequency, fix callouts so that timer interrupt can be
cleared when rescheduling
▸ High resolution TCP timestamps - moving from 1khz / tick
based to TSC based for a 16Mhz (~64ns / increment - the
fastest allowable for a 120s maximum segment lifetime)
PROBLEMS
UNFORESEEN PROBLEMS PART 1
▸ Cached timers don’t scale down while maintaining monotonicity
▸ scheduling of hardclock doesn’t provide consistent updates
▸ solution: use TSC based timer even if more expensive, if no invariant TSC
available restrict measurement values to integral multiples of ticks
▸ Idle connections stop working when the timer wraps
▸ If connection has been idle for longer than it takes for the timestamp
counter to wrap peer will see all segments as coming “before” the last
segment prior to the idle period
▸ Solution: LIE. Increment last timestamp value sent by large value until the
value sent to peer catches up with actual value.
PROBLEMS
UNFORESEEN PROBLEMS PART 2
▸ SRTT & RTTVAR have much too short a memory - outlier values quickly forgotten leading to
frequent spurious retransmits
▸ First solution - taken from RFC 7323 App. G:
▸ RTTVAR <- (1 - beta’) * RTTVAR + beta’ * |SRTT - R’|; SRTT <- (1 - alpha’) * SRTT + alpha’
* R’
▸ alpha’ = alpha / ExpectedSamples; beta’ = beta / ExpectedSamples;
ExpectedSamples = ceil(FlightSize/ (SMSS *2))
▸ Problem: When pipe is empty or cwnd is reset after loss return to very having short
memory
▸ Second solution: ExpectedSamples = ceil(cwnd / (SMSS * 2))
▸ Problem: When cwnd is reset after loss return to very short memory
▸ Final solution: ExpectedSamples = ceil(max(cwnd, cwnd_prev) / (SMSS *2))
THE PRODUCT
HOW DID IT TURN OUT?
▸ Target streaming throughput was 15GB/s on 4 nodes
▸ Actually achieve 16-20GB/s with lower CPU utilization
than Infiniband (!?)
▸ Next generation (2017-) will be 100GigE fabric

More Related Content

PDF
TCP Westwood
PPTX
Cubic
PPT
Tcp congestion control
PPTX
PPTX
TCP Congestion Control By Owais Jara
PDF
Comparative Analysis of Different TCP Variants in Mobile Ad-Hoc Network
PDF
Congestion control
PPT
Congestion control avoidance
TCP Westwood
Cubic
Tcp congestion control
TCP Congestion Control By Owais Jara
Comparative Analysis of Different TCP Variants in Mobile Ad-Hoc Network
Congestion control
Congestion control avoidance

What's hot (20)

PPTX
Tcp congestion avoidance
PPT
Lect9
ODP
A Baker's dozen of TCP
PPT
TCP congestion control
PPT
Tcp congestion avoidance algorithm identification
PDF
Computer network (13)
PDF
TCP Congestion Control
PPTX
Congestion control in tcp
PPT
Troubleshooting TCP/IP
DOCX
Queue Manager both as sender and Receiver.docx
DOCX
CCDT(client connection)MQ.docx
PDF
Network performance overview
DOCX
Implementing ssl self sign demo
PPTX
TCP-FIT: An Improved TCP Congestion Control Algorithm and its Performance
PPTX
Go back-n protocol
PPTX
Go Back N Arq1
DOCX
Ibm mq dqm setups
PDF
PPT
Lect9 (1)
PPTX
Leaky Bucket & Tocken Bucket - Traffic shaping
Tcp congestion avoidance
Lect9
A Baker's dozen of TCP
TCP congestion control
Tcp congestion avoidance algorithm identification
Computer network (13)
TCP Congestion Control
Congestion control in tcp
Troubleshooting TCP/IP
Queue Manager both as sender and Receiver.docx
CCDT(client connection)MQ.docx
Network performance overview
Implementing ssl self sign demo
TCP-FIT: An Improved TCP Congestion Control Algorithm and its Performance
Go back-n protocol
Go Back N Arq1
Ibm mq dqm setups
Lect9 (1)
Leaky Bucket & Tocken Bucket - Traffic shaping
Ad

Viewers also liked (15)

PDF
Rockford Web Devs Meetup - AWS - November 10th, 2015
PDF
Difusividad rev6
PPT
4.android java interfaces
PPT
Locating mechanism
PPTX
Tic ludy
PPT
ApresentaçãO Consultoria 1
PPTX
European civilizations test review
PDF
Bsi 0
PPTX
Lauren Wilson: Graduate Life at Impression Nottingham
PDF
20150817 trans med plan ecsim_vw
PDF
21 tabelas de lajes
PPTX
Aula 2 estudo transversal
PPTX
Neocolonialism
PPTX
Communication de crise et Internet
Rockford Web Devs Meetup - AWS - November 10th, 2015
Difusividad rev6
4.android java interfaces
Locating mechanism
Tic ludy
ApresentaçãO Consultoria 1
European civilizations test review
Bsi 0
Lauren Wilson: Graduate Life at Impression Nottingham
20150817 trans med plan ecsim_vw
21 tabelas de lajes
Aula 2 estudo transversal
Neocolonialism
Communication de crise et Internet
Ad

Similar to Ethernet as fabric (20)

PDF
Lecture 19 22. transport protocol for ad-hoc
PPTX
High Performance Networking with Advanced TCP
PDF
features of tcp important for the web
PPT
Pushing the limits of Controller Area Network (CAN)
PDF
Analytical Research of TCP Variants in Terms of Maximum Throughput
PDF
Designing TCP-Friendly Window-based Congestion Control
PPT
Chapter6TransportLayer header format protocols-2.ppt
PPTX
6610-l14.pptx
PDF
Ns3: Newreno vs Vegas vs Veno
PDF
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
PDF
Transaction TCP
PDF
Do We Really Need TSN in Next-Generation Helicopters? Insights From a Case-Study
PPTX
3.TRANSPORT LAYER Computer Network .pptx
PDF
Computer network (11)
PDF
Performance Evaluation of High Speed Congestion Control Protocols
PPTX
Part9-congestion.pptx
PDF
Computer network (5)
PDF
Tcp performance simulationsusingns2
PDF
Tuning TCP and NGINX on EC2
PPTX
NE #1.pptx
Lecture 19 22. transport protocol for ad-hoc
High Performance Networking with Advanced TCP
features of tcp important for the web
Pushing the limits of Controller Area Network (CAN)
Analytical Research of TCP Variants in Terms of Maximum Throughput
Designing TCP-Friendly Window-based Congestion Control
Chapter6TransportLayer header format protocols-2.ppt
6610-l14.pptx
Ns3: Newreno vs Vegas vs Veno
LF_OVS_17_OVS/OVS-DPDK connection tracking for Mobile usecases
Transaction TCP
Do We Really Need TSN in Next-Generation Helicopters? Insights From a Case-Study
3.TRANSPORT LAYER Computer Network .pptx
Computer network (11)
Performance Evaluation of High Speed Congestion Control Protocols
Part9-congestion.pptx
Computer network (5)
Tcp performance simulationsusingns2
Tuning TCP and NGINX on EC2
NE #1.pptx

Ethernet as fabric

  • 1. ETHERNET FABRIC MAKING CC WORK FOR LOW LATENCY HIGH BANDWIDTH UNDER INCAST
  • 2. A QUICK TOUR OF THE TRANSITION FROM SDP TO TCP OVERVIEW ▸ What is SDP and why did it work well as a fabric? ▸ Why TCP? ▸ What are the problems with default TCP? ▸ Incast ▸ Slow start and delayed acks ▸ NewReno sawtooth - ECN and DCTCP ▸ What we did to implement the research suggestions ▸ Several overlooked implications ▸ Performance in practice
  • 3. SDP - IN THE BEGINNING THERE WAS INFINIBAND SDP - SOCKETS DIRECT PROTOCOL ▸ “The Sockets Direct Protocol (SDP) is a transport-agnostic protocol to support stream sockets over Remote Direct Memory Access (RDMA) network fabrics” - Wikipedia ▸ All congestion control, queueing, and pacing handled in hardware/firmware ▸ Provides guaranteed delivery, low latency, and (in theory) reduced CPU overhead ▸ On paper provides the perfect solution for a hardware agnostic, robust, low-latency cluster interconnect
  • 4. THE GOOD - COST WHY TCP? ETHERNET - OBVIOUSLY ▸ “It’s a single set of skills, plugs and protocols to rule them all. Perhaps it’s time to create the second law of Metcalfe: Never bet against Ethernet.” NetworkWorld Aug 15, 2006 ▸ Multiple Vendors ▸ Very commoditized below 40Gbps ▸ Fewer firmware issues ▸ Much easier to understand failure modes
  • 5. THE BAD & UGLY - UTILIZATION AND LOSS RECOVERY WHAT’S WRONG WITH TCP? ▸ Incast - catastrophic loss in high synchronous fan in ▸ Slow start - recovery too conservative ▸ Delayed acks - 100ms is a very long time when the RTT is 50us ▸ NewReno can’t reach consistent utilization ▸ Partially mitigated by ECN ▸ DCTCP may be a long-term fix
  • 6. PROBLEMS UNIQUE TO A MESH INCAST - DEFINITION AND MITIGATION ▸ Term coined in [PANFS] for the case of increasing the number of simultaneously initiated, effectively barrier synchronized, fan-in flows in to a single port to the point where the instantaneous switch / NIC buffering capacity was exceeded. Thus causing a decline in aggregate bandwidth as the need for retransmits increases. This is further exacerbated by tail-drop behavior in the switch whereby multiple losses within individual streams exceeds the recovery abilities of duplicate ACKs or SACK, leading to RTOs before the flow is resumed. ▸ Solution: scale RTO continuously ▸ Make retransmit timer higher resolution ▸ Make RTT calculation higher resolution
  • 7. BURSTINESS & LOSS “SLOW START” - RENEGOTIATING THE LINK ▸ Under loss window size drops to 1 - given that we know the RTT and maximum peer bandwidth, this should really be a larger value determined by the maximum number of active peers (i.e. 1/RTT * (max BW/ #peers) ▸ LRO - stretch ACKs ignored per RFC - Linux avoids ack division while growing the window more rapidly by acknowledging total acked bytes ▸ Delayed ACKs 100ms - particularly when in slow start this can delay window growth - at other times can create an artificially elevated RTO
  • 8. CONGESTION CONTROL AND UTILIZATION THE NEWRENO SAWTOOTH ▸ ECN ▸ when a CE is seen we only reduce the congestion window by half rather than resetting it to 1 ▸ still causes the window bounce between 2 values even at a stable “steady state” ▸ DCTCP ▸ allows a continuous scaling of the congestion window - each CE only reduces the congestion window by a small fraction ▸ interoperability issues due to redefining ECN and setting both the ECN min/max to the same value of K
  • 9. IMPLEMENTATION WHAT WE DID FOR INCAST ▸ Separating callout scheduling granularity and hardclock frequency, fix callouts so that timer interrupt can be cleared when rescheduling ▸ High resolution TCP timestamps - moving from 1khz / tick based to TSC based for a 16Mhz (~64ns / increment - the fastest allowable for a 120s maximum segment lifetime)
  • 10. PROBLEMS UNFORESEEN PROBLEMS PART 1 ▸ Cached timers don’t scale down while maintaining monotonicity ▸ scheduling of hardclock doesn’t provide consistent updates ▸ solution: use TSC based timer even if more expensive, if no invariant TSC available restrict measurement values to integral multiples of ticks ▸ Idle connections stop working when the timer wraps ▸ If connection has been idle for longer than it takes for the timestamp counter to wrap peer will see all segments as coming “before” the last segment prior to the idle period ▸ Solution: LIE. Increment last timestamp value sent by large value until the value sent to peer catches up with actual value.
  • 11. PROBLEMS UNFORESEEN PROBLEMS PART 2 ▸ SRTT & RTTVAR have much too short a memory - outlier values quickly forgotten leading to frequent spurious retransmits ▸ First solution - taken from RFC 7323 App. G: ▸ RTTVAR <- (1 - beta’) * RTTVAR + beta’ * |SRTT - R’|; SRTT <- (1 - alpha’) * SRTT + alpha’ * R’ ▸ alpha’ = alpha / ExpectedSamples; beta’ = beta / ExpectedSamples; ExpectedSamples = ceil(FlightSize/ (SMSS *2)) ▸ Problem: When pipe is empty or cwnd is reset after loss return to very having short memory ▸ Second solution: ExpectedSamples = ceil(cwnd / (SMSS * 2)) ▸ Problem: When cwnd is reset after loss return to very short memory ▸ Final solution: ExpectedSamples = ceil(max(cwnd, cwnd_prev) / (SMSS *2))
  • 12. THE PRODUCT HOW DID IT TURN OUT? ▸ Target streaming throughput was 15GB/s on 4 nodes ▸ Actually achieve 16-20GB/s with lower CPU utilization than Infiniband (!?) ▸ Next generation (2017-) will be 100GigE fabric