SlideShare a Scribd company logo
Demystifying Data Center
CLOS networks
Igor Gashinsky, Yihua He
June 2019
Overview of Yahoo’s Data Center Fabric v3
(2012-2018)
Agenda
1. What's CLOS?
2. Different CLOS topologies
3. Gory Details of our topology
4. Scaling
5. Automation
6. Operations
2
What’s a CLOS?
● “A Study of Non-blocking Switching Networks" Bell System Technical Journal, 1953
○ multistage switching network
○ ingress stage, the middle stage, and the egress stage
○ can be recursive!
● What everyone calls a “Leaf-Spine” design
● Beneš network
3
Ingress
Ingress
middle
middle
Egress
Egress
Different CLOS topologies
● Not surprisingly, most modern network hardware uses some type
of a CLOS already
○ Chassis Routers/Switches
○ Linecards
4
Virtual Chassis
5
● Spine = Fabric Card
● Leaf = Linecard
● Internal interconnect = PCB x-bar
Virtual Linecard
6
● 32 TOR’s = 32 ports
● FAB(ric) switches = Fabric Chips
● Connect to a common fabric
FAB1 FAB2 ... FABN
1 2 3 4 5 6 7 8 .. 32
NANOG 40 - Feb 2008
7
Fast Forward to 2012
Multidimensional Folded Clos Fabric
● First deployed in 2012
● 1k, 5k, 10k, 20k Node Cluster Sizes in production
○ scales to 40k, 80k, 160k, 320k nodes in a single cluster
○ blast domain vs scaling size tradeoff
● Clusters interconnected with a common East-West fabric layer
● Old: 10G to Server, 40G Core
● New: 25G to Server, 100G Core
● Layer 3, dual stack IPv4/IPv6, BGP-based
● Fully Automated provisioning, self healing system
8
Cluster A
High-level Topology
9
FAB-1 FAB-2 FAB-3 FAB-4 ... ... ... FAB-N
EGR1 EGR2 ... EGRN
Cluster N
EGRN
...
EGR2
EGR1
DC WAN &
Agg layers
10
High-level - Cluster
VC #1
TOR#1 TOR#2 TOR#n
VC #2 VC #3 VC #4
TOR#n
Looks familiar?
11
12
Physical layout
13
First Prototype - 2011
14
Actual picture
TOR perspective
● Each TOR uses a unique private
ASN
● TOR EBGP peers to a single LEF
on each of the N Virtual Chassis's
● TORs have network statements
for Lo0 and all host subnets
15
VC #1
AS64512
TOR#1
AS64513
VC #2
AS64512
VC #3
AS64512
VC #4
AS64512
TOR#2
AS64514
TOR#3
AS64515
TOR#n
AS(64512+n)
Inside the Virtual Chassis
● Each Virtual Chassis uses a private ASN
● SPN is a route reflectors to the LEF
● SPN-LEF ibgp sessions are pt-2-pt
○ update src local-interface
● SPN has network statements for all SPN-LEF
interconnects
● LEF has network statements for LEF-TOR
interconnects
● LEF uses next-hop-self for SPN-LEF ibgp
sessions
● All BGP next hop addresses are learned via bgp
○ There is no IGP inside of the VC
● All Virtual Chassis's use the same ASN
16
Cluster to fabric connectivity
● EGR = EGress Router,
○ Same class of devices as SPN/LEF
○ EGRs are like TORs
● EGRs are special
○ Each EGRs can connect to multiple
LEFs in a single VC and multiple
VCs
○ EGR can aggregate cluster
subnets
○ variation on traditional CLOS
architecture
● FABs
○ connect to EGRs in multiple
clusters
○ Use “remove-private” so that
multiple clusters can use the same
set of ASNs
○ speak eBGP to EGRs and OSPF to
higher level of devices
○ redistribute aggregated subnets
from each cluster to rest of DC
topology
17
VC #1
AS64512
TOR#1
VC #2
AS64512
VC #3
AS64512
VC #4
AS64512
TOR#2 TOR#3 TOR#n
EGR#1
AS64513
EGR#2
AS64514
EGR#3
AS64515
EGR#n
AS645xx
FAB-1 FAB-2 FAB-3 FAB-4 ... ... FAB-
N
VC #1
AS64512
TOR#1
VC #2
AS64512
VC #3
AS64512
VC #4
AS64512
TOR#2 TOR#3 TOR#n
EGR#1
AS64513
EGR#2
AS64514
EGR#3
AS64515
EGR#n
AS645xx
Incast & Buffer Pressure
18
SPINE
LEAF
TOR
● Single cluster capacity
○ Number of edge positions (TOR or EGR) = Rspn* Rlef/2, where R is switch port radix
■ Rspn= 32, Rlef= 32 => 512 position = 20K nodes
■ Rspn= 64, Rlef= 32 => 1024 positions = 40K nodes
■ Rspn= 64, Rlef= 64 => 2048 positions = 80K nodes
■ Rspn= 128, Rlef= 128 => 8192 positions = 320k nodes
○ TOR oversubscription ratio is decided by number of VCs (or number of uplinks)
■ 2VC -- 1:6, 4 VC -- 1:3, 6 VC --- 1:2, 8 VC --- 1:1.5, 12 VC --- 1:1
● Multiple clusters
○ Multiple clusters can be connected thru N-way FABs
○ Additional east-west capacity between the clusters can be added by:
■ Horizontal scaling (additional EGR’s, or additional FABs)
■ Multiple FAB planes (inc dedicated “internal” FABs for east-west traffic only)
■ Turning a FAB into a VC itself
Scaling
19
Automation
● Management complexity:
○ Large number of devices, links and initial configurations
○ Dynamic environment, asynchronized LEF/TOR installations, image and config updates
○ Device/links failure detection and remediation
● Automation is the solution
○ Treat the network with CI/CD principles
○ Device, topology and config modelling abstracted by template and database
○ Integration of inventory/DNS with Zero Touch Provisioning for initial bootstrap and configuration
○ Separation of config intent and config state, complete control loop by state machines
○ Check out our NANOG 68 presentation “Network Automation with State Machines”
20
Operational Experience
● Very stable protocol stack, fast convergence
○ 2012 => 250ms end-to-end convergence
○ 2018 => 125ms end-to-end convergence
○ Even in 2018 some BGP stacks cause micro-blackholes - watch out!
● Significantly lower hardware failure rate than expected
● Easy installation and continuous management
● Oversubscription ratios
● Buffer management techniques
21
Questions?
igor@verizonmedia.com
hyihua@verizonmedia.com
23
Demystifying Datacenter Clos

More Related Content

PDF
Smartcom's control plane software, a customized version of FreeBSD by Boris A...
PDF
Fun with Network Interfaces
PDF
64bit SMP OS for TILE-Gx many core processor
PDF
Tendencias de Uso y Diseño de Redes de Interconexión en Computadores Paralel...
PDF
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
PPTX
Networking essentials lect1
PPTX
Lustre, RoCE, and MAN
PDF
An FPGA for high end Open Networking
Smartcom's control plane software, a customized version of FreeBSD by Boris A...
Fun with Network Interfaces
64bit SMP OS for TILE-Gx many core processor
Tendencias de Uso y Diseño de Redes de Interconexión en Computadores Paralel...
Adding IEEE 802.15.4 and 6LoWPAN to an Embedded Linux Device
Networking essentials lect1
Lustre, RoCE, and MAN
An FPGA for high end Open Networking

Similar to Demystifying Datacenter Clos (20)

PDF
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
PPTX
G3 phase1 final ppt.pptx
PPTX
Disaggregating Ceph using NVMeoF
PDF
Brkdct 3101
PDF
LinkedIn OpenFabric Project - Interop 2017
PPT
Dynamic Routing All Algorithms, Working And Basics
PDF
Automatic topology detection in NAV
DOCX
Lab 9 instructions
PDF
Theta and the Future of Accelerator Programming
PDF
BKK16-103 OpenCSD - Open for Business!
DOCX
Networking interview questions
PDF
Lecture-07 .pdf
PDF
Recent advance in netmap/VALE(mSwitch)
PDF
6LoWPAN: An open IoT Networking Protocol
PDF
Scaling the Web to Billions of Nodes: Towards the IPv6 “Internet of Things” b...
PPTX
Cisco router basic
PDF
Disaggregating Ceph using NVMeoF
PPTX
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
PDF
Linux-wpan: IEEE 802.15.4 and 6LoWPAN in the Linux Kernel - BUD17-120
PDF
Dynamische Routingprotokolle Aufzucht und Pflege - OSPF
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
G3 phase1 final ppt.pptx
Disaggregating Ceph using NVMeoF
Brkdct 3101
LinkedIn OpenFabric Project - Interop 2017
Dynamic Routing All Algorithms, Working And Basics
Automatic topology detection in NAV
Lab 9 instructions
Theta and the Future of Accelerator Programming
BKK16-103 OpenCSD - Open for Business!
Networking interview questions
Lecture-07 .pdf
Recent advance in netmap/VALE(mSwitch)
6LoWPAN: An open IoT Networking Protocol
Scaling the Web to Billions of Nodes: Towards the IPv6 “Internet of Things” b...
Cisco router basic
Disaggregating Ceph using NVMeoF
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Linux-wpan: IEEE 802.15.4 and 6LoWPAN in the Linux Kernel - BUD17-120
Dynamische Routingprotokolle Aufzucht und Pflege - OSPF
Ad

Recently uploaded (20)

PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPT
What is a Computer? Input Devices /output devices
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
August Patch Tuesday
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Hybrid model detection and classification of lung cancer
PPTX
1. Introduction to Computer Programming.pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
The various Industrial Revolutions .pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
A comparative study of natural language inference in Swahili using monolingua...
Group 1 Presentation -Planning and Decision Making .pptx
What is a Computer? Input Devices /output devices
Final SEM Unit 1 for mit wpu at pune .pptx
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
August Patch Tuesday
observCloud-Native Containerability and monitoring.pptx
NewMind AI Weekly Chronicles – August ’25 Week III
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Hybrid model detection and classification of lung cancer
1. Introduction to Computer Programming.pptx
cloud_computing_Infrastucture_as_cloud_p
Assigned Numbers - 2025 - Bluetooth® Document
gpt5_lecture_notes_comprehensive_20250812015547.pdf
WOOl fibre morphology and structure.pdf for textiles
The various Industrial Revolutions .pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
DP Operators-handbook-extract for the Mautical Institute
TLE Review Electricity (Electricity).pptx
Zenith AI: Advanced Artificial Intelligence
A comparative study of natural language inference in Swahili using monolingua...
Ad

Demystifying Datacenter Clos

  • 1. Demystifying Data Center CLOS networks Igor Gashinsky, Yihua He June 2019 Overview of Yahoo’s Data Center Fabric v3 (2012-2018)
  • 2. Agenda 1. What's CLOS? 2. Different CLOS topologies 3. Gory Details of our topology 4. Scaling 5. Automation 6. Operations 2
  • 3. What’s a CLOS? ● “A Study of Non-blocking Switching Networks" Bell System Technical Journal, 1953 ○ multistage switching network ○ ingress stage, the middle stage, and the egress stage ○ can be recursive! ● What everyone calls a “Leaf-Spine” design ● Beneš network 3 Ingress Ingress middle middle Egress Egress
  • 4. Different CLOS topologies ● Not surprisingly, most modern network hardware uses some type of a CLOS already ○ Chassis Routers/Switches ○ Linecards 4
  • 5. Virtual Chassis 5 ● Spine = Fabric Card ● Leaf = Linecard ● Internal interconnect = PCB x-bar
  • 6. Virtual Linecard 6 ● 32 TOR’s = 32 ports ● FAB(ric) switches = Fabric Chips ● Connect to a common fabric FAB1 FAB2 ... FABN 1 2 3 4 5 6 7 8 .. 32
  • 7. NANOG 40 - Feb 2008 7
  • 8. Fast Forward to 2012 Multidimensional Folded Clos Fabric ● First deployed in 2012 ● 1k, 5k, 10k, 20k Node Cluster Sizes in production ○ scales to 40k, 80k, 160k, 320k nodes in a single cluster ○ blast domain vs scaling size tradeoff ● Clusters interconnected with a common East-West fabric layer ● Old: 10G to Server, 40G Core ● New: 25G to Server, 100G Core ● Layer 3, dual stack IPv4/IPv6, BGP-based ● Fully Automated provisioning, self healing system 8
  • 9. Cluster A High-level Topology 9 FAB-1 FAB-2 FAB-3 FAB-4 ... ... ... FAB-N EGR1 EGR2 ... EGRN Cluster N EGRN ... EGR2 EGR1 DC WAN & Agg layers
  • 10. 10 High-level - Cluster VC #1 TOR#1 TOR#2 TOR#n VC #2 VC #3 VC #4 TOR#n
  • 15. TOR perspective ● Each TOR uses a unique private ASN ● TOR EBGP peers to a single LEF on each of the N Virtual Chassis's ● TORs have network statements for Lo0 and all host subnets 15 VC #1 AS64512 TOR#1 AS64513 VC #2 AS64512 VC #3 AS64512 VC #4 AS64512 TOR#2 AS64514 TOR#3 AS64515 TOR#n AS(64512+n)
  • 16. Inside the Virtual Chassis ● Each Virtual Chassis uses a private ASN ● SPN is a route reflectors to the LEF ● SPN-LEF ibgp sessions are pt-2-pt ○ update src local-interface ● SPN has network statements for all SPN-LEF interconnects ● LEF has network statements for LEF-TOR interconnects ● LEF uses next-hop-self for SPN-LEF ibgp sessions ● All BGP next hop addresses are learned via bgp ○ There is no IGP inside of the VC ● All Virtual Chassis's use the same ASN 16
  • 17. Cluster to fabric connectivity ● EGR = EGress Router, ○ Same class of devices as SPN/LEF ○ EGRs are like TORs ● EGRs are special ○ Each EGRs can connect to multiple LEFs in a single VC and multiple VCs ○ EGR can aggregate cluster subnets ○ variation on traditional CLOS architecture ● FABs ○ connect to EGRs in multiple clusters ○ Use “remove-private” so that multiple clusters can use the same set of ASNs ○ speak eBGP to EGRs and OSPF to higher level of devices ○ redistribute aggregated subnets from each cluster to rest of DC topology 17 VC #1 AS64512 TOR#1 VC #2 AS64512 VC #3 AS64512 VC #4 AS64512 TOR#2 TOR#3 TOR#n EGR#1 AS64513 EGR#2 AS64514 EGR#3 AS64515 EGR#n AS645xx FAB-1 FAB-2 FAB-3 FAB-4 ... ... FAB- N VC #1 AS64512 TOR#1 VC #2 AS64512 VC #3 AS64512 VC #4 AS64512 TOR#2 TOR#3 TOR#n EGR#1 AS64513 EGR#2 AS64514 EGR#3 AS64515 EGR#n AS645xx
  • 18. Incast & Buffer Pressure 18 SPINE LEAF TOR
  • 19. ● Single cluster capacity ○ Number of edge positions (TOR or EGR) = Rspn* Rlef/2, where R is switch port radix ■ Rspn= 32, Rlef= 32 => 512 position = 20K nodes ■ Rspn= 64, Rlef= 32 => 1024 positions = 40K nodes ■ Rspn= 64, Rlef= 64 => 2048 positions = 80K nodes ■ Rspn= 128, Rlef= 128 => 8192 positions = 320k nodes ○ TOR oversubscription ratio is decided by number of VCs (or number of uplinks) ■ 2VC -- 1:6, 4 VC -- 1:3, 6 VC --- 1:2, 8 VC --- 1:1.5, 12 VC --- 1:1 ● Multiple clusters ○ Multiple clusters can be connected thru N-way FABs ○ Additional east-west capacity between the clusters can be added by: ■ Horizontal scaling (additional EGR’s, or additional FABs) ■ Multiple FAB planes (inc dedicated “internal” FABs for east-west traffic only) ■ Turning a FAB into a VC itself Scaling 19
  • 20. Automation ● Management complexity: ○ Large number of devices, links and initial configurations ○ Dynamic environment, asynchronized LEF/TOR installations, image and config updates ○ Device/links failure detection and remediation ● Automation is the solution ○ Treat the network with CI/CD principles ○ Device, topology and config modelling abstracted by template and database ○ Integration of inventory/DNS with Zero Touch Provisioning for initial bootstrap and configuration ○ Separation of config intent and config state, complete control loop by state machines ○ Check out our NANOG 68 presentation “Network Automation with State Machines” 20
  • 21. Operational Experience ● Very stable protocol stack, fast convergence ○ 2012 => 250ms end-to-end convergence ○ 2018 => 125ms end-to-end convergence ○ Even in 2018 some BGP stacks cause micro-blackholes - watch out! ● Significantly lower hardware failure rate than expected ● Easy installation and continuous management ● Oversubscription ratios ● Buffer management techniques 21