SlideShare a Scribd company logo
Hedera: Dynamic Flow
Scheduling for Data Center
Network
Mohammad Al-Fares, Sivasankar
Radhakrishnan, Barath Raghavan, Nelson
Huang, Amin Vahdat
- USENIX NSDI 2010 -
1
Presenter: Jason, Tsung-Cheng, HOU
Advisor: Wanjiun Liao
Dec. 22nd, 2011
Problem
• Relying on multipathing, due to…
– Limited port densities of
routers/switches
– Horizontal expansion
• Multi-rooted tree topologies
– Example: Fat-tree / Clos
2
Problem
• BW demand is essential and volatile
– Must route among multiple paths
– Avoid bottlenecks and deliver aggre. BW
• However, current multipath routing…
– Mostly: flow-hash-based ECMP
– Static and oblivious to link-utilization
– Causes long-term large-flow collisions
• Inefficiently utilizing path diversity
– Need a protocol or a scheduler
3
Collisions of elephant flows
• Collisions in two ways: Upward or Downward
D1S1 D2S2 D3S3 D4S4
Equal Cost Paths
• Many equal cost paths going up to the core
switches
• Only one path down from each core switch
• Need to find good flow-to-core mapping
DS
Goal
• Given a dynamic flow demands
– Need to find paths that maximize
network bisection BW
– No end hosts modifications
• However, local switch information is
unable to find proper allocation
– Need a central scheduler
– Must use commodity Ethernet switches
– OpenFlow
6
Architecture
• Detect Large Flows
– Flows that need bandwidth but are network-limited
• Estimate Flow Demands
– Use min-max fairness to allocate flows between SD
pairs
• Allocate Flows
– Use estimated demands to heuristically find better
placement of large flows on the EC paths
– Arrange switches and iterate again
Detect
Large Flows
Estimate
Flow Demands
Allocate Flows
Architecture
• Feedback loop
• Optimize achievable bisection BW by
assigning flow-to-core mappings
• Heuristics of flow demand estimation and
placement
• Central Scheduler
– Global knowledge of all links in the network
– Control tables of all switches (OpenFlow)
Detect
Large Flows
Estimate
Flow Demands
Allocate Flows
Elephant Detection
9
Elephant Detection
• Scheduler polls edge switches
– Flows exceeding threshold are “large”
– 10% of hosts’ link capacity (> 100Mbps)
• Small flows: Default ECMP hashing
• Hedera complements ECMP
– Default forwarding is ECMP
– Only schedules large flows contributing
to bisection BW bottlenecks
• Centralized functions: the essentials
10
Demand Estimation
11
Demand Estimation
• Current flow rate: misleading
– May be already constrained by network
• Need to find flow’s “natural” BW
demand when not limited by network
– As if only limited by NIC of S or D
• Allocate S/D capacity among flows
using max-min fairness
• Equals to BW allocation of optimal
routing, input to placement algorithm
12
Demand Estimation
• Given pairs of large flows, modify
each flow size at S/D iteratively
– S distributes unconv. BW among flows
– R limited: redistributes BW among
excessive-demand flows
– Repeat until all flows converge
• Guaranteed to converge in O(|F|)
– Linear to no. of flows
13
Demand Estimation
A
B
C
X
Y
Flow Estimate Conv. ?
AX
AY
BY
CY
Sender
Available
Unconv. BW
Flows Share
A 1 2 1/2
B 1 1 1
C 1 1 1
Senders
Demand Estimation
Recv RL?
Non-SL
Flows
Share
X No - -
Y Yes 3 1/3
Receivers
Flow Estimate Conv. ?
AX 1/2
AY 1/2
BY 1
CY 1
A
B
C
X
Y
Demand Estimation
Flow Estimate Conv. ?
AX 1/2
AY 1/3 Yes
BY 1/3 Yes
CY 1/3 Yes
Sender
Available
Unconv. BW
Flows Share
A 2/3 1 2/3
B 0 0 0
C 0 0 0
Senders
A
B
C
X
Y
Demand Estimation
Flow Estimate Conv. ?
AX 2/3 Yes
AY 1/3 Yes
BY 1/3 Yes
CY 1/3 Yes
Recv RL?
Non-SL
Flows
Share
X No - -
Y No - -
Receivers
A
B
C
X
Y
Placement Heuristics
18
Placement Heuristics
• Find a good large-flow-to-core mapping
– such that average bisection BW is maximized
• Two approaches
• Global First Fit: Greedily choose path that
has sufficient unreserved BW
– O([ports/switch]2)
• Simulated Annealing: Iteratively find a
globally better mapping of paths to flows
– O(# flows)
Global First-Fit
• New flow found, linearly search all paths from SD
• Place on first path with links can fit the flow
• Once flow ends, entries + reservations time out
?
Flow A
Flow B
Flow C
? ?
0 1 2 3
Scheduler
S D
Simulated Annealing
• Annealing: letting metal to cool down
and get better crystal structure
– Heating up to enter higher energy state
– Cooling to lower energy state with a
better structure and stopping at a temp
• Simulated Annealing:
– Search neighborhood for possible states
– Probabilistically accepting worse state
– Accepting better state, settle gradually
– Avoid local minima 21
Simulated Annealing
• State / State Space
– Possible solutions
• Energy
– Objective
• Neighborhood
– Other options
• Boltzman’s Function
– Prob. to higher state
• Control Temperature
– Current temp. affect
prob. to higher state
• Cooling Schedule
– How temp. falls
• Stopping Criterion
22
)/(1)( tEEP
Simulated Annealing
• State Space:
– All possible large-flow-to-core mappings
– However, same destinations map to same core
– Reduce state space, as long as not too many
large flows and proper threshold
• Neighborhood:
– Swap cores for two hosts within same pod,
attached to same edge / aggregate
– Avoids local minima
23
Simulated Annealing
• Energy:
– Estimated demand of flows
– Total exceeded BW capacity of links, minimize
• Temperature: remaining iterations
• Probability:
• Final state is published to switches and
used as initial state for next round
• Incremental calculation of exceeded cap.
• No recalculation of all links, only new large
flows found and neighborhood swaps 24
Evaluation
25
Implementation
• 16 hosts, k=4 fat-tree data plane
– 20 switches: 4-port NetFPGAs / OpenFlow
– Parallel 48-port non-blocking Quanta switch
– 1 scheduler, OpenFlow control protocol
– Testbed: PortLand
26
Simulator
• k=32; 8,192 hosts
– Pack-level simulators not applicable
– 1Gbps for 8k hosts, takes 2.5x1011 pkts
• Model TCP flows
– TCP’s AIMD when constrained by topology
– Poisson arrival of flows
– No pkt size variations
– No bursty traffic
– No inter-flow dynamics
27
PortLand/OpenFlow, k=4
28
Simulator
29
Reactiveness
• Demand Estimation:
– 27K hosts, 250K flows, converges < 200ms
• Simulated Annealing:
– Asymptotically dependent on # of flows + #
iter., 50K flows and 1K iter.: 11ms
– Most of final bisection BW: few hundred iter.
• Scheduler control loop:
– Polling + Est. + SA = 145ms for 27K hosts
Comments
31
Comments
• Destine to same host, via same core
– May congest at cores, but how severe?
– Large flows to/from a host: <k/2
– No proof, no evaluation
• Decrease search space and runtime
– Scalable for per-flow basis? For large k?
• No protection for mice flows, RPCs
– Only assumes work well under ECMP
– No address when route with large flows
32
Comments
• Own flow-level simulator
– Aim to saturate network
– No flow number by different size
– Traffic generation: avg. flow size and arrival
rates (Poisson) with a mean
– Only above descriptions, no specific numbers
– Too ideal or not volatile enough?
– Avg. bisection BW, but real-time graphs?
• States that per-flow VLB = per-flow ECMP
– Does not compare with other options (VL2)
– No further elaboration
33
Comments
• Shared responsibility
– Controller only deals with critical situations
– Switches perform default measures
– Improves performance and saves time
– How to strike a balance?
– Adopt to different problems?
• Default multipath routing
– States problems of per-flow VLB and ECMP
– How about per-pkt? Author’s future work
– How to improve switches’ default actions?
34
Comments
• Critical controller actions
– Considers large flows degrade overall efficiency
– What are critical situations?
– How to detect and react?
– How to improve reactiveness and adaptability?
• Amin Vahdat’s lab
– Proposes fat-tree topology
– Develops PortLand L2 virtualization
– Hedera: enhances multipath performance
– Integrate all above
35
References
• M. Al-Fares, et. al., “Hedera: Dynamic Flow Scheduling for
Data Center Network”, USENIX NSDI 2010
• Tathagata Das, “Hedera: Dynamic Flow Scheduling for Data
Center Networks”, UC Berkeley course CS 294
• M. Al-Fares, “Hedera: Dynamic Flow Scheduling for Data
Center Network”, USENIX NSDI 2010, slides
36
Supplement
37
Fault-Tolerance
• Link / Switch failure
– Use PortLand’s fault notification protocol
– Hedera routes around failed components
0 1 3
Flow A
Flow B
Flow C
2
Scheduler
Fault-Tolerance
• Scheduler failure
– Soft-state, not required for correctness
(connectivity)
– Switches fall back to ECMP
0 1 3
Flow A
Flow B
Flow C
2
Scheduler
Limitations
• Dynamic workloads,
large flow turnover
faster than control
loop
– Scheduler will be
continually chasing
the traffic matrix
• Need to include
penalty term for
unnecessary SA flow
re-assignmentsFlow Size
MatrixStability
StableUnstable
ECMP Hedera

More Related Content

PPTX
DevoFlow - Scaling Flow Management for High-Performance Networks
PDF
How to Setup A Pen test Lab and How to Play CTF
PDF
xFlow分析の基礎と実例
PPTX
라이브 서비스를 위한 게임 서버 구성
PDF
Event Sourcing: Introduction & Challenges
PDF
Not a Security Boundary
PPTX
Data Mining Zoo classification
PDF
양승명, 다음 세대 크로스플랫폼 MMORPG 아키텍처, NDC2012
DevoFlow - Scaling Flow Management for High-Performance Networks
How to Setup A Pen test Lab and How to Play CTF
xFlow分析の基礎と実例
라이브 서비스를 위한 게임 서버 구성
Event Sourcing: Introduction & Challenges
Not a Security Boundary
Data Mining Zoo classification
양승명, 다음 세대 크로스플랫폼 MMORPG 아키텍처, NDC2012

What's hot (20)

PPTX
Front end 웹사이트 성능 측정 및 개선
PPTX
Tiki.vn - How we scale as a tech startup
PDF
An Apache Hive Based Data Warehouse
PDF
게임서버프로그래밍 #0 - TCP 및 이벤트 통지모델
PDF
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
PPTX
Invoke-Obfuscation DerbyCon 2016
PDF
Database Consistency Models
PDF
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
PDF
PHDays 2018 Threat Hunting Hands-On Lab
PDF
The T.E.E. + eSIM Dual Roots of Trust
PDF
High Concurrency Architecture at TIKI
PDF
CNIT 129S: 13: Attacking Users: Other Techniques (Part 1 of 2)
PDF
CNIT 129S: 12: Attacking Users: Cross-Site Scripting (Part 1 of 2)
PPTX
全新 Windows Server 2019 容器技術 及邁向與 Kubernetes 整合之路 (Windows Server 高峰會)
PPTX
Cassandra concepts, patterns and anti-patterns
PDF
CQRS + Event Sourcing
PDF
Ndb cluster 80_ycsb_mem
PPTX
CSRF Attack and Its Prevention technique in ASP.NET MVC
PPTX
Advanced OSSEC Training: Integration Strategies for Open Source Security
PDF
Overcoming Scaling Challenges in MongoDB Deployments with SSD
Front end 웹사이트 성능 측정 및 개선
Tiki.vn - How we scale as a tech startup
An Apache Hive Based Data Warehouse
게임서버프로그래밍 #0 - TCP 및 이벤트 통지모델
Explainable Artificial Intelligence (XAI) 
to Predict and Explain Future Soft...
Invoke-Obfuscation DerbyCon 2016
Database Consistency Models
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
PHDays 2018 Threat Hunting Hands-On Lab
The T.E.E. + eSIM Dual Roots of Trust
High Concurrency Architecture at TIKI
CNIT 129S: 13: Attacking Users: Other Techniques (Part 1 of 2)
CNIT 129S: 12: Attacking Users: Cross-Site Scripting (Part 1 of 2)
全新 Windows Server 2019 容器技術 及邁向與 Kubernetes 整合之路 (Windows Server 高峰會)
Cassandra concepts, patterns and anti-patterns
CQRS + Event Sourcing
Ndb cluster 80_ycsb_mem
CSRF Attack and Its Prevention technique in ASP.NET MVC
Advanced OSSEC Training: Integration Strategies for Open Source Security
Overcoming Scaling Challenges in MongoDB Deployments with SSD
Ad

Similar to Hedera - Dynamic Flow Scheduling for Data Center Networks, an Application of Software-Defined Networking (SDN) (20)

PPTX
Data Center Network Multipathing
PPTX
Valiant Load Balancing and Traffic Oblivious Routing
PPT
A Scalable, Commodity Data Center Network Architecture
PPTX
FATTREE: A scalable Commodity Data Center Network Architecture
PPT
24-ad-hoc.ppt
PPTX
Energy Efficient Routing Approaches in Ad-hoc Networks
PPTX
Introduction to backwards learning algorithm
PPT
layer2-network-design.ppt
PPT
layer2-network-design.ppt
PPT
12-adhoc_unit 4 _mobile computing_sem5.ppt
PPT
12-adhocssasalirezaalirezalakakssaas.ppt
PPT
Quality of service
PPT
Routing protocols-network-layer
PDF
AusNOG 2019: TCP and BBR
PPT
CS553_ST7_Ch15-LANOverview.ppt
PPT
CS553_ST7_Ch15-LANOverview (1).ppt
PPT
CS553_ST7_Ch15-LANOverview.ppt
PDF
NZNOG 2020: Buffers, Buffer Bloat and BBR
PPT
Tcp congestion control topic in high speed network
PDF
RIPE 76: TCP and BBR
Data Center Network Multipathing
Valiant Load Balancing and Traffic Oblivious Routing
A Scalable, Commodity Data Center Network Architecture
FATTREE: A scalable Commodity Data Center Network Architecture
24-ad-hoc.ppt
Energy Efficient Routing Approaches in Ad-hoc Networks
Introduction to backwards learning algorithm
layer2-network-design.ppt
layer2-network-design.ppt
12-adhoc_unit 4 _mobile computing_sem5.ppt
12-adhocssasalirezaalirezalakakssaas.ppt
Quality of service
Routing protocols-network-layer
AusNOG 2019: TCP and BBR
CS553_ST7_Ch15-LANOverview.ppt
CS553_ST7_Ch15-LANOverview (1).ppt
CS553_ST7_Ch15-LANOverview.ppt
NZNOG 2020: Buffers, Buffer Bloat and BBR
Tcp congestion control topic in high speed network
RIPE 76: TCP and BBR
Ad

More from Jason TC HOU (侯宗成) (11)

PDF
A Data Culture in Daily Work - Examples @ KKTV
PDF
Triangulating Data to Drive Growth
PDF
Design & Growth @ KKTV - uP!ck Sharing
PDF
文武雙全的產品設計 DESIGNING WITH DATA
PDF
PDF
Growth 的基石 用戶行為追蹤
PDF
App 的隱形殺手 - 留存率
PPTX
Software-Defined Networking , Survey of HotSDN 2012
PPTX
Software-Defined Networking SDN - A Brief Introduction
PPTX
Introduction to Cloud Data Center and Network Issues
PPTX
OpenStack Framework Introduction
A Data Culture in Daily Work - Examples @ KKTV
Triangulating Data to Drive Growth
Design & Growth @ KKTV - uP!ck Sharing
文武雙全的產品設計 DESIGNING WITH DATA
Growth 的基石 用戶行為追蹤
App 的隱形殺手 - 留存率
Software-Defined Networking , Survey of HotSDN 2012
Software-Defined Networking SDN - A Brief Introduction
Introduction to Cloud Data Center and Network Issues
OpenStack Framework Introduction

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
KodekX | Application Modernization Development
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
MYSQL Presentation for SQL database connectivity
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation_ Review paper, used for researhc scholars
Per capita expenditure prediction using model stacking based on satellite ima...
KodekX | Application Modernization Development
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Understanding_Digital_Forensics_Presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Programs and apps: productivity, graphics, security and other tools
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MYSQL Presentation for SQL database connectivity

Hedera - Dynamic Flow Scheduling for Data Center Networks, an Application of Software-Defined Networking (SDN)

  • 1. Hedera: Dynamic Flow Scheduling for Data Center Network Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin Vahdat - USENIX NSDI 2010 - 1 Presenter: Jason, Tsung-Cheng, HOU Advisor: Wanjiun Liao Dec. 22nd, 2011
  • 2. Problem • Relying on multipathing, due to… – Limited port densities of routers/switches – Horizontal expansion • Multi-rooted tree topologies – Example: Fat-tree / Clos 2
  • 3. Problem • BW demand is essential and volatile – Must route among multiple paths – Avoid bottlenecks and deliver aggre. BW • However, current multipath routing… – Mostly: flow-hash-based ECMP – Static and oblivious to link-utilization – Causes long-term large-flow collisions • Inefficiently utilizing path diversity – Need a protocol or a scheduler 3
  • 4. Collisions of elephant flows • Collisions in two ways: Upward or Downward D1S1 D2S2 D3S3 D4S4
  • 5. Equal Cost Paths • Many equal cost paths going up to the core switches • Only one path down from each core switch • Need to find good flow-to-core mapping DS
  • 6. Goal • Given a dynamic flow demands – Need to find paths that maximize network bisection BW – No end hosts modifications • However, local switch information is unable to find proper allocation – Need a central scheduler – Must use commodity Ethernet switches – OpenFlow 6
  • 7. Architecture • Detect Large Flows – Flows that need bandwidth but are network-limited • Estimate Flow Demands – Use min-max fairness to allocate flows between SD pairs • Allocate Flows – Use estimated demands to heuristically find better placement of large flows on the EC paths – Arrange switches and iterate again Detect Large Flows Estimate Flow Demands Allocate Flows
  • 8. Architecture • Feedback loop • Optimize achievable bisection BW by assigning flow-to-core mappings • Heuristics of flow demand estimation and placement • Central Scheduler – Global knowledge of all links in the network – Control tables of all switches (OpenFlow) Detect Large Flows Estimate Flow Demands Allocate Flows
  • 10. Elephant Detection • Scheduler polls edge switches – Flows exceeding threshold are “large” – 10% of hosts’ link capacity (> 100Mbps) • Small flows: Default ECMP hashing • Hedera complements ECMP – Default forwarding is ECMP – Only schedules large flows contributing to bisection BW bottlenecks • Centralized functions: the essentials 10
  • 12. Demand Estimation • Current flow rate: misleading – May be already constrained by network • Need to find flow’s “natural” BW demand when not limited by network – As if only limited by NIC of S or D • Allocate S/D capacity among flows using max-min fairness • Equals to BW allocation of optimal routing, input to placement algorithm 12
  • 13. Demand Estimation • Given pairs of large flows, modify each flow size at S/D iteratively – S distributes unconv. BW among flows – R limited: redistributes BW among excessive-demand flows – Repeat until all flows converge • Guaranteed to converge in O(|F|) – Linear to no. of flows 13
  • 14. Demand Estimation A B C X Y Flow Estimate Conv. ? AX AY BY CY Sender Available Unconv. BW Flows Share A 1 2 1/2 B 1 1 1 C 1 1 1 Senders
  • 15. Demand Estimation Recv RL? Non-SL Flows Share X No - - Y Yes 3 1/3 Receivers Flow Estimate Conv. ? AX 1/2 AY 1/2 BY 1 CY 1 A B C X Y
  • 16. Demand Estimation Flow Estimate Conv. ? AX 1/2 AY 1/3 Yes BY 1/3 Yes CY 1/3 Yes Sender Available Unconv. BW Flows Share A 2/3 1 2/3 B 0 0 0 C 0 0 0 Senders A B C X Y
  • 17. Demand Estimation Flow Estimate Conv. ? AX 2/3 Yes AY 1/3 Yes BY 1/3 Yes CY 1/3 Yes Recv RL? Non-SL Flows Share X No - - Y No - - Receivers A B C X Y
  • 19. Placement Heuristics • Find a good large-flow-to-core mapping – such that average bisection BW is maximized • Two approaches • Global First Fit: Greedily choose path that has sufficient unreserved BW – O([ports/switch]2) • Simulated Annealing: Iteratively find a globally better mapping of paths to flows – O(# flows)
  • 20. Global First-Fit • New flow found, linearly search all paths from SD • Place on first path with links can fit the flow • Once flow ends, entries + reservations time out ? Flow A Flow B Flow C ? ? 0 1 2 3 Scheduler S D
  • 21. Simulated Annealing • Annealing: letting metal to cool down and get better crystal structure – Heating up to enter higher energy state – Cooling to lower energy state with a better structure and stopping at a temp • Simulated Annealing: – Search neighborhood for possible states – Probabilistically accepting worse state – Accepting better state, settle gradually – Avoid local minima 21
  • 22. Simulated Annealing • State / State Space – Possible solutions • Energy – Objective • Neighborhood – Other options • Boltzman’s Function – Prob. to higher state • Control Temperature – Current temp. affect prob. to higher state • Cooling Schedule – How temp. falls • Stopping Criterion 22 )/(1)( tEEP
  • 23. Simulated Annealing • State Space: – All possible large-flow-to-core mappings – However, same destinations map to same core – Reduce state space, as long as not too many large flows and proper threshold • Neighborhood: – Swap cores for two hosts within same pod, attached to same edge / aggregate – Avoids local minima 23
  • 24. Simulated Annealing • Energy: – Estimated demand of flows – Total exceeded BW capacity of links, minimize • Temperature: remaining iterations • Probability: • Final state is published to switches and used as initial state for next round • Incremental calculation of exceeded cap. • No recalculation of all links, only new large flows found and neighborhood swaps 24
  • 26. Implementation • 16 hosts, k=4 fat-tree data plane – 20 switches: 4-port NetFPGAs / OpenFlow – Parallel 48-port non-blocking Quanta switch – 1 scheduler, OpenFlow control protocol – Testbed: PortLand 26
  • 27. Simulator • k=32; 8,192 hosts – Pack-level simulators not applicable – 1Gbps for 8k hosts, takes 2.5x1011 pkts • Model TCP flows – TCP’s AIMD when constrained by topology – Poisson arrival of flows – No pkt size variations – No bursty traffic – No inter-flow dynamics 27
  • 30. Reactiveness • Demand Estimation: – 27K hosts, 250K flows, converges < 200ms • Simulated Annealing: – Asymptotically dependent on # of flows + # iter., 50K flows and 1K iter.: 11ms – Most of final bisection BW: few hundred iter. • Scheduler control loop: – Polling + Est. + SA = 145ms for 27K hosts
  • 32. Comments • Destine to same host, via same core – May congest at cores, but how severe? – Large flows to/from a host: <k/2 – No proof, no evaluation • Decrease search space and runtime – Scalable for per-flow basis? For large k? • No protection for mice flows, RPCs – Only assumes work well under ECMP – No address when route with large flows 32
  • 33. Comments • Own flow-level simulator – Aim to saturate network – No flow number by different size – Traffic generation: avg. flow size and arrival rates (Poisson) with a mean – Only above descriptions, no specific numbers – Too ideal or not volatile enough? – Avg. bisection BW, but real-time graphs? • States that per-flow VLB = per-flow ECMP – Does not compare with other options (VL2) – No further elaboration 33
  • 34. Comments • Shared responsibility – Controller only deals with critical situations – Switches perform default measures – Improves performance and saves time – How to strike a balance? – Adopt to different problems? • Default multipath routing – States problems of per-flow VLB and ECMP – How about per-pkt? Author’s future work – How to improve switches’ default actions? 34
  • 35. Comments • Critical controller actions – Considers large flows degrade overall efficiency – What are critical situations? – How to detect and react? – How to improve reactiveness and adaptability? • Amin Vahdat’s lab – Proposes fat-tree topology – Develops PortLand L2 virtualization – Hedera: enhances multipath performance – Integrate all above 35
  • 36. References • M. Al-Fares, et. al., “Hedera: Dynamic Flow Scheduling for Data Center Network”, USENIX NSDI 2010 • Tathagata Das, “Hedera: Dynamic Flow Scheduling for Data Center Networks”, UC Berkeley course CS 294 • M. Al-Fares, “Hedera: Dynamic Flow Scheduling for Data Center Network”, USENIX NSDI 2010, slides 36
  • 38. Fault-Tolerance • Link / Switch failure – Use PortLand’s fault notification protocol – Hedera routes around failed components 0 1 3 Flow A Flow B Flow C 2 Scheduler
  • 39. Fault-Tolerance • Scheduler failure – Soft-state, not required for correctness (connectivity) – Switches fall back to ECMP 0 1 3 Flow A Flow B Flow C 2 Scheduler
  • 40. Limitations • Dynamic workloads, large flow turnover faster than control loop – Scheduler will be continually chasing the traffic matrix • Need to include penalty term for unnecessary SA flow re-assignmentsFlow Size MatrixStability StableUnstable ECMP Hedera