Hedera - Dynamic Flow Scheduling for Data Center Networks, an Application of Software-Defined Networking (SDN)

Hedera: Dynamic Flow
Scheduling for Data Center
Network
Mohammad Al-Fares, Sivasankar
Radhakrishnan, Barath Raghavan, Nelson
Huang, Amin Vahdat
- USENIX NSDI 2010 -
1
Presenter: Jason, Tsung-Cheng, HOU
Advisor: Wanjiun Liao
Dec. 22nd, 2011

Problem
• Relying on multipathing, due to…
– Limited port densities of
routers/switches
– Horizontal expansion
• Multi-rooted tree topologies
– Example: Fat-tree / Clos
2

Problem
• BW demand is essential and volatile
– Must route among multiple paths
– Avoid bottlenecks and deliver aggre. BW
• However, current multipath routing…
– Mostly: flow-hash-based ECMP
– Static and oblivious to link-utilization
– Causes long-term large-flow collisions
• Inefficiently utilizing path diversity
– Need a protocol or a scheduler
3

Collisions of elephant flows
• Collisions in two ways: Upward or Downward
D1S1 D2S2 D3S3 D4S4

Equal Cost Paths
• Many equal cost paths going up to the core
switches
• Only one path down from each core switch
• Need to find good flow-to-core mapping
DS

Goal
• Given a dynamic flow demands
– Need to find paths that maximize
network bisection BW
– No end hosts modifications
• However, local switch information is
unable to find proper allocation
– Need a central scheduler
– Must use commodity Ethernet switches
– OpenFlow
6

Architecture
• Detect Large Flows
– Flows that need bandwidth but are network-limited
• Estimate Flow Demands
– Use min-max fairness to allocate flows between SD
pairs
• Allocate Flows
– Use estimated demands to heuristically find better
placement of large flows on the EC paths
– Arrange switches and iterate again
Detect
Large Flows
Estimate
Flow Demands
Allocate Flows

Architecture
• Feedback loop
• Optimize achievable bisection BW by
assigning flow-to-core mappings
• Heuristics of flow demand estimation and
placement
• Central Scheduler
– Global knowledge of all links in the network
– Control tables of all switches (OpenFlow)
Detect
Large Flows
Estimate
Flow Demands
Allocate Flows

Elephant Detection
• Scheduler polls edge switches
– Flows exceeding threshold are “large”
– 10% of hosts’ link capacity (> 100Mbps)
• Small flows: Default ECMP hashing
• Hedera complements ECMP
– Default forwarding is ECMP
– Only schedules large flows contributing
to bisection BW bottlenecks
• Centralized functions: the essentials
10

Demand Estimation
• Current flow rate: misleading
– May be already constrained by network
• Need to find flow’s “natural” BW
demand when not limited by network
– As if only limited by NIC of S or D
• Allocate S/D capacity among flows
using max-min fairness
• Equals to BW allocation of optimal
routing, input to placement algorithm
12

Demand Estimation
• Given pairs of large flows, modify
each flow size at S/D iteratively
– S distributes unconv. BW among flows
– R limited: redistributes BW among
excessive-demand flows
– Repeat until all flows converge
• Guaranteed to converge in O(|F|)
– Linear to no. of flows
13

Demand Estimation
A
B
C
X
Y
Flow Estimate Conv. ?
AX
AY
BY
CY
Sender
Available
Unconv. BW
Flows Share
A 1 2 1/2
B 1 1 1
C 1 1 1
Senders

Demand Estimation
Recv RL?
Non-SL
Flows
Share
X No - -
Y Yes 3 1/3
Receivers
AX 1/2
AY 1/2
BY 1
CY 1
A
B
C
X
Y

Demand Estimation
AX 1/2
AY 1/3 Yes
BY 1/3 Yes
CY 1/3 Yes
Sender
Available
Unconv. BW
Flows Share
A 2/3 1 2/3
B 0 0 0
C 0 0 0
Senders
A
B
C
X
Y

Demand Estimation
AX 2/3 Yes
AY 1/3 Yes
BY 1/3 Yes
CY 1/3 Yes
Recv RL?
Non-SL
Flows
Share
X No - -
Y No - -
Receivers
A
B
C
X
Y

Placement Heuristics
• Find a good large-flow-to-core mapping
– such that average bisection BW is maximized
• Two approaches
• Global First Fit: Greedily choose path that
has sufficient unreserved BW
– O([ports/switch]2)
• Simulated Annealing: Iteratively find a
globally better mapping of paths to flows
– O(# flows)

Global First-Fit
• New flow found, linearly search all paths from SD
• Place on first path with links can fit the flow
• Once flow ends, entries + reservations time out
?
Flow A
Flow B
Flow C
? ?
0 1 2 3
Scheduler
S D

Simulated Annealing
• Annealing: letting metal to cool down
and get better crystal structure
– Heating up to enter higher energy state
– Cooling to lower energy state with a
better structure and stopping at a temp
• Simulated Annealing:
– Search neighborhood for possible states
– Probabilistically accepting worse state
– Accepting better state, settle gradually
– Avoid local minima 21

Simulated Annealing
• State / State Space
– Possible solutions
• Energy
– Objective
• Neighborhood
– Other options
• Boltzman’s Function
– Prob. to higher state
• Control Temperature
– Current temp. affect
prob. to higher state
• Cooling Schedule
– How temp. falls
• Stopping Criterion
22
)/(1)( tEEP

Simulated Annealing
• State Space:
– All possible large-flow-to-core mappings
– However, same destinations map to same core
– Reduce state space, as long as not too many
large flows and proper threshold
• Neighborhood:
– Swap cores for two hosts within same pod,
attached to same edge / aggregate
– Avoids local minima
23

Simulated Annealing
• Energy:
– Estimated demand of flows
– Total exceeded BW capacity of links, minimize
• Temperature: remaining iterations
• Probability:
• Final state is published to switches and
used as initial state for next round
• Incremental calculation of exceeded cap.
• No recalculation of all links, only new large
flows found and neighborhood swaps 24

Implementation
• 16 hosts, k=4 fat-tree data plane
– 20 switches: 4-port NetFPGAs / OpenFlow
– Parallel 48-port non-blocking Quanta switch
– 1 scheduler, OpenFlow control protocol
– Testbed: PortLand
26

Simulator
• k=32; 8,192 hosts
– Pack-level simulators not applicable
– 1Gbps for 8k hosts, takes 2.5x1011 pkts
• Model TCP flows
– TCP’s AIMD when constrained by topology
– Poisson arrival of flows
– No pkt size variations
– No bursty traffic
– No inter-flow dynamics
27

Reactiveness
• Demand Estimation:
– 27K hosts, 250K flows, converges < 200ms
• Simulated Annealing:
– Asymptotically dependent on # of flows + #
iter., 50K flows and 1K iter.: 11ms
– Most of final bisection BW: few hundred iter.
• Scheduler control loop:
– Polling + Est. + SA = 145ms for 27K hosts

Comments
• Destine to same host, via same core
– May congest at cores, but how severe?
– Large flows to/from a host: <k/2
– No proof, no evaluation
• Decrease search space and runtime
– Scalable for per-flow basis? For large k?
• No protection for mice flows, RPCs
– Only assumes work well under ECMP
– No address when route with large flows
32

Comments
• Own flow-level simulator
– Aim to saturate network
– No flow number by different size
– Traffic generation: avg. flow size and arrival
rates (Poisson) with a mean
– Only above descriptions, no specific numbers
– Too ideal or not volatile enough?
– Avg. bisection BW, but real-time graphs?
• States that per-flow VLB = per-flow ECMP
– Does not compare with other options (VL2)
– No further elaboration
33

Comments
• Shared responsibility
– Controller only deals with critical situations
– Switches perform default measures
– Improves performance and saves time
– How to strike a balance?
– Adopt to different problems?
• Default multipath routing
– States problems of per-flow VLB and ECMP
– How about per-pkt? Author’s future work
– How to improve switches’ default actions?
34

Comments
• Critical controller actions
– Considers large flows degrade overall efficiency
– What are critical situations?
– How to detect and react?
– How to improve reactiveness and adaptability?
• Amin Vahdat’s lab
– Proposes fat-tree topology
– Develops PortLand L2 virtualization
– Hedera: enhances multipath performance
– Integrate all above
35

References
• M. Al-Fares, et. al., “Hedera: Dynamic Flow Scheduling for
Data Center Network”, USENIX NSDI 2010
• Tathagata Das, “Hedera: Dynamic Flow Scheduling for Data
Center Networks”, UC Berkeley course CS 294
• M. Al-Fares, “Hedera: Dynamic Flow Scheduling for Data
Center Network”, USENIX NSDI 2010, slides
36

Fault-Tolerance
• Link / Switch failure
– Use PortLand’s fault notification protocol
– Hedera routes around failed components
0 1 3
Flow A
Flow B
Flow C
2
Scheduler

Fault-Tolerance
• Scheduler failure
– Soft-state, not required for correctness
(connectivity)
– Switches fall back to ECMP
0 1 3
Flow A
Flow B
Flow C
2
Scheduler

Limitations
• Dynamic workloads,
large flow turnover
faster than control
loop
– Scheduler will be
continually chasing
the traffic matrix
• Need to include
penalty term for
unnecessary SA flow
re-assignmentsFlow Size
MatrixStability
StableUnstable
ECMP Hedera

Hedera - Dynamic Flow Scheduling for Data Center Networks, an Application of Software-Defined Networking (SDN)

More Related Content

What's hot (20)

Similar to Hedera - Dynamic Flow Scheduling for Data Center Networks, an Application of Software-Defined Networking (SDN) (20)

More from Jason TC HOU (侯宗成) (11)

Recently uploaded (20)

Hedera - Dynamic Flow Scheduling for Data Center Networks, an Application of Software-Defined Networking (SDN)