Mahout low-overhead datacenter traffic management using end-host-based elephant detection

This paper was presented as part of the main technical program at IEEE INFOCOM 2011

Mahout: Low-Overhead Datacenter Traffic Management using

End-Host-Based Elephant Detection

Andrew R. Curtis* Wonho Kim* Praveen Yalagandula
University of Waterloo Princeton University HP Labs
Waterloo, Ontario, Canada Princeton, NJ, USA Palo Alto, CA, USA

Abstract-Datacenters need high-bandwidth interconnection report that 90% of the flows carry less than 1MB of data
fabrics. Several researchers have proposed highly-redundant and more than 90% of bytes transferred are in flows greater
topologies with multiple paths between pairs of end hosts for
than 100MB. Hash-based flow forwarding techniques such as
datacenter networks. However, traffic management is necessary
to effectively utilize the bisection bandwidth provided by these
Equal-Cost Multi-Path (ECMP) routing [19] works well only
topologies. This requires timely detection of elephant flows for large numbers of small (or mice) flows and no elephant
flows that carry large amount of data-and managing those flows. For example, AI-Fares et al. 's Hedera [7] shows that
flows. Previously proposed approaches incur high monitoring managing elephant flows effectively can yield as much as
overheads, consume significant switch resources, and/or have long
113% higher aggregate throughput compared to ECMP.
detection times.
We propose, instead, to detect elephant flows at the end hosts.
Existing elephant flow detection methods have several lim
We do this by observing the end hosts's socket buffers, which itations that make them unsuitable for datacenter networks.
provide better, more efficient visibility of flow behavior. We These proposals use one of three techniques to identify ele
present Mahout, a low-overhead yet effective traffic management phants: (1) periodic polling of statistics from switches, (2)
system that follows OpenFlow-like central controller approach for
streaming techniques like sampling or window-based algo
network management but augments the design with our novel end
host mechanism. Once an elephant flow is detected, an end host
rithms, or (3) application-level modifications (full details of
signals the network controller using in-band signaling with low each approach are given in Section II). We have not seen
overheads. Through analytical evaluation and experiments, we support for Quality of Service (QoS) solutions take hold,
demonstrate the benefits of Mahout over previous solutions. which implies that modifying applications is probably an unac
ceptable solution. We will show that the other two approaches
I. INT RODUCT ION
fall short in the datacenter setting due to high monitoring
Datacenter switching fabrics have enormous bandwidth de overheads, significant switch resource consumption, and/or
mands due to the recent uptick in bandwidth-intensive appli long detection times.
cations used by enterprises to manage their exploding data. We assert that the right place for elephant flow detection
These applications transfer huge quantities of data between is at the end hosts. In this paper, we describe Mahout, a
thousands of servers. For example, Hadoop [18] performs an low-overhead yet effective traffic management system using
all-to-all transfer of up to petabytes of files during the shuffle end-host-based elephant detection. We subscribe to the in
phase of a MapReduce job [15]. Further, to better consolidate creasingly popular simple-switchlsmart-controller model (as in
employee desktop and other computation needs, enterprises OpenFlow [4]), and so our system is similar to NOX [28] and
are leveraging virtualized datacenter frameworks (e.g., using Hedera [7].
VMWare [29] and Xen [10], [30]), where timely migration of Mahout augments this basic design. It has low overhead, as
virtual machines requires high throughput network. it monitors and detects elephant flows at the end host via a
Designing datacenter networks using redundant topologies shim layer in the OS, rather than monitoring at the switches
such as Fat-tree [6], [12], HyperX [5], and Flattened But in the network. Mahout does timely management of elephant
terfly [22] solves the high-bandwidth requirement. However, flows through an in-band signaling mechanism between the
traffic management is necessary to extract the best bisection shim layer at the end hosts and the network controller. At the
bandwidth from such topologies [7]. A key challenge is that switches, any flow not signaled as an elephant is routed using
the flows come and go too quickly in a data center to compute a static load-balancing scheme (e.g., ECMP). Only elephant
a route for each individually; e.g., Kandula et al. report lOOK flows are monitored and managed by the central controller.
flow arrivals a second in a 1,500 server cluster [21]. The combination of end host elephant detection and in-band
For effective utilization of the datacenter fabric, we need to signaling eliminates the need for per-flow monitoring in the
detect elephant flows-flows that transfer significant amount switches, and hence incurs low overhead and requires few
of data-and dynamically orchestrate their paths. Datacenter switch resources.
measurements [17], [21] show that a large fraction of datacen We demonstrate the benefits of Mahout using analytical
ter traffic is carried in a small fraction of flows. The authors evaluation and simulations and through experiments on a small
*This work was performed while Andrew and Wonho were interns at HP testbed. We have built a Linux prototype for our end host
Labs-Palo Alto. elephant flow detection algorithm and tested its effectiveness.

978-1-4244-9921-2/11/$26.00 ©2011 IEEE 1629

We have also built a Mahout controller, for setting up switches B. Identifying elephant flows
with default entries and for processing the tagged packets from
The mix of latency- and throughput-sensitive flows in the
the end hosts. Our analytical evaluation shows that Mahout
data centers means that effective flow scheduling needs to
offers one to two orders of magnitude of reduction in the
balance visibility and overhead-a one size fits all approach is
number of flows processed by the controller and in switch
not sufficient in this setting. To achieve this balance, elephant
resource requirements, compared to Hedera-like approaches.
flows must be identified so that they are the only flows touched
Our simulations show that Mahout can achieve considerable
by the controller. The following are the previously considered
throughput improvements compared to static load balancing
mechanisms for identifying elephants:
techniques while incurring an order of magnitude lower over
head than Hedera. Our prototype experiments show that the • Applications identify their flows as elephants: This solu
Mahout approach can detect elephant flows at least an order tion accurately and immediately identifies elephant flows.
of magnitude sooner than statistics-polling based approaches. This is a common assumption for a plethora of research
The key contributions of our work are: 1) a novel end work in network QoS where focus is to give higher
host based mechanism for detecting elephant flows, 2) design priority to latency and throughput-sensitive flows such
of a centralized datacenter traffic management system that as voice and video applications (see, e.g., [11]). How
has low overhead yet high effectiveness, and 3) simulation ever, this solution is impractical for traffic management
and prototype experiments demonstrating the benefits of the in datacenters as each and every application must be
proposed design. modified to support it. If all applications are not modified,
an alternative technique will still be needed to identify
II. BACKGROUND & RELATED WORK elephant flows initiated by unmodified applications.
A related approach is to classify flows based on which
A. Datacenter networks and traffic
application is initiating them. This classifies flows using
The heterogeneous mix of applications running in datacen stochastic machine learning techniques [27], or using
ters produces flows that are generally sensitive to either latency simple matching based on the packet header fields (such
or throughput. Latency-sensitive flows are usually generated as TCP port numbers). While this approach might be
by network protocols (such as ARP and DNS) and interactive suitable for enterprise network management, it is un
applications. They typically transfer up to a few kilobytes. suitable for datacenter network management because of
On the other hand, throughput-sensitive flows, created by, the enormous amount of traffic in the datacenter and the
e.g., MapReduce, scientific computing, and virtual machine difficulty in obtaining flow traces to train the classification
migration, transfer up to gigabytes. This traffic mix implies algorithms.
that a datacenter network needs to deliver high bisection • Maintain per-flow statistics: In this approach, each flow
bandwidth for throughput-sensitive flows without introducing is monitored at the first switch that the flow goes
setup delay on latency-sensitive flows. through. These statistics are pulled from switches by
Designing datacenter networks using redundant topologies the controller at regular intervals and used to classify
such as Fat-tree [6], [12], HyperX [5], or Flattened But elephant flows. Hedera [7] and Helios [16] are examples
terfly [22] solves the high-bandwidth requirement. However, of systems proposing to use such a mechanism. However,
these networks use multiple end-to-end paths to provide this this approach does not scale to large networks. First,
high-bandwidth, so they need to load balance traffic across this consumes significant switch resources: a flow table
them. Load balancing can be performed with no overhead entry for each flow monitored at a switch. We'll show
using oblivious routing, where the path a flow from node i in Section IV that this requires considerable number of
to node j is routed on is randomly selected from a probability flow table entries. Second, bandwidth between switches
distribution over all i to j paths, but it has been shown to and the controller is limited, so much so that transferring
achieve less than half the optimal throughput when the traffic statistics becomes the bottleneck in traffic management
mix contains many elephant flows [7]. The other extreme is in datacenter network. As a result, the flow statistics
to perform online scheduling by selecting the path for all new cannot be quickly transferred to the controller, resulting
flows using a load balancing algorithm, e.g., greedily adding in prolonged sub-par routings.
a flow along the path with least congestion. This approach • Sampling: Instead of monitoring each flow in the net
doesn't scale well-flows arrive too quickly for a single sched work, in this approach, a controller samples packets from
uler to keep up-and it adds too much setup time to latency all ports of the switches using switch sampling features
sensitive flows. For example, flow installation using NOX can such as sFlow [3]. Only a small fraction of packets are
take up to lOms [28]. Partition-aggregate applications (such sampled (typically, 1 in 1000) at the switches and only
as search and other web applications) partition work across headers of the packets are transferred to the controller.
multiple machines and then aggregate the responses. Jobs have The controller analyzes the samples and identifies a flow
a deadline of 10-1OOms [8], so a lOms flow setup delay can as an elephant after it has seen sufficient number of sam
consume the entire time budget. Therefore, online scheduling ples from the flow. However, such an approach can not
is not suitable for latency-sensitive flows. reliably detect an elephant flow before it has carried more

1630

:
,---:----, :::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::............................. 100

80
in
� 60
2]
"'
>-
40

RACK RACK
20 TCPBuffer -
SWITCH SWITCH Sent Data --
a
a 100 200 300 400 500 600 700 800
r--:::--j -.- END-HOST END-HOST END-HOST END-HOST
Time (us)

Fig. 2: Amount of data observed in the Tep buffers vs. data observed
at the network layer for a flow.

simple approach allows the controller to detect elephant flows
END-HOST END-HOST END·HOST END·HOST
without any switch CPU- and bandwidth-intensive monitoring.
The Mahout controller then manages only the elephant flows,
Fig. I: Mahout architecture.
to maintain a globally optimal arrangement of them.
In the following, we describe Mahout's end host shim layer
than 10K packets, or roughly 15MB [25]. Additionally, for detecting elephant flows, our in-band signaling method for
sampling has high overhead, since the controller must informing controller about the elephant flows, and the Mahout
process each sampled packet. network controller.

C. OpenFlow A. Detecting Elephant Flows
OpenFlow [23] aims to open up traditionally closed de An end host based implementation for detecting elephant
signs of commercial switches to enable network innovation. flows is better than in-network monitoring/sampling based
OpenFlow switches maintain a flow-table where each entry methods, particularly in datacenters, because: (l) The network
contains a pattern to match and the actions to perform on a behavior of a flow is affected by how rapidly the end-point
packet that matches that entry. OpenFlow defines a protocol for applications are generating data for the flow, and this is not
communication between a controller and an OpenFlow switch biased by congestion in the network. In contrast to in-network
to add and remove entries from the flow table of the switch monitors, the end host OS has better visibility into the behavior
and to query statistics of the flows. of applications. (2) In datacenters, it is possible to augment
Upon receiving a packet, if an OpenFlow switch does not the end host OS; this is enabled by the single administrative
have an entry in the flow table or TCAM that matches the domain and software uniformity typical of modern datacenters.
packet, the switch encapsulates and forwards the packet to the (3) Mahout's elephant detection mechanism has very little
controller over a secure connection. The controller responds overhead (it is implemented with two if statements) on com
back with a flow table entry and the original packet. The switch modity servers. In contrast, using an in-network mechanism
then installs the entry into its flow table and forwards the to do fine-grained flow monitoring (e.g., using exact matching
packet according to the actions specified in the entry. The flow on OpenFlow's 12-tuple) can be infeasible, even on an edge
table entries expire after a set amount of time, typically 60 switch, and even more so on a core switch, especially on
seconds. OpenFlow switches maintain statistics for each entry commodity hardware. For example, assume that 32 servers
in their flow table. These statistics include a packet counter, are connected, as is typical, to a rack switch. If each server
byte counter, and duration. generates 20 new flows per second, with a default flow timeout
The OpenFlow 1.0 specification [2] defines matching over period of 60 seconds, an edge-switch needs to maintain and
12 fields of packet header (see the top line in Figure 3). The monitor 38400 flow entries. This number is infeasible in any
specification defines several actions including forwarding on a of the real switch implementations of OpenFlow that we are
single physical port, forwarding on multiple ports, forwarding aware of.
to the controller, drop, queue (to a specified queue), and A key idea of the Mahout system is to monitor end host
defaulting to traditional switching. To support such flexibility, socket buffers, and thus determine elephant flows before
current commercial switch implementations of OpenFlow use and with lower overheads than with in-network monitoring
TCAMs for flow table. systems. We demonstrate the rationale for this approach with
a micro-benchmark: an ftp transfer of a 50MB file from a host
III. OUR SOLUTION: MAHOUT 1 to host 2, connected via two switches all with 1 Gbps links.
Mahout's architecture is shown in Figure 1. In Mahout, a In Figure 2, we show the cumulative amount of data
shim layer on each end host monitors the flows originating observed on the network, and in the TCP buffer, as time
from that host. When this layer detects an elephant flow, it progresses. The time axis starts when the application first
marks subsequent packets of that flow using an in-band sig provides data to the kernel. From the graph, one can observe
naling mechanism. The switches in the network are configured that the application fills the TCP buffers at a rate much higher
to forward these marked packets to the Mahout controller. This than the observed network rate. If the threshold for considering

1631

Algorithm 1 Pseudocode for end host shim layer Algorithm 1 shows pseudocode for the end host shim layer
1: When sending a packet function that is executed when a TCP packet is being sent.
2: if number of bytes in buffer � thresholdelephant then
C. Mahout Controller
3: / * Elephant flow */
4: if last-tagged-time - nowO � Ttagperiod then At each rack switch, the Mahout controller initially config
5: set DS = 00001100 ures two default OpenFlow flow table entries: (i) an entry to
6: last-tagged-time = nowO send a copy of packets with the DSCP bits set to 000011 to the
7: end if
controller and (ii) the lowest-priority entry to switch packets
8: end if
using NORMAL forwarding action. We set up switches to per
form ECMP forwarding by default in the NORMAL operation
mode. Figure 3 shows the two default entries at the bottom.
a flow as an elephant is 100KB (Figure 2. of [17] shows that
In this figure, an entry has a higher priority over (is matched
more than 85% of flows are less than 100KB), we can see
before) entries below that entry.
that Mahout's end host shim layer can detect a flow to be When a flow starts, it normally will match the lowest
an elephant 3x sooner than in-network monitoring. In this priority (NORMAL) rule, so its packet will follow ECMP
experiment there were no other active flows on the network. forwarding. When an end host detects a flow as an elephant
In further experimental results, presented in Section V, we and marks a packet of that flow. That packet marked with
observe an order of magnitude faster detection when there are DSCP 000011 matches the other default rule, and the rack
other flows. switch forwards it to the Mahout controller. The controller
Mahout uses a shim layer in the end hosts to monitor then computes the best path for this elephant, and installs a
the socket buffers. When a socket buffer crosses a chosen flow-specific entry in the rack switch.
threshold, the shim layer determines that the flow is an In Figure 3, we show a few example entries for the ele
elephant. This simple approach ensures that flows that are phant flows. Note that these entries are installed with higher
bottlenecked at the application layer and not in the network priority than Mahout's two default rules; hence, the packets
layer, irrespective of how long-lived they are or how many corresponding to these elephant flows are switched using the
bytes they have transferred, will not be determined as the actions of these flow-specific entries rather than the actions of
elephant flows. Such flows need no special management in the default entries. Also, the DS field is set to wildcard for
the network. In contrast, if an application is generating data these elephant flow entries, so that once the flow-specific rule
for a flow faster than the flow's achieved network throughput, is installed, any tagged packets from the end hosts are not
the socket buffer will fill up, and hence Mahout will detect forwarded to the controller.
this an an elephant flow that needs management. Once an elephant flow is reported to the Mahout controller,
it needs to be placed on the best available path. We define the
B. In-band Signaling best path for a flow from s to t as the least congested of all
paths from s to t. The least congested s-t path is found by
Once Mahout's shim layer has detected an elephant flow,
enumerating over all such paths.
it needs to signal this to the network controller. We do this
To manage the elephant flows, Mahout regularly pulls
indirectly, by marking the packets in a way that is easily
statistics on the elephant flows and link utilizations from the
and efficiently detected by OpenFlow switches, and then the
switches, and uses these statistics to optimize the elephant
switches divert the marked packets to the network controller.
flows' routes. This is done with the increasing first fit algo
To avoid inundating the controller with too many packets of
rithm given in Algorithm 2. Correa and Goemans introduced
the same flow, the end host shim layer marks the packets of
this algorithm and proved that it finds routings that have at
an elephant flow only once every Ttagperiod seconds (we use
most a 10% higher link utilization than the optimal routing
1 second in our prototype).
[13]. While we cannot guarantee this bound because we re
To mark a packet, we repurpose the Differentiated Services
route only the elephant flows, we expect this algorithm to
Field (DS Field) [26] in the IPv4 header. This field was
perform as well as any other heuristic.
originally called the IP Type-of-Service (IPToS) byte. The
first 6 bits of the DS Field, called Differentiated Services D. Discussion
Code Point (DSCP), define the per-hop behavior of a packet. a) DSCP bits: In Mahout, the end host shim layer uses
The current OpenFlow specification [2] allows matching on the DSCP bits of the DS field in IP header for signaling
DSCP bits, and most commercial switch implementations elephant flows. However, there may be some datacenters
of OpenFlow support this feature in hardware; hence, we where DSCP may be needed for other uses, such as for
use the DS Field for signaling between the end host shim prioritization among different types of flows (voice, video, and
layer and the network controller. Currently the code point data) or for prioritization among different customers. In such
space corresponding to xxxx11 (x denotes a wild-card bit) is scenarios, we plan to use VLAN Priority Code Point (PCP) [1]
reserved for experimental or local usage [20], and we leverage bits. OpenFlow supports matching on these bits too. We can
this space. When an end host detects an elephant flow, it sets leverage the fact that it is very unlikely for both these code
the DSCP bits to 000011 in the packets belonging to that flow. point fields (PCP and DSCP) to be in use simultaneously.

1632

IMPL.
Bit lengths _. 48 48 16 12 32 32 16 16
DEP.

Packet fields
ACTIONS
to match

2 01AB..8F 3OCD..9E 0806 xx xx 10.0.0.1 10.0.0.4 6 XXXXXX 80 2490 Forward to port 4

Flow entries
/
xx xx XXXXXX
for detected - 10 BE9F..03 04DE .. CF 0806 10.0.1.10 10.0.1.40 6 3450 3451 Forward to port 3

q
elephant flows �

: I : I : I : I : I : I : I : I : I::I:� : I : !
D,fa"" rul" Send to Controller

at the lowest
NORMAL routing
priotity (ECMP)

Fig. 3: An example flow table setup at a switch by the Mahout controller.

Algorithm 2 Offline increasing first fit has more than 100KB data, xxxll1 to denote more than 1MB,
1: sort(F); reverse(F) /* F: set of elephant flows */ xxllll to denote more than 10MB, and so on. The controller
2: for J E F do can then change the default entry corresponding to the tagged
3: for l E J.path do packets (second from bottom in the Figure 3) to select higher
4: l.load = l.load - J.rate
thresholds, based on the load at the controller. Further study
5: end for
6: end for
is needed to explore these approaches.
7: for J E F do
8: bescpaths[fj.congest = 00
IV. ANALY T ICAL EVALUAT ION
9: /* Pst: set of all s-t paths */ In this section, we analyze the expected overhead of de
10: for path E Pst do
tecting elephant flows with Mahout, with flow sampling, and
11: congest = (f.rate + path.load) / path. bandwidth
by maintaining per-flow statistics (e.g., the approach used by
12: if congest < bescpath.congest then
13: best_paths[fj = path Hedera). We set up an analytical framework to evaluate the
14: bescpaths[f].congest = congest number of switch table entries and control messages used
15: end if by each method. We evaluate each method using an example
16: end for
datacenter, and show that Mahout is the only solution that can
17: end for
scale to support large datacenters.
18: return best_paths
Flow sampling identifies elephants by sampling an expected
l out of k packets. Once it has seen enough packets from
the same flow, then the flow is classified as an elephant. The
b) Virtualized Datacenter: In a virtualized datacenter, number of packets needed to classify an elephant does not
a single server will host multiple guest virtual machines, affect our analysis in this section, so we ignore it for now.
each possibly running a different operating system. In such Hedera [7] uses periodic polling for elephant flow detection.
a scenario, the Mahout shim layer needs to be deployed in Every t seconds, the Hedera controller pulls the per-flow
each of the guest virtual machines. Note that the host operating statistics from each switch. In order to estimate the true rate of
system will not have visibility into the socket buffers of a guest a flow (i.e., the rate of the flow if its rate is only constrained
virtual machine. However, in cloud computing infrastructures by its endpoints' NICs and not by any link in the network),
such as Amazon EC2 [9], typically the infrastructure provider the statistics for every flow in the network must be collected.
makes available a few preconfigured OS versions, which in Pulling statistics for all flows using OpenFlow requires setting
clude the paravirtualization drivers to work with the provider's up a flow table entry for every flow, so each flow must be sent
hypervisor. Thus, we believe that it is feasible to deploy the to the controller before it can be started, so we include this
Mahout shim layer in virtualized datacenters, too. cost in our analysis.
c) Elephantfiow threshold: Choosing too Iow a value for We consider a million server network for the following
thresholdelephant in Algorithm 1 can cause many flows to analysis. Our notation and and the assumed values are shown
be recognized as elephants, and hence cause the rack switches in the Table I.
to forward too many packets to the controller. When there Hedera [7]: As table entries need to be maintained for all
are many elephant flows, to avoid the controller overload, we flows, the number of flow table entries needed at each rack
could provide a means for the controller to signal the end hosts switch is T·P-D. In our example, this translates to 32·20·60 =

to increase the threshold value. However, this would require 38,400 entries at each rack switch. We are not aware of any
a out-of-band control mechanism. An alternative is to use existing switch with OpenFlow support that can support this
multiple DSCP values to denote different levels of thresholds. many entries in the flow table in the hardware-for example,
For example, xxxxll can be designated to denote that a flow HP ProCurve 5400zl switches support up to 1.7K OpenFlow

1633

Parameter Description Value
reduces this overhead but adversely impacts the effects of flow
N Num. of end hosts 2�u (1M)
T Num. of end hosts per rack switch 32 scheduling since not all elephants are detected.
S Num. of rack switches 215 (32K) We expect the number of elephants identified by sampling
F Avg. new flows per second per end host 20 [28]
to be similar to Mahout, so we do not analyze the flow table
D Avg. duration of a flow in the flow table 60 seconds
c Size of counters in bytes 24 [2] entry overhead of sampling separately.
Tstat Rate of gathering statistics I-per-second Mahout: Because elephant flow detection is done at the
p Num. of bytes in a packet 1500
end-host, switches contain flow table entries for elephant flows
Im Fraction of mice 0.99
Ie Fraction of elephants 0.01 only. Also, statistics are only gathered for the elephant flows.
Tsample Rate of sampling l-in-IOOO So, the number of flow entries per rack switch in Mahout is
h.ample Size of packet sample (bytes) 60
TABLE I: Parameters and tYPIcal values for the analytical evaluatIOn
T . F·D·fe 384 entries. The number of flow setups that the
=

Mahout controller needs to handle is N·F·fe, which is about
200K requests per second, which needs 7 controllers. Also,
entries per linecard. It is unlikely that any switch in the near
the number of packets per second that need to be processed
future will support so many table entries given the expense of
for gathering statistics is a fe fraction of the same in case of
high-speed memory.
Hedera. Thus 7 controllers are needed for gathering statistics
The Hedera controller needs to handle N . F flow setups per
at the rate of once per second, or the statistics can be gathered
second, or more than 20 million requests per second in our
by a single controller at the rate of once every 7 seconds.
example. A single NOX controller can handle only 30,000
requests per second; hence one needs 667 controllers to just
V. EXPERIMENTS
handle the flow setup load [28], assuming that the load can
be perfectly distributed. Flow scheduling, however, does not A. Simulations
seem to be a simple task to distribute. Our goal is to compare the performance and overheads of
The rate at which the controller needs to process the Mahout against the competing approaches described in the
statistics packets is previous section. To do so, we implemented a flow-level,
c·T·F ·D event-based simulator that can scale to a few thousand end
----- . S . r stat
p hosts connected using Clos topology [12]. We now describe
this simulator and our evaluation of Mahout with it.
In our example, this implies (24 . 38400) /1500 . 215
1) Methodology: We simulate a datacenter network by
1 � 20 . 1M control packets per second. Assuming that NOX
modeling the behavior of flows. The network topology is
controller can handle these packets at the rate it can handle
modeled as a capacitated, directed graph and forms a three
the flow setup requests (30,000 per second), this translates to
level Clos topology. All simulations here are of a 1,600
needing 670 controllers just to process these packets. Or, if we
server datacenter network, and they use a network with a
consider only one controller, then the statistics can be gathered
rack to aggregation and aggregation to core links that are
only once every 670 seconds ( � 11 minutes).
1:5 oversubscribed, i.e., the network has 320Gb bisection
Sampling: Sampling incurs the messaging overhead of
bandwidth. All servers have 1Gbps NICs and links have 1 Gbps
taking samples, and then installs flow table entries when an
capacity. Our simulation is event-based, so there is no discrete
elephant is detected. The rate at which the controller needs to
clock-instead, the timing of events is accurate to floating
process the sampled packets is
point precision. Input to the simulator is a file listing the start
bytes per sample time, bytes, and endpoints of a set of flows (our workloads are
throughput· r sample . --'----=----=-
p described below). When a flow starts or completes, the rate of
We assume that each sample contains only a 60 byte each flow is recomputed.
header and that headers can be combined into 1500 byte We model the OpenFlow protocol only by accounting for
packets, so there are 25 samples per message to the controller. the delay when a switch sets up a flow table entry for a flow.
The aggregate throughput of a datacenter network changes When this occurs, the switch sends the flow to the OpenFlow
frequently, but if 10% of the hosts are sending traffic, the controller by placing it in its OpenFlow queues. This queue
aggregate throughput (in Gbps) is 0 . 10 . N. We then find the has lOMbps of bandwidth (this number was measured from
messaging overhead of sampling to be around 550K messages an OpenFlow switch [24]). This queue has infinite capacity,
per second, or if we bundle samples into packets (i.e., 25 so our model optimistically estimates the delay between a
samples fit in a 1500 byte packet), then this drops to 22K switch and the OpenFlow controller since a real system drops
messages per second. arriving packets if one of these queues is full, resulting in TCP
At first blush, this messaging overhead does not seem timeouts. Moreover, we assume that there is no other overhead
like too much overhead; however, as the network utilization when setting up a flow, so the OpenFlow controller deals with
increases, the messaging overhead can reach 3.75 million the flow and installs flow table entries instantly.
(or 150K if there are 25 samples per packet) packets per We simulate three different schedulers: (1) an offline sched
second. Therefore, sampling incurs the highest overhead when uler that periodically pulls flow statistics from the switches,
load balancing is most needed. Decreasing the sampling rate (2) a scheduler that behaves like the Mahout scheduler, but

1634

250
uses sampling to detect elephant flows, and (3) the Mahout �
Q.

scheduler as described in Sec. III-C. t[ 200 '"
,.
The stat-pulling controller behaves like Hedera [7] and � 150 I-- - I-- - I-- - ,.- I-- - f-
.s::;

Helios [16]. Here, the controller pulls flow statistics from "
."

� 100 I-- - I-- - I-- - r-- I-- - f-
each switch at regular intervals. The statistics from a flow -5
! I-- - I-- - I-- - I-- I-- - f-
table entry are 24 bytes, so the amount of time to transfer the SO

statistics from a switch to the controller is proportional to the
�
» 0
... .... ;'l
number of flow table entries at the switch. When transferring '"
e>. CD CD CD 0 0 0
:;; :;;
S
0 0
c:i
�
S
00 0
� �
0
;'l
statistics, we assume that the CPU-to-controller rate is the ;'l -;:;-
bottleneck, not the network or OpenFiow controller itself.
Once the controller has statistics for all flows, it computes Mahout (threshold) Sampling (frac. Pulling (s)
packets)
a new routing for elephant flows and reassigns paths instantly.
In practice, computing this routing and inserting updated flow Fig. 4: Throughput results for the schedulers with various parameters.
Error bars on all charts show 95% confidence intervals.
table entries into the switches will take up to hundreds of
milliseconds. We allow this to be done instantaneously to intra- or inter-rack flow respectively. We select the number
find the theoretical best achievable results using an offline of bytes in a flow following the distribution of flow sizes in
approach. The global re-routing of flows is computed using their measurements as well. Before starting the shuffle job,
the increasing best fit algorithm described in Algorithm 2. This we simulate this background traffic for three minutes. The
algorithm is simpler than the simulated annealing employed by simulation ends whenever the last shuffle job flow completes.
Hedera; however, we expect the results to be similar, since this b) Metrics: To measure the performance of each sched
heuristic is likely to be as good as any other (as discussed in uler, we tracked the aggregate throughput of all flows; this is
Sec. III-C) the amount of bisection bandwidth the scheduler is able to
As we are doing flow-level simulations, sampling packets is extract from the network. We measure overhead as before in
not straightforward since there are no packets to sample from. Section IV, i.e., by counting the number of control messages
Instead, we sample from flows by determining the amount of and the number of flow table entries at each switch. All
time it will take for k packets to traverse a link, given its rate, numbers shown here are averaged from ten runs.
and then sample from the flows on the link by weighting each 2) Results: The per-second aggregate throughput for the
flow by its rate. Full details are in [14]. various scheduling methods is shown in Figure 4. We com
a) Workloads: We simulate background traffic modeled pare these schedulers to static load balancing with equal-cost
on recent measurements [21] and add traffic modeled on multipath (ECMP), which uniformly randomizes the outgoing
MapReduce traffic to stress the network. We assume that the flows across a set of ports [7]. We used three different elephant
MapReduce job has just gone into its shuffle phase. In this thresholds for Mahout: 128KB, 1MB, and 100MB, and flows
phase, each end host transfers 128MB to each other host. Each carrying at least this threshold of bytes were classified as an
end host opens a connection to at most five other end hosts elephant after sending 2, 20, or 2000 packets respectively. As
simultaneously (as done by default in Hadoop's implementa expected, controlling elephant flows extracts more bisection
tion of MapReduce). Once one of these connections completes, bandwidth from the network-Mahout extracts 16% more
the host opens a connection to another end host, repeating this bisection bandwidth from the network than ECMP and the
until it has transferred its 128MB file to each other end host. other schedulers obtain similar results depending on their
The order of these outgoing connections is randomized for parameters.
each end host. For all experiments here, we used 250 randomly Hedera's results found that flow scheduling gives a much
selected end hosts in the shuffle load. The reduce phase shuffle larger improvement over ECMP than our results (up to 113%
begins three minutes after the background traffic is started on some workloads) [7]. This is due to the differences in
to allow the background traffic to reach a steady state, and workloads. Our workload is based on measurements [21],
measurements shown here are taken for five minutes after the whereas their workloads are synthetic. We have repeated our
reduce phase began. simulations using some of their workloads and find similar
We added background traffic following the macroscopic results: the schedulers improve throughput by more than 100%
flow measurements collected by Kandula et al. [17], [21] compared to ECMP on their workloads.
to the traffic mix because datacenters run a heterogeneous We examine the overhead versus performance tradeoff by
mix of services simultaneously. They give the fraction of counting the maximum number of flow table entries per rack
correspondents a server has within its rack and outside of its switch and the number of messages to the controller. These
rack over a ten second interval. We follow this distribution results are shown in Figures 5 and 6
to decide how many inter- and intra-rack flows a server starts Mahout has the least overhead of any scheduling approach
over ten seconds; however, they do not give a more detailed considered. Pulling statistics requires too many flow table
breakdown of flow destinations than this, so we assume entries per switch and sends too many packets to the controller
that the selection of a destination host is uniformly random to scale to large datacenters; here, the stat-pulling scheduler
across the source server's rack or the remaining racks for an used nearly 800 flow table entries per rack switch on average

1635

12000
-g HOST-1 HOST-2
8 10000
3:
� 8000
rt: Server
Background
Sources openFIOw
protoCOl .:'
../
..
� Unux
..
. . .... .
6000 /.'" Linux
.. ""---
�
.. I Mahout Shim I ,--. .
..
-,
I Mahout Shim I
E 4000

� 2000
0
u
n II n 1Gbps 1Gbps 1Gbps

128KB 1MB 100MB 1/100 1/1000 1/10000 0.1 1 10
Fig. 8: Testbed for prototype experiments.
Mahout (threshold) Sampling (frac. packets) Pulling (s)
the networking stack. Implementing it as a separate kernel
Fig. 5: Number of packets sent to controller by various schedulers. module improves deployability because it can be installed
Here, we bundled samples together into a single packet (there are
without modifying or upgrading the linux kernel. Our con
25 samples per packet)-each bundle of samples counts as a single
controller message.
troller leverages the NOX platform to learn topology and
configure switches with the default entries. It also processes
1800
• Avg table entries • Max table entries the packets marked by the shim layer and installs entries for
.= 1600

.t:: 1400 the elephant flows.
�
..
1200
Our testbed for experimenting different components of the
� 1000

.� 800 prototype is shown in the Figure 8. Switches 1 and 2 in the
..
..
600
Figure are HP ProCurve 5400zl switches running firmware
:is 400

� 200
with OpenFiow support. We have two end hosts in the testbed,
one acting as the server for the flows and another acting as the
128KB 1MB 100MB 1/100 1/1000 1/10000 0.1 10
client. For some experiments, we also run some background
Mahout (threshold) Sampling (frac. packets) Pulling (s) flows to emulate the other network traffic.
Fig. 6: Average and maximum number of flow table entries at each Since this is a not a large scale testbed, we perform
switch used by the schedulers. microbenchmark experiments focusing on the timeliness of
no matter how frequently the statistics were pulled. This elephant flow detection and compare Mahout against Hedera
is more than seven times the number of entries used by like polling approach. We first present measurements of the
the sampling and Mahout controllers, and makes the offline time it takes to detect an elephant flow at end host and then
scheduler infeasible in larger datacenters because the flow present the overall time it takes for the controller to detect an
tables will not be able to support such a large number of elephant flow. Our workload consists of a file transfer using
entries. Also, when pulling stats every 1 sec., the controller ftp. For experiments with presence of background flows, we
receives lOx more messages than when using Mahout with an run 10 simultaneous iperf connections.
elephant threshold of 100MB. a) End host elephant flow detection time: In this ex
These simulations indicate that, for our workload, the value periment, we ftp a 50MB file from Host- l to Host-2. We
of thresholdelephant affects the overhead of the Mahout con track the number of bytes in the socket buffer for that flow
troller, but does not have much of an impact on performance and the number of bytes transferred on the network along
(up to a point: when we set this threshold to 1GB (not shown with the timestamps. We did 30 trials of this experiment.
on the charts), the Mahout scheduler performed no better than Figure 2 shows a single run. In Figure 7, we show the time it
ECMP). The number of packets to the Mahout controller goes takes before a flow can be classified as an elephant based
from 328 per sec. when the elephant threshold is 128KB to 214 on information from the buffer utilization versus based on
per sec. when the threshold is 100MB, indicating that tuning the number of bytes sent on the network. Here we consider
it can reduce controller overhead by more than 50% without different thresholds for considering a flow as an elephant. We
affecting the scheduler's performance. Even so, we suggest present both cases of with and without background flows. It is
making this threshold as small as possible to save memory at clear from these results that Mahout's approach of monitoring
the end hosts and for quicker elephant flow detection (see the the TCP buffers can significantly quicken the elephant flow
experiments on our prototype in the next section). We believe detection (more than an order of magnitude sooner in some
a threshold of 200-500KB is best for most workloads. cases) at the end hosts and is also not affected by the
congestion in the network.
B. Prototype & Microbenchmarks b) Elephant flow detection time at the controller: In this
We have implemented a prototype of the Mahout system. experiment, we measure how long it takes for an elephant flow
The shim layer is implemented as a kernel module inserted to be detected at the controller using the Mahout approach
between the TCPIIP stack and device driver, and the controller versus using Hedera-like periodic polling approach. To be fair
is built upon NOX [28], an open-source OpenFiow controller to the polling approach, we have done the periodic polling
written in Python language. For the shim layer, we created a at the fastest rate possible (poll in a loop without any wait
function for the pseudocode shown in Algorithm 1 and invoke periods in between). As can be seen from the Table II, Mahout
it for the outgoing packets, after the IP header creation in controller can detect an elephant flow in few milliseconds.

1636

100 r--�--�----' 100 r--�--�--,
Time to Queue � Time to Queue !QQQQ
Time to Send rs;:::s:s::s] Time to Send rs:ss:sJ

U) 10 U) 10
S S
'" '"
E E
i= i=

0.1
100KB 200KB 500KB 1MB 100KB 200KB 500KB 1MB

(a) No background flows (b) 10 background flows
Fig. 7: For each threshold bytes, time taken for the TCP buffer is filled versus the time taken for those many bytes to appear on the network
with (a) no background flows and (b) 10 background TCP flows. Error bars show 95% confidence intervals.

[6] M. AI-Fares,A. Loukissas,and A. Vahdat. A scalable,commodity data
center network architecture. In SIGCOMM, 2008.
[7] M. AI-Fares,S. Radhakrishnan,B. Raghavan,N. Huang,and A. Vahdat.
Hedera: Dynamic How scheduling for data center networks. In NSDI,
TABLE II: Time it takes to detect an elephant flow at the Mahout
2010.
controller vs. the Hedera controller, with no other active flows. [8] M. Alizadeh,A. Greenberg,D. A. Maltz,J. Padhye,P. Patel, B. Prab
hakar,S. Sengupta,and M. Sridharan. Dctcp: Efficient packet transport
In contrast, Hedera takes 189.83ms before the flow can be
for the commoditized data center. In SIGCOMM, 2010.
detected as an elephant irrespective of the threshold. All times �] http://aws.amazon.com/ec2/.
are same due to switch's overheads in collecting statistics and [10] P. Barham,B. Dragovic, K. Fraser,S. H,T. Harris,A. Ho,R. Neuge
bauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In
relaying it to the central controller.
SOSP, pages 164-177,2003.
Overall, our working prototype demonstrates the deploy [II] R. Braden,D. Clark,and S. Shenker. Integrated service in the internet
ment feasibility of Mahout. The experiments show an order architecture: an overview. Technical report, IETF, Network Working
Group,June 1994.
of magnitude difference in the elephant flow detection times
[12] C. Clos. A study of non-blocking switching networks. Bell System
at the controller in Mahout vs. a competing approach. Technical Journal, 32(5):406-424,1953.
[13] J. R. Correa and M. X. Goemans. Improved bounds on nonblocking
VI. CONCLUSION 3-stage c10s networks. SIAM J. Comput., 37(3):870-894,2007.
[14] A. R. Curtis,W. Kim,and P. Yalagandula. Mahout: Low-Overhead Dat
Previous research in datacenter network management has
acenter Traffic Management using End-Host-Based Elephant Detection.
shown that elephant flows-flows that carry large amount of Technical Report HPL-2010-91,HP Labs,2010.
data-need to be detected and managed for better utilization of [IS] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on
Large Clusters . In OSDI,2004.
the mUlti-path topologies. However, the previous approaches [16] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Sub
for elephant flow detection are based on monitoring the ramanya, Y. Fainman, G. Papen, and A. Vahdat. Helios: A Hybrid
behavior of flows in the network and hence incur long de ElectricaVOptical Switch Architecture for Modular Data Centers. In
SIGCOMM, 2010.
tection times, high switch resource usage, and/or high control [17] A. Greenberg, J. R. Hamilton,N. Jain,S. Kandula, C. Kim,P. Lahiri,
bandwidth and processing overhead. In contrast, we propose a D. Maltz, P. Patel, and S. Sengupta. VL2: a scalable and Hexible data
novel end host based solution that monitors the socket buffers center network. In SIGCOMM, 2009.
[18] Hadoop MapReduce. http://hadoop.apache.orglmapreduce/.
to detect elephant flows and signals the network controller
[19] C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992
using an in-band mechanism. We present Mahout, a low (Informational),Nov. 2000.
overhead yet effective traffic management system based on this [20] lANA DSCP registry. http://www.iana.orglassignments/dscp-registry.
[21] S. Kandula,S. Sengupta, A. Greenberg, and P. Patel. The nature of
idea. Our experimental results show that our system can detect
datacenter traffic: Measurements & analysis. In IMC, 2009.
elephant flows an order of magnitude sooner than polling [22] J. Kim and W. J. Dally. Flattened butterHy: A cost-efficient topology
based approaches while incurring an order of magnitude lower for high-radix networks. In ISCA, 2007.
[23] N. McKeown,T. Anderson,H. Balakrishnan,G. Parulkar,L. Peterson,
controller overhead than other approaches.
J. Rexford,S. Shenker,and J. Turner. OpenFlow: Enabling Innovation
in Campus Networks. ACM CCR, 2008.
ACKNOWLEDGEMENTS
[24] J. C. Mogul, J. Tourrilhes,P. Yalagandula,P. Sharma, A. R. Curtis,
We sincerely thank Sujata Banerjee, Jeff Mogul, Puneet and S. Banerjee. DevoHow: Cost-effective How management for high
performance enterprise networks. In HotNets, 2010.
Sharma, and Jean Tourrilhes for several beneficial discussions [25] T. Mori, M. Uchida, R. Kawahara, J. Pan, and S. Goto. Identifying
and comments on earlier drafts of this paper. elephant Hows through periodically sampled packets. In Proc. IMC,
pages 115-120,Taormina,Oct. 2004.
REFERENCES [26] K. Nichols, S. Blake, F. Baker, and D. Black. Definition of the
Differentiated Services Field (OS Field) in the IPv4 and IPv6 Headers.
[I] IEEE Std. 802.IQ-2005,Virtual Bridged Local Area Networks. RFC 2474 (Proposed Standard),Dec. 1998.
[2] OpenFlow Switch Specification, Version 1.0.0. [27] M. Roughan,S. Sen,O. Spatscheck,and N. Duffield. Class-of-service
http://www.openHowswitch. orgidocuments/openHow-spec-v1.0.0.pdf. mapping for qos: A statistical signature-based approach to ip traffic
[3] sFlow. http://www.sHow.orgi. classification. In In IMC04, pages 135-148,2004.
[4] The OpenFlow Switch Consortium. http://www.openHowswitch.orgi. [28] A. Tavakoli,M. Casado,T. Koponen, and S. Shenker. Applying NOX
[5] J. H. Ahn,N. Binkert,A. Davis,M. McLaren,and R. S. Schreiber. Hy- to the datacenter. In HotNets-Vlll, 2009.
perx: topology,routing,and packaging of efficient large-scale networks. [29] VMware. http://www.vmware.com.
In Proceedings of the Conference on High Performance Computing [30] Xen. http://www.xen.org.
Networking, Storage and Analysis (SC '09), 2009.

1637

Mahout low-overhead datacenter traffic management using end-host-based elephant detection

More Related Content

What's hot (20)

Similar to Mahout low-overhead datacenter traffic management using end-host-based elephant detection (20)

More from João Gabriel Lima (20)

Recently uploaded (20)

Mahout low-overhead datacenter traffic management using end-host-based elephant detection