SlideShare a Scribd company logo
This paper was presented as part of the main technical program at IEEE INFOCOM 2011




Mahout: Low-Overhead Datacenter Traffic Management using

                             End-Host-Based Elephant Detection

                 Andrew R. Curtis*                             Wonho Kim*                       Praveen Yalagandula
                University of Waterloo                      Princeton University                      HP Labs
               Waterloo, Ontario, Canada                    Princeton, NJ, USA                  Palo Alto, CA, USA



  Abstract-Datacenters need high-bandwidth interconnection             report that 90% of the flows carry less than 1MB of data
fabrics. Several researchers have proposed highly-redundant            and more than 90% of bytes transferred are in flows greater
topologies with multiple paths between pairs of end hosts for
                                                                       than 100MB. Hash-based flow forwarding techniques such as
datacenter networks. However, traffic management is necessary
to effectively utilize the bisection bandwidth provided by these
                                                                       Equal-Cost Multi-Path (ECMP) routing [19] works well only
topologies. This requires timely detection of elephant flows­          for large numbers of small (or mice) flows and no elephant
flows that carry large amount of data-and managing those               flows. For example, AI-Fares et al. 's Hedera [7] shows that
flows. Previously proposed approaches incur high monitoring            managing elephant flows effectively can yield as much as
overheads, consume significant switch resources, and/or have long
                                                                       113% higher aggregate throughput compared to ECMP.
detection times.
  We propose, instead, to detect elephant flows at the end hosts.
                                                                          Existing elephant flow detection methods have several lim­
We do this by observing the end hosts's socket buffers, which          itations that make them unsuitable for datacenter networks.
provide better, more efficient visibility of flow behavior. We         These proposals use one of three techniques to identify ele­
present Mahout, a low-overhead yet effective traffic management        phants: (1) periodic polling of statistics from switches, (2)
system that follows OpenFlow-like central controller approach for
                                                                       streaming techniques like sampling or window-based algo­
network management but augments the design with our novel end
host mechanism. Once an elephant flow is detected, an end host
                                                                       rithms, or (3) application-level modifications (full details of
signals the network controller using in-band signaling with low        each approach are given in Section II). We have not seen
overheads. Through analytical evaluation and experiments, we           support for Quality of Service (QoS) solutions take hold,
demonstrate the benefits of Mahout over previous solutions.            which implies that modifying applications is probably an unac­
                                                                       ceptable solution. We will show that the other two approaches
                       I. INT RODUCT ION
                                                                       fall short in the datacenter setting due to high monitoring
   Datacenter switching fabrics have enormous bandwidth de­            overheads, significant switch resource consumption, and/or
mands due to the recent uptick in bandwidth-intensive appli­           long detection times.
cations used by enterprises to manage their exploding data.               We assert that the right place for elephant flow detection
These applications transfer huge quantities of data between            is at the end hosts. In this paper, we describe Mahout, a
thousands of servers. For example, Hadoop [18] performs an             low-overhead yet effective traffic management system using
all-to-all transfer of up to petabytes of files during the shuffle     end-host-based elephant detection. We subscribe to the in­
phase of a MapReduce job [15]. Further, to better consolidate          creasingly popular simple-switchlsmart-controller model (as in
employee desktop and other computation needs, enterprises              OpenFlow [4]), and so our system is similar to NOX [28] and
are leveraging virtualized datacenter frameworks (e.g., using          Hedera [7].
VMWare [29] and Xen [10], [30]), where timely migration of                Mahout augments this basic design. It has low overhead, as
virtual machines requires high throughput network.                     it monitors and detects elephant flows at the end host via a
   Designing datacenter networks using redundant topologies            shim layer in the OS, rather than monitoring at the switches
such as Fat-tree [6], [12], HyperX [5], and Flattened But­             in the network. Mahout does timely management of elephant
terfly [22] solves the high-bandwidth requirement. However,            flows through an in-band signaling mechanism between the
traffic management is necessary to extract the best bisection          shim layer at the end hosts and the network controller. At the
bandwidth from such topologies [7]. A key challenge is that            switches, any flow not signaled as an elephant is routed using
the flows come and go too quickly in a data center to compute          a static load-balancing scheme (e.g., ECMP). Only elephant
a route for each individually; e.g., Kandula et al. report lOOK        flows are monitored and managed by the central controller.
flow arrivals a second in a 1,500 server cluster [21].                 The combination of end host elephant detection and in-band
   For effective utilization of the datacenter fabric, we need to      signaling eliminates the need for per-flow monitoring in the
detect elephant flows-flows that transfer significant amount           switches, and hence incurs low overhead and requires few
of data-and dynamically orchestrate their paths. Datacenter            switch resources.
measurements [17], [21] show that a large fraction of datacen­            We demonstrate the benefits of Mahout using analytical
ter traffic is carried in a small fraction of flows. The authors       evaluation and simulations and through experiments on a small
  *This work was performed while Andrew and Wonho were interns at HP   testbed. We have built a Linux prototype for our end host
Labs-Palo Alto.                                                        elephant flow detection algorithm and tested its effectiveness.




      978-1-4244-9921-2/11/$26.00 ©2011 IEEE                      1629
We have also built a Mahout controller, for setting up switches      B.   Identifying elephant flows
with default entries and for processing the tagged packets from
                                                                        The mix of latency- and throughput-sensitive flows in the
the end hosts. Our analytical evaluation shows that Mahout
                                                                     data centers means that effective flow scheduling needs to
offers one to two orders of magnitude of reduction in the
                                                                     balance visibility and overhead-a one size fits all approach is
number of flows processed by the controller and in switch
                                                                     not sufficient in this setting. To achieve this balance, elephant
resource requirements, compared to Hedera-like approaches.
                                                                     flows must be identified so that they are the only flows touched
Our simulations show that Mahout can achieve considerable
                                                                     by the controller. The following are the previously considered
throughput improvements compared to static load balancing
                                                                     mechanisms for identifying elephants:
techniques while incurring an order of magnitude lower over­
head than Hedera. Our prototype experiments show that the                 •   Applications identify their flows as elephants: This solu­
Mahout approach can detect elephant flows at least an order                   tion accurately and immediately identifies elephant flows.
of magnitude sooner than statistics-polling based approaches.                 This is a common assumption for a plethora of research
   The key contributions of our work are: 1) a novel end                      work in network QoS where focus is to give higher
host based mechanism for detecting elephant flows, 2) design                  priority to latency and throughput-sensitive flows such
of a centralized datacenter traffic management system that                    as voice and video applications (see, e.g., [11]). How­
has low overhead yet high effectiveness, and 3) simulation                    ever, this solution is impractical for traffic management
and prototype experiments demonstrating the benefits of the                   in datacenters as each and every application must be
proposed design.                                                              modified to support it. If all applications are not modified,
                                                                              an alternative technique will still be needed to identify
            II. BACKGROUND & RELATED WORK                                     elephant flows initiated by unmodified applications.
                                                                              A related approach is to classify flows based on which
A. Datacenter networks and traffic
                                                                              application is initiating them. This classifies flows using
   The heterogeneous mix of applications running in datacen­                  stochastic machine learning techniques [27], or using
ters produces flows that are generally sensitive to either latency            simple matching based on the packet header fields (such
or throughput. Latency-sensitive flows are usually generated                  as TCP port numbers). While this approach might be
by network protocols (such as ARP and DNS) and interactive                    suitable for enterprise network management, it is un­
applications. They typically transfer up to a few kilobytes.                  suitable for datacenter network management because of
On the other hand, throughput-sensitive flows, created by,                    the enormous amount of traffic in the datacenter and the
e.g., MapReduce, scientific computing, and virtual machine                    difficulty in obtaining flow traces to train the classification
migration, transfer up to gigabytes. This traffic mix implies                 algorithms.
that a datacenter network needs to deliver high bisection                 •   Maintain per-flow statistics: In this approach, each flow
bandwidth for throughput-sensitive flows without introducing                  is monitored at the first switch that the flow goes
setup delay on latency-sensitive flows.                                       through. These statistics are pulled from switches by
   Designing datacenter networks using redundant topologies                   the controller at regular intervals and used to classify
such as Fat-tree [6], [12], HyperX [5], or Flattened But­                     elephant flows. Hedera [7] and Helios [16] are examples
terfly [22] solves the high-bandwidth requirement. However,                   of systems proposing to use such a mechanism. However,
these networks use multiple end-to-end paths to provide this                  this approach does not scale to large networks. First,
high-bandwidth, so they need to load balance traffic across                   this consumes significant switch resources: a flow table
them. Load balancing can be performed with no overhead                        entry for each flow monitored at a switch. We'll show
using oblivious routing, where the path a flow from node i                    in Section IV that this requires considerable number of
to node j is routed on is randomly selected from a probability                flow table entries. Second, bandwidth between switches
distribution over all i to j paths, but it has been shown to                  and the controller is limited, so much so that transferring
achieve less than half the optimal throughput when the traffic                statistics becomes the bottleneck in traffic management
mix contains many elephant flows [7]. The other extreme is                    in datacenter network. As a result, the flow statistics
to perform online scheduling by selecting the path for all new                cannot be quickly transferred to the controller, resulting
flows using a load balancing algorithm, e.g., greedily adding                 in prolonged sub-par routings.
a flow along the path with least congestion. This approach                •   Sampling: Instead of monitoring each flow in the net­
doesn't scale well-flows arrive too quickly for a single sched­               work, in this approach, a controller samples packets from
uler to keep up-and it adds too much setup time to latency­                   all ports of the switches using switch sampling features
sensitive flows. For example, flow installation using NOX can                 such as sFlow [3]. Only a small fraction of packets are
take up to lOms [28]. Partition-aggregate applications (such                  sampled (typically, 1 in 1000) at the switches and only
as search and other web applications) partition work across                   headers of the packets are transferred to the controller.
multiple machines and then aggregate the responses. Jobs have                 The controller analyzes the samples and identifies a flow
a deadline of 10-1OOms [8], so a lOms flow setup delay can                    as an elephant after it has seen sufficient number of sam­
consume the entire time budget. Therefore, online scheduling                  ples from the flow. However, such an approach can not
is not suitable for latency-sensitive flows.                                  reliably detect an elephant flow before it has carried more




                                                                1630
:
     ,---:----, :::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::.............................                              100


                                                                                                                                         80
                                                                                                                                    in
                                                                                                                                    �    60
                                                                                                                                    2]
                                                                                                                                    "'
                                                                                                                                    >-
                                                                                                                                         40


                                 RACK                 RACK
                                                                                                                                         20                                            TCPBuffer -­
                               SWITCH                SWITCH                                                                                                                            Sent Data --
                                                                                                                                          a
                                                                                                                                               a   100   200   300      400      500   600   700      800
 r--:::--j -.-                 END-HOST             END-HOST                                           END-HOST   END-HOST
                                                                                                                                                                     Time (us)

                                                                                                                                Fig. 2: Amount of data observed in the Tep buffers vs. data observed
                                                                                                                                at the network layer for a flow.

                                                                                                                                simple approach allows the controller to detect elephant flows
                               END-HOST             END-HOST                                           END·HOST   END·HOST
                                                                                                                                without any switch CPU- and bandwidth-intensive monitoring.
                                                                                                                                The Mahout controller then manages only the elephant flows,
                                      Fig. I: Mahout architecture.
                                                                                                                                to maintain a globally optimal arrangement of them.
                                                                                                                                   In the following, we describe Mahout's end host shim layer
      than 10K packets, or roughly 15MB [25]. Additionally,                                                                     for detecting elephant flows, our in-band signaling method for
      sampling has high overhead, since the controller must                                                                     informing controller about the elephant flows, and the Mahout
      process each sampled packet.                                                                                              network controller.

C.   OpenFlow                                                                                                                  A. Detecting Elephant Flows
   OpenFlow [23] aims to open up traditionally closed de­                                                                          An end host based implementation for detecting elephant
signs of commercial switches to enable network innovation.                                                                      flows is better than in-network monitoring/sampling based
OpenFlow switches maintain a flow-table where each entry                                                                        methods, particularly in datacenters, because: (l) The network
contains a pattern to match and the actions to perform on a                                                                     behavior of a flow is affected by how rapidly the end-point
packet that matches that entry. OpenFlow defines a protocol for                                                                 applications are generating data for the flow, and this is not
communication between a controller and an OpenFlow switch                                                                       biased by congestion in the network. In contrast to in-network
to add and remove entries from the flow table of the switch                                                                     monitors, the end host OS has better visibility into the behavior
and to query statistics of the flows.                                                                                           of applications. (2) In datacenters, it is possible to augment
   Upon receiving a packet, if an OpenFlow switch does not                                                                      the end host OS; this is enabled by the single administrative
have an entry in the flow table or TCAM that matches the                                                                        domain and software uniformity typical of modern datacenters.
packet, the switch encapsulates and forwards the packet to the                                                                  (3) Mahout's elephant detection mechanism has very little
controller over a secure connection. The controller responds                                                                    overhead (it is implemented with two if statements) on com­
back with a flow table entry and the original packet. The switch                                                                modity servers. In contrast, using an in-network mechanism
then installs the entry into its flow table and forwards the                                                                    to do fine-grained flow monitoring (e.g., using exact matching
packet according to the actions specified in the entry. The flow                                                                on OpenFlow's 12-tuple) can be infeasible, even on an edge
table entries expire after a set amount of time, typically 60                                                                   switch, and even more so on a core switch, especially on
seconds. OpenFlow switches maintain statistics for each entry                                                                   commodity hardware. For example, assume that 32 servers
in their flow table. These statistics include a packet counter,                                                                 are connected, as is typical, to a rack switch. If each server
byte counter, and duration.                                                                                                     generates 20 new flows per second, with a default flow timeout
   The OpenFlow 1.0 specification [2] defines matching over                                                                     period of 60 seconds, an edge-switch needs to maintain and
12 fields of packet header (see the top line in Figure 3). The                                                                  monitor 38400 flow entries. This number is infeasible in any
specification defines several actions including forwarding on a                                                                 of the real switch implementations of OpenFlow that we are
single physical port, forwarding on multiple ports, forwarding                                                                  aware of.
to the controller, drop, queue (to a specified queue), and                                                                         A key idea of the Mahout system is to monitor end host
defaulting to traditional switching. To support such flexibility,                                                               socket buffers, and thus determine elephant flows before
current commercial switch implementations of OpenFlow use                                                                       and with lower overheads than with in-network monitoring
TCAMs for flow table.                                                                                                           systems. We demonstrate the rationale for this approach with
                                                                                                                                a micro-benchmark: an ftp transfer of a 50MB file from a host
                             III. OUR SOLUTION: MAHOUT                                                                          1 to host 2, connected via two switches all with 1 Gbps links.
   Mahout's architecture is shown in Figure 1. In Mahout, a                                                                        In Figure 2, we show the cumulative amount of data
shim layer on each end host monitors the flows originating                                                                      observed on the network, and in the TCP buffer, as time
from that host. When this layer detects an elephant flow, it                                                                    progresses. The time axis starts when the application first
marks subsequent packets of that flow using an in-band sig­                                                                     provides data to the kernel. From the graph, one can observe
naling mechanism. The switches in the network are configured                                                                    that the application fills the TCP buffers at a rate much higher
to forward these marked packets to the Mahout controller. This                                                                  than the observed network rate. If the threshold for considering




                                                                                                                             1631
Algorithm    1   Pseudocode for end host shim layer                  Algorithm 1 shows pseudocode for the end host shim layer
 1: When sending a packet                                          function that is executed when a TCP packet is being sent.
 2: if number of bytes in buffer   � thresholdelephant   then
                                                                   C.   Mahout Controller
 3:    / * Elephant flow */
 4:   if last-tagged-time - nowO � Ttagperiod then                    At each rack switch, the Mahout controller initially config­
 5:      set DS = 00001100                                         ures two default OpenFlow flow table entries: (i) an entry to
 6:      last-tagged-time = nowO                                   send a copy of packets with the DSCP bits set to 000011 to the
 7:   end if
                                                                   controller and (ii) the lowest-priority entry to switch packets
 8: end if
                                                                   using NORMAL forwarding action. We set up switches to per­
                                                                   form ECMP forwarding by default in the NORMAL operation
                                                                   mode. Figure 3 shows the two default entries at the bottom.
a flow as an elephant is 100KB (Figure 2. of [17] shows that
                                                                   In this figure, an entry has a higher priority over (is matched
more than 85% of flows are less than 100KB), we can see
                                                                   before) entries below that entry.
that Mahout's end host shim layer can detect a flow to be             When a flow starts, it normally will match the lowest­
an elephant 3x sooner than in-network monitoring. In this          priority (NORMAL) rule, so its packet will follow ECMP
experiment there were no other active flows on the network.        forwarding. When an end host detects a flow as an elephant
In further experimental results, presented in Section V, we        and marks a packet of that flow. That packet marked with
observe an order of magnitude faster detection when there are      DSCP 000011 matches the other default rule, and the rack
other flows.                                                       switch forwards it to the Mahout controller. The controller
   Mahout uses a shim layer in the end hosts to monitor            then computes the best path for this elephant, and installs a
the socket buffers. When a socket buffer crosses a chosen          flow-specific entry in the rack switch.
threshold, the shim layer determines that the flow is an              In Figure 3, we show a few example entries for the ele­
elephant. This simple approach ensures that flows that are         phant flows. Note that these entries are installed with higher
bottlenecked at the application layer and not in the network       priority than Mahout's two default rules; hence, the packets
layer, irrespective of how long-lived they are or how many         corresponding to these elephant flows are switched using the
bytes they have transferred, will not be determined as the         actions of these flow-specific entries rather than the actions of
elephant flows. Such flows need no special management in           the default entries. Also, the DS field is set to wildcard for
the network. In contrast, if an application is generating data     these elephant flow entries, so that once the flow-specific rule
for a flow faster than the flow's achieved network throughput,     is installed, any tagged packets from the end hosts are not
the socket buffer will fill up, and hence Mahout will detect       forwarded to the controller.
this an an elephant flow that needs management.                       Once an elephant flow is reported to the Mahout controller,
                                                                   it needs to be placed on the best available path. We define the
B.   In-band Signaling                                             best path for a flow from s to t as the least congested of all
                                                                   paths from s to t. The least congested s-t path is found by
   Once Mahout's shim layer has detected an elephant flow,
                                                                   enumerating over all such paths.
it needs to signal this to the network controller. We do this
                                                                      To manage the elephant flows, Mahout regularly pulls
indirectly, by marking the packets in a way that is easily
                                                                   statistics on the elephant flows and link utilizations from the
and efficiently detected by OpenFlow switches, and then the
                                                                   switches, and uses these statistics to optimize the elephant
switches divert the marked packets to the network controller.
                                                                   flows' routes. This is done with the increasing first fit algo­
To avoid inundating the controller with too many packets of
                                                                   rithm given in Algorithm 2. Correa and Goemans introduced
the same flow, the end host shim layer marks the packets of
                                                                   this algorithm and proved that it finds routings that have at
an elephant flow only once every Ttagperiod seconds (we use
                                                                   most a 10% higher link utilization than the optimal routing
1 second in our prototype).
                                                                   [13]. While we cannot guarantee this bound because we re­
   To mark a packet, we repurpose the Differentiated Services
                                                                   route only the elephant flows, we expect this algorithm to
Field (DS Field) [26] in the IPv4 header. This field was
                                                                   perform as well as any other heuristic.
originally called the IP Type-of-Service (IPToS) byte. The
first 6 bits of the DS Field, called Differentiated Services       D. Discussion
Code Point (DSCP), define the per-hop behavior of a packet.              a) DSCP bits: In Mahout, the end host shim layer uses
The current OpenFlow specification [2] allows matching on          the DSCP bits of the DS field in IP header for signaling
DSCP bits, and most commercial switch implementations              elephant flows. However, there may be some datacenters
of OpenFlow support this feature in hardware; hence, we            where DSCP may be needed for other uses, such as for
use the DS Field for signaling between the end host shim           prioritization among different types of flows (voice, video, and
layer and the network controller. Currently the code point         data) or for prioritization among different customers. In such
space corresponding to xxxx11 (x denotes a wild-card bit) is       scenarios, we plan to use VLAN Priority Code Point (PCP) [1]
reserved for experimental or local usage [20], and we leverage     bits. OpenFlow supports matching on these bits too. We can
this space. When an end host detects an elephant flow, it sets     leverage the fact that it is very unlikely for both these code
the DSCP bits to 000011 in the packets belonging to that flow.     point fields (PCP and DSCP) to be in use simultaneously.




                                                                1632
IMPL.
         Bit lengths _.­             48          48         16    12           32          32                      16     16
                            DEP.

       Packet fields
                                                                                                                                      ACTIONS
        to match




                            2      01AB..8F   3OCD..9E     0806   xx   xx   10.0.0.1    10.0.0.4    6    XXXXXX    80    2490      Forward to port 4

       Flow entries
                       /
                                                                  xx   xx                                XXXXXX
       for detected -       10     BE9F..03   04DE .. CF   0806             10.0.1.10   10.0.1.40   6             3450   3451      Forward to port 3




                       q
      elephant flows �




                            : I : I : I : I : I : I : I : I : I::I:� : I : !
       D,fa"" rul"                                                                                                                Send to Controller

       at the lowest
                                                                                                                                   NORMAL routing
           priotity                                                                                                                    (ECMP)


                                      Fig. 3: An example flow table setup at a switch by the Mahout controller.


Algorithm 2 Offline increasing first fit                                            has more than 100KB data, xxxll1 to denote more than 1MB,
 1:   sort(F); reverse(F) /* F: set of elephant flows */                            xxllll to denote more than 10MB, and so on. The controller
 2:   for J E F do                                                                  can then change the default entry corresponding to the tagged
 3:     for l E J.path do                                                           packets (second from bottom in the Figure 3) to select higher
 4:        l.load = l.load - J.rate
                                                                                    thresholds, based on the load at the controller. Further study
 5:     end for
 6:   end for
                                                                                    is needed to explore these approaches.
 7:   for J E F do
 8:     bescpaths[fj.congest = 00
                                                                                                        IV. ANALY T ICAL EVALUAT ION
 9:     /* Pst: set of all s-t paths */                                                In this section, we analyze the expected overhead of de­
10:     for path E Pst do
                                                                                    tecting elephant flows with Mahout, with flow sampling, and
11:       congest = (f.rate + path.load) / path. bandwidth
                                                                                    by maintaining per-flow statistics (e.g., the approach used by
12:       if congest < bescpath.congest then
13:          best_paths[fj = path                                                   Hedera). We set up an analytical framework to evaluate the
14:          bescpaths[f].congest = congest                                         number of switch table entries and control messages used
15:          end if                                                                 by each method. We evaluate each method using an example
16:     end for
                                                                                    datacenter, and show that Mahout is the only solution that can
17:   end for
                                                                                    scale to support large datacenters.
18:   return best_paths
                                                                                       Flow sampling identifies elephants by sampling an expected
                                                                                    l out of k packets. Once it has seen enough packets from
                                                                                    the same flow, then the flow is classified as an elephant. The
     b) Virtualized Datacenter: In a virtualized datacenter,                        number of packets needed to classify an elephant does not
a single server will host multiple guest virtual machines,                          affect our analysis in this section, so we ignore it for now.
each possibly running a different operating system. In such                         Hedera [7] uses periodic polling for elephant flow detection.
a scenario, the Mahout shim layer needs to be deployed in                           Every t seconds, the Hedera controller pulls the per-flow
each of the guest virtual machines. Note that the host operating                    statistics from each switch. In order to estimate the true rate of
system will not have visibility into the socket buffers of a guest                  a flow (i.e., the rate of the flow if its rate is only constrained
virtual machine. However, in cloud computing infrastructures                        by its endpoints' NICs and not by any link in the network),
such as Amazon EC2 [9], typically the infrastructure provider                       the statistics for every flow in the network must be collected.
makes available a few preconfigured OS versions, which in­                          Pulling statistics for all flows using OpenFlow requires setting
clude the paravirtualization drivers to work with the provider's                    up a flow table entry for every flow, so each flow must be sent
hypervisor. Thus, we believe that it is feasible to deploy the                      to the controller before it can be started, so we include this
Mahout shim layer in virtualized datacenters, too.                                  cost in our analysis.
     c) Elephantfiow threshold: Choosing too Iow a value for                           We consider a million server network for the following
thresholdelephant    in Algorithm 1 can cause many flows to                         analysis. Our notation and and the assumed values are shown
be recognized as elephants, and hence cause the rack switches                       in the Table I.
to forward too many packets to the controller. When there                              Hedera [7]: As table entries need to be maintained for all
are many elephant flows, to avoid the controller overload, we                       flows, the number of flow table entries needed at each rack
could provide a means for the controller to signal the end hosts                    switch is T·P-D. In our example, this translates to 32·20·60       =



to increase the threshold value. However, this would require                        38,400 entries at each rack switch. We are not aware of any
a out-of-band control mechanism. An alternative is to use                           existing switch with OpenFlow support that can support this
multiple DSCP values to denote different levels of thresholds.                      many entries in the flow table in the hardware-for example,
For example, xxxxll can be designated to denote that a flow                         HP ProCurve 5400zl switches support up to 1.7K OpenFlow




                                                                            1633
Parameter                 Description                       Value
                                                                           reduces this overhead but adversely impacts the effects of flow
     N                  Num. of end hosts                  2�u (1M)
     T         Num. of end hosts per rack switch               32          scheduling since not all elephants are detected.
     S               Num. of rack switches                215 (32K)           We expect the number of elephants identified by sampling
     F       Avg. new flows per second per end host         20 [28]
                                                                           to be similar to Mahout, so we do not analyze the flow table
     D       Avg. duration of a flow in the flow table    60 seconds
     c              Size of counters in bytes                24 [2]        entry overhead of sampling separately.
   Tstat           Rate of gathering statistics          I-per-second         Mahout: Because elephant flow detection is done at the
     p             Num. of bytes in a packet                 1500
                                                                           end-host, switches contain flow table entries for elephant flows
    Im                   Fraction of mice                     0.99
     Ie               Fraction of elephants                   0.01         only. Also, statistics are only gathered for the elephant flows.
  Tsample                Rate of sampling                  l-in-IOOO       So, the number of flow entries per rack switch in Mahout is
  h.ample         Size of packet sample (bytes)                60
TABLE I: Parameters and tYPIcal values for the analytical evaluatIOn
                                                                           T . F·D·fe 384 entries. The number of flow setups that the
                                                                                        =



                                                                           Mahout controller needs to handle is N·F·fe, which is about
                                                                           200K requests per second, which needs 7 controllers. Also,
entries per linecard. It is unlikely that any switch in the near
                                                                           the number of packets per second that need to be processed
future will support so many table entries given the expense of
                                                                           for gathering statistics is a fe fraction of the same in case of
high-speed memory.
                                                                           Hedera. Thus 7 controllers are needed for gathering statistics
   The Hedera controller needs to handle N . F flow setups per
                                                                           at the rate of once per second, or the statistics can be gathered
second, or more than 20 million requests per second in our
                                                                           by a single controller at the rate of once every 7 seconds.
example. A single NOX controller can handle only 30,000
requests per second; hence one needs 667 controllers to just
                                                                                                  V. EXPERIMENTS
handle the flow setup load [28], assuming that the load can
be perfectly distributed. Flow scheduling, however, does not              A. Simulations
seem to be a simple task to distribute.                                       Our goal is to compare the performance and overheads of
   The rate at which the controller needs to process the                   Mahout against the competing approaches described in the
statistics packets is                                                      previous section. To do so, we implemented a flow-level,
                          c·T·F ·D                                         event-based simulator that can scale to a few thousand end
                          ----- . S .           r stat
                                 p                                         hosts connected using Clos topology [12]. We now describe
                                                                           this simulator and our evaluation of Mahout with it.
   In our example, this implies (24 . 38400) /1500 . 215
                                                                              1) Methodology: We simulate a datacenter network by
1 � 20 . 1M control packets per second. Assuming that NOX
                                                                           modeling the behavior of flows. The network topology is
controller can handle these packets at the rate it can handle
                                                                           modeled as a capacitated, directed graph and forms a three­
the flow setup requests (30,000 per second), this translates to
                                                                           level Clos topology. All simulations here are of a 1,600
needing 670 controllers just to process these packets. Or, if we
                                                                           server datacenter network, and they use a network with a
consider only one controller, then the statistics can be gathered
                                                                           rack to aggregation and aggregation to core links that are
only once every 670 seconds ( � 11 minutes).
                                                                           1:5 oversubscribed, i.e., the network has 320Gb bisection
   Sampling: Sampling incurs the messaging overhead of
                                                                           bandwidth. All servers have 1Gbps NICs and links have 1 Gbps
taking samples, and then installs flow table entries when an
                                                                           capacity. Our simulation is event-based, so there is no discrete
elephant is detected. The rate at which the controller needs to
                                                                           clock-instead, the timing of events is accurate to floating­
process the sampled packets is
                                                                           point precision. Input to the simulator is a file listing the start
                                       bytes per sample                    time, bytes, and endpoints of a set of flows (our workloads are
                throughput· r sample . --'----=----=-­
                                                     p                     described below). When a flow starts or completes, the rate of
   We assume that each sample contains only a 60 byte                      each flow is recomputed.
header and that headers can be combined into 1500 byte                        We model the OpenFlow protocol only by accounting for
packets, so there are 25 samples per message to the controller.            the delay when a switch sets up a flow table entry for a flow.
The aggregate throughput of a datacenter network changes                   When this occurs, the switch sends the flow to the OpenFlow
frequently, but if 10% of the hosts are sending traffic, the               controller by placing it in its OpenFlow queues. This queue
aggregate throughput (in Gbps) is 0 . 10 . N. We then find the             has lOMbps of bandwidth (this number was measured from
messaging overhead of sampling to be around 550K messages                  an OpenFlow switch [24]). This queue has infinite capacity,
per second, or if we bundle samples into packets (i.e., 25                 so our model optimistically estimates the delay between a
samples fit in a 1500 byte packet), then this drops to 22K                 switch and the OpenFlow controller since a real system drops
messages per second.                                                       arriving packets if one of these queues is full, resulting in TCP
   At first blush, this messaging overhead does not seem                   timeouts. Moreover, we assume that there is no other overhead
like too much overhead; however, as the network utilization                when setting up a flow, so the OpenFlow controller deals with
increases, the messaging overhead can reach 3.75 million                   the flow and installs flow table entries instantly.
(or 150K if there are 25 samples per packet) packets per                      We simulate three different schedulers: (1) an offline sched­
second. Therefore, sampling incurs the highest overhead when               uler that periodically pulls flow statistics from the switches,
load balancing is most needed. Decreasing the sampling rate                (2) a scheduler that behaves like the Mahout scheduler, but




                                                                        1634
250
uses sampling to detect elephant flows, and (3) the Mahout               �
                                                                         Q.

scheduler as described in Sec. III-C.                                   t[ 200                                         '"
                                                                                                                                                ,.
   The stat-pulling controller behaves like Hedera [7] and               � 150               I--        -        I--         -       I--       - ,.-               I--          -         f-
                                                                         .s::;

Helios [16]. Here, the controller pulls flow statistics from              "
                                                                          ."


                                                                         �       100         I--        -        I--         -       I--       -       r--         I--          -         f-
each switch at regular intervals. The statistics from a flow             -5
                                                                         !                   I--        -        I--         -       I--       -       I--         I--          -         f-
table entry are 24 bytes, so the amount of time to transfer the                   SO

statistics from a switch to the controller is proportional to the
                                                                          �
                                                                         » 0
                                                                                                                                                             ...         ....       ;'l
number of flow table entries at the switch. When transferring                                      '"
                                                                                       e>.         CD       CD         CD        0         0       0
                                                                                       :;;                             :;;
                                                                                                                                 S
                                                                                                                                           0       0
                                                                                                                                                             c:i
                                                                                                            �
                                                                                                                                           S
                                                                                                   00                                              0
                                                                                       �           �
                                                                                                                       0
                                                                                                                                                   ;'l
statistics, we assume that the CPU-to-controller rate is the                                                           ;'l                         -;:;-
bottleneck, not the network or OpenFiow controller itself.
Once the controller has statistics for all flows, it computes                                  Mahout (threshold)                Sampling (frac.                    Pulling (s)
                                                                                                                                    packets)
a new routing for elephant flows and reassigns paths instantly.
In practice, computing this routing and inserting updated flow        Fig. 4: Throughput results for the schedulers with various parameters.
                                                                      Error bars on all charts show 95% confidence intervals.
table entries into the switches will take up to hundreds of
milliseconds. We allow this to be done instantaneously to             intra- or inter-rack flow respectively. We select the number
find the theoretical best achievable results using an offline         of bytes in a flow following the distribution of flow sizes in
approach. The global re-routing of flows is computed using            their measurements as well. Before starting the shuffle job,
the increasing best fit algorithm described in Algorithm 2. This      we simulate this background traffic for three minutes. The
algorithm is simpler than the simulated annealing employed by         simulation ends whenever the last shuffle job flow completes.
Hedera; however, we expect the results to be similar, since this            b) Metrics: To measure the performance of each sched­
heuristic is likely to be as good as any other (as discussed in       uler, we tracked the aggregate throughput of all flows; this is
Sec. III-C)                                                           the amount of bisection bandwidth the scheduler is able to
   As we are doing flow-level simulations, sampling packets is        extract from the network. We measure overhead as before in
not straightforward since there are no packets to sample from.        Section IV, i.e., by counting the number of control messages
Instead, we sample from flows by determining the amount of            and the number of flow table entries at each switch. All
time it will take for k packets to traverse a link, given its rate,   numbers shown here are averaged from ten runs.
and then sample from the flows on the link by weighting each             2) Results: The per-second aggregate throughput for the
flow by its rate. Full details are in [14].                           various scheduling methods is shown in Figure 4. We com­
      a) Workloads: We simulate background traffic modeled            pare these schedulers to static load balancing with equal-cost
on recent measurements [21] and add traffic modeled on                multipath (ECMP), which uniformly randomizes the outgoing
MapReduce traffic to stress the network. We assume that the           flows across a set of ports [7]. We used three different elephant
MapReduce job has just gone into its shuffle phase. In this           thresholds for Mahout: 128KB, 1MB, and 100MB, and flows
phase, each end host transfers 128MB to each other host. Each         carrying at least this threshold of bytes were classified as an
end host opens a connection to at most five other end hosts           elephant after sending 2, 20, or 2000 packets respectively. As
simultaneously (as done by default in Hadoop's implementa­            expected, controlling elephant flows extracts more bisection
tion of MapReduce). Once one of these connections completes,          bandwidth from the network-Mahout extracts 16% more
the host opens a connection to another end host, repeating this       bisection bandwidth from the network than ECMP and the
until it has transferred its 128MB file to each other end host.       other schedulers obtain similar results depending on their
The order of these outgoing connections is randomized for             parameters.
each end host. For all experiments here, we used 250 randomly            Hedera's results found that flow scheduling gives a much
selected end hosts in the shuffle load. The reduce phase shuffle      larger improvement over ECMP than our results (up to 113%
begins three minutes after the background traffic is started          on some workloads) [7]. This is due to the differences in
to allow the background traffic to reach a steady state, and          workloads. Our workload is based on measurements [21],
measurements shown here are taken for five minutes after the          whereas their workloads are synthetic. We have repeated our
reduce phase began.                                                   simulations using some of their workloads and find similar
   We added background traffic following the macroscopic              results: the schedulers improve throughput by more than 100%
flow measurements collected by Kandula et al. [17], [21]              compared to ECMP on their workloads.
to the traffic mix because datacenters run a heterogeneous               We examine the overhead versus performance tradeoff by
mix of services simultaneously. They give the fraction of             counting the maximum number of flow table entries per rack
correspondents a server has within its rack and outside of its        switch and the number of messages to the controller. These
rack over a ten second interval. We follow this distribution          results are shown in Figures 5 and 6
to decide how many inter- and intra-rack flows a server starts           Mahout has the least overhead of any scheduling approach
over ten seconds; however, they do not give a more detailed           considered. Pulling statistics requires too many flow table
breakdown of flow destinations than this, so we assume                entries per switch and sends too many packets to the controller
that the selection of a destination host is uniformly random          to scale to large datacenters; here, the stat-pulling scheduler
across the source server's rack or the remaining racks for an         used nearly 800 flow table entries per rack switch on average




                                                                 1635
12000
        -g                                                                                                             HOST-1                                                                           HOST-2
        8        10000
        3:
            �     8000
                                                      rt:                                                        Server
                                                                                                                           Background
                                                                                                                            Sources              openFIOw
                                                                                                                                                 protoCOl .:'
                                                                                                                                                              ../
                                                                                                                                                                         ..
            �                                                                                                             Unux
                                                                                                                                                         ..
                                                                                                                                                                        . . .... .
                  6000                                                                                                                                 /.'"                                              Linux
                                                                                                                                                                                .. ""---
            �
            ..                                                                                                     I   Mahout Shim   I                                  ,--. .
                                                                                                                                                                                ..
                                                                                                                                                                                      -,
                                                                                                                                                                                                   I   Mahout Shim   I
            E     4000


        �         2000
        0
        u
                                                              n                         II          n                                    1Gbps                  1Gbps                      1Gbps

                           128KB    1MB    100MB    1/100   1/1000 1/10000     0.1       1          10
                                                                                                                                 Fig. 8: Testbed for prototype experiments.
                             Mahout (threshold)     Sampling (frac. packets)         Pulling (s)
                                                                                                            the networking stack. Implementing it as a separate kernel
Fig. 5: Number of packets sent to controller by various schedulers.                                         module improves deployability because it can be installed
Here, we bundled samples together into a single packet (there are
                                                                                                            without modifying or upgrading the linux kernel. Our con­
25 samples per packet)-each bundle of samples counts as a single
controller message.
                                                                                                            troller leverages the NOX platform to learn topology and
                                                                                                            configure switches with the default entries. It also processes
                1800
                          • Avg table entries   • Max table entries                                         the packets marked by the shim layer and installs entries for
     .=         1600

     .t::       1400                                                                                        the elephant flows.
     �
     ..
                1200
                                                                                                               Our testbed for experimenting different components of the
     �          1000

     .�          800                                                                                        prototype is shown in the Figure 8. Switches 1 and 2 in the
     ..
     ..
                 600
                                                                                                            Figure are HP ProCurve 5400zl switches running firmware
     :is         400

     �           200
                                                                                                            with OpenFiow support. We have two end hosts in the testbed,
                                                                                                            one acting as the server for the flows and another acting as the
                         128KB     1MB    100MB    1/100    1/1000 1/10000     0.1                   10
                                                                                                            client. For some experiments, we also run some background
                           Mahout (threshold)       Sampling (frac. packets)          Pulling (s)           flows to emulate the other network traffic.
Fig. 6: Average and maximum number of flow table entries at each                                               Since this is a not a large scale testbed, we perform
switch used by the schedulers.                                                                              microbenchmark experiments focusing on the timeliness of
no matter how frequently the statistics were pulled. This                                                   elephant flow detection and compare Mahout against Hedera­
is more than seven times the number of entries used by                                                      like polling approach. We first present measurements of the
the sampling and Mahout controllers, and makes the offline                                                  time it takes to detect an elephant flow at end host and then
scheduler infeasible in larger datacenters because the flow                                                 present the overall time it takes for the controller to detect an
tables will not be able to support such a large number of                                                   elephant flow. Our workload consists of a file transfer using
entries. Also, when pulling stats every 1 sec., the controller                                              ftp. For experiments with presence of background flows, we
receives lOx more messages than when using Mahout with an                                                   run 10 simultaneous iperf connections.
elephant threshold of 100MB.                                                                                      a) End host elephant flow detection time: In this ex­
   These simulations indicate that, for our workload, the value                                              periment, we ftp a 50MB file from Host- l to Host-2. We
of thresholdelephant affects the overhead of the Mahout con­                                                 track the number of bytes in the socket buffer for that flow
troller, but does not have much of an impact on performance                                                  and the number of bytes transferred on the network along
(up to a point: when we set this threshold to 1GB (not shown                                                 with the timestamps. We did 30 trials of this experiment.
on the charts), the Mahout scheduler performed no better than                                                Figure 2 shows a single run. In Figure 7, we show the time it
ECMP). The number of packets to the Mahout controller goes                                                   takes before a flow can be classified as an elephant based
from 328 per sec. when the elephant threshold is 128KB to 214                                                on information from the buffer utilization versus based on
per sec. when the threshold is 100MB, indicating that tuning                                                 the number of bytes sent on the network. Here we consider
it can reduce controller overhead by more than 50% without                                                   different thresholds for considering a flow as an elephant. We
affecting the scheduler's performance. Even so, we suggest                                                   present both cases of with and without background flows. It is
making this threshold as small as possible to save memory at                                                 clear from these results that Mahout's approach of monitoring
the end hosts and for quicker elephant flow detection (see the                                               the TCP buffers can significantly quicken the elephant flow
experiments on our prototype in the next section). We believe                                                detection (more than an order of magnitude sooner in some
a threshold of 200-500KB is best for most workloads.                                                         cases) at the end hosts and is also not affected by the
                                                                                                             congestion in the network.
B.    Prototype & Microbenchmarks                                                                                 b) Elephant flow detection time at the controller: In this
   We have implemented a prototype of the Mahout system.                                                     experiment, we measure how long it takes for an elephant flow
The shim layer is implemented as a kernel module inserted                                                    to be detected at the controller using the Mahout approach
between the TCPIIP stack and device driver, and the controller                                               versus using Hedera-like periodic polling approach. To be fair
is built upon NOX [28], an open-source OpenFiow controller                                                   to the polling approach, we have done the periodic polling
written in Python language. For the shim layer, we created a                                                 at the fastest rate possible (poll in a loop without any wait
function for the pseudocode shown in Algorithm 1 and invoke                                                  periods in between). As can be seen from the Table II, Mahout
it for the outgoing packets, after the IP header creation in                                                 controller can detect an elephant flow in few milliseconds.




                                                                                                          1636
100 r--�--�----'                                          100 r--�--�--,
                               Time to Queue �                                           Time to Queue !QQQQ
                                Time to Send rs;:::s:s::s]                                Time to Send rs:ss:sJ


                      U)   10                                                   U)    10
                      S                                                         S
                      '"                                                        '"
                      E                                                         E
                      i=                                                        i=




                           0.1
                                    100KB   200KB    500KB    1MB                             100KB   200KB    500KB    1MB

                                   (a) No background flows                                  (b) 10 background flows
Fig. 7: For each threshold bytes, time taken for the TCP buffer is filled versus the time taken for those many bytes to appear on the network
with (a) no background flows and (b) 10 background TCP flows. Error bars show 95% confidence intervals.


                                                                                 [6] M. AI-Fares,A. Loukissas,and A. Vahdat. A scalable,commodity data
                                                                                     center network architecture. In SIGCOMM, 2008.
                                                                                 [7] M. AI-Fares,S. Radhakrishnan,B. Raghavan,N. Huang,and A. Vahdat.
                                                                                     Hedera: Dynamic How scheduling for data center networks. In NSDI,
TABLE II: Time it takes to detect an elephant flow at the Mahout
                                                                                     2010.
controller vs. the Hedera controller, with no other active flows.                [8] M. Alizadeh,A. Greenberg,D. A. Maltz,J. Padhye,P. Patel, B. Prab­
                                                                                     hakar,S. Sengupta,and M. Sridharan. Dctcp: Efficient packet transport
In contrast, Hedera takes 189.83ms before the flow can be
                                                                                     for the commoditized data center. In SIGCOMM, 2010.
detected as an elephant irrespective of the threshold. All times                 �] http://aws.amazon.com/ec2/.
are same due to switch's overheads in collecting statistics and                 [10] P. Barham,B. Dragovic, K. Fraser,S. H,T. Harris,A. Ho,R. Neuge­
                                                                                     bauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In
relaying it to the central controller.
                                                                                     SOSP, pages 164-177,2003.
   Overall, our working prototype demonstrates the deploy­                      [II] R. Braden,D. Clark,and S. Shenker. Integrated service in the internet
ment feasibility of Mahout. The experiments show an order                            architecture: an overview. Technical report, IETF, Network Working
                                                                                     Group,June 1994.
of magnitude difference in the elephant flow detection times
                                                                                [12] C. Clos. A study of non-blocking switching networks. Bell System
at the controller in Mahout vs. a competing approach.                                Technical Journal, 32(5):406-424,1953.
                                                                                [13] J. R. Correa and M. X. Goemans. Improved bounds on nonblocking
                           VI. CONCLUSION                                            3-stage c10s networks. SIAM J. Comput., 37(3):870-894,2007.
                                                                                [14] A. R. Curtis,W. Kim,and P. Yalagandula. Mahout: Low-Overhead Dat­
   Previous research in datacenter network management has
                                                                                     acenter Traffic Management using End-Host-Based Elephant Detection.
shown that elephant flows-flows that carry large amount of                           Technical Report HPL-2010-91,HP Labs,2010.
data-need to be detected and managed for better utilization of                  [IS] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on
                                                                                     Large Clusters . In OSDI,2004.
the mUlti-path topologies. However, the previous approaches                     [16] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Sub­
for elephant flow detection are based on monitoring the                              ramanya, Y. Fainman, G. Papen, and A. Vahdat. Helios: A Hybrid
behavior of flows in the network and hence incur long de­                            ElectricaVOptical Switch Architecture for Modular Data Centers. In
                                                                                     SIGCOMM, 2010.
tection times, high switch resource usage, and/or high control                  [17] A. Greenberg, J. R. Hamilton,N. Jain,S. Kandula, C. Kim,P. Lahiri,
bandwidth and processing overhead. In contrast, we propose a                         D. Maltz, P. Patel, and S. Sengupta. VL2: a scalable and Hexible data
novel end host based solution that monitors the socket buffers                       center network. In SIGCOMM, 2009.
                                                                                [18] Hadoop MapReduce. http://hadoop.apache.orglmapreduce/.
to detect elephant flows and signals the network controller
                                                                                [19] C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992
using an in-band mechanism. We present Mahout, a low­                                (Informational),Nov. 2000.
overhead yet effective traffic management system based on this                  [20] lANA DSCP registry. http://www.iana.orglassignments/dscp-registry.
                                                                                [21] S. Kandula,S. Sengupta, A. Greenberg, and P. Patel. The nature of
idea. Our experimental results show that our system can detect
                                                                                     datacenter traffic: Measurements & analysis. In IMC, 2009.
elephant flows an order of magnitude sooner than polling                        [22] J. Kim and W. J. Dally. Flattened butterHy: A cost-efficient topology
based approaches while incurring an order of magnitude lower                         for high-radix networks. In ISCA, 2007.
                                                                                [23] N. McKeown,T. Anderson,H. Balakrishnan,G. Parulkar,L. Peterson,
controller overhead than other approaches.
                                                                                     J. Rexford,S. Shenker,and J. Turner. OpenFlow: Enabling Innovation
                                                                                     in Campus Networks. ACM CCR, 2008.
                       ACKNOWLEDGEMENTS
                                                                                [24] J. C. Mogul, J. Tourrilhes,P. Yalagandula,P. Sharma, A. R. Curtis,
  We sincerely thank Sujata Banerjee, Jeff Mogul, Puneet                             and S. Banerjee. DevoHow: Cost-effective How management for high
                                                                                     performance enterprise networks. In HotNets, 2010.
Sharma, and Jean Tourrilhes for several beneficial discussions                  [25] T. Mori, M. Uchida, R. Kawahara, J. Pan, and S. Goto. Identifying
and comments on earlier drafts of this paper.                                        elephant Hows through periodically sampled packets. In Proc. IMC,
                                                                                     pages 115-120,Taormina,Oct. 2004.
                                 REFERENCES                                     [26] K. Nichols, S. Blake, F. Baker, and D. Black. Definition of the
                                                                                     Differentiated Services Field (OS Field) in the IPv4 and IPv6 Headers.
 [I] IEEE Std. 802.IQ-2005,Virtual Bridged Local Area Networks.                      RFC 2474 (Proposed Standard),Dec. 1998.
 [2] OpenFlow        Switch       Specification,        Version        1.0.0.   [27] M. Roughan,S. Sen,O. Spatscheck,and N. Duffield. Class-of-service
     http://www.openHowswitch. orgidocuments/openHow-spec-v1.0.0.pdf.                mapping for qos: A statistical signature-based approach to ip traffic
 [3] sFlow. http://www.sHow.orgi.                                                    classification. In In IMC04, pages 135-148,2004.
 [4] The OpenFlow Switch Consortium. http://www.openHowswitch.orgi.             [28] A. Tavakoli,M. Casado,T. Koponen, and S. Shenker. Applying NOX
 [5] J. H. Ahn,N. Binkert,A. Davis,M. McLaren,and R. S. Schreiber. Hy-               to the datacenter. In HotNets-Vlll, 2009.
     perx: topology,routing,and packaging of efficient large-scale networks.    [29] VMware. http://www.vmware.com.
     In Proceedings of the Conference on High Performance Computing             [30] Xen. http://www.xen.org.
     Networking, Storage and Analysis (SC '09), 2009.




                                                                           1637

More Related Content

PDF
Adaptive Handoff Initiation Scheme in Heterogeneous Network
PDF
Comparative Analysis of Drop Tail, Red and NLRED Congestion Control Algorithm...
PDF
RED: A HIGH LINK UTILIZATION AND FAIR ALGORITHM
PDF
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
PDF
Study on Different Mechanism for Congestion Control in Real Time Traffic for ...
PDF
Differentiated Classes of Service and Flow Management using An Hybrid Broker1
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
Compare and Analyses of Optimized R-Leach With Leach Algorithm In WSN
Adaptive Handoff Initiation Scheme in Heterogeneous Network
Comparative Analysis of Drop Tail, Red and NLRED Congestion Control Algorithm...
RED: A HIGH LINK UTILIZATION AND FAIR ALGORITHM
Congestion Prediction and Adaptive Rate Adjustment Technique for Wireless Sen...
Study on Different Mechanism for Congestion Control in Real Time Traffic for ...
Differentiated Classes of Service and Flow Management using An Hybrid Broker1
International Journal of Computational Engineering Research(IJCER)
Compare and Analyses of Optimized R-Leach With Leach Algorithm In WSN

What's hot (20)

PDF
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
PDF
Improving Performance of TCP in Wireless Environment using TCP-P
PDF
A Bandwidth Efficient Scheduling Framework for Non Real Time Applications in ...
PDF
30 ijaprr vol1-4-24-28syed
PDF
IRJET-A Survey on Red Queue Mechanism for Reduce Congestion in Wireless Network
PPTX
A New Energy-Efficient Routing Protocol for Wireless Body Area Sensor Networks
PDF
11.a study of congestion aware adaptive routing protocols in manet
PDF
Ijcnc050203
PDF
International Journal of Computer Networks & Communications (IJCNC)
PDF
KALMAN FILTER BASED CONGESTION CONTROLLER
PDF
Iisrt arunkumar b (networks)
PDF
Nearest Adjacent Node Discovery Scheme for Routing Protocol in Wireless Senso...
PDF
Energy Efficient LEACH protocol for Wireless Sensor Network (I-LEACH)
PDF
2013 2014 ieee projects titles with abstract
PDF
A survey on mac strategies for cognitive radio networks
PDF
A REAL TIME PRIORITY BASED SCHEDULER FOR LOW RATE WIRELESS SENSOR NETWORKS
PDF
PERFORMANCE EVALUATION OF SELECTED E2E TCP CONGESTION CONTROL MECHANISM OVER ...
PDF
Comparative Analysis of Green Algorithm within Active Queue Management for Mo...
PDF
A Novel Rebroadcast Technique for Reducing Routing Overhead In Mobile Ad Hoc ...
PPTX
Control aspects in Wireless sensor networks
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Improving Performance of TCP in Wireless Environment using TCP-P
A Bandwidth Efficient Scheduling Framework for Non Real Time Applications in ...
30 ijaprr vol1-4-24-28syed
IRJET-A Survey on Red Queue Mechanism for Reduce Congestion in Wireless Network
A New Energy-Efficient Routing Protocol for Wireless Body Area Sensor Networks
11.a study of congestion aware adaptive routing protocols in manet
Ijcnc050203
International Journal of Computer Networks & Communications (IJCNC)
KALMAN FILTER BASED CONGESTION CONTROLLER
Iisrt arunkumar b (networks)
Nearest Adjacent Node Discovery Scheme for Routing Protocol in Wireless Senso...
Energy Efficient LEACH protocol for Wireless Sensor Network (I-LEACH)
2013 2014 ieee projects titles with abstract
A survey on mac strategies for cognitive radio networks
A REAL TIME PRIORITY BASED SCHEDULER FOR LOW RATE WIRELESS SENSOR NETWORKS
PERFORMANCE EVALUATION OF SELECTED E2E TCP CONGESTION CONTROL MECHANISM OVER ...
Comparative Analysis of Green Algorithm within Active Queue Management for Mo...
A Novel Rebroadcast Technique for Reducing Routing Overhead In Mobile Ad Hoc ...
Control aspects in Wireless sensor networks
Ad

Similar to Mahout low-overhead datacenter traffic management using end-host-based elephant detection (20)

PDF
Call Admission Control (CAC) with Load Balancing Approach for the WLAN Networks
PDF
Dynamic measurement aware
PDF
Networking project list for java and dotnet
PDF
SPROJReport (1)
PDF
Dual-resource TCPAQM for Processing-constrained Networks
PDF
Soft Real-Time Guarantee for Control Applications Using Both Measurement and ...
PDF
Cloud computing – partitioning algorithm
DOCX
Orchestrating Bulk Data Transfers across Geo-Distributed Datacenters
DOCX
Orchestrating bulk data transfers across
DOCX
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
PDF
An efficient vertical handoff mechanism for future mobile network
DOCX
A low complexity congestion control and scheduling algorithm for multihop wir...
PDF
Analysis of Rate Based Congestion Control Algorithms in Wireless Technologies
PDF
Minimum bandwidth reservations for periodic streams in wireless real time sys...
PDF
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
PDF
IRJET- Efficient and Secure Communication In Vehicular AD HOC Network
PDF
IEEE 2015 NS2 Projects
PDF
On Demand Bandwidth Reservation for Real- Time Traffic in Cellular IP Network...
PDF
IEEE 2015 NS2 Projects
PDF
A NEW APPROACH TO STOCHASTIC SCHEDULING IN DATA CENTER NETWORKS
Call Admission Control (CAC) with Load Balancing Approach for the WLAN Networks
Dynamic measurement aware
Networking project list for java and dotnet
SPROJReport (1)
Dual-resource TCPAQM for Processing-constrained Networks
Soft Real-Time Guarantee for Control Applications Using Both Measurement and ...
Cloud computing – partitioning algorithm
Orchestrating Bulk Data Transfers across Geo-Distributed Datacenters
Orchestrating bulk data transfers across
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
An efficient vertical handoff mechanism for future mobile network
A low complexity congestion control and scheduling algorithm for multihop wir...
Analysis of Rate Based Congestion Control Algorithms in Wireless Technologies
Minimum bandwidth reservations for periodic streams in wireless real time sys...
Congestion Control in Wireless Sensor Networks Using Genetic Algorithm
IRJET- Efficient and Secure Communication In Vehicular AD HOC Network
IEEE 2015 NS2 Projects
On Demand Bandwidth Reservation for Real- Time Traffic in Cellular IP Network...
IEEE 2015 NS2 Projects
A NEW APPROACH TO STOCHASTIC SCHEDULING IN DATA CENTER NETWORKS
Ad

More from João Gabriel Lima (20)

PDF
Cooking with data
PDF
Deep marketing - Indoor Customer Segmentation
PDF
Aplicações de Alto Desempenho com JHipster Full Stack
PDF
Realidade aumentada com react native e ARKit
PDF
PDF
Big data e Inteligência Artificial
PDF
Mineração de Dados no Weka - Regressão Linear
PDF
Segurança na Internet - Estudos de caso
PDF
Segurança na Internet - Google Hacking
PDF
Segurança na Internet - Conceitos fundamentais
PDF
Web Machine Learning
PDF
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
PDF
Mineração de dados com RapidMiner + WEKA - Clusterização
PDF
Mineração de dados na prática com RapidMiner e Weka
PDF
Visualizacao de dados - Come to the dark side
PDF
REST x SOAP : Qual abordagem escolher?
PDF
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
PDF
E-trânsito cidadão - IPVA em suas mãos
PPTX
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
PDF
Hackeando a Internet das Coisas com Javascript
Cooking with data
Deep marketing - Indoor Customer Segmentation
Aplicações de Alto Desempenho com JHipster Full Stack
Realidade aumentada com react native e ARKit
Big data e Inteligência Artificial
Mineração de Dados no Weka - Regressão Linear
Segurança na Internet - Estudos de caso
Segurança na Internet - Google Hacking
Segurança na Internet - Conceitos fundamentais
Web Machine Learning
Mineração de Dados com RapidMiner - Um Estudo de caso sobre o Churn Rate em...
Mineração de dados com RapidMiner + WEKA - Clusterização
Mineração de dados na prática com RapidMiner e Weka
Visualizacao de dados - Come to the dark side
REST x SOAP : Qual abordagem escolher?
Game of data - Predição e Análise da série Game Of Thrones a partir do uso de...
E-trânsito cidadão - IPVA em suas mãos
[Estácio - IESAM] Automatizando Tarefas com Gulp.js
Hackeando a Internet das Coisas com Javascript

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
KodekX | Application Modernization Development
Encapsulation theory and applications.pdf
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
20250228 LYD VKU AI Blended-Learning.pptx
cuic standard and advanced reporting.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Review of recent advances in non-invasive hemoglobin estimation
“AI and Expert System Decision Support & Business Intelligence Systems”
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Chapter 3 Spatial Domain Image Processing.pdf
Network Security Unit 5.pdf for BCA BBA.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)
KodekX | Application Modernization Development

Mahout low-overhead datacenter traffic management using end-host-based elephant detection

  • 1. This paper was presented as part of the main technical program at IEEE INFOCOM 2011 Mahout: Low-Overhead Datacenter Traffic Management using End-Host-Based Elephant Detection Andrew R. Curtis* Wonho Kim* Praveen Yalagandula University of Waterloo Princeton University HP Labs Waterloo, Ontario, Canada Princeton, NJ, USA Palo Alto, CA, USA Abstract-Datacenters need high-bandwidth interconnection report that 90% of the flows carry less than 1MB of data fabrics. Several researchers have proposed highly-redundant and more than 90% of bytes transferred are in flows greater topologies with multiple paths between pairs of end hosts for than 100MB. Hash-based flow forwarding techniques such as datacenter networks. However, traffic management is necessary to effectively utilize the bisection bandwidth provided by these Equal-Cost Multi-Path (ECMP) routing [19] works well only topologies. This requires timely detection of elephant flows­ for large numbers of small (or mice) flows and no elephant flows that carry large amount of data-and managing those flows. For example, AI-Fares et al. 's Hedera [7] shows that flows. Previously proposed approaches incur high monitoring managing elephant flows effectively can yield as much as overheads, consume significant switch resources, and/or have long 113% higher aggregate throughput compared to ECMP. detection times. We propose, instead, to detect elephant flows at the end hosts. Existing elephant flow detection methods have several lim­ We do this by observing the end hosts's socket buffers, which itations that make them unsuitable for datacenter networks. provide better, more efficient visibility of flow behavior. We These proposals use one of three techniques to identify ele­ present Mahout, a low-overhead yet effective traffic management phants: (1) periodic polling of statistics from switches, (2) system that follows OpenFlow-like central controller approach for streaming techniques like sampling or window-based algo­ network management but augments the design with our novel end host mechanism. Once an elephant flow is detected, an end host rithms, or (3) application-level modifications (full details of signals the network controller using in-band signaling with low each approach are given in Section II). We have not seen overheads. Through analytical evaluation and experiments, we support for Quality of Service (QoS) solutions take hold, demonstrate the benefits of Mahout over previous solutions. which implies that modifying applications is probably an unac­ ceptable solution. We will show that the other two approaches I. INT RODUCT ION fall short in the datacenter setting due to high monitoring Datacenter switching fabrics have enormous bandwidth de­ overheads, significant switch resource consumption, and/or mands due to the recent uptick in bandwidth-intensive appli­ long detection times. cations used by enterprises to manage their exploding data. We assert that the right place for elephant flow detection These applications transfer huge quantities of data between is at the end hosts. In this paper, we describe Mahout, a thousands of servers. For example, Hadoop [18] performs an low-overhead yet effective traffic management system using all-to-all transfer of up to petabytes of files during the shuffle end-host-based elephant detection. We subscribe to the in­ phase of a MapReduce job [15]. Further, to better consolidate creasingly popular simple-switchlsmart-controller model (as in employee desktop and other computation needs, enterprises OpenFlow [4]), and so our system is similar to NOX [28] and are leveraging virtualized datacenter frameworks (e.g., using Hedera [7]. VMWare [29] and Xen [10], [30]), where timely migration of Mahout augments this basic design. It has low overhead, as virtual machines requires high throughput network. it monitors and detects elephant flows at the end host via a Designing datacenter networks using redundant topologies shim layer in the OS, rather than monitoring at the switches such as Fat-tree [6], [12], HyperX [5], and Flattened But­ in the network. Mahout does timely management of elephant terfly [22] solves the high-bandwidth requirement. However, flows through an in-band signaling mechanism between the traffic management is necessary to extract the best bisection shim layer at the end hosts and the network controller. At the bandwidth from such topologies [7]. A key challenge is that switches, any flow not signaled as an elephant is routed using the flows come and go too quickly in a data center to compute a static load-balancing scheme (e.g., ECMP). Only elephant a route for each individually; e.g., Kandula et al. report lOOK flows are monitored and managed by the central controller. flow arrivals a second in a 1,500 server cluster [21]. The combination of end host elephant detection and in-band For effective utilization of the datacenter fabric, we need to signaling eliminates the need for per-flow monitoring in the detect elephant flows-flows that transfer significant amount switches, and hence incurs low overhead and requires few of data-and dynamically orchestrate their paths. Datacenter switch resources. measurements [17], [21] show that a large fraction of datacen­ We demonstrate the benefits of Mahout using analytical ter traffic is carried in a small fraction of flows. The authors evaluation and simulations and through experiments on a small *This work was performed while Andrew and Wonho were interns at HP testbed. We have built a Linux prototype for our end host Labs-Palo Alto. elephant flow detection algorithm and tested its effectiveness. 978-1-4244-9921-2/11/$26.00 ©2011 IEEE 1629
  • 2. We have also built a Mahout controller, for setting up switches B. Identifying elephant flows with default entries and for processing the tagged packets from The mix of latency- and throughput-sensitive flows in the the end hosts. Our analytical evaluation shows that Mahout data centers means that effective flow scheduling needs to offers one to two orders of magnitude of reduction in the balance visibility and overhead-a one size fits all approach is number of flows processed by the controller and in switch not sufficient in this setting. To achieve this balance, elephant resource requirements, compared to Hedera-like approaches. flows must be identified so that they are the only flows touched Our simulations show that Mahout can achieve considerable by the controller. The following are the previously considered throughput improvements compared to static load balancing mechanisms for identifying elephants: techniques while incurring an order of magnitude lower over­ head than Hedera. Our prototype experiments show that the • Applications identify their flows as elephants: This solu­ Mahout approach can detect elephant flows at least an order tion accurately and immediately identifies elephant flows. of magnitude sooner than statistics-polling based approaches. This is a common assumption for a plethora of research The key contributions of our work are: 1) a novel end work in network QoS where focus is to give higher host based mechanism for detecting elephant flows, 2) design priority to latency and throughput-sensitive flows such of a centralized datacenter traffic management system that as voice and video applications (see, e.g., [11]). How­ has low overhead yet high effectiveness, and 3) simulation ever, this solution is impractical for traffic management and prototype experiments demonstrating the benefits of the in datacenters as each and every application must be proposed design. modified to support it. If all applications are not modified, an alternative technique will still be needed to identify II. BACKGROUND & RELATED WORK elephant flows initiated by unmodified applications. A related approach is to classify flows based on which A. Datacenter networks and traffic application is initiating them. This classifies flows using The heterogeneous mix of applications running in datacen­ stochastic machine learning techniques [27], or using ters produces flows that are generally sensitive to either latency simple matching based on the packet header fields (such or throughput. Latency-sensitive flows are usually generated as TCP port numbers). While this approach might be by network protocols (such as ARP and DNS) and interactive suitable for enterprise network management, it is un­ applications. They typically transfer up to a few kilobytes. suitable for datacenter network management because of On the other hand, throughput-sensitive flows, created by, the enormous amount of traffic in the datacenter and the e.g., MapReduce, scientific computing, and virtual machine difficulty in obtaining flow traces to train the classification migration, transfer up to gigabytes. This traffic mix implies algorithms. that a datacenter network needs to deliver high bisection • Maintain per-flow statistics: In this approach, each flow bandwidth for throughput-sensitive flows without introducing is monitored at the first switch that the flow goes setup delay on latency-sensitive flows. through. These statistics are pulled from switches by Designing datacenter networks using redundant topologies the controller at regular intervals and used to classify such as Fat-tree [6], [12], HyperX [5], or Flattened But­ elephant flows. Hedera [7] and Helios [16] are examples terfly [22] solves the high-bandwidth requirement. However, of systems proposing to use such a mechanism. However, these networks use multiple end-to-end paths to provide this this approach does not scale to large networks. First, high-bandwidth, so they need to load balance traffic across this consumes significant switch resources: a flow table them. Load balancing can be performed with no overhead entry for each flow monitored at a switch. We'll show using oblivious routing, where the path a flow from node i in Section IV that this requires considerable number of to node j is routed on is randomly selected from a probability flow table entries. Second, bandwidth between switches distribution over all i to j paths, but it has been shown to and the controller is limited, so much so that transferring achieve less than half the optimal throughput when the traffic statistics becomes the bottleneck in traffic management mix contains many elephant flows [7]. The other extreme is in datacenter network. As a result, the flow statistics to perform online scheduling by selecting the path for all new cannot be quickly transferred to the controller, resulting flows using a load balancing algorithm, e.g., greedily adding in prolonged sub-par routings. a flow along the path with least congestion. This approach • Sampling: Instead of monitoring each flow in the net­ doesn't scale well-flows arrive too quickly for a single sched­ work, in this approach, a controller samples packets from uler to keep up-and it adds too much setup time to latency­ all ports of the switches using switch sampling features sensitive flows. For example, flow installation using NOX can such as sFlow [3]. Only a small fraction of packets are take up to lOms [28]. Partition-aggregate applications (such sampled (typically, 1 in 1000) at the switches and only as search and other web applications) partition work across headers of the packets are transferred to the controller. multiple machines and then aggregate the responses. Jobs have The controller analyzes the samples and identifies a flow a deadline of 10-1OOms [8], so a lOms flow setup delay can as an elephant after it has seen sufficient number of sam­ consume the entire time budget. Therefore, online scheduling ples from the flow. However, such an approach can not is not suitable for latency-sensitive flows. reliably detect an elephant flow before it has carried more 1630
  • 3. : ,---:----, :::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::............................. 100 80 in � 60 2] "' >- 40 RACK RACK 20 TCPBuffer -­ SWITCH SWITCH Sent Data -- a a 100 200 300 400 500 600 700 800 r--:::--j -.- END-HOST END-HOST END-HOST END-HOST Time (us) Fig. 2: Amount of data observed in the Tep buffers vs. data observed at the network layer for a flow. simple approach allows the controller to detect elephant flows END-HOST END-HOST END·HOST END·HOST without any switch CPU- and bandwidth-intensive monitoring. The Mahout controller then manages only the elephant flows, Fig. I: Mahout architecture. to maintain a globally optimal arrangement of them. In the following, we describe Mahout's end host shim layer than 10K packets, or roughly 15MB [25]. Additionally, for detecting elephant flows, our in-band signaling method for sampling has high overhead, since the controller must informing controller about the elephant flows, and the Mahout process each sampled packet. network controller. C. OpenFlow A. Detecting Elephant Flows OpenFlow [23] aims to open up traditionally closed de­ An end host based implementation for detecting elephant signs of commercial switches to enable network innovation. flows is better than in-network monitoring/sampling based OpenFlow switches maintain a flow-table where each entry methods, particularly in datacenters, because: (l) The network contains a pattern to match and the actions to perform on a behavior of a flow is affected by how rapidly the end-point packet that matches that entry. OpenFlow defines a protocol for applications are generating data for the flow, and this is not communication between a controller and an OpenFlow switch biased by congestion in the network. In contrast to in-network to add and remove entries from the flow table of the switch monitors, the end host OS has better visibility into the behavior and to query statistics of the flows. of applications. (2) In datacenters, it is possible to augment Upon receiving a packet, if an OpenFlow switch does not the end host OS; this is enabled by the single administrative have an entry in the flow table or TCAM that matches the domain and software uniformity typical of modern datacenters. packet, the switch encapsulates and forwards the packet to the (3) Mahout's elephant detection mechanism has very little controller over a secure connection. The controller responds overhead (it is implemented with two if statements) on com­ back with a flow table entry and the original packet. The switch modity servers. In contrast, using an in-network mechanism then installs the entry into its flow table and forwards the to do fine-grained flow monitoring (e.g., using exact matching packet according to the actions specified in the entry. The flow on OpenFlow's 12-tuple) can be infeasible, even on an edge table entries expire after a set amount of time, typically 60 switch, and even more so on a core switch, especially on seconds. OpenFlow switches maintain statistics for each entry commodity hardware. For example, assume that 32 servers in their flow table. These statistics include a packet counter, are connected, as is typical, to a rack switch. If each server byte counter, and duration. generates 20 new flows per second, with a default flow timeout The OpenFlow 1.0 specification [2] defines matching over period of 60 seconds, an edge-switch needs to maintain and 12 fields of packet header (see the top line in Figure 3). The monitor 38400 flow entries. This number is infeasible in any specification defines several actions including forwarding on a of the real switch implementations of OpenFlow that we are single physical port, forwarding on multiple ports, forwarding aware of. to the controller, drop, queue (to a specified queue), and A key idea of the Mahout system is to monitor end host defaulting to traditional switching. To support such flexibility, socket buffers, and thus determine elephant flows before current commercial switch implementations of OpenFlow use and with lower overheads than with in-network monitoring TCAMs for flow table. systems. We demonstrate the rationale for this approach with a micro-benchmark: an ftp transfer of a 50MB file from a host III. OUR SOLUTION: MAHOUT 1 to host 2, connected via two switches all with 1 Gbps links. Mahout's architecture is shown in Figure 1. In Mahout, a In Figure 2, we show the cumulative amount of data shim layer on each end host monitors the flows originating observed on the network, and in the TCP buffer, as time from that host. When this layer detects an elephant flow, it progresses. The time axis starts when the application first marks subsequent packets of that flow using an in-band sig­ provides data to the kernel. From the graph, one can observe naling mechanism. The switches in the network are configured that the application fills the TCP buffers at a rate much higher to forward these marked packets to the Mahout controller. This than the observed network rate. If the threshold for considering 1631
  • 4. Algorithm 1 Pseudocode for end host shim layer Algorithm 1 shows pseudocode for the end host shim layer 1: When sending a packet function that is executed when a TCP packet is being sent. 2: if number of bytes in buffer � thresholdelephant then C. Mahout Controller 3: / * Elephant flow */ 4: if last-tagged-time - nowO � Ttagperiod then At each rack switch, the Mahout controller initially config­ 5: set DS = 00001100 ures two default OpenFlow flow table entries: (i) an entry to 6: last-tagged-time = nowO send a copy of packets with the DSCP bits set to 000011 to the 7: end if controller and (ii) the lowest-priority entry to switch packets 8: end if using NORMAL forwarding action. We set up switches to per­ form ECMP forwarding by default in the NORMAL operation mode. Figure 3 shows the two default entries at the bottom. a flow as an elephant is 100KB (Figure 2. of [17] shows that In this figure, an entry has a higher priority over (is matched more than 85% of flows are less than 100KB), we can see before) entries below that entry. that Mahout's end host shim layer can detect a flow to be When a flow starts, it normally will match the lowest­ an elephant 3x sooner than in-network monitoring. In this priority (NORMAL) rule, so its packet will follow ECMP experiment there were no other active flows on the network. forwarding. When an end host detects a flow as an elephant In further experimental results, presented in Section V, we and marks a packet of that flow. That packet marked with observe an order of magnitude faster detection when there are DSCP 000011 matches the other default rule, and the rack other flows. switch forwards it to the Mahout controller. The controller Mahout uses a shim layer in the end hosts to monitor then computes the best path for this elephant, and installs a the socket buffers. When a socket buffer crosses a chosen flow-specific entry in the rack switch. threshold, the shim layer determines that the flow is an In Figure 3, we show a few example entries for the ele­ elephant. This simple approach ensures that flows that are phant flows. Note that these entries are installed with higher bottlenecked at the application layer and not in the network priority than Mahout's two default rules; hence, the packets layer, irrespective of how long-lived they are or how many corresponding to these elephant flows are switched using the bytes they have transferred, will not be determined as the actions of these flow-specific entries rather than the actions of elephant flows. Such flows need no special management in the default entries. Also, the DS field is set to wildcard for the network. In contrast, if an application is generating data these elephant flow entries, so that once the flow-specific rule for a flow faster than the flow's achieved network throughput, is installed, any tagged packets from the end hosts are not the socket buffer will fill up, and hence Mahout will detect forwarded to the controller. this an an elephant flow that needs management. Once an elephant flow is reported to the Mahout controller, it needs to be placed on the best available path. We define the B. In-band Signaling best path for a flow from s to t as the least congested of all paths from s to t. The least congested s-t path is found by Once Mahout's shim layer has detected an elephant flow, enumerating over all such paths. it needs to signal this to the network controller. We do this To manage the elephant flows, Mahout regularly pulls indirectly, by marking the packets in a way that is easily statistics on the elephant flows and link utilizations from the and efficiently detected by OpenFlow switches, and then the switches, and uses these statistics to optimize the elephant switches divert the marked packets to the network controller. flows' routes. This is done with the increasing first fit algo­ To avoid inundating the controller with too many packets of rithm given in Algorithm 2. Correa and Goemans introduced the same flow, the end host shim layer marks the packets of this algorithm and proved that it finds routings that have at an elephant flow only once every Ttagperiod seconds (we use most a 10% higher link utilization than the optimal routing 1 second in our prototype). [13]. While we cannot guarantee this bound because we re­ To mark a packet, we repurpose the Differentiated Services route only the elephant flows, we expect this algorithm to Field (DS Field) [26] in the IPv4 header. This field was perform as well as any other heuristic. originally called the IP Type-of-Service (IPToS) byte. The first 6 bits of the DS Field, called Differentiated Services D. Discussion Code Point (DSCP), define the per-hop behavior of a packet. a) DSCP bits: In Mahout, the end host shim layer uses The current OpenFlow specification [2] allows matching on the DSCP bits of the DS field in IP header for signaling DSCP bits, and most commercial switch implementations elephant flows. However, there may be some datacenters of OpenFlow support this feature in hardware; hence, we where DSCP may be needed for other uses, such as for use the DS Field for signaling between the end host shim prioritization among different types of flows (voice, video, and layer and the network controller. Currently the code point data) or for prioritization among different customers. In such space corresponding to xxxx11 (x denotes a wild-card bit) is scenarios, we plan to use VLAN Priority Code Point (PCP) [1] reserved for experimental or local usage [20], and we leverage bits. OpenFlow supports matching on these bits too. We can this space. When an end host detects an elephant flow, it sets leverage the fact that it is very unlikely for both these code the DSCP bits to 000011 in the packets belonging to that flow. point fields (PCP and DSCP) to be in use simultaneously. 1632
  • 5. IMPL. Bit lengths _.­ 48 48 16 12 32 32 16 16 DEP. Packet fields ACTIONS to match 2 01AB..8F 3OCD..9E 0806 xx xx 10.0.0.1 10.0.0.4 6 XXXXXX 80 2490 Forward to port 4 Flow entries / xx xx XXXXXX for detected - 10 BE9F..03 04DE .. CF 0806 10.0.1.10 10.0.1.40 6 3450 3451 Forward to port 3 q elephant flows � : I : I : I : I : I : I : I : I : I::I:� : I : ! D,fa"" rul" Send to Controller at the lowest NORMAL routing priotity (ECMP) Fig. 3: An example flow table setup at a switch by the Mahout controller. Algorithm 2 Offline increasing first fit has more than 100KB data, xxxll1 to denote more than 1MB, 1: sort(F); reverse(F) /* F: set of elephant flows */ xxllll to denote more than 10MB, and so on. The controller 2: for J E F do can then change the default entry corresponding to the tagged 3: for l E J.path do packets (second from bottom in the Figure 3) to select higher 4: l.load = l.load - J.rate thresholds, based on the load at the controller. Further study 5: end for 6: end for is needed to explore these approaches. 7: for J E F do 8: bescpaths[fj.congest = 00 IV. ANALY T ICAL EVALUAT ION 9: /* Pst: set of all s-t paths */ In this section, we analyze the expected overhead of de­ 10: for path E Pst do tecting elephant flows with Mahout, with flow sampling, and 11: congest = (f.rate + path.load) / path. bandwidth by maintaining per-flow statistics (e.g., the approach used by 12: if congest < bescpath.congest then 13: best_paths[fj = path Hedera). We set up an analytical framework to evaluate the 14: bescpaths[f].congest = congest number of switch table entries and control messages used 15: end if by each method. We evaluate each method using an example 16: end for datacenter, and show that Mahout is the only solution that can 17: end for scale to support large datacenters. 18: return best_paths Flow sampling identifies elephants by sampling an expected l out of k packets. Once it has seen enough packets from the same flow, then the flow is classified as an elephant. The b) Virtualized Datacenter: In a virtualized datacenter, number of packets needed to classify an elephant does not a single server will host multiple guest virtual machines, affect our analysis in this section, so we ignore it for now. each possibly running a different operating system. In such Hedera [7] uses periodic polling for elephant flow detection. a scenario, the Mahout shim layer needs to be deployed in Every t seconds, the Hedera controller pulls the per-flow each of the guest virtual machines. Note that the host operating statistics from each switch. In order to estimate the true rate of system will not have visibility into the socket buffers of a guest a flow (i.e., the rate of the flow if its rate is only constrained virtual machine. However, in cloud computing infrastructures by its endpoints' NICs and not by any link in the network), such as Amazon EC2 [9], typically the infrastructure provider the statistics for every flow in the network must be collected. makes available a few preconfigured OS versions, which in­ Pulling statistics for all flows using OpenFlow requires setting clude the paravirtualization drivers to work with the provider's up a flow table entry for every flow, so each flow must be sent hypervisor. Thus, we believe that it is feasible to deploy the to the controller before it can be started, so we include this Mahout shim layer in virtualized datacenters, too. cost in our analysis. c) Elephantfiow threshold: Choosing too Iow a value for We consider a million server network for the following thresholdelephant in Algorithm 1 can cause many flows to analysis. Our notation and and the assumed values are shown be recognized as elephants, and hence cause the rack switches in the Table I. to forward too many packets to the controller. When there Hedera [7]: As table entries need to be maintained for all are many elephant flows, to avoid the controller overload, we flows, the number of flow table entries needed at each rack could provide a means for the controller to signal the end hosts switch is T·P-D. In our example, this translates to 32·20·60 = to increase the threshold value. However, this would require 38,400 entries at each rack switch. We are not aware of any a out-of-band control mechanism. An alternative is to use existing switch with OpenFlow support that can support this multiple DSCP values to denote different levels of thresholds. many entries in the flow table in the hardware-for example, For example, xxxxll can be designated to denote that a flow HP ProCurve 5400zl switches support up to 1.7K OpenFlow 1633
  • 6. Parameter Description Value reduces this overhead but adversely impacts the effects of flow N Num. of end hosts 2�u (1M) T Num. of end hosts per rack switch 32 scheduling since not all elephants are detected. S Num. of rack switches 215 (32K) We expect the number of elephants identified by sampling F Avg. new flows per second per end host 20 [28] to be similar to Mahout, so we do not analyze the flow table D Avg. duration of a flow in the flow table 60 seconds c Size of counters in bytes 24 [2] entry overhead of sampling separately. Tstat Rate of gathering statistics I-per-second Mahout: Because elephant flow detection is done at the p Num. of bytes in a packet 1500 end-host, switches contain flow table entries for elephant flows Im Fraction of mice 0.99 Ie Fraction of elephants 0.01 only. Also, statistics are only gathered for the elephant flows. Tsample Rate of sampling l-in-IOOO So, the number of flow entries per rack switch in Mahout is h.ample Size of packet sample (bytes) 60 TABLE I: Parameters and tYPIcal values for the analytical evaluatIOn T . F·D·fe 384 entries. The number of flow setups that the = Mahout controller needs to handle is N·F·fe, which is about 200K requests per second, which needs 7 controllers. Also, entries per linecard. It is unlikely that any switch in the near the number of packets per second that need to be processed future will support so many table entries given the expense of for gathering statistics is a fe fraction of the same in case of high-speed memory. Hedera. Thus 7 controllers are needed for gathering statistics The Hedera controller needs to handle N . F flow setups per at the rate of once per second, or the statistics can be gathered second, or more than 20 million requests per second in our by a single controller at the rate of once every 7 seconds. example. A single NOX controller can handle only 30,000 requests per second; hence one needs 667 controllers to just V. EXPERIMENTS handle the flow setup load [28], assuming that the load can be perfectly distributed. Flow scheduling, however, does not A. Simulations seem to be a simple task to distribute. Our goal is to compare the performance and overheads of The rate at which the controller needs to process the Mahout against the competing approaches described in the statistics packets is previous section. To do so, we implemented a flow-level, c·T·F ·D event-based simulator that can scale to a few thousand end ----- . S . r stat p hosts connected using Clos topology [12]. We now describe this simulator and our evaluation of Mahout with it. In our example, this implies (24 . 38400) /1500 . 215 1) Methodology: We simulate a datacenter network by 1 � 20 . 1M control packets per second. Assuming that NOX modeling the behavior of flows. The network topology is controller can handle these packets at the rate it can handle modeled as a capacitated, directed graph and forms a three­ the flow setup requests (30,000 per second), this translates to level Clos topology. All simulations here are of a 1,600 needing 670 controllers just to process these packets. Or, if we server datacenter network, and they use a network with a consider only one controller, then the statistics can be gathered rack to aggregation and aggregation to core links that are only once every 670 seconds ( � 11 minutes). 1:5 oversubscribed, i.e., the network has 320Gb bisection Sampling: Sampling incurs the messaging overhead of bandwidth. All servers have 1Gbps NICs and links have 1 Gbps taking samples, and then installs flow table entries when an capacity. Our simulation is event-based, so there is no discrete elephant is detected. The rate at which the controller needs to clock-instead, the timing of events is accurate to floating­ process the sampled packets is point precision. Input to the simulator is a file listing the start bytes per sample time, bytes, and endpoints of a set of flows (our workloads are throughput· r sample . --'----=----=-­ p described below). When a flow starts or completes, the rate of We assume that each sample contains only a 60 byte each flow is recomputed. header and that headers can be combined into 1500 byte We model the OpenFlow protocol only by accounting for packets, so there are 25 samples per message to the controller. the delay when a switch sets up a flow table entry for a flow. The aggregate throughput of a datacenter network changes When this occurs, the switch sends the flow to the OpenFlow frequently, but if 10% of the hosts are sending traffic, the controller by placing it in its OpenFlow queues. This queue aggregate throughput (in Gbps) is 0 . 10 . N. We then find the has lOMbps of bandwidth (this number was measured from messaging overhead of sampling to be around 550K messages an OpenFlow switch [24]). This queue has infinite capacity, per second, or if we bundle samples into packets (i.e., 25 so our model optimistically estimates the delay between a samples fit in a 1500 byte packet), then this drops to 22K switch and the OpenFlow controller since a real system drops messages per second. arriving packets if one of these queues is full, resulting in TCP At first blush, this messaging overhead does not seem timeouts. Moreover, we assume that there is no other overhead like too much overhead; however, as the network utilization when setting up a flow, so the OpenFlow controller deals with increases, the messaging overhead can reach 3.75 million the flow and installs flow table entries instantly. (or 150K if there are 25 samples per packet) packets per We simulate three different schedulers: (1) an offline sched­ second. Therefore, sampling incurs the highest overhead when uler that periodically pulls flow statistics from the switches, load balancing is most needed. Decreasing the sampling rate (2) a scheduler that behaves like the Mahout scheduler, but 1634
  • 7. 250 uses sampling to detect elephant flows, and (3) the Mahout � Q. scheduler as described in Sec. III-C. t[ 200 '" ,. The stat-pulling controller behaves like Hedera [7] and � 150 I-- - I-- - I-- - ,.- I-- - f- .s::; Helios [16]. Here, the controller pulls flow statistics from " ." � 100 I-- - I-- - I-- - r-- I-- - f- each switch at regular intervals. The statistics from a flow -5 ! I-- - I-- - I-- - I-- I-- - f- table entry are 24 bytes, so the amount of time to transfer the SO statistics from a switch to the controller is proportional to the � » 0 ... .... ;'l number of flow table entries at the switch. When transferring '" e>. CD CD CD 0 0 0 :;; :;; S 0 0 c:i � S 00 0 � � 0 ;'l statistics, we assume that the CPU-to-controller rate is the ;'l -;:;- bottleneck, not the network or OpenFiow controller itself. Once the controller has statistics for all flows, it computes Mahout (threshold) Sampling (frac. Pulling (s) packets) a new routing for elephant flows and reassigns paths instantly. In practice, computing this routing and inserting updated flow Fig. 4: Throughput results for the schedulers with various parameters. Error bars on all charts show 95% confidence intervals. table entries into the switches will take up to hundreds of milliseconds. We allow this to be done instantaneously to intra- or inter-rack flow respectively. We select the number find the theoretical best achievable results using an offline of bytes in a flow following the distribution of flow sizes in approach. The global re-routing of flows is computed using their measurements as well. Before starting the shuffle job, the increasing best fit algorithm described in Algorithm 2. This we simulate this background traffic for three minutes. The algorithm is simpler than the simulated annealing employed by simulation ends whenever the last shuffle job flow completes. Hedera; however, we expect the results to be similar, since this b) Metrics: To measure the performance of each sched­ heuristic is likely to be as good as any other (as discussed in uler, we tracked the aggregate throughput of all flows; this is Sec. III-C) the amount of bisection bandwidth the scheduler is able to As we are doing flow-level simulations, sampling packets is extract from the network. We measure overhead as before in not straightforward since there are no packets to sample from. Section IV, i.e., by counting the number of control messages Instead, we sample from flows by determining the amount of and the number of flow table entries at each switch. All time it will take for k packets to traverse a link, given its rate, numbers shown here are averaged from ten runs. and then sample from the flows on the link by weighting each 2) Results: The per-second aggregate throughput for the flow by its rate. Full details are in [14]. various scheduling methods is shown in Figure 4. We com­ a) Workloads: We simulate background traffic modeled pare these schedulers to static load balancing with equal-cost on recent measurements [21] and add traffic modeled on multipath (ECMP), which uniformly randomizes the outgoing MapReduce traffic to stress the network. We assume that the flows across a set of ports [7]. We used three different elephant MapReduce job has just gone into its shuffle phase. In this thresholds for Mahout: 128KB, 1MB, and 100MB, and flows phase, each end host transfers 128MB to each other host. Each carrying at least this threshold of bytes were classified as an end host opens a connection to at most five other end hosts elephant after sending 2, 20, or 2000 packets respectively. As simultaneously (as done by default in Hadoop's implementa­ expected, controlling elephant flows extracts more bisection tion of MapReduce). Once one of these connections completes, bandwidth from the network-Mahout extracts 16% more the host opens a connection to another end host, repeating this bisection bandwidth from the network than ECMP and the until it has transferred its 128MB file to each other end host. other schedulers obtain similar results depending on their The order of these outgoing connections is randomized for parameters. each end host. For all experiments here, we used 250 randomly Hedera's results found that flow scheduling gives a much selected end hosts in the shuffle load. The reduce phase shuffle larger improvement over ECMP than our results (up to 113% begins three minutes after the background traffic is started on some workloads) [7]. This is due to the differences in to allow the background traffic to reach a steady state, and workloads. Our workload is based on measurements [21], measurements shown here are taken for five minutes after the whereas their workloads are synthetic. We have repeated our reduce phase began. simulations using some of their workloads and find similar We added background traffic following the macroscopic results: the schedulers improve throughput by more than 100% flow measurements collected by Kandula et al. [17], [21] compared to ECMP on their workloads. to the traffic mix because datacenters run a heterogeneous We examine the overhead versus performance tradeoff by mix of services simultaneously. They give the fraction of counting the maximum number of flow table entries per rack correspondents a server has within its rack and outside of its switch and the number of messages to the controller. These rack over a ten second interval. We follow this distribution results are shown in Figures 5 and 6 to decide how many inter- and intra-rack flows a server starts Mahout has the least overhead of any scheduling approach over ten seconds; however, they do not give a more detailed considered. Pulling statistics requires too many flow table breakdown of flow destinations than this, so we assume entries per switch and sends too many packets to the controller that the selection of a destination host is uniformly random to scale to large datacenters; here, the stat-pulling scheduler across the source server's rack or the remaining racks for an used nearly 800 flow table entries per rack switch on average 1635
  • 8. 12000 -g HOST-1 HOST-2 8 10000 3: � 8000 rt: Server Background Sources openFIOw protoCOl .:' ../ .. � Unux .. . . .... . 6000 /.'" Linux .. ""--- � .. I Mahout Shim I ,--. . .. -, I Mahout Shim I E 4000 � 2000 0 u n II n 1Gbps 1Gbps 1Gbps 128KB 1MB 100MB 1/100 1/1000 1/10000 0.1 1 10 Fig. 8: Testbed for prototype experiments. Mahout (threshold) Sampling (frac. packets) Pulling (s) the networking stack. Implementing it as a separate kernel Fig. 5: Number of packets sent to controller by various schedulers. module improves deployability because it can be installed Here, we bundled samples together into a single packet (there are without modifying or upgrading the linux kernel. Our con­ 25 samples per packet)-each bundle of samples counts as a single controller message. troller leverages the NOX platform to learn topology and configure switches with the default entries. It also processes 1800 • Avg table entries • Max table entries the packets marked by the shim layer and installs entries for .= 1600 .t:: 1400 the elephant flows. � .. 1200 Our testbed for experimenting different components of the � 1000 .� 800 prototype is shown in the Figure 8. Switches 1 and 2 in the .. .. 600 Figure are HP ProCurve 5400zl switches running firmware :is 400 � 200 with OpenFiow support. We have two end hosts in the testbed, one acting as the server for the flows and another acting as the 128KB 1MB 100MB 1/100 1/1000 1/10000 0.1 10 client. For some experiments, we also run some background Mahout (threshold) Sampling (frac. packets) Pulling (s) flows to emulate the other network traffic. Fig. 6: Average and maximum number of flow table entries at each Since this is a not a large scale testbed, we perform switch used by the schedulers. microbenchmark experiments focusing on the timeliness of no matter how frequently the statistics were pulled. This elephant flow detection and compare Mahout against Hedera­ is more than seven times the number of entries used by like polling approach. We first present measurements of the the sampling and Mahout controllers, and makes the offline time it takes to detect an elephant flow at end host and then scheduler infeasible in larger datacenters because the flow present the overall time it takes for the controller to detect an tables will not be able to support such a large number of elephant flow. Our workload consists of a file transfer using entries. Also, when pulling stats every 1 sec., the controller ftp. For experiments with presence of background flows, we receives lOx more messages than when using Mahout with an run 10 simultaneous iperf connections. elephant threshold of 100MB. a) End host elephant flow detection time: In this ex­ These simulations indicate that, for our workload, the value periment, we ftp a 50MB file from Host- l to Host-2. We of thresholdelephant affects the overhead of the Mahout con­ track the number of bytes in the socket buffer for that flow troller, but does not have much of an impact on performance and the number of bytes transferred on the network along (up to a point: when we set this threshold to 1GB (not shown with the timestamps. We did 30 trials of this experiment. on the charts), the Mahout scheduler performed no better than Figure 2 shows a single run. In Figure 7, we show the time it ECMP). The number of packets to the Mahout controller goes takes before a flow can be classified as an elephant based from 328 per sec. when the elephant threshold is 128KB to 214 on information from the buffer utilization versus based on per sec. when the threshold is 100MB, indicating that tuning the number of bytes sent on the network. Here we consider it can reduce controller overhead by more than 50% without different thresholds for considering a flow as an elephant. We affecting the scheduler's performance. Even so, we suggest present both cases of with and without background flows. It is making this threshold as small as possible to save memory at clear from these results that Mahout's approach of monitoring the end hosts and for quicker elephant flow detection (see the the TCP buffers can significantly quicken the elephant flow experiments on our prototype in the next section). We believe detection (more than an order of magnitude sooner in some a threshold of 200-500KB is best for most workloads. cases) at the end hosts and is also not affected by the congestion in the network. B. Prototype & Microbenchmarks b) Elephant flow detection time at the controller: In this We have implemented a prototype of the Mahout system. experiment, we measure how long it takes for an elephant flow The shim layer is implemented as a kernel module inserted to be detected at the controller using the Mahout approach between the TCPIIP stack and device driver, and the controller versus using Hedera-like periodic polling approach. To be fair is built upon NOX [28], an open-source OpenFiow controller to the polling approach, we have done the periodic polling written in Python language. For the shim layer, we created a at the fastest rate possible (poll in a loop without any wait function for the pseudocode shown in Algorithm 1 and invoke periods in between). As can be seen from the Table II, Mahout it for the outgoing packets, after the IP header creation in controller can detect an elephant flow in few milliseconds. 1636
  • 9. 100 r--�--�----' 100 r--�--�--, Time to Queue � Time to Queue !QQQQ Time to Send rs;:::s:s::s] Time to Send rs:ss:sJ U) 10 U) 10 S S '" '" E E i= i= 0.1 100KB 200KB 500KB 1MB 100KB 200KB 500KB 1MB (a) No background flows (b) 10 background flows Fig. 7: For each threshold bytes, time taken for the TCP buffer is filled versus the time taken for those many bytes to appear on the network with (a) no background flows and (b) 10 background TCP flows. Error bars show 95% confidence intervals. [6] M. AI-Fares,A. Loukissas,and A. Vahdat. A scalable,commodity data center network architecture. In SIGCOMM, 2008. [7] M. AI-Fares,S. Radhakrishnan,B. Raghavan,N. Huang,and A. Vahdat. Hedera: Dynamic How scheduling for data center networks. In NSDI, TABLE II: Time it takes to detect an elephant flow at the Mahout 2010. controller vs. the Hedera controller, with no other active flows. [8] M. Alizadeh,A. Greenberg,D. A. Maltz,J. Padhye,P. Patel, B. Prab­ hakar,S. Sengupta,and M. Sridharan. Dctcp: Efficient packet transport In contrast, Hedera takes 189.83ms before the flow can be for the commoditized data center. In SIGCOMM, 2010. detected as an elephant irrespective of the threshold. All times �] http://aws.amazon.com/ec2/. are same due to switch's overheads in collecting statistics and [10] P. Barham,B. Dragovic, K. Fraser,S. H,T. Harris,A. Ho,R. Neuge­ bauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In relaying it to the central controller. SOSP, pages 164-177,2003. Overall, our working prototype demonstrates the deploy­ [II] R. Braden,D. Clark,and S. Shenker. Integrated service in the internet ment feasibility of Mahout. The experiments show an order architecture: an overview. Technical report, IETF, Network Working Group,June 1994. of magnitude difference in the elephant flow detection times [12] C. Clos. A study of non-blocking switching networks. Bell System at the controller in Mahout vs. a competing approach. Technical Journal, 32(5):406-424,1953. [13] J. R. Correa and M. X. Goemans. Improved bounds on nonblocking VI. CONCLUSION 3-stage c10s networks. SIAM J. Comput., 37(3):870-894,2007. [14] A. R. Curtis,W. Kim,and P. Yalagandula. Mahout: Low-Overhead Dat­ Previous research in datacenter network management has acenter Traffic Management using End-Host-Based Elephant Detection. shown that elephant flows-flows that carry large amount of Technical Report HPL-2010-91,HP Labs,2010. data-need to be detected and managed for better utilization of [IS] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters . In OSDI,2004. the mUlti-path topologies. However, the previous approaches [16] N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Sub­ for elephant flow detection are based on monitoring the ramanya, Y. Fainman, G. Papen, and A. Vahdat. Helios: A Hybrid behavior of flows in the network and hence incur long de­ ElectricaVOptical Switch Architecture for Modular Data Centers. In SIGCOMM, 2010. tection times, high switch resource usage, and/or high control [17] A. Greenberg, J. R. Hamilton,N. Jain,S. Kandula, C. Kim,P. Lahiri, bandwidth and processing overhead. In contrast, we propose a D. Maltz, P. Patel, and S. Sengupta. VL2: a scalable and Hexible data novel end host based solution that monitors the socket buffers center network. In SIGCOMM, 2009. [18] Hadoop MapReduce. http://hadoop.apache.orglmapreduce/. to detect elephant flows and signals the network controller [19] C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992 using an in-band mechanism. We present Mahout, a low­ (Informational),Nov. 2000. overhead yet effective traffic management system based on this [20] lANA DSCP registry. http://www.iana.orglassignments/dscp-registry. [21] S. Kandula,S. Sengupta, A. Greenberg, and P. Patel. The nature of idea. Our experimental results show that our system can detect datacenter traffic: Measurements & analysis. In IMC, 2009. elephant flows an order of magnitude sooner than polling [22] J. Kim and W. J. Dally. Flattened butterHy: A cost-efficient topology based approaches while incurring an order of magnitude lower for high-radix networks. In ISCA, 2007. [23] N. McKeown,T. Anderson,H. Balakrishnan,G. Parulkar,L. Peterson, controller overhead than other approaches. J. Rexford,S. Shenker,and J. Turner. OpenFlow: Enabling Innovation in Campus Networks. ACM CCR, 2008. ACKNOWLEDGEMENTS [24] J. C. Mogul, J. Tourrilhes,P. Yalagandula,P. Sharma, A. R. Curtis, We sincerely thank Sujata Banerjee, Jeff Mogul, Puneet and S. Banerjee. DevoHow: Cost-effective How management for high performance enterprise networks. In HotNets, 2010. Sharma, and Jean Tourrilhes for several beneficial discussions [25] T. Mori, M. Uchida, R. Kawahara, J. Pan, and S. Goto. Identifying and comments on earlier drafts of this paper. elephant Hows through periodically sampled packets. In Proc. IMC, pages 115-120,Taormina,Oct. 2004. REFERENCES [26] K. Nichols, S. Blake, F. Baker, and D. Black. Definition of the Differentiated Services Field (OS Field) in the IPv4 and IPv6 Headers. [I] IEEE Std. 802.IQ-2005,Virtual Bridged Local Area Networks. RFC 2474 (Proposed Standard),Dec. 1998. [2] OpenFlow Switch Specification, Version 1.0.0. [27] M. Roughan,S. Sen,O. Spatscheck,and N. Duffield. Class-of-service http://www.openHowswitch. orgidocuments/openHow-spec-v1.0.0.pdf. mapping for qos: A statistical signature-based approach to ip traffic [3] sFlow. http://www.sHow.orgi. classification. In In IMC04, pages 135-148,2004. [4] The OpenFlow Switch Consortium. http://www.openHowswitch.orgi. [28] A. Tavakoli,M. Casado,T. Koponen, and S. Shenker. Applying NOX [5] J. H. Ahn,N. Binkert,A. Davis,M. McLaren,and R. S. Schreiber. Hy- to the datacenter. In HotNets-Vlll, 2009. perx: topology,routing,and packaging of efficient large-scale networks. [29] VMware. http://www.vmware.com. In Proceedings of the Conference on High Performance Computing [30] Xen. http://www.xen.org. Networking, Storage and Analysis (SC '09), 2009. 1637