SlideShare a Scribd company logo
Cross-Platform Estimation of Network Function
Performance
Amedeo Sapio
Department of Control and
Computer Engineering
Politecnico di Torino
Torino, Italy
amedeo.sapio@polito.it
Mario Baldi
Department of Control and
Computer Engineering
Politecnico di Torino
Torino, Italy
mario.baldi@polito.it
Gergely Pongr´acz
TrafficLab
Ericsson Research
Budapest, Hungary
gergely.pongracz@ericsson.com
Abstract—This work shows how the performance of a network
function can be estimated with an error margin that is small
enough to properly support orchestration of network functions
virtualization (NFV) platforms. Being able to estimate the per-
formance of a virtualized network function (VNF) on execution
hardware of various types enables its optimal placement, while
efficiently utilizing available resources. Network functions are
modeled using a methodology focused on the identification of
recurring execution patterns and aimed at providing a platform
independent representation. By mapping the model on specific
hardware, the performance of the network function can be
estimated in terms of maximum throughput that the network
function can achieve on the specific execution platform. The
approach is such that once the basic modeling building blocks
have been mapped, the estimate can be computed automatically.
This work presents the model of an Ethernet switch and evaluates
its accuracy by comparing the performance estimation it provides
with experimental results.
Keywords—Network Functions Virtualization; Virtual Network
Function; modeling; orchestration; performance estimation;
I. INTRODUCTION
For a few years now software network appliances have been
increasingly deployed. Initially, their appeal stemmed from
their lower cost, shorter time-to-market, ease of upgrade when
compared to purposely designed hardware devices. These
features are particularly advantageous in the case of appliances,
a.k.a. middleboxes, operating on relatively recent higher layer
protocols that are usually more complex and are possibly still
evolving. Then, with the overwhelming success and diffusion
of cloud computing and virtualization, software appliances
became natural means to ensure that network functionalities
had the same flexibility and mobility as the virtual machines
(VMs) they offer services to. Hence, value started to be seen
in the software implementation of also less complex, more
stable network functionalities. This trend led to embracing
Software Defined Networking and Network Functions Virtu-
alization (NFV). The former as a hybrid hardware/software
approach to ensure high performance for lower layer packet
forwarding, while retaining a high degree of flexibility and
programmability. The latter as a virtualization solution target-
ing the execution of software network functions in isolated
Virtual Machines (VMs) sharing a pool of hosts, rather than
on dedicated hardware (i.e., appliances). Such a solution en-
ables virtual network appliances (i.e., VMs executing network
functions) to be provisioned, allocated a different amount of
resources, and possibly moved across data centers in little time,
which is key in ensuring that the network can keep up with
the flexibility in the provisioning and deployment of virtual
hosts in today’s virtualized data centers. Additional flexibility
is offered when coupling NFV with SDN as network traffic can
be steered through a chain of Virtualized Network Functions
(VNFs) in order to provide aggregated services. With inputs
from the industry, the NFV approach has been standardized by
the European Telecommunications Standards Institute (ETSI)
in 2013 [1].
The flexibility provided by NFV requires the ability to
effectively assign compute nodes to VNFs and allocate the
most appropriate amount of resources, such as CPU quota,
RAM, virtual interfaces. In the ETSI standard the component
in charge of taking such decisions is called orchestrator and it
can also dynamically modify the amount of resources assigned
to a running VNF when needed. The orchestrator can also
request the migration of a VNF when the current compute node
executing it is no longer capable of fulfilling the VNF perfor-
mance requirements. These tasks require the orchestrator to
be able to estimate the performance of VNFs according to the
amount of resources they can use. Such estimation must take
into account the nature of the traffic manipulation performed
by the VNF at hand, some specifics of its implementation, and
the expected amount of traffic it operates on. A good estimation
is key in ensuring higher resource usage efficiency and avoid
adjustments at runtime.
This work presents and evaluates the model of an Ethernet
switch based on a unified modeling approach [2] applicable
to any VNF, independently of the platform it is running on.
By mapping the VNF model to a specific hardware, it is
possible to predict the maximum amount of traffic that the
VNF can sustain. In this work, the model is mapped to a
sample hardware platform and the predicted performance is
compared with the actual measurements.
The deployed modeling approach [2] is particularly valu-
able because it relies on a description of VNFs in terms of
basic operations, which results in a hardware independent
notation that ensures that the model is valid for any execution
platform. In addition, the mapping of the model on a target
hardware architecture (required in order to determine the actual
performance) can be automated, hence allowing to easily apply
the approach to each available hardware platform and choose
the most suitable for the execution.
After discussing related work in Section II, the modeling
approach is described in Section III. Section IV presents
the modelization of an Ethernet switch and the mapping of
the model to a general purpose hardware architecture. In
order to validate the accuracy of the approach, Section V
compares the performance estimated through the model with
actual measurements obtained by running targeted experiments
with a software implementation of the Ethernet switch on the
considered hardware platform.
II. RELATED WORK
This work applies to an Ethernet switch the approach
to network function modelization proposed in [2], providing
experimental measurements to validate the obtained model.
The modelization approach was inspired by [3] that aims
to demonstrate that the Software Defined Networks approach
does not necessarily imply lower performance compared to
purpose-built ASICs. In order to prove it, the performance of
a software implementation of an Ethernet Provider Backbone
Edge Bridge is evaluated. The execution platform considered
in this work is a hypothetical network processor, for which
a high-level model is provided. The authors do not aim at
providing a universal modelization approach for a generic
network functions. Rather, their purpose is to use a specific
sample network function to demonstrate that, even for very
specific tasks, the NPU-based software implementation offers
performance only slightly lower than purpose designed chips.
A modeling approach for describing packet processing in
middleboxes and the ways they can be deployed is presented
in [4] and applied to a NAT, a L4 load balancer, and a L7
load balancer. The proposed model is not aimed at estimating
performance and resources requirements, but it rather focuses
on accurately describing middleboxes functionalities to support
decisions in their deployment.
On the other hand, a VNF modeling approach aimed at
performance estimation would be greatly beneficial to cloud
platforms where the performance of the network infrastructure
is taken into account when placing VMs [5]–[7]. For example,
[7] describes the changes needed in the OpenStack software
platform, the open-source reference cloud management system,
to enable the Nova scheduler to plan VM allocation based on
network property data and a set of constraints provided by the
orchestrator. We argue that in order to infer such constraints,
the orchestrator needs a VNF model like the ones generated
by the approach presented in this paper.
III. METHODOLOGY
The proposed modeling approach is based on the definition
of a set of processing steps, here called Elementary Operations
(EOs), that are common throughout various NF implementa-
tions. This stems from the observation that, generally, most
NFs perform a rather small set of operations when processing
the average packet, namely, a well-defined alteration of packet
headers, coupled with a data structure lookup.
An EO is informally defined as the longest sequence of
elementary steps (e.g., CPU instructions or ASIC transactions)
that is common among multiple NFs processing tasks. As a
consequence, an EO have variable granularity ranging from a
simple I/O or memory load operation, to a whole IP checksum
computation. On the other hand, EOs are defined so that each
can be potentially used in multiple NF models.
An NF is modeled as a sequence of EOs that represent the
actions performed for the vast majority of packets. Since we
are interested in performance estimation, we ignore handling
that affects only a small number of packets (i.e., less the 1%),
since these tasks have a negligible impact on performance,
even when they are more complex and resource intensive
than the most common ones. Accordingly exceptions, such as
failures, configuration changes, etc., are not considered.
It is important to highlight that NF models produced with
this approach are hardware independent, which ensures that
they can be applied when NFs are deployed on different
execution platforms. In order to estimate the performance of an
NF on a specific hardware platform, each EO must be mapped
on the hardware components involved in its execution and
their features. This mapping allows to take into consideration
the limits of the involved hardware components and gather
a set of constraints that affect the performance (e.g., clock
frequency). Moreover, the load incurred by each component
when executing each EO must be estimated, whether through
actual experiments or based on nominal hardware specifica-
tions. The data collected during such mapping are specific to
EOs and the hardware platform, but not to a particular NF.
Hence, they can be applied to estimate the performance of any
NF starting from its model. Specifically, the performance of
each individual EO involved in the NF model is computed and
composed considering the cumulative load that all EOs impose
on the hardware components of the execution platform, while
heeding all of the applicable constraints.
Figure 1 summarizes the steps and intermediate outputs of
the proposed approach.
NF
EO
HW
Architecture
NF
Performance
Express
Map
EO Performance Constraints
NF Model
Fig. 1: NF modeling and performance estimation approach.
Table I presents a sample list of EOs that we identified
when modeling a number of NFs. Such list is by no means
meant to be exhaustive; rather, it should be incrementally
extended whenever it turns out that a new NF being considered
cannot be described with previously identified EOs. When
defining an EO, it is important to identify the parameters
related to traffic characteristics that significantly affect the
execution and resource consumption.
TABLE I: Sample list of EOs
EO Parameters Description
1 mem_I/O L1n, L2n
Packet copy between
I/O and (cache) memory
2 parse b Parsing a data field
3 increase b
Increase/decrease
a field
4 array_access es, max
Direct access to
a byte array in memory
5 hash_lookup
N, HE,
max, p
Simple hash table lookup
6 checksum b Compute IP checksum
7 sum b Sum 2 operands
A succinct description of the EOs listed in table I is
provided below.
1) Packet copy between I/O and memory:
A packet is copied from/to an I/O buffer to/from
memory. L1n is the number of bytes that are prefer-
ably stored in L1 cache memory, otherwise in L2
cache or external RAM. L2n bytes are preferably
stored in L2 cache memory, otherwise in external
RAM. The parameters have been chosen taking into
consideration that some NPUs provide a manual
cache that can be explicitly loaded with the data
that need fast access. General purpose CPUs may
have assembler instructions (e.g., PREFETCHh) to
explicitly influence the cache logic.
2) Parsing a data field:
A data field of b bytes stored in memory is parsed.
A parsing operation is necessary before performing
any computation on a field (corresponds to loading
a processor register). This EO can be used also to
model the dual operation, i.e., encapsulation, which
implies storing back into memory a properly con-
structed sequence of fields.
3) Increase/decrease a field:
Increase/decrease the numerical value contained in
a field of b bytes. The field to increase must have
already been parsed.
4) Direct access to a byte array in memory:
This EO performs a direct access to an element of
an array in memory using an index. Each array entry
has size es, while the array has at most max entries.
5) Simple hash table lookup:
A simple lookup in a direct, XOR based hash table is
performed. The hash key consists of N components
and each entry has size equal to HE. The table has
at most max entries. The collision probability is p.
6) Compute IP checksum:
The standard IP checksum computation is performed
on b bytes.
7) Sum 2 operands:
Two operands of b bytes are added.
For the sake of simplicity (and without affecting the
validity of the approach, as shown by the results in Section V),
in modeling NFs by means of EOs, we assume that the number
of processor registers is larger than the number of packet fields
that must be processed simultaneously. Therefore there is no
competition for this resource.
IV. A MODELING USE CASE
This section demonstrates the application of the modeling
approach described in the previous section. EOs are used to
describe the operation of an Ethernet switch and then they are
mapped to a general purpose hardware platform.
A. Ethernet Switch Model
For each packet the switch selects the output interface
where it must be forwarded, retrieving it from a hash table
keyed by the destination MAC address extracted from the
packet.
When the network interface receives a packet, it is firstly
stored in an I/O buffer. In order to access the Ethernet header,
the CPU/NPU must first copy the packet in cache or main
memory. Since the switch operates only on the Ethernet header
that is of limited size (14 bytes), it is copied in the L1 cache,
while the rest of the packet (up to 1486 bytes) can be copied in
L2 cache or main memory. To ensure generality, we consider
that an incoming packet cannot be copied directly from an I/O
buffer to another, instead it must be first copied in (cache)
memory in any case.
The switch must then read the destination MAC address
(6 bytes) prior to using it to access the hash table to get the
appropriate output interface. The hash table has one key (the
destination MAC) and consists of 12 byte entries composed
by the key and the output interface MAC address.
Here we considered that the output interface is identified
by its Ethernet address. Different implementations can use a
different identifier, which leads to a minor variation in the
model.
The average number of entries in a real case scenario is
≈ 2M, which can give an idea of whether it can be fully
stored in cache under any traffic conditions. Here we assume
that the collision probability is negligible (i.e., the hash table
is sufficiently sparse).
The packet can then be moved to the buffer of the selected
output I/O device. The resulting model is summarized in
Figure 2.
mem_I/O(14, 1486)
parse(6)
hash_lookup(1, 12, 2M, 0)
mem_I/O(14, 1486)
Fig. 2: Ethernet switch model.
B. Mapping to Hardware
We now proceed to map the described EOs to a specific
hardware platform. Figure 3 provides a schematic representa-
tion of the platform main components and relative constraints
using the template proposed in [3]: an Intel R
Xeon E5-2630
CPU, a DDR3 RAM module and a 10Gb Ethernet Controller.
Using the CPU reference manual [8], it is possible to
determine the operations required for the execution of each
EO in Table I and estimate the achievable performance.
DDR3
- 1333 Mtps
- Max 85.2 Gbps
- CAS lat. 9
MCT
- 4 ch.
- DDR3
- Max
340.8 Gbps
I/O
PCIe v3.0
- 8 Gtps
- 126 Gbps
(x16)
x86-64
6 cores /slot
- 2 threads / core
- 2.3 – 2.8 GHz
AVX
VT-d, VT-x + EPT
L1
- per core
- i=32KB
- d=32KB
L2
- per core
- 256 KB
L3
- per slot
- 15 MB
2x 10 GbE
- 5 Gtps
- PCIe v2.0 (x8)
- Max 32 Gbps
Intel Xeon E5-2630
Fig. 3: Hardware architecture description.
1. mem_I/O(L1n, L2n)
The CPU L1 and L2 data caches can move one line per
cache cycle, i.e., 512 bits (64 bytes) in 4 clock cycles and
12 clock cycles respectively, and their maximum sizes are
32 KB and 256 KB, respectively. Moreover, read and write
operations in I/O buffers require on average 40 clock cycles.
On the whole, the execution of this EO requires:
4 ∗ ⌈
min(32KB, L1n)
64B
⌉+
12 ∗ ⌈
min(256KB, max(0, L1n − 32KB) + L2n)
64B
⌉+
40 ∗ ⌈
L1n + L2n
64B
⌉
clock cycles and
⌈
max(0, max(0, L1n − 32KB) + L2n − 256KB)
64B
⌉
L3 cache or DRAM accesses.
2. parse(b)
Loading a 64 bit register requires 4 clock cycles if data is
in L1 cache or 12 clock cycles if data is in L2 cache, otherwise
an additional L3 cache or DRAM memory access is required
to retrieve a 64 byte line and store it in L1 or L2 respectively:
4 ∗ ⌈
b
8B
⌉ clock cycles {+⌈
b
64B
⌉ L3 or DRAM accesses}
or
12 ∗ ⌈
b
8B
⌉ clock cycles {+⌈
b
64B
⌉ L3 or DRAM accesses}
3. increase(b)
Whether a processor includes an increase instruction or one
for adding a constant value to a 64 bit register, this EO requires
1 clock cycle to complete. However, thanks to pipelining, up
to 3 independent such instructions can be executed during 1
clock cycle:
⌈0.33 ∗
b
8B
⌉ clock cycles
4. array_access(es, max)
Direct array access needs to execute an “ADD” instruction
(1 clock cycle) for computing the index and a “LOAD” instruc-
tion resulting into a direct memory access and as many clock
cycles as the number of CPU registers required to load the
selected array element:
1 + ⌈
es
8B
⌉ clock cycles
+⌈
es
64B
⌉ DRAM accesses
5. hash_lookup(N, HE, max, p)
We assume that a simple hash lookup is implemented
according to the pseudo-code described in [3] and shown in
Figure 4 for ease of reference.
Register $1-N: key components
Register $HL: hash length
Register $HP: hash array pointer
Register $HE: hash entry size
Register $Z: result
Pseudo code:
# hash key calculation
eor $tmp, $tmp
for i in 1 ... N
eor $tmp, $i
# key is available in $tmp
# calculate hash index from key
udiv $tmp2, $tmp, $HL
mls $tmp2, $tmp2, $HL, $tmp
# index is available in $tmp2
# index -> hash entry pointer
mul $tmp, $tmp2, $HE
add $tmp, $HP
# entry pointer available in $tmp
<prefetch entry to L1 memory>
# pointer to L1 entry -> $tmp2
# hash key check (entry vs. key)
for i in 1 ... N
ldr $Z, [$tmp2], #4
# check keys
cmp $i, $Z
bne collision
# no jump means matching keys
# pointer to data available in $Z
Fig. 4: Hash lookup pseudo-code.
Considering that the hash entry needs to be loaded from
memory to L1 cache, a simple hash lookup would require
approximately:
⌈(4 ∗ N + 106 + 4 ∗ ⌈
HE
8B
⌉ + 4 ∗ ⌈
HE
32B
⌉) ∗ (1 + p)⌉
clock cycles and
⌈(⌈
HE
64B
⌉ ∗ (1 + p))⌉
DRAM accesses.
Otherwise, if the entry is already in the cache, the memory
accesses and cache store operations are not required. Notice
that in order for the whole table to be in cache, its size should
be limited to:
max ∗ HE ≤ 32KB + 256KB = 288KB
So, in the average case, a mix of cache hits and misses will
take place, depending on the specific traffic profile.
6. checksum(b)
Figure 5 shows a sample assembly code to compute a
checksum on an Intel R
x86-64 processor. Assuming that the
data on which the checksum is computed is not in L1/L2 cache,
according to the Intel R
documentation [8], the execution of
this code requires
7 ∗ ⌈
b
2
⌉ + 8 clock cycles
+⌈
b
64B
⌉ L3 or DRAM accesses
Register ECX: number of bytes b
Register EDX: pointer to the buffer
Register EBX: checksum
CHECKSUM_LOOP:
XOR EAX, EAX ;EAX=0
MOV AX, WORD PTR [EDX] ;AX <- next word
ADD EBX, EAX ;add to checksum
SUB ECX, 2 ;update number of bytes
ADD EDX, 2 ;update buffer
CMP ECX, 1 ;check if ended
JG CKSUM_LOOP
MOV EAX, EBX ;EAX=EBX=checksum
;EAX=checksum>>16 EAX is the carry
SHR EAX, 16
AND EBX, 0xffff ;EBX=checksum&0xffff
;EAX=(checksum>>16)+(checksum&0xffff)
ADD EAX, EBX
MOV EBX, EAX ;EBX=checksum
SHR EBX, 16 ;EBX=checksum>>16
ADD EAX, EBX ;checksum+=(checksum>>16)
MOV checksum, EAX ;checksum=EAX
Fig. 5: Sample Intel R
x86 assembly code for checksum
computation.
7. sum(b)
On the considered architecture, the execution of this EO is
equivalent to the increase(b) EO. Please note that this is
not necessarily the case on every architecture.
TABLE II: Estimates for different packet sizes
Packet size Mpps Gbps
64 12.05 7.91
128 8.38 9.69
256 5.21 11.24
512 2.97 12.34
1024 1.59 13.01
1500 1.09 12.95
C. Performance Estimation
Using the above mapping of EOs in the Ethernet switch
model devised in Section IV-A and shown in Figure 2, we
can estimate that forwarding a packet of the maximum size
(1500 bytes) requires:
2630 clock cycles + 1 DRAM access
As a consequence, a single core of an Intel R
Xeon E5-
2630 operating at 2.8 Ghz can process ≈ 1.09 Mpps, while
the DDR3 memory can support 70.16 Mpps. The memory
throughput is estimated considering that each packet requires
a 12 byte memory access to read the hash table entry and the
time to read the second 8 bytes word from memory is:
(CAS latency ∗ 2) + 1
data rate
As a result a single core can process ≈ 12.95 Gbps.
If we consider minimum size (64 byte) packets (i.e., an
unrealistic, worst case scenario), the Ethernet switch requires:
238 clock cycles + 1 DRAM access
which means that a single core at 2.8 Ghz can process ≈
12.05 Mpps (while the load and throughput of the memory
remain the same), which translates into ≈ 7.9 Gbps. Estimates
calculated for different packet sizes are reported in Table II.
V. EXPERIMENTAL VALIDATION
In order to evaluate the accuracy of the estimates provided
by the proposed modeling approach, in this section we show
measurements made in a lab setting with software switch
implementations running on the presented hardware platform.
Three software switches are used in the experiments: Open
vSwitch (OVS), eXtensible DataPath Daemon (xDPd) and
Ericsson Research Flow Switch (ERFS). These switches are
configured via the OpenFlow protocol to perform a single des-
tination MAC address-based output port selection and forward
packets on the selected interface. The execution platform is
equipped with two Xeon E5-2630 processors whose model
is provided in Figure 3. To minimize the interference of the
operating system drivers, the network interfaces are managed
by the Intel R
DPDK drivers. These drivers are designed for
fast packet processing enabling applications (i.e., the switch
implementation in this case) to receive and send packets
directly from/to a network interface card within the minimum
possible number of CPU cycles. A separate PC with the same
hardware configuration is used as a traffic generator with
the DPDK based pktgen traffic generator that is capable of
saturating a 10GbE link with minimum size packets.
0
2
4
6
8
10
12
14
16
64128 256 512 1024 1500
Throughput(Mpps)
Packet size (bytes)
Estimate
OVS-DPDK
ERFS
xDPd-0.6
Fig. 6: Performance with 100 flows
The test traffic consists of Ethernet packets with different
destination MAC addresses in order to prevent inter-packet
caching. The total number of packets sent for each test is equal
to 100, 000, 000 ∗ transmission rate (in Gbps). The generator
PC is also used to compute statistics on received packets.
Figure 6 shows the results obtained using each of the
above listed switches and generating 100 concurrent flows with
different destination MAC addresses. From the results it is
clear that in this scenario the switches can achieve throughput
up to the link capacity except with very small packets. The
estimated value is above the measured value, as expected, since
the estimation considers the hardware computational capability
and not the transmission rate of the physical links. For small
packets the fully-optimized pipeline of ERFS outperforms
xDPd and OVS. With 64 byte packets the measured throughput
of ERFS significantly exceeds the estimated value, which in
turn is above the measurements ones for the other 2 switches.
In order to further test the accuracy of the estimates, we
run additional tests with bi-directional flows. The generated
traffic has the same characteristics as the previous tests and in
this case we calculate the aggregate statistics on all output
interfaces. In this way the traffic processed by the switch
can hypothetically reach 20 Gbps. We test this configuration
with increasing packet sizes, until the link capacity is reached.
The results obtained, which involved 2 different cores, are
presented in Figure 7, together with the values estimated with
the modeling approach. As correctly estimated, a rate around
only 22 Mpps can be reached with small packets. As it is
visible, version 0.6 of xDPd has internal scalability problems,
while the other 2 switches are capable to scale as needed. The
above results show that the model provides a good estimation
of the throughput limit. In the case of bi-directional flows
the computed estimation has a 9% error for 64 byte packets,
0.2% for 128 byte packets and 6% for 256 byte packets. The
error increases for bigger packets because the computational
capabilities, which are what the model takes into account, are
no longer the factor limiting performance.
The results show that the proposed modeling approach
provides means to produce a valuable estimation of network
functions performance. This methodology will be further im-
0
5
10
15
20
25
64128 256 512 1024 1500
Throughput(Mpps)
Packet size (bytes)
Estimate
OVS-DPDK
ERFS
xDPd-0.6
Fig. 7: Performance with 100 flows and bi-directional traffic
(using 2 cores)
proved considering also the effects of packets interaction and
concurrence.
ACKNOWLEDGMENT
This work was conducted within the framework of the FP7
UNIFY project1
, which is partially funded by the Commission
of the European Union. Study sponsors had no role in writing
this report. The views expressed do not necessarily represent
the views of the authors’ employers, the UNIFY project, or
the Commission of the European Union.
REFERENCES
[1] “ETSI ISG for NFV, ETSI GS NFV-INF 001, Network
Functions Virtualisation (NFV); Infrastructure Overview,”
http://www.etsi.org/deliver/etsi gs/NFV-INF/001 099/001/01.01.01
60/gs NFV-INF001v010101p.pdf, [Online; accessed 19-May-2015].
[2] M. Baldi and A. Sapio, “A network function modeling approach for
performance estimation,” in 2015 IEEE 1st International Forum on
Research and Technologies for Society and Industry Leveraging a better
tomorrow (RTSI 2015), Torino, Italy, Sep. 2015.
[3] G. Pongr´acz, L. Moln´ar, Z. L. Kis, and Z. Tur´anyi, “Cheap silicon: a myth
or reality? picking the right data plane hardware for software defined
networking,” in Proceedings of the second ACM SIGCOMM workshop
on Hot topics in software defined networking. ACM, 2013, pp. 103–108.
[4] D. Joseph and I. Stoica, “Modeling middleboxes,” Network, IEEE,
vol. 22, no. 5, pp. 20–25, 2008.
[5] A. Gember, A. Krishnamurthy, S. S. John, R. Grandl, X. Gao, A. Anand,
T. Benson, A. Akella, and V. Sekar, “Stratos: A network-aware orches-
tration layer for middleboxes in the cloud,” Technical Report, Tech. Rep.,
2013.
[6] J. Soares, M. Dias, J. Carapinha, B. Parreira, and S. Sargento,
“Cloud4nfv: A platform for virtual network functions,” in Cloud Net-
working (CloudNet), 2014 IEEE 3rd International Conference on. IEEE,
2014, pp. 288–293.
[7] F. Lucrezia, G. Marchetto, F. G. O. Risso, and V. Vercellone, “Intro-
ducing network-aware scheduling capabilities in openstack,” Network
Softwarization (NetSoft), 2015 IEEE 1st Conference on, 2015.
[8] “Intel 64 and IA-32 Architectures Optimization Reference Manual,”
http://www.intel.com/content/www/us/en/architecture-and-technology/
64-ia-32-architectures-optimization-manual.html, [Online; accessed
19-May-2015].
1http://www.fp7-unify.eu/

More Related Content

PPT
An Adaptive Load Balancing Middleware for Distributed Simulation
PPTX
NoC simulators presentation
DOCX
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
PDF
PR-366: A ConvNet for 2020s
PDF
PERFORMANCE ANALYSIS OF OLSR PROTOCOL IN MANET CONSIDERING DIFFERENT MOBILITY...
PPTX
MPLS in DC and inter-DC networks: the unified forwarding mechanism for networ...
PPT
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - L'IA pou...
An Adaptive Load Balancing Middleware for Distributed Simulation
NoC simulators presentation
ORCHESTRATING BULK DATA TRANSFERS ACROSS GEO-DISTRIBUTED DATACENTERS
PR-366: A ConvNet for 2020s
PERFORMANCE ANALYSIS OF OLSR PROTOCOL IN MANET CONSIDERING DIFFERENT MOBILITY...
MPLS in DC and inter-DC networks: the unified forwarding mechanism for networ...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - L'IA pou...

What's hot (16)

PDF
EFFECTS OF MAC PARAMETERS ON THE PERFORMANCE OF IEEE 802.11 DCF IN NS-3
PPTX
Emerging Technologies in On-Chip and Off-Chip Interconnection Networks
DOC
Performance of a speculative transmission scheme for scheduling latency reduc...
DOCX
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
PDF
Transfer Learning for Software Performance Analysis: An Exploratory Analysis
PDF
2453
PPT
IS-ENES COMP Superscalar tutorial
PPTX
Resource management
PDF
Cross Layer- Performance Enhancement Architecture (CL-PEA) for MANET
PDF
2004 qof is_mpls_ospf
DOCX
Error tolerant resource allocation and payment minimization for cloud system
PDF
Fast switching of threads between cores - Advanced Operating Systems
PDF
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
PDF
TEST-COST-SENSITIVE CONVOLUTIONAL NEURAL NETWORKS WITH EXPERT BRANCHES
PDF
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
DOCX
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...
EFFECTS OF MAC PARAMETERS ON THE PERFORMANCE OF IEEE 802.11 DCF IN NS-3
Emerging Technologies in On-Chip and Off-Chip Interconnection Networks
Performance of a speculative transmission scheme for scheduling latency reduc...
MC0085 – Advanced Operating Systems - Master of Computer Science - MCA - SMU DE
Transfer Learning for Software Performance Analysis: An Exploratory Analysis
2453
IS-ENES COMP Superscalar tutorial
Resource management
Cross Layer- Performance Enhancement Architecture (CL-PEA) for MANET
2004 qof is_mpls_ospf
Error tolerant resource allocation and payment minimization for cloud system
Fast switching of threads between cores - Advanced Operating Systems
COVERAGE DRIVEN FUNCTIONAL TESTING ARCHITECTURE FOR PROTOTYPING SYSTEM USING ...
TEST-COST-SENSITIVE CONVOLUTIONAL NEURAL NETWORKS WITH EXPERT BRANCHES
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
IEEE 2014 JAVA CLOUD COMPUTING PROJECTS A stochastic model to investigate dat...
Ad

Viewers also liked (16)

PPTX
Conectores johan
PDF
Natural Stool Softener Remedies
PPTX
Halloween 1516
PDF
Digital trends (updated Future)
PDF
PDF
IPLOG-BSidesROC-2015
PPTX
Vancouver Career College Students Participate in Different Halloween Activiti...
PPTX
սանդերքի մասին
PPTX
2013 Wizards slideshow
PDF
López y león la investigación cualitativa. nuevas formas
PDF
Social Media & Journalistiek, een experimentje
PDF
Thought Studios Limited
PDF
Feasibility Studies: Ammonium Nitrate Manufacturing
PDF
HB5 Powerpoint
DOCX
Proceso didáctico
PDF
Feasibility Studies: Aluminium Chloride Manufacturing
Conectores johan
Natural Stool Softener Remedies
Halloween 1516
Digital trends (updated Future)
IPLOG-BSidesROC-2015
Vancouver Career College Students Participate in Different Halloween Activiti...
սանդերքի մասին
2013 Wizards slideshow
López y león la investigación cualitativa. nuevas formas
Social Media & Journalistiek, een experimentje
Thought Studios Limited
Feasibility Studies: Ammonium Nitrate Manufacturing
HB5 Powerpoint
Proceso didáctico
Feasibility Studies: Aluminium Chloride Manufacturing
Ad

Similar to Conference Paper: Cross-platform estimation of Network Function Performance (20)

PDF
Network Function Modeling and Performance Estimation
PDF
Network Functions Virtualization Fundamentals
PDF
Control of Communication and Energy Networks Final Project - Service Function...
PDF
Conference Paper: Network Function Chaining in DCs: the unified recurring con...
PDF
Conference Paper: Elastic Network Functions: opportunities and challenges
PDF
Analysis of basic Architectures used for Lifecycle Management and Orchestrati...
PDF
Network function virtualization
PDF
SDN: A New Approach to Networking Technology
PPTX
Modern Networking Unit 3 Network Function virtualization
PDF
Network Function Virtualisation
PDF
Why Network Functions Virtualization sdn?
PDF
Xura NFV and Messaging Infrastructure_WP_1.0
PDF
Whitepaper nfv sdn-available-now
PPT
enetwork function virtualization NFV.ppt
PDF
A VNF modeling approach for verification purposes
PDF
NFV Tutorial
PDF
NFV Tutorial
PDF
High performance and flexible networking
PDF
Service provider-considerations
DOCX
Computer Network unit-5 -SDN and NFV topics
Network Function Modeling and Performance Estimation
Network Functions Virtualization Fundamentals
Control of Communication and Energy Networks Final Project - Service Function...
Conference Paper: Network Function Chaining in DCs: the unified recurring con...
Conference Paper: Elastic Network Functions: opportunities and challenges
Analysis of basic Architectures used for Lifecycle Management and Orchestrati...
Network function virtualization
SDN: A New Approach to Networking Technology
Modern Networking Unit 3 Network Function virtualization
Network Function Virtualisation
Why Network Functions Virtualization sdn?
Xura NFV and Messaging Infrastructure_WP_1.0
Whitepaper nfv sdn-available-now
enetwork function virtualization NFV.ppt
A VNF modeling approach for verification purposes
NFV Tutorial
NFV Tutorial
High performance and flexible networking
Service provider-considerations
Computer Network unit-5 -SDN and NFV topics

More from Ericsson (20)

PDF
Ericsson Technology Review: Versatile Video Coding explained – the future of ...
PDF
Ericsson Technology Review: issue 2, 2020
PDF
Ericsson Technology Review: Integrated access and backhaul – a new type of wi...
PDF
Ericsson Technology Review: Critical IoT connectivity: Ideal for time-critica...
PDF
Ericsson Technology Review: 5G evolution: 3GPP releases 16 & 17 overview (upd...
PDF
Ericsson Technology Review: The future of cloud computing: Highly distributed...
PDF
Ericsson Technology Review: Optimizing UICC modules for IoT applications
PDF
Ericsson Technology Review: issue 1, 2020
PDF
Ericsson Technology Review: 5G BSS: Evolving BSS to fit the 5G economy
PDF
Ericsson Technology Review: 5G migration strategy from EPS to 5G system
PDF
Ericsson Technology Review: Creating the next-generation edge-cloud ecosystem
PDF
Ericsson Technology Review: Issue 2/2019
PDF
Ericsson Technology Review: Spotlight on the Internet of Things
PDF
Ericsson Technology Review - Technology Trends 2019
PDF
Ericsson Technology Review: Driving transformation in the automotive and road...
PDF
SD-WAN Orchestration
PDF
Ericsson Technology Review: 5G-TSN integration meets networking requirements ...
PDF
Ericsson Technology Review: Meeting 5G latency requirements with inactive state
PDF
Ericsson Technology Review: Cloud-native application design in the telecom do...
PDF
Ericsson Technology Review: Service exposure: a critical capability in a 5G w...
Ericsson Technology Review: Versatile Video Coding explained – the future of ...
Ericsson Technology Review: issue 2, 2020
Ericsson Technology Review: Integrated access and backhaul – a new type of wi...
Ericsson Technology Review: Critical IoT connectivity: Ideal for time-critica...
Ericsson Technology Review: 5G evolution: 3GPP releases 16 & 17 overview (upd...
Ericsson Technology Review: The future of cloud computing: Highly distributed...
Ericsson Technology Review: Optimizing UICC modules for IoT applications
Ericsson Technology Review: issue 1, 2020
Ericsson Technology Review: 5G BSS: Evolving BSS to fit the 5G economy
Ericsson Technology Review: 5G migration strategy from EPS to 5G system
Ericsson Technology Review: Creating the next-generation edge-cloud ecosystem
Ericsson Technology Review: Issue 2/2019
Ericsson Technology Review: Spotlight on the Internet of Things
Ericsson Technology Review - Technology Trends 2019
Ericsson Technology Review: Driving transformation in the automotive and road...
SD-WAN Orchestration
Ericsson Technology Review: 5G-TSN integration meets networking requirements ...
Ericsson Technology Review: Meeting 5G latency requirements with inactive state
Ericsson Technology Review: Cloud-native application design in the telecom do...
Ericsson Technology Review: Service exposure: a critical capability in a 5G w...

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Approach and Philosophy of On baking technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
KodekX | Application Modernization Development
PPTX
Spectroscopy.pptx food analysis technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
MYSQL Presentation for SQL database connectivity
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
Understanding_Digital_Forensics_Presentation.pptx
MIND Revenue Release Quarter 2 2025 Press Release
The Rise and Fall of 3GPP – Time for a Sabbatical?
Approach and Philosophy of On baking technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KodekX | Application Modernization Development
Spectroscopy.pptx food analysis technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Weekly Chronicles - August'25 Week I
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
20250228 LYD VKU AI Blended-Learning.pptx
MYSQL Presentation for SQL database connectivity

Conference Paper: Cross-platform estimation of Network Function Performance

  • 1. Cross-Platform Estimation of Network Function Performance Amedeo Sapio Department of Control and Computer Engineering Politecnico di Torino Torino, Italy amedeo.sapio@polito.it Mario Baldi Department of Control and Computer Engineering Politecnico di Torino Torino, Italy mario.baldi@polito.it Gergely Pongr´acz TrafficLab Ericsson Research Budapest, Hungary gergely.pongracz@ericsson.com Abstract—This work shows how the performance of a network function can be estimated with an error margin that is small enough to properly support orchestration of network functions virtualization (NFV) platforms. Being able to estimate the per- formance of a virtualized network function (VNF) on execution hardware of various types enables its optimal placement, while efficiently utilizing available resources. Network functions are modeled using a methodology focused on the identification of recurring execution patterns and aimed at providing a platform independent representation. By mapping the model on specific hardware, the performance of the network function can be estimated in terms of maximum throughput that the network function can achieve on the specific execution platform. The approach is such that once the basic modeling building blocks have been mapped, the estimate can be computed automatically. This work presents the model of an Ethernet switch and evaluates its accuracy by comparing the performance estimation it provides with experimental results. Keywords—Network Functions Virtualization; Virtual Network Function; modeling; orchestration; performance estimation; I. INTRODUCTION For a few years now software network appliances have been increasingly deployed. Initially, their appeal stemmed from their lower cost, shorter time-to-market, ease of upgrade when compared to purposely designed hardware devices. These features are particularly advantageous in the case of appliances, a.k.a. middleboxes, operating on relatively recent higher layer protocols that are usually more complex and are possibly still evolving. Then, with the overwhelming success and diffusion of cloud computing and virtualization, software appliances became natural means to ensure that network functionalities had the same flexibility and mobility as the virtual machines (VMs) they offer services to. Hence, value started to be seen in the software implementation of also less complex, more stable network functionalities. This trend led to embracing Software Defined Networking and Network Functions Virtu- alization (NFV). The former as a hybrid hardware/software approach to ensure high performance for lower layer packet forwarding, while retaining a high degree of flexibility and programmability. The latter as a virtualization solution target- ing the execution of software network functions in isolated Virtual Machines (VMs) sharing a pool of hosts, rather than on dedicated hardware (i.e., appliances). Such a solution en- ables virtual network appliances (i.e., VMs executing network functions) to be provisioned, allocated a different amount of resources, and possibly moved across data centers in little time, which is key in ensuring that the network can keep up with the flexibility in the provisioning and deployment of virtual hosts in today’s virtualized data centers. Additional flexibility is offered when coupling NFV with SDN as network traffic can be steered through a chain of Virtualized Network Functions (VNFs) in order to provide aggregated services. With inputs from the industry, the NFV approach has been standardized by the European Telecommunications Standards Institute (ETSI) in 2013 [1]. The flexibility provided by NFV requires the ability to effectively assign compute nodes to VNFs and allocate the most appropriate amount of resources, such as CPU quota, RAM, virtual interfaces. In the ETSI standard the component in charge of taking such decisions is called orchestrator and it can also dynamically modify the amount of resources assigned to a running VNF when needed. The orchestrator can also request the migration of a VNF when the current compute node executing it is no longer capable of fulfilling the VNF perfor- mance requirements. These tasks require the orchestrator to be able to estimate the performance of VNFs according to the amount of resources they can use. Such estimation must take into account the nature of the traffic manipulation performed by the VNF at hand, some specifics of its implementation, and the expected amount of traffic it operates on. A good estimation is key in ensuring higher resource usage efficiency and avoid adjustments at runtime. This work presents and evaluates the model of an Ethernet switch based on a unified modeling approach [2] applicable to any VNF, independently of the platform it is running on. By mapping the VNF model to a specific hardware, it is possible to predict the maximum amount of traffic that the VNF can sustain. In this work, the model is mapped to a sample hardware platform and the predicted performance is compared with the actual measurements. The deployed modeling approach [2] is particularly valu- able because it relies on a description of VNFs in terms of basic operations, which results in a hardware independent notation that ensures that the model is valid for any execution platform. In addition, the mapping of the model on a target hardware architecture (required in order to determine the actual performance) can be automated, hence allowing to easily apply
  • 2. the approach to each available hardware platform and choose the most suitable for the execution. After discussing related work in Section II, the modeling approach is described in Section III. Section IV presents the modelization of an Ethernet switch and the mapping of the model to a general purpose hardware architecture. In order to validate the accuracy of the approach, Section V compares the performance estimated through the model with actual measurements obtained by running targeted experiments with a software implementation of the Ethernet switch on the considered hardware platform. II. RELATED WORK This work applies to an Ethernet switch the approach to network function modelization proposed in [2], providing experimental measurements to validate the obtained model. The modelization approach was inspired by [3] that aims to demonstrate that the Software Defined Networks approach does not necessarily imply lower performance compared to purpose-built ASICs. In order to prove it, the performance of a software implementation of an Ethernet Provider Backbone Edge Bridge is evaluated. The execution platform considered in this work is a hypothetical network processor, for which a high-level model is provided. The authors do not aim at providing a universal modelization approach for a generic network functions. Rather, their purpose is to use a specific sample network function to demonstrate that, even for very specific tasks, the NPU-based software implementation offers performance only slightly lower than purpose designed chips. A modeling approach for describing packet processing in middleboxes and the ways they can be deployed is presented in [4] and applied to a NAT, a L4 load balancer, and a L7 load balancer. The proposed model is not aimed at estimating performance and resources requirements, but it rather focuses on accurately describing middleboxes functionalities to support decisions in their deployment. On the other hand, a VNF modeling approach aimed at performance estimation would be greatly beneficial to cloud platforms where the performance of the network infrastructure is taken into account when placing VMs [5]–[7]. For example, [7] describes the changes needed in the OpenStack software platform, the open-source reference cloud management system, to enable the Nova scheduler to plan VM allocation based on network property data and a set of constraints provided by the orchestrator. We argue that in order to infer such constraints, the orchestrator needs a VNF model like the ones generated by the approach presented in this paper. III. METHODOLOGY The proposed modeling approach is based on the definition of a set of processing steps, here called Elementary Operations (EOs), that are common throughout various NF implementa- tions. This stems from the observation that, generally, most NFs perform a rather small set of operations when processing the average packet, namely, a well-defined alteration of packet headers, coupled with a data structure lookup. An EO is informally defined as the longest sequence of elementary steps (e.g., CPU instructions or ASIC transactions) that is common among multiple NFs processing tasks. As a consequence, an EO have variable granularity ranging from a simple I/O or memory load operation, to a whole IP checksum computation. On the other hand, EOs are defined so that each can be potentially used in multiple NF models. An NF is modeled as a sequence of EOs that represent the actions performed for the vast majority of packets. Since we are interested in performance estimation, we ignore handling that affects only a small number of packets (i.e., less the 1%), since these tasks have a negligible impact on performance, even when they are more complex and resource intensive than the most common ones. Accordingly exceptions, such as failures, configuration changes, etc., are not considered. It is important to highlight that NF models produced with this approach are hardware independent, which ensures that they can be applied when NFs are deployed on different execution platforms. In order to estimate the performance of an NF on a specific hardware platform, each EO must be mapped on the hardware components involved in its execution and their features. This mapping allows to take into consideration the limits of the involved hardware components and gather a set of constraints that affect the performance (e.g., clock frequency). Moreover, the load incurred by each component when executing each EO must be estimated, whether through actual experiments or based on nominal hardware specifica- tions. The data collected during such mapping are specific to EOs and the hardware platform, but not to a particular NF. Hence, they can be applied to estimate the performance of any NF starting from its model. Specifically, the performance of each individual EO involved in the NF model is computed and composed considering the cumulative load that all EOs impose on the hardware components of the execution platform, while heeding all of the applicable constraints. Figure 1 summarizes the steps and intermediate outputs of the proposed approach. NF EO HW Architecture NF Performance Express Map EO Performance Constraints NF Model Fig. 1: NF modeling and performance estimation approach. Table I presents a sample list of EOs that we identified when modeling a number of NFs. Such list is by no means meant to be exhaustive; rather, it should be incrementally extended whenever it turns out that a new NF being considered cannot be described with previously identified EOs. When defining an EO, it is important to identify the parameters related to traffic characteristics that significantly affect the execution and resource consumption.
  • 3. TABLE I: Sample list of EOs EO Parameters Description 1 mem_I/O L1n, L2n Packet copy between I/O and (cache) memory 2 parse b Parsing a data field 3 increase b Increase/decrease a field 4 array_access es, max Direct access to a byte array in memory 5 hash_lookup N, HE, max, p Simple hash table lookup 6 checksum b Compute IP checksum 7 sum b Sum 2 operands A succinct description of the EOs listed in table I is provided below. 1) Packet copy between I/O and memory: A packet is copied from/to an I/O buffer to/from memory. L1n is the number of bytes that are prefer- ably stored in L1 cache memory, otherwise in L2 cache or external RAM. L2n bytes are preferably stored in L2 cache memory, otherwise in external RAM. The parameters have been chosen taking into consideration that some NPUs provide a manual cache that can be explicitly loaded with the data that need fast access. General purpose CPUs may have assembler instructions (e.g., PREFETCHh) to explicitly influence the cache logic. 2) Parsing a data field: A data field of b bytes stored in memory is parsed. A parsing operation is necessary before performing any computation on a field (corresponds to loading a processor register). This EO can be used also to model the dual operation, i.e., encapsulation, which implies storing back into memory a properly con- structed sequence of fields. 3) Increase/decrease a field: Increase/decrease the numerical value contained in a field of b bytes. The field to increase must have already been parsed. 4) Direct access to a byte array in memory: This EO performs a direct access to an element of an array in memory using an index. Each array entry has size es, while the array has at most max entries. 5) Simple hash table lookup: A simple lookup in a direct, XOR based hash table is performed. The hash key consists of N components and each entry has size equal to HE. The table has at most max entries. The collision probability is p. 6) Compute IP checksum: The standard IP checksum computation is performed on b bytes. 7) Sum 2 operands: Two operands of b bytes are added. For the sake of simplicity (and without affecting the validity of the approach, as shown by the results in Section V), in modeling NFs by means of EOs, we assume that the number of processor registers is larger than the number of packet fields that must be processed simultaneously. Therefore there is no competition for this resource. IV. A MODELING USE CASE This section demonstrates the application of the modeling approach described in the previous section. EOs are used to describe the operation of an Ethernet switch and then they are mapped to a general purpose hardware platform. A. Ethernet Switch Model For each packet the switch selects the output interface where it must be forwarded, retrieving it from a hash table keyed by the destination MAC address extracted from the packet. When the network interface receives a packet, it is firstly stored in an I/O buffer. In order to access the Ethernet header, the CPU/NPU must first copy the packet in cache or main memory. Since the switch operates only on the Ethernet header that is of limited size (14 bytes), it is copied in the L1 cache, while the rest of the packet (up to 1486 bytes) can be copied in L2 cache or main memory. To ensure generality, we consider that an incoming packet cannot be copied directly from an I/O buffer to another, instead it must be first copied in (cache) memory in any case. The switch must then read the destination MAC address (6 bytes) prior to using it to access the hash table to get the appropriate output interface. The hash table has one key (the destination MAC) and consists of 12 byte entries composed by the key and the output interface MAC address. Here we considered that the output interface is identified by its Ethernet address. Different implementations can use a different identifier, which leads to a minor variation in the model. The average number of entries in a real case scenario is ≈ 2M, which can give an idea of whether it can be fully stored in cache under any traffic conditions. Here we assume that the collision probability is negligible (i.e., the hash table is sufficiently sparse). The packet can then be moved to the buffer of the selected output I/O device. The resulting model is summarized in Figure 2. mem_I/O(14, 1486) parse(6) hash_lookup(1, 12, 2M, 0) mem_I/O(14, 1486) Fig. 2: Ethernet switch model. B. Mapping to Hardware We now proceed to map the described EOs to a specific hardware platform. Figure 3 provides a schematic representa- tion of the platform main components and relative constraints using the template proposed in [3]: an Intel R Xeon E5-2630 CPU, a DDR3 RAM module and a 10Gb Ethernet Controller. Using the CPU reference manual [8], it is possible to determine the operations required for the execution of each EO in Table I and estimate the achievable performance.
  • 4. DDR3 - 1333 Mtps - Max 85.2 Gbps - CAS lat. 9 MCT - 4 ch. - DDR3 - Max 340.8 Gbps I/O PCIe v3.0 - 8 Gtps - 126 Gbps (x16) x86-64 6 cores /slot - 2 threads / core - 2.3 – 2.8 GHz AVX VT-d, VT-x + EPT L1 - per core - i=32KB - d=32KB L2 - per core - 256 KB L3 - per slot - 15 MB 2x 10 GbE - 5 Gtps - PCIe v2.0 (x8) - Max 32 Gbps Intel Xeon E5-2630 Fig. 3: Hardware architecture description. 1. mem_I/O(L1n, L2n) The CPU L1 and L2 data caches can move one line per cache cycle, i.e., 512 bits (64 bytes) in 4 clock cycles and 12 clock cycles respectively, and their maximum sizes are 32 KB and 256 KB, respectively. Moreover, read and write operations in I/O buffers require on average 40 clock cycles. On the whole, the execution of this EO requires: 4 ∗ ⌈ min(32KB, L1n) 64B ⌉+ 12 ∗ ⌈ min(256KB, max(0, L1n − 32KB) + L2n) 64B ⌉+ 40 ∗ ⌈ L1n + L2n 64B ⌉ clock cycles and ⌈ max(0, max(0, L1n − 32KB) + L2n − 256KB) 64B ⌉ L3 cache or DRAM accesses. 2. parse(b) Loading a 64 bit register requires 4 clock cycles if data is in L1 cache or 12 clock cycles if data is in L2 cache, otherwise an additional L3 cache or DRAM memory access is required to retrieve a 64 byte line and store it in L1 or L2 respectively: 4 ∗ ⌈ b 8B ⌉ clock cycles {+⌈ b 64B ⌉ L3 or DRAM accesses} or 12 ∗ ⌈ b 8B ⌉ clock cycles {+⌈ b 64B ⌉ L3 or DRAM accesses} 3. increase(b) Whether a processor includes an increase instruction or one for adding a constant value to a 64 bit register, this EO requires 1 clock cycle to complete. However, thanks to pipelining, up to 3 independent such instructions can be executed during 1 clock cycle: ⌈0.33 ∗ b 8B ⌉ clock cycles 4. array_access(es, max) Direct array access needs to execute an “ADD” instruction (1 clock cycle) for computing the index and a “LOAD” instruc- tion resulting into a direct memory access and as many clock cycles as the number of CPU registers required to load the selected array element: 1 + ⌈ es 8B ⌉ clock cycles +⌈ es 64B ⌉ DRAM accesses 5. hash_lookup(N, HE, max, p) We assume that a simple hash lookup is implemented according to the pseudo-code described in [3] and shown in Figure 4 for ease of reference. Register $1-N: key components Register $HL: hash length Register $HP: hash array pointer Register $HE: hash entry size Register $Z: result Pseudo code: # hash key calculation eor $tmp, $tmp for i in 1 ... N eor $tmp, $i # key is available in $tmp # calculate hash index from key udiv $tmp2, $tmp, $HL mls $tmp2, $tmp2, $HL, $tmp # index is available in $tmp2 # index -> hash entry pointer mul $tmp, $tmp2, $HE add $tmp, $HP # entry pointer available in $tmp <prefetch entry to L1 memory> # pointer to L1 entry -> $tmp2 # hash key check (entry vs. key) for i in 1 ... N ldr $Z, [$tmp2], #4 # check keys cmp $i, $Z bne collision # no jump means matching keys # pointer to data available in $Z Fig. 4: Hash lookup pseudo-code. Considering that the hash entry needs to be loaded from memory to L1 cache, a simple hash lookup would require approximately: ⌈(4 ∗ N + 106 + 4 ∗ ⌈ HE 8B ⌉ + 4 ∗ ⌈ HE 32B ⌉) ∗ (1 + p)⌉ clock cycles and
  • 5. ⌈(⌈ HE 64B ⌉ ∗ (1 + p))⌉ DRAM accesses. Otherwise, if the entry is already in the cache, the memory accesses and cache store operations are not required. Notice that in order for the whole table to be in cache, its size should be limited to: max ∗ HE ≤ 32KB + 256KB = 288KB So, in the average case, a mix of cache hits and misses will take place, depending on the specific traffic profile. 6. checksum(b) Figure 5 shows a sample assembly code to compute a checksum on an Intel R x86-64 processor. Assuming that the data on which the checksum is computed is not in L1/L2 cache, according to the Intel R documentation [8], the execution of this code requires 7 ∗ ⌈ b 2 ⌉ + 8 clock cycles +⌈ b 64B ⌉ L3 or DRAM accesses Register ECX: number of bytes b Register EDX: pointer to the buffer Register EBX: checksum CHECKSUM_LOOP: XOR EAX, EAX ;EAX=0 MOV AX, WORD PTR [EDX] ;AX <- next word ADD EBX, EAX ;add to checksum SUB ECX, 2 ;update number of bytes ADD EDX, 2 ;update buffer CMP ECX, 1 ;check if ended JG CKSUM_LOOP MOV EAX, EBX ;EAX=EBX=checksum ;EAX=checksum>>16 EAX is the carry SHR EAX, 16 AND EBX, 0xffff ;EBX=checksum&0xffff ;EAX=(checksum>>16)+(checksum&0xffff) ADD EAX, EBX MOV EBX, EAX ;EBX=checksum SHR EBX, 16 ;EBX=checksum>>16 ADD EAX, EBX ;checksum+=(checksum>>16) MOV checksum, EAX ;checksum=EAX Fig. 5: Sample Intel R x86 assembly code for checksum computation. 7. sum(b) On the considered architecture, the execution of this EO is equivalent to the increase(b) EO. Please note that this is not necessarily the case on every architecture. TABLE II: Estimates for different packet sizes Packet size Mpps Gbps 64 12.05 7.91 128 8.38 9.69 256 5.21 11.24 512 2.97 12.34 1024 1.59 13.01 1500 1.09 12.95 C. Performance Estimation Using the above mapping of EOs in the Ethernet switch model devised in Section IV-A and shown in Figure 2, we can estimate that forwarding a packet of the maximum size (1500 bytes) requires: 2630 clock cycles + 1 DRAM access As a consequence, a single core of an Intel R Xeon E5- 2630 operating at 2.8 Ghz can process ≈ 1.09 Mpps, while the DDR3 memory can support 70.16 Mpps. The memory throughput is estimated considering that each packet requires a 12 byte memory access to read the hash table entry and the time to read the second 8 bytes word from memory is: (CAS latency ∗ 2) + 1 data rate As a result a single core can process ≈ 12.95 Gbps. If we consider minimum size (64 byte) packets (i.e., an unrealistic, worst case scenario), the Ethernet switch requires: 238 clock cycles + 1 DRAM access which means that a single core at 2.8 Ghz can process ≈ 12.05 Mpps (while the load and throughput of the memory remain the same), which translates into ≈ 7.9 Gbps. Estimates calculated for different packet sizes are reported in Table II. V. EXPERIMENTAL VALIDATION In order to evaluate the accuracy of the estimates provided by the proposed modeling approach, in this section we show measurements made in a lab setting with software switch implementations running on the presented hardware platform. Three software switches are used in the experiments: Open vSwitch (OVS), eXtensible DataPath Daemon (xDPd) and Ericsson Research Flow Switch (ERFS). These switches are configured via the OpenFlow protocol to perform a single des- tination MAC address-based output port selection and forward packets on the selected interface. The execution platform is equipped with two Xeon E5-2630 processors whose model is provided in Figure 3. To minimize the interference of the operating system drivers, the network interfaces are managed by the Intel R DPDK drivers. These drivers are designed for fast packet processing enabling applications (i.e., the switch implementation in this case) to receive and send packets directly from/to a network interface card within the minimum possible number of CPU cycles. A separate PC with the same hardware configuration is used as a traffic generator with the DPDK based pktgen traffic generator that is capable of saturating a 10GbE link with minimum size packets.
  • 6. 0 2 4 6 8 10 12 14 16 64128 256 512 1024 1500 Throughput(Mpps) Packet size (bytes) Estimate OVS-DPDK ERFS xDPd-0.6 Fig. 6: Performance with 100 flows The test traffic consists of Ethernet packets with different destination MAC addresses in order to prevent inter-packet caching. The total number of packets sent for each test is equal to 100, 000, 000 ∗ transmission rate (in Gbps). The generator PC is also used to compute statistics on received packets. Figure 6 shows the results obtained using each of the above listed switches and generating 100 concurrent flows with different destination MAC addresses. From the results it is clear that in this scenario the switches can achieve throughput up to the link capacity except with very small packets. The estimated value is above the measured value, as expected, since the estimation considers the hardware computational capability and not the transmission rate of the physical links. For small packets the fully-optimized pipeline of ERFS outperforms xDPd and OVS. With 64 byte packets the measured throughput of ERFS significantly exceeds the estimated value, which in turn is above the measurements ones for the other 2 switches. In order to further test the accuracy of the estimates, we run additional tests with bi-directional flows. The generated traffic has the same characteristics as the previous tests and in this case we calculate the aggregate statistics on all output interfaces. In this way the traffic processed by the switch can hypothetically reach 20 Gbps. We test this configuration with increasing packet sizes, until the link capacity is reached. The results obtained, which involved 2 different cores, are presented in Figure 7, together with the values estimated with the modeling approach. As correctly estimated, a rate around only 22 Mpps can be reached with small packets. As it is visible, version 0.6 of xDPd has internal scalability problems, while the other 2 switches are capable to scale as needed. The above results show that the model provides a good estimation of the throughput limit. In the case of bi-directional flows the computed estimation has a 9% error for 64 byte packets, 0.2% for 128 byte packets and 6% for 256 byte packets. The error increases for bigger packets because the computational capabilities, which are what the model takes into account, are no longer the factor limiting performance. The results show that the proposed modeling approach provides means to produce a valuable estimation of network functions performance. This methodology will be further im- 0 5 10 15 20 25 64128 256 512 1024 1500 Throughput(Mpps) Packet size (bytes) Estimate OVS-DPDK ERFS xDPd-0.6 Fig. 7: Performance with 100 flows and bi-directional traffic (using 2 cores) proved considering also the effects of packets interaction and concurrence. ACKNOWLEDGMENT This work was conducted within the framework of the FP7 UNIFY project1 , which is partially funded by the Commission of the European Union. Study sponsors had no role in writing this report. The views expressed do not necessarily represent the views of the authors’ employers, the UNIFY project, or the Commission of the European Union. REFERENCES [1] “ETSI ISG for NFV, ETSI GS NFV-INF 001, Network Functions Virtualisation (NFV); Infrastructure Overview,” http://www.etsi.org/deliver/etsi gs/NFV-INF/001 099/001/01.01.01 60/gs NFV-INF001v010101p.pdf, [Online; accessed 19-May-2015]. [2] M. Baldi and A. Sapio, “A network function modeling approach for performance estimation,” in 2015 IEEE 1st International Forum on Research and Technologies for Society and Industry Leveraging a better tomorrow (RTSI 2015), Torino, Italy, Sep. 2015. [3] G. Pongr´acz, L. Moln´ar, Z. L. Kis, and Z. Tur´anyi, “Cheap silicon: a myth or reality? picking the right data plane hardware for software defined networking,” in Proceedings of the second ACM SIGCOMM workshop on Hot topics in software defined networking. ACM, 2013, pp. 103–108. [4] D. Joseph and I. Stoica, “Modeling middleboxes,” Network, IEEE, vol. 22, no. 5, pp. 20–25, 2008. [5] A. Gember, A. Krishnamurthy, S. S. John, R. Grandl, X. Gao, A. Anand, T. Benson, A. Akella, and V. Sekar, “Stratos: A network-aware orches- tration layer for middleboxes in the cloud,” Technical Report, Tech. Rep., 2013. [6] J. Soares, M. Dias, J. Carapinha, B. Parreira, and S. Sargento, “Cloud4nfv: A platform for virtual network functions,” in Cloud Net- working (CloudNet), 2014 IEEE 3rd International Conference on. IEEE, 2014, pp. 288–293. [7] F. Lucrezia, G. Marchetto, F. G. O. Risso, and V. Vercellone, “Intro- ducing network-aware scheduling capabilities in openstack,” Network Softwarization (NetSoft), 2015 IEEE 1st Conference on, 2015. [8] “Intel 64 and IA-32 Architectures Optimization Reference Manual,” http://www.intel.com/content/www/us/en/architecture-and-technology/ 64-ia-32-architectures-optimization-manual.html, [Online; accessed 19-May-2015]. 1http://www.fp7-unify.eu/