Recent advance in netmap/VALE(mSwitch)

Recent advance in netmap/
VALE(mSwitch)
Michio Honda, Felipe Huici

(NEC Europe Ltd.)

Giuseppe Lettieri and Luigi Rizzo

(Universita di Pisa)

Kernel/VM@Jimbo-cho, Japan on Dec. 8 2013

michio.honda@neclab.eu / @michioh

Outline
•  netmap API basics

–  Architecture

–  How to write apps

•  VALE (mSwitch)

–  Architecture

–  System design

–  Evaluation

–  Use cases

netmap overview
•  A fast packet I/O mechanism between the NIC and the
user-space

–  Remove unnecessary metadata (e.g., sk_buff) allocation

–  Amortized systemcall costs, reduced/removed data copies

Page
4

From
h,p://info.iet.unipi.it/~luigi/netmap/

Performance
•  Saturate 10 Gbps pipe with low CPU frequency

From

netmap API (initialization)
•  open(“/dev/netmap”) returns a ﬁle descriptor

•  ioctl(fd, NIOCREG, arg) puts an interface in netmap mode

•  mmap(…, fd, 0) maps buffers and rings

Page
6

From

netmap API (TX)
•  TX

–  Fill up to avail buffers, starting from slot cur

–  ioctl(fd, NIOCTXSYNC) queues the packets

•  poll() can be used for blocking I/O

Page
7

From

netmap API (RX)
•  RX

–  ioctl(fd, NIOCTXSYNC) reports newly received packets

–  Process up to avail buffers, starting from slot cur

•  poll() can be used for blocking I/O

Page
8

From

Other features
•  Multi queue support

–  One netmap ring is initialized for a single physical ring

•  e.g., different pthreads can be assigned to different netmap/
physical rings

•  Host stack support

–  NIC is put into netmap mode,
resetting its phy

–  The host stack still sees the interface,
and packets can be sent to/from the
NIC via “software rings”

•  Either implicitly by the kernel or explicitly
by the app

From

Implementation
•  Available for FreeBSD and Linux

–  Linux code is glued from FreeBSD one

•  Common code

–  control, systemcall backends, memory allocator etc

•  Device-specific code

–  Each of the supported drivers implements some functions

•  nm_register(struct ifnet *ifp, int onoff)

–  Put the NIC into netmap mode, allocate netmap rings and slots

•  nm_txsync(struct ifnet *ifp, u_int ring_nr)

–  Flush out the packets from the netmap ring filled by the user

•  nm_rxsync(struct ifnet *ifp, u_int ring_nr)

–  refil the netmap ring with the receiving packets

Implementation (cont.)
•  Small modiﬁcations in device drivers

From

VALE (MSWITCH): A NETMAPBASED, VERY FAST
SOFTWARE SWITCH

Software switch
•  Switching packets between network interfaces

–  General-purpose OS and processor

–  Virtual and hardware interfaces

•  20 years ago

–  Prototyping, low-performance alternative

•  Now and near future

–  Replacement of hardware switch

–  Hosting virtual machines (incl. virtual
network functions)

Page
13

Software switch

Performance of today’s software switch
•  Forward packets from a 10Gbps NIC to another one

Throughput (Gbps)

–  Xeon E5-1650 (3.8Ghz) (1 CPU core is used)

–  Lower than 1 Gbps for the minimum-sized packets

FreeBSD bridge

10

5
2
1
64

Page
14

Openvswitch

128
256
512
Packet size (Bytes)

1024

Problems
1.  Inefficient packet I/O mechanism

–  Today’s software switches use a dynamically-allocated, big
metadata (e.g., sk_buff) designed for end systems

–  it should be simplified, because the packet is just forwarded

–  For switches it is more important to process small packets
efficiently

2.  Inefficient packet switching algorithm

–  How to move packets from the source
to the destination(s) efficiently?

–  Traditional way

Page
15

1

2

3

4

•  Lock a destination, send a single packet, then unlock the destination

•  Inefficient due to locking cost/contention

Problems (cont.)
3.  Lack of ﬂexibility in packet processing

–  How to decide packet’s destination?

–  One could use layer 2 learning bridge to decide packet’s
destination

–  One could use OpenFlow packet matching to do so

packet
processing

Page
16

Solutions
1.  “Inefﬁcient packet I/O mechanisms”

–  Simple, minimalistic packet representation (netmap API*)

•  No metadata allocation cost

•  Reduced cache pollution

2.  “Inefﬁcient packet switching”

1

–  Group multiple packets going to the same
destination

3
–  Lock the destination only once for a group of packets

Page
17

2

4

Netmap
–
a
novel
frameworl
for
fast

packet
I/O


Luigi
Rizzo

Università
di
Pisa

Bitmap-based forwarding algorithm
•  Algorithm in the original VALE

•  Support for unicast, multicast and broadcast

–  Get pointers to a batch of packets

pkt id dst
–  Identify the destination of each
p0 0010
packet and represent as a bitmap

p1 0001
–  Lock each destination, and send all p2 0010
p3 1111
the packets going there

•  Problem

–  Scalability issue in the

p4 0010

0010
0001
0010
1111
0010

Figure 3. Bitmap-based packet forwarding algorithm: packets are labeled destinations

presence of many from p0 to p4; for each packet, destination(s)
are identiﬁed and represented in a bitmap (a bit for each posVALE,
a
V forwarder considers
sible destination port). Theirtual
Local
Ethernet
each destih,p://info.iet.unipi.it/~luigi/vale/

nation port in turn, scanning the corresponding column of
Luigi
Rizzo,
Giuseppe
LeReri

the bitmap to identify the packets bound to the current destiUniversità
di
Pisa

nation port.

p1
p2
p3
p4

0001
0010
1111
0010

0010
1111
0010

List-based forwarding algorithm

Figure 3. Bitmap-based packet forwarding algorithm: packets are labeled from p0 to p4; for each packet, destination(s)
are identiﬁed and represented in a bitmap (a bit for each possible destination port). The forwarder considers each destination port in turn, scanning the corresponding column of
the bitmap to identify the packets bound to the current destination port.

•  Algorithm in the current VALE (mSwitch)

•  Support for unicast and broadcast

–  Make a linked-list for each destination

–  Broadcast packets are pkt id dst next
p0 d1 null
mapped into destination p1 d0 null
p2
index 254

p3
–  Scan each destination, p4
and broadcast packets
are inserted in-order

pkt id dst next
p0 d1
p2
p1 d0 null
p2 d1
p4
p3 d254 null
p4 d1 null

dst d0 d1 d2
head p1 p0
tail p1 p0

d254

...

dst d0 d1 d2
head p1 p0
tail p1 p4

d254
... p3
p3

Figure 4. List-based packet forwarding: packets are labeled
from p0 to p4, destination port indices are labeled from d1

in the
in the
well f
To
cast o
This l
tinatio
forwa
curren
time
9 sho
4.3

Once
identi
and m
menta
the du
of pac
Ins
two p
large
leases
tion o
vance
order
many
page
for th
queue
tolera

Solutions (cont.)
3.  “Lack of ﬂexibility in packet processing”

–  Separate a switching fabric
and packet processing

–  Switching fabric

Packet processing

•  Move packets quickly

Switching fabric

–  Packet processing

•  Decide packets’ destination and tell the switching fabric

typedef
u_int
(*BDG_LOOKUP_T)(char
*buf,
u_int
len,
uint8_t
*ring_nr,

struct
netmap_adapter
*srcif);
–  Return

•  The index of the destination port for unicast

•  NM_BDG_BROADCAST for broadcast

•  NM_BDG_NOPORT for dropping this packet

–  By default L2 learning is set

Page
20

VALE (mSwitch) architecture

...

Netmap API

Virtual
interfaces

Netmap API

app1/vm1 . . . appN/vmN

apps

User

Socket API

Kernel

OS stack

Packet processing
Switching fabric
NIC

•  Packet forwarding (identifying packets’ destination
(packet processing) and copying packets to the
destination ring) takes place in the sender’s context

–  The receiver just consumes the packets
Page
21

The other features
•  Indirect buffer support

–  netmap slot can contain a pointer to the actual buffer

–  Useful to eliminate data copy from VM’s backend to a
netmap slot

•  Support for a large packet

–  Multiple netmap slots (by default 2048 byte each) can be
used to contain a single packet

Bare mSwitch performance

8
6
4
2
64 128 256 512 1518
Packet size (Bytes)

Dummy

20

64B

128B

256B

15
10
5
1.3 1.9 2.6 3.2 3.8
CPU Clock Frequency (Ghz)

Experiments
are
done
with
Xeon
E5-‐1650
CPU
(6
core,
3.8
Ghz
with
Turboboost),
16GB
DDR3

(a) NIC to NIC
Page
23
RAM
(quad
channel)
and
Intel
X520-‐T2
10Gbps
NICs

Throughput (Gbps)

10

10Gbps line rate

Throughput (Gbps)

Throughput (Gbps)

•  NIC to NIC (10Gbps)
•  “Dummy” processing module

20

64

15
10
5

1.3
CPU C

(b

Figure 5. Throughput between 10 Gb/s NICs and vir

port, and 20 Gb/s ones with two pairs). In addition, we assign one CPU core per port pair. The results in figure 5(b)
are similar to those in the NIC-to-NIC case, with line rate
values for all packet sizes at 3.8 GHz. The graph also shows
that mSwitch scales well with the number of ports and CPU
cores: we achieve line rate for two 10 Gb/s ports for all
•  Virtual port Finally, figure port
packet sizes. to virtual 5(c) presents throughput numbers in the opposite direction, from virtual
Dummy
•  “Dummy” processing module

ports to NICs,
with similar results.
of baseline pery plug-in mod1 CPU core
figured to mark
200
2 CPU cores
here, the packet
3 CPU cores
mediately (thus
150
s for packets to
then the switch

different packet
orts. We further
Turbo Boost to
ets us shed light

ts of the NIC to
one to the other
h this setup, we
Page

ies for 256-byte 24

ones starting at


Throughput (Gbps)

gen, a fast genbe plugged into
e Gb/s to mean
ackets per sectch size of 1024

100
50
25
60

508 1514 8K 64K
Packet size (Bytes)

Figure 6. Forwarding performance between two virtual


250

1514B packets
64KB packets

200
150
100
50
0

broadcast
(a) Experiment topologies.

Throughput (Gbps)

unicast

Throughput (Gbps)

•  Dummy packet processing module”
•  N virtual ports to N virtual ports

•  “
64B packets

250
200

64B packets
1514B packets
64KB packets

150
100
50
0

2

4
6
# of ports

(b) Unicast throughput.

8

2

4
6
# of ports

8

(c) Broadcast throughput.

Figure 7. Switching capacity with an increasing number of virtual ports. For unicast, each src/dst port pair is assigned a single
CPU core, for broadcast each port is given a core. For setups with more than 6 ports (our system has 6 cores) we assign cores
in a round-robin fashion.
Page
25

adcast

# of ports

ent topologies.

# of ports

(b) Unicast throughput.

(c) Broadcast throughput.

g capacity with an increasing number of virtual ports. For unicast, each src/dst port pair is assigned a single
cast each port is given a core. For setups with more than 6 ports (our system has 6 cores) we assign cores
hion.

mSwitch’s Scalability

–  Bitmap- vs List-based
algorithm

–  List-based algorithm
scales very well
2
3
4
# of destination ports

5

forwarding throughput from a single
iple destinations using minimum-sized
compares mSwitch’s forwarding algo’s (bitmap).
Page
26

Aggregate throughput (Gbps)

•  A single virtual port to 14
List Algo.
Bitmapmany virtual ports

Algo.
12

List Algo.
Bitmap Algo.

10
8
6
4
2
0
1 20 40 60 100 150 200
# of destination ports

250

Figure 9. Comparison ofWe
use
the
minimum-‐sized

mSwitch’s forwarding algorithm
packets
w the single
CPU
c a large
(list) to that of VALE (bitmap) in ith
a
presence of ore

number of active destination ports (single sender, minimum-

Learning bridge performance

8
6
4
2

Page
27

64 128 256 512 1024
Packet size (Bytes)
(a) Layer 2 learning bridge.

Learning bridge

8
6
4
2
64 128 256 512 1024
Packet size (Bytes)

(b) 3-tuple ﬁlter (user-space nw stack support).

Throughput (Gbps)

–  Adding a cost of MAC address hashing at packet processing
FreeBSD bridge
mSwitch-3-tuple
mSwitch-learn
10
10

Throughput (Gbps)

Throughput (Gbps)

•  mSwitch-learn: pure learning bridge processing

10
8
6
4
2

6

OpenVswitch acceleration
•  mSwitch-OVS: mSwitch with Openswitch’s packet
processing

512 1024
Bytes)

tack support).

Throughput (Gbps)

ch-3-tuple

10

Page
28

OVS
mSwitch-OVS

8
6
4
2
64 128 256 512 1024
Packet size (Bytes)
(c) Open vSwitch.

Openvswitch

Conclusion
•  Our contribution

–  VALE(mSwitch): fast, modular software switch

–  Very fast packet forwarding at bare metal

•  200 Gbps between virtual ports (with 1500 Byte packets and 3 CPU
cores)

•  Almost the line rate with using 1 CPU core and 2 10 Gbps NICs

–  Useful to implement various systems

•  Very fast learning bridge

•  Accelerate OpenVswitch up to 2.6 times

–  Small modiﬁcations, preserving control interface

•  Fast protocol multiplexer/demultiplexer for user-space protocol stacks

•  Code (Linux, FreeBSD) is available at:

–  http://info.iet.unipi.it/~luigi/netmap/

Page
29

Recent advance in netmap/VALE(mSwitch)

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Recent advance in netmap/VALE(mSwitch) (20)

Recently uploaded (20)

Recent advance in netmap/VALE(mSwitch)