SlideShare a Scribd company logo
Recent advance in netmap/
VALE(mSwitch)	
Michio Honda, Felipe Huici	

(NEC Europe Ltd.)	

Giuseppe Lettieri and Luigi Rizzo 	

(Universita di Pisa)	

Kernel/VM@Jimbo-cho, Japan on Dec. 8 2013	

	

michio.honda@neclab.eu / @michioh
Outline	
•  netmap API basics	

–  Architecture	

–  How to write apps	


•  VALE (mSwitch)	

–  Architecture	

–  System design	

–  Evaluation	

–  Use cases
NETMAP API BASICS
netmap overview	
•  A fast packet I/O mechanism between the NIC and the
user-space	

–  Remove unnecessary metadata (e.g., sk_buff) allocation	

–  Amortized systemcall costs, reduced/removed data copies	


Page	
  4	
  

From	
  h,p://info.iet.unipi.it/~luigi/netmap/
Performance	
•  Saturate 10 Gbps pipe with low CPU frequency	

From	
  h,p://info.iet.unipi.it/~luigi/netmap/
netmap API (initialization)	
•  open(“/dev/netmap”) returns a file descriptor	

•  ioctl(fd, NIOCREG, arg) puts an interface in netmap mode	

•  mmap(…, fd, 0) maps buffers and rings	


Page	
  6	
  

From	
  h,p://info.iet.unipi.it/~luigi/netmap/
netmap API (TX)	
•  TX	

–  Fill up to avail buffers, starting from slot cur	

–  ioctl(fd, NIOCTXSYNC) queues the packets	

•  poll() can be used for blocking I/O	


Page	
  7	
  

From	
  h,p://info.iet.unipi.it/~luigi/netmap/
netmap API (RX)	
•  RX	

–  ioctl(fd, NIOCTXSYNC) reports newly received packets	

–  Process up to avail buffers, starting from slot cur	

•  poll() can be used for blocking I/O	


Page	
  8	
  

From	
  h,p://info.iet.unipi.it/~luigi/netmap/
Other features	
•  Multi queue support	

–  One netmap ring is initialized for a single physical ring 	

•  e.g., different pthreads can be assigned to different netmap/
physical rings	


•  Host stack support	

–  NIC is put into netmap mode, 
resetting its phy	

–  The host stack still sees the interface,
and packets can be sent to/from the
NIC via “software rings”	

•  Either implicitly by the kernel or explicitly
by the app	

From	
  h,p://info.iet.unipi.it/~luigi/netmap/
Implementation	
•  Available for FreeBSD and Linux	

–  Linux code is glued from FreeBSD one	


•  Common code	

–  control, systemcall backends, memory allocator etc	


•  Device-specific code	

–  Each of the supported drivers implements some functions	

•  nm_register(struct ifnet *ifp, int onoff)	

–  Put the NIC into netmap mode, allocate netmap rings and slots	

•  nm_txsync(struct ifnet *ifp, u_int ring_nr)	

–  Flush out the packets from the netmap ring filled by the user	

•  nm_rxsync(struct ifnet *ifp, u_int ring_nr)	

–  refil the netmap ring with the receiving packets
Implementation (cont.)	
•  Small modifications in device drivers	

	


From	
  h,p://info.iet.unipi.it/~luigi/netmap/
VALE (MSWITCH): A NETMAPBASED, VERY FAST
SOFTWARE SWITCH
Software switch	
•  Switching packets between network interfaces	

–  General-purpose OS and processor	

–  Virtual and hardware interfaces	


•  20 years ago	

–  Prototyping, low-performance alternative	


•  Now and near future	

–  Replacement of hardware switch	

–  Hosting virtual machines (incl. virtual 
network functions)	

Page	
  13	
  

Software switch
Performance of today’s software switch	
•  Forward packets from a 10Gbps NIC to another one	


Throughput (Gbps)

–  Xeon E5-1650 (3.8Ghz) (1 CPU core is used)	

–  Lower than 1 Gbps for the minimum-sized packets	

FreeBSD bridge

10

5
2
1
64

Page	
  14	
  

Openvswitch

128
256
512
Packet size (Bytes)

1024
Problems	
1.  Inefficient packet I/O mechanism	

–  Today’s software switches use a dynamically-allocated, big 
metadata (e.g., sk_buff) designed for end systems	

–  it should be simplified, because the packet is just forwarded	

–  For switches it is more important to process small packets
efficiently	

2.  Inefficient packet switching algorithm	

–  How to move packets from the source
to the destination(s) efficiently?	

–  Traditional way	

Page	
  15	
  

1

2

3

4

•  Lock a destination, send a single packet, then unlock the destination	

•  Inefficient due to locking cost/contention
Problems (cont.)	
3.  Lack of flexibility in packet processing	

–  How to decide packet’s destination?	

–  One could use layer 2 learning bridge to decide packet’s
destination	

–  One could use OpenFlow packet matching to do so	


packet
processing

Page	
  16	
  
Solutions	
1.  “Inefficient packet I/O mechanisms”	

–  Simple, minimalistic packet representation (netmap API*)	

•  No metadata allocation cost	

•  Reduced cache pollution	


2.  “Inefficient packet switching”	


1

–  Group multiple packets going to the same
destination	

3
–  Lock the destination only once for a group of packets	


Page	
  17	
  

2

4

Netmap	
  –	
  a	
  novel	
  frameworl	
  for	
  fast	
  
packet	
  I/O	
  
h,p://info.iet.unipi.it/~luigi/netmap/	
  
Luigi	
  Rizzo	
  
Università	
  di	
  Pisa	
  
Bitmap-based forwarding algorithm	
•  Algorithm in the original VALE	

•  Support for unicast, multicast and broadcast	

–  Get pointers to a batch of packets	

pkt id dst
–  Identify the destination of each 
p0 0010
packet and represent as a bitmap	

 p1 0001
–  Lock each destination, and send all  p2 0010
p3 1111
the packets going there	


•  Problem	

–  Scalability issue in the

p4 0010

0010
0001
0010
1111
0010

Figure 3. Bitmap-based packet forwarding algorithm: packets are labeled destinations	

presence of many from p0 to p4; for each packet, destination(s)
are identified and represented in a bitmap (a bit for each posVALE,	
  a	
  V forwarder considers
sible destination port). Theirtual	
  Local	
  Ethernet	
   each destih,p://info.iet.unipi.it/~luigi/vale/	
  
nation port in turn, scanning the corresponding column of
Luigi	
  Rizzo,	
  Giuseppe	
  LeReri	
  
the bitmap to identify the packets bound to the current destiUniversità	
  di	
  Pisa	
  
nation port.
	
  
p1
p2
p3
p4

0001
0010
1111
0010

0010
1111
0010

List-based forwarding algorithm	

Figure 3. Bitmap-based packet forwarding algorithm: packets are labeled from p0 to p4; for each packet, destination(s)
are identified and represented in a bitmap (a bit for each possible destination port). The forwarder considers each destination port in turn, scanning the corresponding column of
the bitmap to identify the packets bound to the current destination port.

•  Algorithm in the current VALE (mSwitch)	

•  Support for unicast and broadcast	

–  Make a linked-list for each destination	

–  Broadcast packets are pkt id dst next
p0 d1 null
mapped into destination p1 d0 null
p2
index 254	

p3
–  Scan each destination, p4
and broadcast packets
are inserted in-order	

 pkt id dst next
p0 d1
p2
p1 d0 null
p2 d1
p4
p3 d254 null
p4 d1 null

dst d0 d1 d2
head p1 p0
tail p1 p0

d254

...

dst d0 d1 d2
head p1 p0
tail p1 p4

d254
... p3
p3

Figure 4. List-based packet forwarding: packets are labeled
from p0 to p4, destination port indices are labeled from d1

in the
in the
well f
To
cast o
This l
tinatio
forwa
curren
time
9 sho
4.3

Once
identi
and m
menta
the du
of pac
Ins
two p
large
leases
tion o
vance
order
many
page
for th
queue
tolera
Solutions (cont.)	
3.  “Lack of flexibility in packet processing”	

–  Separate a switching fabric 
and packet processing	

–  Switching fabric	


Packet processing

•  Move packets quickly	


Switching fabric

–  Packet processing	

•  Decide packets’ destination and tell the switching fabric 	


typedef	
  u_int	
  (*BDG_LOOKUP_T)(char	
  *buf,	
  u_int	
  len,	
  uint8_t	
  *ring_nr,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  struct	
  netmap_adapter	
  *srcif);	
–  Return	

•  The index of the destination port for unicast	

•  NM_BDG_BROADCAST for broadcast	

•  NM_BDG_NOPORT for dropping this packet	


–  By default L2 learning is set	

Page	
  20	
  
VALE (mSwitch) architecture	

...

Netmap API

Virtual
interfaces

Netmap API

app1/vm1 . . . appN/vmN

apps

User

Socket API

Kernel

OS stack

Packet processing
Switching fabric
NIC

•  Packet forwarding (identifying packets’ destination
(packet processing) and copying packets to the
destination ring) takes place in the sender’s context	

–  The receiver just consumes the packets	
Page	
  21	
  
The other features	
•  Indirect buffer support	

–  netmap slot can contain a pointer to the actual buffer	

–  Useful to eliminate data copy from VM’s backend to a
netmap slot	


•  Support for a large packet	

–  Multiple netmap slots (by default 2048 byte each) can be
used to contain a single packet
Bare mSwitch performance	

8
6
4
2
64 128 256 512 1518
Packet size (Bytes)

Dummy

20

64B

128B

256B

15
10
5
1.3 1.9 2.6 3.2 3.8
CPU Clock Frequency (Ghz)

Experiments	
  are	
  done	
  with	
  Xeon	
  E5-­‐1650	
  CPU	
  (6	
  core,	
  3.8	
  Ghz	
  with	
  Turboboost),	
  16GB	
  DDR3	
  
(a) NIC to NIC
Page	
  23	
   RAM	
  (quad	
  channel)	
  and	
  Intel	
  X520-­‐T2	
  10Gbps	
  NICs	
  

Throughput (Gbps)

10

10Gbps line rate

Throughput (Gbps)

Throughput (Gbps)

•  NIC to NIC (10Gbps)	
•  “Dummy” processing module 	


20

64

15
10
5

1.3
CPU C

(b

Figure 5. Throughput between 10 Gb/s NICs and vir
port, and 20 Gb/s ones with two pairs). In addition, we assign one CPU core per port pair. The results in figure 5(b)
are similar to those in the NIC-to-NIC case, with line rate
values for all packet sizes at 3.8 GHz. The graph also shows
that mSwitch scales well with the number of ports and CPU
cores: we achieve line rate for two 10 Gb/s ports for all
•  Virtual port Finally, figure port	
packet sizes. to virtual 5(c) presents throughput numbers in the opposite direction, from virtual
Dummy
•  “Dummy” processing module 	

 ports to NICs,
with similar results.
of baseline pery plug-in mod1 CPU core
figured to mark
200
2 CPU cores
here, the packet
3 CPU cores
mediately (thus
150
s for packets to
then the switch

different packet
orts. We further
Turbo Boost to
ets us shed light

ts of the NIC to
one to the other
h this setup, we
Page	
  
ies for 256-byte 24	
  
ones starting at

Bare mSwitch performance	

Throughput (Gbps)

gen, a fast genbe plugged into
e Gb/s to mean
ackets per sectch size of 1024

100
50
25
60

508 1514 8K 64K
Packet size (Bytes)

Figure 6. Forwarding performance between two virtual
Bare mSwitch performance	

250

1514B packets
64KB packets

200
150
100
50
0

broadcast
(a) Experiment topologies.

Throughput (Gbps)

unicast

Throughput (Gbps)

•  Dummy packet processing module”	
•  N virtual ports to N virtual ports	

•  “	
64B packets

250
200

64B packets
1514B packets
64KB packets

150
100
50
0

2

4
6
# of ports

(b) Unicast throughput.

8

2

4
6
# of ports

8

(c) Broadcast throughput.

Figure 7. Switching capacity with an increasing number of virtual ports. For unicast, each src/dst port pair is assigned a single
CPU core, for broadcast each port is given a core. For setups with more than 6 ports (our system has 6 cores) we assign cores
in a round-robin fashion.
Page	
  25	
  
adcast

# of ports

ent topologies.

# of ports

(b) Unicast throughput.

(c) Broadcast throughput.

g capacity with an increasing number of virtual ports. For unicast, each src/dst port pair is assigned a single
cast each port is given a core. For setups with more than 6 ports (our system has 6 cores) we assign cores
hion.

mSwitch’s Scalability	

–  Bitmap- vs List-based
algorithm 	

–  List-based algorithm
scales very well	
2
3
4
# of destination ports

5

forwarding throughput from a single
iple destinations using minimum-sized
compares mSwitch’s forwarding algo’s (bitmap).
Page	
  26	
  

Aggregate throughput (Gbps)

•  A single virtual port to  14
List Algo.
Bitmapmany virtual ports	

Algo.
12

List Algo.
Bitmap Algo.

10
8
6
4
2
0
1 20 40 60 100 150 200
# of destination ports

250

Figure 9. Comparison ofWe	
  use	
  the	
  minimum-­‐sized	
  
mSwitch’s forwarding algorithm
packets	
  w the single	
  CPU	
  c a large
(list) to that of VALE (bitmap) in ith	
  a	
  presence of ore	
  
number of active destination ports (single sender, minimum-
Learning bridge performance	

8
6
4
2

Page	
  27	
  

64 128 256 512 1024
Packet size (Bytes)
(a) Layer 2 learning bridge.

Learning bridge

8
6
4
2
64 128 256 512 1024
Packet size (Bytes)

(b) 3-tuple filter (user-space nw stack support).

Throughput (Gbps)

–  Adding a cost of MAC address hashing at packet processing	
FreeBSD bridge
mSwitch-3-tuple
mSwitch-learn
10
10

Throughput (Gbps)

Throughput (Gbps)

•  mSwitch-learn: pure learning bridge processing	


10
8
6
4
2

6
OpenVswitch acceleration	
•  mSwitch-OVS: mSwitch with Openswitch’s packet
processing	


512 1024
Bytes)

tack support).

Throughput (Gbps)

ch-3-tuple

10

Page	
  28	
  

OVS
mSwitch-OVS

8
6
4
2
64 128 256 512 1024
Packet size (Bytes)
(c) Open vSwitch.

Openvswitch
Conclusion	
•  Our contribution	


–  VALE(mSwitch): fast, modular software switch	

–  Very fast packet forwarding at bare metal	

•   200 Gbps between virtual ports (with 1500 Byte packets and 3 CPU
cores)	

•  Almost the line rate with using 1 CPU core and 2 10 Gbps NICs	


–  Useful to implement various systems	

•  Very fast learning bridge	

•  Accelerate OpenVswitch up to 2.6 times	

–  Small modifications, preserving control interface	

•  Fast protocol multiplexer/demultiplexer for user-space protocol stacks	


•  Code (Linux, FreeBSD) is available at:	

–  http://info.iet.unipi.it/~luigi/netmap/	


Page	
  29	
  

More Related Content

PDF
Userspace networking
PDF
mSwitch: A Highly-Scalable, Modular Software Switch
PPTX
Netmap presentation
PPTX
DPDK KNI interface
PDF
Intel DPDK Step by Step instructions
PDF
How to Speak Intel DPDK KNI for Web Services.
PDF
PASTE: Network Stacks Must Integrate with NVMM Abstractions
PPTX
Dpdk applications
Userspace networking
mSwitch: A Highly-Scalable, Modular Software Switch
Netmap presentation
DPDK KNI interface
Intel DPDK Step by Step instructions
How to Speak Intel DPDK KNI for Web Services.
PASTE: Network Stacks Must Integrate with NVMM Abstractions
Dpdk applications

What's hot (20)

PDF
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
PPTX
Understanding DPDK
PPTX
Packet Framework - Cristian Dumitrescu
PDF
VLANs in the Linux Kernel
ODP
Dpdk performance
PDF
DPDK in Containers Hands-on Lab
PDF
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
PDF
Performance challenges in software networking
PDF
100 M pps on PC.
PPSX
FD.io Vector Packet Processing (VPP)
PDF
DPDK Summit 2015 - HP - Al Sanders
PPTX
Enable DPDK and SR-IOV for containerized virtual network functions with zun
PPTX
The n00bs guide to ovs dpdk
PPTX
Introduction to DPDK
ODP
Integrating Linux routing with FusionCLI™
PDF
DPDK In Depth
PPTX
Linux Network Stack
PDF
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
PPTX
High Performance Networking Leveraging the DPDK and Growing Community
PDF
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
Lagopus presentation on 14th Annual ON*VECTOR International Photonics Workshop
Understanding DPDK
Packet Framework - Cristian Dumitrescu
VLANs in the Linux Kernel
Dpdk performance
DPDK in Containers Hands-on Lab
DPDK Summit - 08 Sept 2014 - Futurewei - Jun Xu - Revisit the IP Stack in Lin...
Performance challenges in software networking
100 M pps on PC.
FD.io Vector Packet Processing (VPP)
DPDK Summit 2015 - HP - Al Sanders
Enable DPDK and SR-IOV for containerized virtual network functions with zun
The n00bs guide to ovs dpdk
Introduction to DPDK
Integrating Linux routing with FusionCLI™
DPDK In Depth
Linux Network Stack
DPDK Summit - 08 Sept 2014 - 6WIND - High Perf Networking Leveraging the DPDK...
High Performance Networking Leveraging the DPDK and Growing Community
DPDK Summit 2015 - RIFT.io - Tim Mortsolf
Ad

Viewers also liked (11)

PPTX
DPDK (Data Plane Development Kit)
PDF
[1A6]Docker로 보는 서버 운영의 미래
PDF
Docker infiniband
PPTX
Ethernet and TCP optimizations
PDF
Virtualized network with openvswitch
PPTX
[Ppt발표팁]효과적인 슬라이드 발표를 위한 10가지 팁
PPTX
Windows Server Containers- How we hot here and architecture deep dive
PDF
Open vSwitch 패킷 처리 구조
PDF
[OpenStack 하반기 스터디] Docker를 이용한 OpenStack 가상화
PDF
Direct Code Execution @ CoNEXT 2013
PDF
NUSE (Network Stack in Userspace) at #osio
DPDK (Data Plane Development Kit)
[1A6]Docker로 보는 서버 운영의 미래
Docker infiniband
Ethernet and TCP optimizations
Virtualized network with openvswitch
[Ppt발표팁]효과적인 슬라이드 발표를 위한 10가지 팁
Windows Server Containers- How we hot here and architecture deep dive
Open vSwitch 패킷 처리 구조
[OpenStack 하반기 스터디] Docker를 이용한 OpenStack 가상화
Direct Code Execution @ CoNEXT 2013
NUSE (Network Stack in Userspace) at #osio
Ad

Similar to Recent advance in netmap/VALE(mSwitch) (20)

PDF
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
PDF
High performance and flexible networking
PDF
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
PPTX
Enterprise network design multi layer network and security.pptx
PPTX
The Basics of Industrial Ethernet Communications
PDF
Platforms for Accelerating the Software Defined and Virtual Infrastructure
PDF
ABB Corporate Research: Overview of Wired Industrial Ethernet Switching Solut...
PDF
10 sdn-vir-6up
PDF
Allied Telesis AT-8000S v2 New 2023 Series
PDF
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
PPT
Software defined network and Virtualization
PDF
Programming the Network Data Plane
PPTX
High performace network of Cloud Native Taiwan User Group
PDF
FreeBSD VPC Introduction
PPT
Packet Switching Technique in Computer Network
PPT
Packet switching paradigms Computer Networks: A Systems Approach.ppt
PPT
PacketSwitching1 com1337 .ppt
PPTX
FlowER Erlang Openflow Controller
PDF
Open flow wp
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
High performance and flexible networking
DPDK Summit - 08 Sept 2014 - NTT - High Performance vSwitch
Enterprise network design multi layer network and security.pptx
The Basics of Industrial Ethernet Communications
Platforms for Accelerating the Software Defined and Virtual Infrastructure
ABB Corporate Research: Overview of Wired Industrial Ethernet Switching Solut...
10 sdn-vir-6up
Allied Telesis AT-8000S v2 New 2023 Series
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
Software defined network and Virtualization
Programming the Network Data Plane
High performace network of Cloud Native Taiwan User Group
FreeBSD VPC Introduction
Packet Switching Technique in Computer Network
Packet switching paradigms Computer Networks: A Systems Approach.ppt
PacketSwitching1 com1337 .ppt
FlowER Erlang Openflow Controller
Open flow wp

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation theory and applications.pdf
PDF
KodekX | Application Modernization Development
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Modernizing your data center with Dell and AMD
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
A Presentation on Artificial Intelligence
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Network Security Unit 5.pdf for BCA BBA.
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation theory and applications.pdf
KodekX | Application Modernization Development
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 3 Spatial Domain Image Processing.pdf
cuic standard and advanced reporting.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Advanced methodologies resolving dimensionality complications for autism neur...
Modernizing your data center with Dell and AMD
The AUB Centre for AI in Media Proposal.docx
A Presentation on Artificial Intelligence
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?

Recent advance in netmap/VALE(mSwitch)

  • 1. Recent advance in netmap/ VALE(mSwitch) Michio Honda, Felipe Huici (NEC Europe Ltd.) Giuseppe Lettieri and Luigi Rizzo (Universita di Pisa) Kernel/VM@Jimbo-cho, Japan on Dec. 8 2013 michio.honda@neclab.eu / @michioh
  • 2. Outline •  netmap API basics –  Architecture –  How to write apps •  VALE (mSwitch) –  Architecture –  System design –  Evaluation –  Use cases
  • 4. netmap overview •  A fast packet I/O mechanism between the NIC and the user-space –  Remove unnecessary metadata (e.g., sk_buff) allocation –  Amortized systemcall costs, reduced/removed data copies Page  4   From  h,p://info.iet.unipi.it/~luigi/netmap/
  • 5. Performance •  Saturate 10 Gbps pipe with low CPU frequency From  h,p://info.iet.unipi.it/~luigi/netmap/
  • 6. netmap API (initialization) •  open(“/dev/netmap”) returns a file descriptor •  ioctl(fd, NIOCREG, arg) puts an interface in netmap mode •  mmap(…, fd, 0) maps buffers and rings Page  6   From  h,p://info.iet.unipi.it/~luigi/netmap/
  • 7. netmap API (TX) •  TX –  Fill up to avail buffers, starting from slot cur –  ioctl(fd, NIOCTXSYNC) queues the packets •  poll() can be used for blocking I/O Page  7   From  h,p://info.iet.unipi.it/~luigi/netmap/
  • 8. netmap API (RX) •  RX –  ioctl(fd, NIOCTXSYNC) reports newly received packets –  Process up to avail buffers, starting from slot cur •  poll() can be used for blocking I/O Page  8   From  h,p://info.iet.unipi.it/~luigi/netmap/
  • 9. Other features •  Multi queue support –  One netmap ring is initialized for a single physical ring •  e.g., different pthreads can be assigned to different netmap/ physical rings •  Host stack support –  NIC is put into netmap mode, resetting its phy –  The host stack still sees the interface, and packets can be sent to/from the NIC via “software rings” •  Either implicitly by the kernel or explicitly by the app From  h,p://info.iet.unipi.it/~luigi/netmap/
  • 10. Implementation •  Available for FreeBSD and Linux –  Linux code is glued from FreeBSD one •  Common code –  control, systemcall backends, memory allocator etc •  Device-specific code –  Each of the supported drivers implements some functions •  nm_register(struct ifnet *ifp, int onoff) –  Put the NIC into netmap mode, allocate netmap rings and slots •  nm_txsync(struct ifnet *ifp, u_int ring_nr) –  Flush out the packets from the netmap ring filled by the user •  nm_rxsync(struct ifnet *ifp, u_int ring_nr) –  refil the netmap ring with the receiving packets
  • 11. Implementation (cont.) •  Small modifications in device drivers From  h,p://info.iet.unipi.it/~luigi/netmap/
  • 12. VALE (MSWITCH): A NETMAPBASED, VERY FAST SOFTWARE SWITCH
  • 13. Software switch •  Switching packets between network interfaces –  General-purpose OS and processor –  Virtual and hardware interfaces •  20 years ago –  Prototyping, low-performance alternative •  Now and near future –  Replacement of hardware switch –  Hosting virtual machines (incl. virtual network functions) Page  13   Software switch
  • 14. Performance of today’s software switch •  Forward packets from a 10Gbps NIC to another one Throughput (Gbps) –  Xeon E5-1650 (3.8Ghz) (1 CPU core is used) –  Lower than 1 Gbps for the minimum-sized packets FreeBSD bridge 10 5 2 1 64 Page  14   Openvswitch 128 256 512 Packet size (Bytes) 1024
  • 15. Problems 1.  Inefficient packet I/O mechanism –  Today’s software switches use a dynamically-allocated, big metadata (e.g., sk_buff) designed for end systems –  it should be simplified, because the packet is just forwarded –  For switches it is more important to process small packets efficiently 2.  Inefficient packet switching algorithm –  How to move packets from the source to the destination(s) efficiently? –  Traditional way Page  15   1 2 3 4 •  Lock a destination, send a single packet, then unlock the destination •  Inefficient due to locking cost/contention
  • 16. Problems (cont.) 3.  Lack of flexibility in packet processing –  How to decide packet’s destination? –  One could use layer 2 learning bridge to decide packet’s destination –  One could use OpenFlow packet matching to do so packet processing Page  16  
  • 17. Solutions 1.  “Inefficient packet I/O mechanisms” –  Simple, minimalistic packet representation (netmap API*) •  No metadata allocation cost •  Reduced cache pollution 2.  “Inefficient packet switching” 1 –  Group multiple packets going to the same destination 3 –  Lock the destination only once for a group of packets Page  17   2 4 Netmap  –  a  novel  frameworl  for  fast   packet  I/O   h,p://info.iet.unipi.it/~luigi/netmap/   Luigi  Rizzo   Università  di  Pisa  
  • 18. Bitmap-based forwarding algorithm •  Algorithm in the original VALE •  Support for unicast, multicast and broadcast –  Get pointers to a batch of packets pkt id dst –  Identify the destination of each p0 0010 packet and represent as a bitmap p1 0001 –  Lock each destination, and send all p2 0010 p3 1111 the packets going there •  Problem –  Scalability issue in the p4 0010 0010 0001 0010 1111 0010 Figure 3. Bitmap-based packet forwarding algorithm: packets are labeled destinations presence of many from p0 to p4; for each packet, destination(s) are identified and represented in a bitmap (a bit for each posVALE,  a  V forwarder considers sible destination port). Theirtual  Local  Ethernet   each destih,p://info.iet.unipi.it/~luigi/vale/   nation port in turn, scanning the corresponding column of Luigi  Rizzo,  Giuseppe  LeReri   the bitmap to identify the packets bound to the current destiUniversità  di  Pisa   nation port.  
  • 19. p1 p2 p3 p4 0001 0010 1111 0010 0010 1111 0010 List-based forwarding algorithm Figure 3. Bitmap-based packet forwarding algorithm: packets are labeled from p0 to p4; for each packet, destination(s) are identified and represented in a bitmap (a bit for each possible destination port). The forwarder considers each destination port in turn, scanning the corresponding column of the bitmap to identify the packets bound to the current destination port. •  Algorithm in the current VALE (mSwitch) •  Support for unicast and broadcast –  Make a linked-list for each destination –  Broadcast packets are pkt id dst next p0 d1 null mapped into destination p1 d0 null p2 index 254 p3 –  Scan each destination, p4 and broadcast packets are inserted in-order pkt id dst next p0 d1 p2 p1 d0 null p2 d1 p4 p3 d254 null p4 d1 null dst d0 d1 d2 head p1 p0 tail p1 p0 d254 ... dst d0 d1 d2 head p1 p0 tail p1 p4 d254 ... p3 p3 Figure 4. List-based packet forwarding: packets are labeled from p0 to p4, destination port indices are labeled from d1 in the in the well f To cast o This l tinatio forwa curren time 9 sho 4.3 Once identi and m menta the du of pac Ins two p large leases tion o vance order many page for th queue tolera
  • 20. Solutions (cont.) 3.  “Lack of flexibility in packet processing” –  Separate a switching fabric and packet processing –  Switching fabric Packet processing •  Move packets quickly Switching fabric –  Packet processing •  Decide packets’ destination and tell the switching fabric typedef  u_int  (*BDG_LOOKUP_T)(char  *buf,  u_int  len,  uint8_t  *ring_nr,                                  struct  netmap_adapter  *srcif); –  Return •  The index of the destination port for unicast •  NM_BDG_BROADCAST for broadcast •  NM_BDG_NOPORT for dropping this packet –  By default L2 learning is set Page  20  
  • 21. VALE (mSwitch) architecture ... Netmap API Virtual interfaces Netmap API app1/vm1 . . . appN/vmN apps User Socket API Kernel OS stack Packet processing Switching fabric NIC •  Packet forwarding (identifying packets’ destination (packet processing) and copying packets to the destination ring) takes place in the sender’s context –  The receiver just consumes the packets Page  21  
  • 22. The other features •  Indirect buffer support –  netmap slot can contain a pointer to the actual buffer –  Useful to eliminate data copy from VM’s backend to a netmap slot •  Support for a large packet –  Multiple netmap slots (by default 2048 byte each) can be used to contain a single packet
  • 23. Bare mSwitch performance 8 6 4 2 64 128 256 512 1518 Packet size (Bytes) Dummy 20 64B 128B 256B 15 10 5 1.3 1.9 2.6 3.2 3.8 CPU Clock Frequency (Ghz) Experiments  are  done  with  Xeon  E5-­‐1650  CPU  (6  core,  3.8  Ghz  with  Turboboost),  16GB  DDR3   (a) NIC to NIC Page  23   RAM  (quad  channel)  and  Intel  X520-­‐T2  10Gbps  NICs   Throughput (Gbps) 10 10Gbps line rate Throughput (Gbps) Throughput (Gbps) •  NIC to NIC (10Gbps) •  “Dummy” processing module 20 64 15 10 5 1.3 CPU C (b Figure 5. Throughput between 10 Gb/s NICs and vir
  • 24. port, and 20 Gb/s ones with two pairs). In addition, we assign one CPU core per port pair. The results in figure 5(b) are similar to those in the NIC-to-NIC case, with line rate values for all packet sizes at 3.8 GHz. The graph also shows that mSwitch scales well with the number of ports and CPU cores: we achieve line rate for two 10 Gb/s ports for all •  Virtual port Finally, figure port packet sizes. to virtual 5(c) presents throughput numbers in the opposite direction, from virtual Dummy •  “Dummy” processing module ports to NICs, with similar results. of baseline pery plug-in mod1 CPU core figured to mark 200 2 CPU cores here, the packet 3 CPU cores mediately (thus 150 s for packets to then the switch different packet orts. We further Turbo Boost to ets us shed light ts of the NIC to one to the other h this setup, we Page   ies for 256-byte 24   ones starting at Bare mSwitch performance Throughput (Gbps) gen, a fast genbe plugged into e Gb/s to mean ackets per sectch size of 1024 100 50 25 60 508 1514 8K 64K Packet size (Bytes) Figure 6. Forwarding performance between two virtual
  • 25. Bare mSwitch performance 250 1514B packets 64KB packets 200 150 100 50 0 broadcast (a) Experiment topologies. Throughput (Gbps) unicast Throughput (Gbps) •  Dummy packet processing module” •  N virtual ports to N virtual ports •  “ 64B packets 250 200 64B packets 1514B packets 64KB packets 150 100 50 0 2 4 6 # of ports (b) Unicast throughput. 8 2 4 6 # of ports 8 (c) Broadcast throughput. Figure 7. Switching capacity with an increasing number of virtual ports. For unicast, each src/dst port pair is assigned a single CPU core, for broadcast each port is given a core. For setups with more than 6 ports (our system has 6 cores) we assign cores in a round-robin fashion. Page  25  
  • 26. adcast # of ports ent topologies. # of ports (b) Unicast throughput. (c) Broadcast throughput. g capacity with an increasing number of virtual ports. For unicast, each src/dst port pair is assigned a single cast each port is given a core. For setups with more than 6 ports (our system has 6 cores) we assign cores hion. mSwitch’s Scalability –  Bitmap- vs List-based algorithm –  List-based algorithm scales very well 2 3 4 # of destination ports 5 forwarding throughput from a single iple destinations using minimum-sized compares mSwitch’s forwarding algo’s (bitmap). Page  26   Aggregate throughput (Gbps) •  A single virtual port to 14 List Algo. Bitmapmany virtual ports Algo. 12 List Algo. Bitmap Algo. 10 8 6 4 2 0 1 20 40 60 100 150 200 # of destination ports 250 Figure 9. Comparison ofWe  use  the  minimum-­‐sized   mSwitch’s forwarding algorithm packets  w the single  CPU  c a large (list) to that of VALE (bitmap) in ith  a  presence of ore   number of active destination ports (single sender, minimum-
  • 27. Learning bridge performance 8 6 4 2 Page  27   64 128 256 512 1024 Packet size (Bytes) (a) Layer 2 learning bridge. Learning bridge 8 6 4 2 64 128 256 512 1024 Packet size (Bytes) (b) 3-tuple filter (user-space nw stack support). Throughput (Gbps) –  Adding a cost of MAC address hashing at packet processing FreeBSD bridge mSwitch-3-tuple mSwitch-learn 10 10 Throughput (Gbps) Throughput (Gbps) •  mSwitch-learn: pure learning bridge processing 10 8 6 4 2 6
  • 28. OpenVswitch acceleration •  mSwitch-OVS: mSwitch with Openswitch’s packet processing 512 1024 Bytes) tack support). Throughput (Gbps) ch-3-tuple 10 Page  28   OVS mSwitch-OVS 8 6 4 2 64 128 256 512 1024 Packet size (Bytes) (c) Open vSwitch. Openvswitch
  • 29. Conclusion •  Our contribution –  VALE(mSwitch): fast, modular software switch –  Very fast packet forwarding at bare metal •  200 Gbps between virtual ports (with 1500 Byte packets and 3 CPU cores) •  Almost the line rate with using 1 CPU core and 2 10 Gbps NICs –  Useful to implement various systems •  Very fast learning bridge •  Accelerate OpenVswitch up to 2.6 times –  Small modifications, preserving control interface •  Fast protocol multiplexer/demultiplexer for user-space protocol stacks •  Code (Linux, FreeBSD) is available at: –  http://info.iet.unipi.it/~luigi/netmap/ Page  29