SlideShare a Scribd company logo
Fast Userspace OVS with
AF_XDP
OVS Conference 2018
William Tu, VMware Inc
Outline
• AF_XDP Introduction
• OVS AF_XDP netdev
• Performance Optimizations
Linux AF_XDP
• A new socket type that receives/sends raw
frames with high speed
• Use XDP (eXpress Data Path) program to
trigger receive
• Userspace program manages Rx/Tx ring and
Fill/Completion ring.
• Zero Copy from DMA buffer to user space
memory with driver support
• Ingress/egress performance > 20Mpps [1]
3
From “DPDK PMD for AF_XDP”, Zhang Qi
[1] The Path to DPDK Speeds for AF XDP, Linux Plumber 2018
OVS-AF_XDP Netdev
ovs-vswitchd
Goal
• Use AF_XDP socket as a fast
channel to usersapce OVS
datapath, dpif-netdev
• Flow processing happens in
userspace
4
Network Stacks
Hardware
User space
Driver +
XDP
Userspace
DatapathAF_XDP
socket
Kernel
high speed channel
OVS-AF_XDP Architecture
5
Existing
• netdev: abstraction layer for network device
• dpif: datapath interface
• dpif-netdev: userspace implementation of OVS
datapath
New
• Kernel: XDP program and eBPF map
• AF_XDP netdev: implementation of afxdp device
ovs/Documentation/topics/porting.rst
OVS AF_XDP Configuration
#./configure
# make && maskinstall
# make check-afxdp
# ovs-vsctladd-br br0 --
set Bridge br0 datapath_type=netdev
# ovs-vsctladd-port br0 eth0 --
set int enp2s0 type="afxdp”
Based on v3 patch: [ovs-dev] [PATCHv3 RFC 0/3] AF_XDP netdev support for OVS
Prototype Evaluation
• Sender sends 64Byte, 20Mpps to one port, measure the
receiving packet rate at the other port
• Measure single flow, single core performance with Linux
kernel 4.19-rc3 and OVS master
• Enable AF_XDP Zero Copy mode
• Performance goal: 20Mpps rxdop
16-core Intel Xeon
E5 2620 v3 2.4GHz
32GB memory
DPDK packet generator
Netronome
NFP-4000 + AF_XDP
Userspace Datapath
br0
ingress Egress
enp2s0
7
20Mpps
sender
Intel XL710
40GbE
Budget your
packet like
Budget your
money
Time Budget
To achieve 20Mpps
• Budget per packet: 50ns
• 2.4GHz CPU: 120 cycles per packet
Fact [1]
• Cache misses: 32ns, x86 LOCK prefix: 8.25ns
• System call with/wo SELinux auditing: 75ns / 42ns
Batch of 32 packets
• Budget per batch: 50ns x 32 = 1.5us
[1] Improving Linux networking performance, LWN, https://lwn.net/Articles/629155/, Jesper Brouer
Optimization 1/5
• OVS pmd (Poll-Mode Driver) netdev for rx/tx
• Before: call poll() syscall and wait for new I/O
• After: dedicated thread to busy polling the Rx ring
• Effect: avoid system call overhead
9
+const struct netdev_class netdev_afxdp_class = {
+ NETDEV_LINUX_CLASS_COMMON,
+ .type= "afxdp",
+ .is_pmd = true,
.construct = netdev_linux_construct,
.get_stats = netdev_internal_get_stats,
Optimization 2/5
• Packet metadata pre-allocation
• Before: allocate md when receives packets
• After: pre-allocate md and initialize it
• Effect:
• Reduce number of per-packet operations
• Reduce cache misses
10
Multiple 2KB umem chunk memory region
storing packet data
Packet metadata in continuous memory region
(struct dp_packet)
One-to-one maps to AF_XDP umem
Optimizations 3-5
• Packet data memory pool for AF_XDP
• Fast data structure to GET and PUT free memory chunk
• Effect: Reduce cache misses
• Dedicated packet data pool per-device queue
• Effect: Consume more memory but avoid mutex lock
• Batching sendmsg system call
• Effect: Reduce system call rate
11
Reference: Bringing the Power of eBPF to Open vSwitch, Linux Plumber 2018
Performance Evaluation
OVS AF_XDP RX drop
# ovs-ofctl add-flow br0 
"in_port=enp2s0, actions=drop"
# ovs-appctl pmd-stats-show
OVS AF_XDP
br0
enp2s0
DROP
pmd-stats-show (rxdrop)
pmd thread numa_id 0 core_id 11:
packets received: 2069687732
packet recirculations: 0
avg. datapath passes per packet: 1.00
emc hits: 2069687636
smc hits: 0
megaflow hits: 95
avg. subtable lookups per megaflow hit: 1.00
miss with success upcall: 1
miss with failed upcall: 0
avg. packets per output batch: 0.00
idle cycles: 4196235931 (1.60%)
processing cycles: 258609877383 (98.40%)
avg cycles per packet: 126.98 (262806113314/2069687732)
avg processing cycles per packet: 124.95 (258609877383/2069687732)
120ns budget
for 20Mpps
Perf record -p `pidof ovs-vswitchd` sleep 10
26.91% pmd7 ovs-vswitchd [.] netdev_linux_rxq_xsk
26.38% pmd7 ovs-vswitchd [.] dp_netdev_input__
24.65% pmd7 ovs-vswitchd [.] miniflow_extract
6.87% pmd7 libc-2.23.so [.]__memcmp_sse4_1
3.27% pmd7 ovs-vswitchd [.] umem_elem_push
3.06% pmd7 ovs-vswitchd [.] odp_execute_actions
2.03% pmd7 ovs-vswitchd [.] umem_elem_pop
top
PID USER PR NI VIRT RES SHR S %CPU%MEM TIME+COMMAND
16root 20 0 0 0 0R 100.0 0.0 75:16.85ksoftirqd/1
21088root 20 0 451400 52656 4968S 100.0 0.2 6:58.70ovs-vswitchd
Mempool
overhead
OVS AF_XDP l2fwd
#ovs-ofctladd-flowbr0"in_port=enp2s0actions=
set_field:14->in_port,
set_field:a0:36:9f:33:b1:40->dl_src,enp2s0"
OVS AF_XDP
br0
enp2s0
pmd-stats-show (l2fwd)
pmd thread numa_id 0 core_id 11:
packets received: 868900288
packet recirculations: 0
avg. datapath passes per packet: 1.00
emc hits: 868900164
smc hits: 0
megaflow hits: 122
avg. subtable lookups per megaflow hit: 1.00
miss with success upcall: 2
miss with failed upcall: 0
avg. packets per output batch: 30.57
idle cycles: 3344425951 (2.09%)
processing cycles: 157004675952 (97.91%)
avg cycles per packet: 184.54 (160349101903/868900288)
avg processing cycles per packet: 180.69 (157004675952/868900288)
Extra ~55 cycles for send
Perf record -p `pidof ovs-vswitchd` sleep 10
25.92% pmd7 ovs-vswitchd [.]netdev_linux_rxq_xsk
17.75% pmd7 ovs-vswitchd [.]dp_netdev_input__
16.55% pmd7 ovs-vswitchd [.]netdev_linux_send
16.10% pmd7 ovs-vswitchd [.]miniflow_extract
4.78% pmd7 libc-2.23.so [.]__memcmp_sse4_1
3.67% pmd7 ovs-vswitchd [.]dp_execute_cb
2.86% pmd7 ovs-vswitchd [.]__umem_elem_push
2.46% pmd7 ovs-vswitchd [.]__umem_elem_pop
1.96% pmd7 ovs-vswitchd [.]non_atomic_ullong_add
1.69% pmd7 ovs-vswitchd [.]dp_netdev_pmd_flush
_output_on_port
TOP results are similar to rxdrop
Mempool
overhead
# ./configure --with-dpdk=
# ovs-ofctl add-flow br0 "in_port=enp2s0, 
actions=output:vhost-user-1"
# ovs-ofctl add-flow br0 "in_port=vhost-user-1,
actions=output:enp2s0"
AF_XDP PVP Performance
• QEMU 3.0.0
• VM Ubuntu 18.04
• DPDK stable 17.11.4
• OVS-DPDK vhostuserclient port
• options:dq-zero-copy=true
• options:n_txq_desc=128
OVS AF_XDP
br0
QEMU + vhost-user
VM
XDP redirect
enp2s0
virtio
PVP CPU utilization
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+COMMAND
16root 20 0 0 0 0R 100.0 0.0 88:26.26 ksoftirqd/1
21510root 20 0 9807168 53724 5668S100.0 0.2 5:58.38 ovs-vswitchd
21662root 20 0 4894752 30576 12252S 100.0 0.1 5:21.78qemu-system-x86
21878root 20 0 41940 3832 3096R 6.2 0.0 0:00.01top
pmd-stats-show (PVP)
pmd thread numa_id 0 core_id 11:
packets received: 205680121
packet recirculations: 0
avg. datapath passes per packet: 1.00
emc hits: 205680121
smc hits: 0
megaflow hits: 0
avg. subtable lookups per megaflow hit: 0.00
miss with success upcall: 0
miss with failed upcall: 0
avg. packets per output batch: 31.01
idle cycles: 0 (0.00%)
processing cycles: 74238999024 (100.00%)
avg cycles per packet: 360.94 (74238999024/205680121)
avg processing cycles per packet: 360.94 (74238999024/205680121)
AF_XDP PVP Performance Evaluation
• ./perf record -p `pidof ovs-vswitchd` sleep 10
15.88% pmd28 ovs-vswitchd [.]rte_vhost_dequeue_burst
14.51% pmd28 ovs-vswitchd [.]rte_vhost_enqueue_burst
10.41% pmd28 ovs-vswitchd [.]dp_netdev_input__
8.31% pmd28 ovs-vswitchd [.]miniflow_extract
7.65% pmd28 ovs-vswitchd [.]netdev_linux_rxq_xsk
5.59% pmd28 ovs-vswitchd [.]netdev_linux_send
4.20% pmd28 ovs-vswitchd [.]dpdk_do_tx_copy
3.96% pmd28 libc-2.23.so [.]__memcmp_sse4_1
3.94% pmd28 libc-2.23.so [.]__memcpy_avx_unaligned
2.45% pmd28 ovs-vswitchd [.]free_dpdk_buf
2.43% pmd28 ovs-vswitchd [.]__netdev_dpdk_vhost_send
2.14% pmd28 ovs-vswitchd [.]miniflow_hash_5tuple
1.89% pmd28 ovs-vswitchd [.]dp_execute_cb
1.82% pmd28 ovs-vswitchd [.]netdev_dpdk_vhost_rxq_recv
Performance Result
OVS
AF_XDP
PPS CPU
RX Drop 19Mpps 200%
L2fwd [2] 14Mpps 200%
PVP [3] 3.3Mpps 300%
[1] Intel® Open Network Platform Release 2.1 Performance Test Report
[2] Demo rxdrop/l2fwd: https://www.youtube.com/watch?v=VGMmCZ6vA0s
[3] Demo PVP: https://www.youtube.com/watch?v=WevLbHf32UY
OVS
DPDK [1]
PPS CPU
RX Drop NA NA
l3fwd 13Mpps 100%
PVP 7.4Mpps 200%
Conclusion 1/2
• AF_XDP is a high-speed Linux socket type
• We add a new netdev type based on AF_XDP
• Re-use the userspace datapath used by OVS-DPDK
Performance
• Pre-allocate and pre-init as much as possible
• Batching does not reduce # of per-packet operations
• Batching + cache-aware data structure amortizes the cache misses
Conclusion 2/2
• Need high packet rate but can’t deploy DPDK? Use AF_XDP!
• Still slower than OVS-DPDK [1], more optimizations are coming [2]
Comparison with OVS-DPDK
• Better integration with Linux kernel and management tool
• Selectively use kernel’s feature, no re-injection needed
• Do not require dedicated device or CPU
[1] The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel
[2] The Path to DPDK Speeds for AF XDP, Linux Plumber 2018
Thank you
./perf kvm stat record -p 21662 sleep 10
Analyze eventsforall VMs,all VCPUs:
VM-EXIT Samples Samples% Time% MinTime Max Time Avg time
HLT 298071 95.56% 99.91% 0.43us 511955.09us 32.95us (+- 19.18% )
EPT_MISCONFIG 10366 3.32% 0.05% 0.39us 12.35us 0.47us ( +- 0.71% )
EXTERNAL_INTERRUPT 2462 0.79% 0.01% 0.33us 21.20us 0.50us ( +- 3.21% )
MSR_WRITE 761 0.24% 0.01% 0.40us 12.74us 1.19us ( +- 3.51% )
IO_INSTRUCTION 185 0.06% 0.02% 1.98us 35.96us 8.30us ( +- 4.97% )
PREEMPTION_TIMER 62 0.02% 0.00% 0.52us 2.77us 1.04us ( +- 4.34% )
MSR_READ 19 0.01% 0.00% 0.79us 2.49us 1.37us ( +- 8.71% )
EXCEPTION_NMI 1 0.00% 0.00% 0.58us 0.58us 0.58us ( +- 0.00% )
Total Samples:311927,Total eventshandledtime:9831483.62us.
root@ovs-afxdp:~/ovs# ovs-vsctl show
2ade349f-2bce-4118-b633-dce5ac51d994
Bridge "br0"
Port "br0"
Interface"br0"
type: internal
Port "vhost-user-1"
Interface"vhost-user-1"
type: dpdkvhostuser
Port "enp2s0"
Interface"enp2s0"
type: afxdp
QEMU
qemu-system-x86_64-hda ubuntu1810.qcow
-m4096
-cpuhost,+x2apic -enable-kvm
-chardevsocket,id=char1,path=/tmp/vhost,server 
-netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 
-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
mq=on,vectors=10,mrg_rxbuf=on,rx_queue_size=1024
-object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on
-numanode,memdev=mem -mem-prealloc-smp 2

More Related Content

PDF
High-Performance Networking Using eBPF, XDP, and io_uring
ODP
Dpdk performance
PDF
Intel dpdk Tutorial
PDF
eBPF/XDP
PDF
BPF Internals (eBPF)
PDF
DPDK & Layer 4 Packet Processing
PPTX
Understanding DPDK
PDF
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC
High-Performance Networking Using eBPF, XDP, and io_uring
Dpdk performance
Intel dpdk Tutorial
eBPF/XDP
BPF Internals (eBPF)
DPDK & Layer 4 Packet Processing
Understanding DPDK
The Next Generation Firewall for Red Hat Enterprise Linux 7 RC

What's hot (20)

PDF
LinuxCon 2015 Linux Kernel Networking Walkthrough
PDF
U-Boot - An universal bootloader
PDF
Android Boot Time Optimization
PDF
Performance Wins with BPF: Getting Started
PDF
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
PDF
VLANs in the Linux Kernel
PPTX
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
PPTX
eBPF Basics
PDF
DPDK in Containers Hands-on Lab
PPTX
Using Rook to Manage Kubernetes Storage with Ceph
PDF
InfiniBand FAQ
PDF
Neutron packet logging framework
PDF
Fun with Network Interfaces
PDF
Android Binder IPC for Linux
PDF
Containers: The What, Why, and How
PDF
Introduction to eBPF and XDP
PPTX
Dpdk applications
ODP
Block Storage For VMs With Ceph
PDF
Kubernetes Networking - Sreenivas Makam - Google - CC18
PDF
DPDK: Multi Architecture High Performance Packet Processing
LinuxCon 2015 Linux Kernel Networking Walkthrough
U-Boot - An universal bootloader
Android Boot Time Optimization
Performance Wins with BPF: Getting Started
BPF & Cilium - Turning Linux into a Microservices-aware Operating System
VLANs in the Linux Kernel
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
eBPF Basics
DPDK in Containers Hands-on Lab
Using Rook to Manage Kubernetes Storage with Ceph
InfiniBand FAQ
Neutron packet logging framework
Fun with Network Interfaces
Android Binder IPC for Linux
Containers: The What, Why, and How
Introduction to eBPF and XDP
Dpdk applications
Block Storage For VMs With Ceph
Kubernetes Networking - Sreenivas Makam - Google - CC18
DPDK: Multi Architecture High Performance Packet Processing
Ad

Similar to Fast Userspace OVS with AF_XDP, OVS CONF 2018 (20)

PDF
Dev Conf 2017 - Meeting nfv networking requirements
PDF
LF_OVS_17_Ingress Scheduling
PDF
CETH for XDP [Linux Meetup Santa Clara | July 2016]
PPTX
Ovs perf
PDF
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
PDF
Xdp and ebpf_maps
PDF
BUD17-300: Journey of a packet
PDF
LF_OVS_17_OVS-DPDK Installation and Gotchas
PDF
Accelerate Service Function Chaining Vertical Solution with DPDK
PDF
HKG18-110 - net_mdev: Fast path user space I/O
PPTX
High performace network of Cloud Native Taiwan User Group
PDF
Achieving the Ultimate Performance with KVM
PDF
Achieving the Ultimate Performance with KVM
PDF
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
ODP
Accelerated dataplanes integration and deployment
PPTX
OVS v OVS-DPDK
PPTX
DPDK layer for porting IPS-IDS
PPTX
Enabling accelerated networking - seminar by Enea at the Embedded Conference ...
PDF
BKK16-106 ODP Project Update
Dev Conf 2017 - Meeting nfv networking requirements
LF_OVS_17_Ingress Scheduling
CETH for XDP [Linux Meetup Santa Clara | July 2016]
Ovs perf
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
Xdp and ebpf_maps
BUD17-300: Journey of a packet
LF_OVS_17_OVS-DPDK Installation and Gotchas
Accelerate Service Function Chaining Vertical Solution with DPDK
HKG18-110 - net_mdev: Fast path user space I/O
High performace network of Cloud Native Taiwan User Group
Achieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVM
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Accelerated dataplanes integration and deployment
OVS v OVS-DPDK
DPDK layer for porting IPS-IDS
Enabling accelerated networking - seminar by Enea at the Embedded Conference ...
BKK16-106 ODP Project Update
Ad

Recently uploaded (20)

PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Sustainable Sites - Green Building Construction
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Lecture Notes Electrical Wiring System Components
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Sustainable Sites - Green Building Construction
Internet of Things (IOT) - A guide to understanding
Model Code of Practice - Construction Work - 21102022 .pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
CH1 Production IntroductoryConcepts.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
bas. eng. economics group 4 presentation 1.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Geodesy 1.pptx...............................................
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx

Fast Userspace OVS with AF_XDP, OVS CONF 2018

  • 1. Fast Userspace OVS with AF_XDP OVS Conference 2018 William Tu, VMware Inc
  • 2. Outline • AF_XDP Introduction • OVS AF_XDP netdev • Performance Optimizations
  • 3. Linux AF_XDP • A new socket type that receives/sends raw frames with high speed • Use XDP (eXpress Data Path) program to trigger receive • Userspace program manages Rx/Tx ring and Fill/Completion ring. • Zero Copy from DMA buffer to user space memory with driver support • Ingress/egress performance > 20Mpps [1] 3 From “DPDK PMD for AF_XDP”, Zhang Qi [1] The Path to DPDK Speeds for AF XDP, Linux Plumber 2018
  • 4. OVS-AF_XDP Netdev ovs-vswitchd Goal • Use AF_XDP socket as a fast channel to usersapce OVS datapath, dpif-netdev • Flow processing happens in userspace 4 Network Stacks Hardware User space Driver + XDP Userspace DatapathAF_XDP socket Kernel high speed channel
  • 5. OVS-AF_XDP Architecture 5 Existing • netdev: abstraction layer for network device • dpif: datapath interface • dpif-netdev: userspace implementation of OVS datapath New • Kernel: XDP program and eBPF map • AF_XDP netdev: implementation of afxdp device ovs/Documentation/topics/porting.rst
  • 6. OVS AF_XDP Configuration #./configure # make && maskinstall # make check-afxdp # ovs-vsctladd-br br0 -- set Bridge br0 datapath_type=netdev # ovs-vsctladd-port br0 eth0 -- set int enp2s0 type="afxdp” Based on v3 patch: [ovs-dev] [PATCHv3 RFC 0/3] AF_XDP netdev support for OVS
  • 7. Prototype Evaluation • Sender sends 64Byte, 20Mpps to one port, measure the receiving packet rate at the other port • Measure single flow, single core performance with Linux kernel 4.19-rc3 and OVS master • Enable AF_XDP Zero Copy mode • Performance goal: 20Mpps rxdop 16-core Intel Xeon E5 2620 v3 2.4GHz 32GB memory DPDK packet generator Netronome NFP-4000 + AF_XDP Userspace Datapath br0 ingress Egress enp2s0 7 20Mpps sender Intel XL710 40GbE
  • 8. Budget your packet like Budget your money Time Budget To achieve 20Mpps • Budget per packet: 50ns • 2.4GHz CPU: 120 cycles per packet Fact [1] • Cache misses: 32ns, x86 LOCK prefix: 8.25ns • System call with/wo SELinux auditing: 75ns / 42ns Batch of 32 packets • Budget per batch: 50ns x 32 = 1.5us [1] Improving Linux networking performance, LWN, https://lwn.net/Articles/629155/, Jesper Brouer
  • 9. Optimization 1/5 • OVS pmd (Poll-Mode Driver) netdev for rx/tx • Before: call poll() syscall and wait for new I/O • After: dedicated thread to busy polling the Rx ring • Effect: avoid system call overhead 9 +const struct netdev_class netdev_afxdp_class = { + NETDEV_LINUX_CLASS_COMMON, + .type= "afxdp", + .is_pmd = true, .construct = netdev_linux_construct, .get_stats = netdev_internal_get_stats,
  • 10. Optimization 2/5 • Packet metadata pre-allocation • Before: allocate md when receives packets • After: pre-allocate md and initialize it • Effect: • Reduce number of per-packet operations • Reduce cache misses 10 Multiple 2KB umem chunk memory region storing packet data Packet metadata in continuous memory region (struct dp_packet) One-to-one maps to AF_XDP umem
  • 11. Optimizations 3-5 • Packet data memory pool for AF_XDP • Fast data structure to GET and PUT free memory chunk • Effect: Reduce cache misses • Dedicated packet data pool per-device queue • Effect: Consume more memory but avoid mutex lock • Batching sendmsg system call • Effect: Reduce system call rate 11 Reference: Bringing the Power of eBPF to Open vSwitch, Linux Plumber 2018
  • 13. OVS AF_XDP RX drop # ovs-ofctl add-flow br0 "in_port=enp2s0, actions=drop" # ovs-appctl pmd-stats-show OVS AF_XDP br0 enp2s0 DROP
  • 14. pmd-stats-show (rxdrop) pmd thread numa_id 0 core_id 11: packets received: 2069687732 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 2069687636 smc hits: 0 megaflow hits: 95 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 1 miss with failed upcall: 0 avg. packets per output batch: 0.00 idle cycles: 4196235931 (1.60%) processing cycles: 258609877383 (98.40%) avg cycles per packet: 126.98 (262806113314/2069687732) avg processing cycles per packet: 124.95 (258609877383/2069687732) 120ns budget for 20Mpps
  • 15. Perf record -p `pidof ovs-vswitchd` sleep 10 26.91% pmd7 ovs-vswitchd [.] netdev_linux_rxq_xsk 26.38% pmd7 ovs-vswitchd [.] dp_netdev_input__ 24.65% pmd7 ovs-vswitchd [.] miniflow_extract 6.87% pmd7 libc-2.23.so [.]__memcmp_sse4_1 3.27% pmd7 ovs-vswitchd [.] umem_elem_push 3.06% pmd7 ovs-vswitchd [.] odp_execute_actions 2.03% pmd7 ovs-vswitchd [.] umem_elem_pop top PID USER PR NI VIRT RES SHR S %CPU%MEM TIME+COMMAND 16root 20 0 0 0 0R 100.0 0.0 75:16.85ksoftirqd/1 21088root 20 0 451400 52656 4968S 100.0 0.2 6:58.70ovs-vswitchd Mempool overhead
  • 17. pmd-stats-show (l2fwd) pmd thread numa_id 0 core_id 11: packets received: 868900288 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 868900164 smc hits: 0 megaflow hits: 122 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 2 miss with failed upcall: 0 avg. packets per output batch: 30.57 idle cycles: 3344425951 (2.09%) processing cycles: 157004675952 (97.91%) avg cycles per packet: 184.54 (160349101903/868900288) avg processing cycles per packet: 180.69 (157004675952/868900288) Extra ~55 cycles for send
  • 18. Perf record -p `pidof ovs-vswitchd` sleep 10 25.92% pmd7 ovs-vswitchd [.]netdev_linux_rxq_xsk 17.75% pmd7 ovs-vswitchd [.]dp_netdev_input__ 16.55% pmd7 ovs-vswitchd [.]netdev_linux_send 16.10% pmd7 ovs-vswitchd [.]miniflow_extract 4.78% pmd7 libc-2.23.so [.]__memcmp_sse4_1 3.67% pmd7 ovs-vswitchd [.]dp_execute_cb 2.86% pmd7 ovs-vswitchd [.]__umem_elem_push 2.46% pmd7 ovs-vswitchd [.]__umem_elem_pop 1.96% pmd7 ovs-vswitchd [.]non_atomic_ullong_add 1.69% pmd7 ovs-vswitchd [.]dp_netdev_pmd_flush _output_on_port TOP results are similar to rxdrop Mempool overhead
  • 19. # ./configure --with-dpdk= # ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1" # ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0" AF_XDP PVP Performance • QEMU 3.0.0 • VM Ubuntu 18.04 • DPDK stable 17.11.4 • OVS-DPDK vhostuserclient port • options:dq-zero-copy=true • options:n_txq_desc=128 OVS AF_XDP br0 QEMU + vhost-user VM XDP redirect enp2s0 virtio
  • 20. PVP CPU utilization PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+COMMAND 16root 20 0 0 0 0R 100.0 0.0 88:26.26 ksoftirqd/1 21510root 20 0 9807168 53724 5668S100.0 0.2 5:58.38 ovs-vswitchd 21662root 20 0 4894752 30576 12252S 100.0 0.1 5:21.78qemu-system-x86 21878root 20 0 41940 3832 3096R 6.2 0.0 0:00.01top
  • 21. pmd-stats-show (PVP) pmd thread numa_id 0 core_id 11: packets received: 205680121 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 205680121 smc hits: 0 megaflow hits: 0 avg. subtable lookups per megaflow hit: 0.00 miss with success upcall: 0 miss with failed upcall: 0 avg. packets per output batch: 31.01 idle cycles: 0 (0.00%) processing cycles: 74238999024 (100.00%) avg cycles per packet: 360.94 (74238999024/205680121) avg processing cycles per packet: 360.94 (74238999024/205680121)
  • 22. AF_XDP PVP Performance Evaluation • ./perf record -p `pidof ovs-vswitchd` sleep 10 15.88% pmd28 ovs-vswitchd [.]rte_vhost_dequeue_burst 14.51% pmd28 ovs-vswitchd [.]rte_vhost_enqueue_burst 10.41% pmd28 ovs-vswitchd [.]dp_netdev_input__ 8.31% pmd28 ovs-vswitchd [.]miniflow_extract 7.65% pmd28 ovs-vswitchd [.]netdev_linux_rxq_xsk 5.59% pmd28 ovs-vswitchd [.]netdev_linux_send 4.20% pmd28 ovs-vswitchd [.]dpdk_do_tx_copy 3.96% pmd28 libc-2.23.so [.]__memcmp_sse4_1 3.94% pmd28 libc-2.23.so [.]__memcpy_avx_unaligned 2.45% pmd28 ovs-vswitchd [.]free_dpdk_buf 2.43% pmd28 ovs-vswitchd [.]__netdev_dpdk_vhost_send 2.14% pmd28 ovs-vswitchd [.]miniflow_hash_5tuple 1.89% pmd28 ovs-vswitchd [.]dp_execute_cb 1.82% pmd28 ovs-vswitchd [.]netdev_dpdk_vhost_rxq_recv
  • 23. Performance Result OVS AF_XDP PPS CPU RX Drop 19Mpps 200% L2fwd [2] 14Mpps 200% PVP [3] 3.3Mpps 300% [1] Intel® Open Network Platform Release 2.1 Performance Test Report [2] Demo rxdrop/l2fwd: https://www.youtube.com/watch?v=VGMmCZ6vA0s [3] Demo PVP: https://www.youtube.com/watch?v=WevLbHf32UY OVS DPDK [1] PPS CPU RX Drop NA NA l3fwd 13Mpps 100% PVP 7.4Mpps 200%
  • 24. Conclusion 1/2 • AF_XDP is a high-speed Linux socket type • We add a new netdev type based on AF_XDP • Re-use the userspace datapath used by OVS-DPDK Performance • Pre-allocate and pre-init as much as possible • Batching does not reduce # of per-packet operations • Batching + cache-aware data structure amortizes the cache misses
  • 25. Conclusion 2/2 • Need high packet rate but can’t deploy DPDK? Use AF_XDP! • Still slower than OVS-DPDK [1], more optimizations are coming [2] Comparison with OVS-DPDK • Better integration with Linux kernel and management tool • Selectively use kernel’s feature, no re-injection needed • Do not require dedicated device or CPU [1] The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel [2] The Path to DPDK Speeds for AF XDP, Linux Plumber 2018
  • 27. ./perf kvm stat record -p 21662 sleep 10 Analyze eventsforall VMs,all VCPUs: VM-EXIT Samples Samples% Time% MinTime Max Time Avg time HLT 298071 95.56% 99.91% 0.43us 511955.09us 32.95us (+- 19.18% ) EPT_MISCONFIG 10366 3.32% 0.05% 0.39us 12.35us 0.47us ( +- 0.71% ) EXTERNAL_INTERRUPT 2462 0.79% 0.01% 0.33us 21.20us 0.50us ( +- 3.21% ) MSR_WRITE 761 0.24% 0.01% 0.40us 12.74us 1.19us ( +- 3.51% ) IO_INSTRUCTION 185 0.06% 0.02% 1.98us 35.96us 8.30us ( +- 4.97% ) PREEMPTION_TIMER 62 0.02% 0.00% 0.52us 2.77us 1.04us ( +- 4.34% ) MSR_READ 19 0.01% 0.00% 0.79us 2.49us 1.37us ( +- 8.71% ) EXCEPTION_NMI 1 0.00% 0.00% 0.58us 0.58us 0.58us ( +- 0.00% ) Total Samples:311927,Total eventshandledtime:9831483.62us.
  • 28. root@ovs-afxdp:~/ovs# ovs-vsctl show 2ade349f-2bce-4118-b633-dce5ac51d994 Bridge "br0" Port "br0" Interface"br0" type: internal Port "vhost-user-1" Interface"vhost-user-1" type: dpdkvhostuser Port "enp2s0" Interface"enp2s0" type: afxdp
  • 29. QEMU qemu-system-x86_64-hda ubuntu1810.qcow -m4096 -cpuhost,+x2apic -enable-kvm -chardevsocket,id=char1,path=/tmp/vhost,server -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1, mq=on,vectors=10,mrg_rxbuf=on,rx_queue_size=1024 -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on -numanode,memdev=mem -mem-prealloc-smp 2

Editor's Notes

  • #5: Previous approach introducing BPF_ACTION Tc is a Kernel packet queuing subsystem, provide QoS …. Ovs-vswitch creates map, load ebpf programs, etc
  • #8: Compare Linux kernel 4.9-rc3
  • #17: ovs-ofctl add-flow br0 "in_port=enp2s0\ actions=set_field:14->in_port,set_field:a0:36:9f:33:b1:40->dl_src,enp2s0"
  • #20: 10455 2018-12-04T17:34:15.952Z|00146|dpdk|INFO|VHOST_CONFIG: dequeue zero copy is enabled
  • #23: 16 root 20 0 0 0 0 R 100.0 0.0 10:17.12 ksoftirqd/1 19525 root 20 0 9807164 54104 5800 S 106.7 0.2 2:07.59 ovs-vswitchd 19627 root 20 0 4886528 30336 12260 S 106.7 0.1 0:59.59 qemu-system-x86