SlideShare a Scribd company logo
PFQ: a Novel Architecture for Packet
Capture on Parallel Commodity
Hardware
Nicola Bonelli, Andrea Di Pietro,
Stefano Giordano, Gregorio Procissi
CNIT e Dip. di Ingegneria dell’Informazione - Università di Pisa
Outline
• Introduction and motivation
• Multi-core programming guidelines
• PFQ architecture
• Performance evaluation
• Conclusion and future work
Introduction and Motivations
• Designing monitoring applications has become a very challenging task:
– The hardware has evolved: 10Gbits links, multi-core architectures and multi-
queue network devices (MSI-X)…
• The present software for traffic monitoring, including some parts of the
Linux kernel, is not optimized for new hardware
– (+) kernel support for multi-queue network adapters is implemented
– (-) Linux kernel has a very bad support for monitoring applications
– (-) PF_PACKET is extremely slow, even when used in memory-map mode (pcap)
– (-) PF_RING has been designed for single-processor systems
• Traffic monitoring should:
– Exploits modern hardware, scaling possibly linearly with the number of cores
– Decouple the hardware parallelism from the software one
– Divide and conquer approach to steer packets to applications or threads
Multi-thread on Multi-core
• What’s wrong with the current software?
– Previous multi-threading paradigms used for single-processor systems are still
valid, but prevent the software from scaling with the number of cores.
• For a software to be effective on multi-core system…
– Semaphores, mutexes, and spinlocks are out of question!
– R/W mutexes prevent readers from scaling, even though they are supposed to
grant concurrent access to readers
– Atomic operations are sometimes required, but must be used with
moderation
• sparse-counters instead of atomic ones
• design algorithm as they can use amortized atomic operations
– Sharing (writes to shared data) has serious impact on performance
– writes to shared memory are delayed by the hardware, reads must be synchronized
– False-sharing must and can always be avoided
• wait-free algorithms are mandatory, use lock-free algorithm should be
avoided (if possible)…
PFQ preamble
• PFQ is a novel capture system natively supporting 64bit multi-core
architectures written on top of all the previously exposed
guidelines
• PFQ is not a custom driver
• It is an architecture running on top of standard Ethernet drivers, as
well as slightly modified ones “PFQ aware drivers” (PF_RING aware
driver inheritance)
• PFQ enables packet capturing, filtering, hw queues and devices
aggregation, packet classifications, packet steering and so forth…
• Decouples the hardware parallelism (i.e. Intel RSS) from the
software one
PFQ architecture
Built on the top of the following components…
• User-space C++11 library that provides the same abstraction as that of the STL:
container and iterators
• DB-MPSC queue: double-buffered multiple-producers queue (for the
communication to user-space):
– Allows NAPI contexts to enqueue packets concurrently
– Reduce the sharing, eliminate the false sharing between user-space and NAPI contexts
– Enables user-space copies of packets from the queue to a private buffer in a batch fashion
• De-multiplexing Matrix:
– perfect wait-free concurrently accessible data structure
– no serialization is required to steer/copy packets
• SPSC queue:
– enables batching for socket buffers (skb), to increase temporal locality for the memory
manager (SLAB for kernel prior to 2.6.39)
• Driver aware:
– an effective idea inherited from PF_RING
PFQ architecture
Packet steering
Given a packet and a set of sockets, which sockets need to receive it?
• For capture engines that do not support it, filtering can be used to
dispatch packets across a number of sockets:
– Traversing the socket list to find those interested in the packet has
linear complexity O(n).
– Flexible approach because it enables dispatching as well as copies
• We designed a “packet steering” paradigm that:
– O(1) complexity to identify the destination sockets
– Support both balancing and copies of packets
– Custom hash functions for packet dispatching
Packet steering
• Completely concurrent block (wait-free):
– Shared state (de-multiplexing matrix) is mostly read only
– Writes, which are in general rare events, are serialized each other to prevent
race conditions. The update of the state in the matrix is atomic
• Load balancing groups:
– A socket can create or subscribe a load-balancing group
– It will receive a fraction of the overall traffic
• Socket binding
– One or more hardware queues of a given NIC
– One or more NICs
• Binding and balancing groups are orthogonal and can be concurrently
used
Socket queue: DB-MPSC
• The queue of socket is an unavoidable contention point:
– Load balancing shuffles packets across sockets
• How handle contention without impacting the performance?
– Use an atomic operation to reserve a slot within the queue (will be amortized
in future implementations)
– Reduce traffic coherence among the cores running k-thread and user-space
thread
– Swap between buffers is triggered by user-space thread or by water-mark
– Packets can be copied in batch fashion, or consumed in-place
Testbed: Mascara & Monsters
Mascara Monsters
10 Gb link
Xeon 6-core X5650, @2.57 GHz,
12GBytes RAM
New socket PF_DIRECT for generation
Intel 82599 multi-queue 10G ethernet
adapter.
By deploying 3-4 cores, it is possible to
generate up to ~12 Mpps of 64 bytes.
Xeon 6-core X5650 @2.57GHz, 12
GBytes RAM
Intel 82599 multi-queue 10G ethernet
adapter
PFQ on board for traffic capture
Single socket layout
Fully parallel layout
Load balancing across sockets
• Using 12 capturing NAPI
• Varying the number of user space threads
Packet copy
• Copying packets to a variable number of user space threads
• 12 NAPI contexts within the kernel
Future directions
We are working to improve the packet steering framework…
• How can we better distribute packets according to application-
specific semantics?
• Enhance balancing groups, allow a single socket to join multiple
balancing groups
• Each group is associated with a “specific steering function”
• Investigating on the implementation for wait-free stateful algorithm
(pimp/CAS)
• Add the support of control- and data-plane socket
• Implement a filtering mechanism by means of some bloom filter
variant (capture filters)
Conclusions
• Modern commodity architectures are increasingly parallel
• Multithread software is today not ready for multi-core
architectures:
• Need to strictly fulfill coding and design rules to achieve linear
scalability
• PFQ: a novel Linux packet capturing engine
– Better scalability with respect to competitors
– Flexible packet steering that eases the implementation of multi-
thread user-space applications
– Decouples kernel space and user space parallelism
• PFQ webpage and download:
– netgroup.iet.unipi.it/software/pfq

More Related Content

PDF
Functional approach to packet processing
PPT
PF_DIRECT@TMA12
PPT
PFQ@ 9th Italian Networking Workshop (Courmayeur)
PPTX
Preparing OpenSHMEM for Exascale
PPTX
Open shmem
PDF
Assisting User’s Transition to Titan’s Accelerated Architecture
PPTX
Memory model
PDF
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Functional approach to packet processing
PF_DIRECT@TMA12
PFQ@ 9th Italian Networking Workshop (Courmayeur)
Preparing OpenSHMEM for Exascale
Open shmem
Assisting User’s Transition to Titan’s Accelerated Architecture
Memory model
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures

What's hot (20)

PDF
Evolving Virtual Networking with IO Visor
PDF
General Purpose GPU Computing
PDF
File Systems: Why, How and Where
PPTX
Bgpcep odl summit 2015
PDF
Maxwell siuc hpc_description_tutorial
PPTX
P4 to OpenDataPlane Compiler - BUD17-304
PDF
Linux Kernel Cryptographic API and Use Cases
PDF
Deep Learning on ARM Platforms - SFO17-509
PDF
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
PDF
Mahti quick-start guide
PDF
Heterogeneous multiprocessing on androd and i.mx7
PDF
Foss Gadgematics
PDF
Run Your Own 6LoWPAN Based IoT Network
PPSX
FD.io Vector Packet Processing (VPP)
PDF
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
PPTX
PDF
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
PDF
BUD17-300: Journey of a packet
PDF
Programming Trends in High Performance Computing
PDF
Lucata at the HPEC GraphBLAS BoF
Evolving Virtual Networking with IO Visor
General Purpose GPU Computing
File Systems: Why, How and Where
Bgpcep odl summit 2015
Maxwell siuc hpc_description_tutorial
P4 to OpenDataPlane Compiler - BUD17-304
Linux Kernel Cryptographic API and Use Cases
Deep Learning on ARM Platforms - SFO17-509
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
Mahti quick-start guide
Heterogeneous multiprocessing on androd and i.mx7
Foss Gadgematics
Run Your Own 6LoWPAN Based IoT Network
FD.io Vector Packet Processing (VPP)
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
BUD17-300: Journey of a packet
Programming Trends in High Performance Computing
Lucata at the HPEC GraphBLAS BoF
Ad

Viewers also liked (9)

PDF
PFQ@ 10th Italian Networking Workshop (Bormio)
PDF
Cat's anatomy
PPTX
Types, classes and concepts
PPTX
Netmap presentation
PPTX
DPDK KNI interface
PPTX
Understanding DPDK algorithmics
PPTX
PDF
Userspace networking
PPTX
Understanding DPDK
PFQ@ 10th Italian Networking Workshop (Bormio)
Cat's anatomy
Types, classes and concepts
Netmap presentation
DPDK KNI interface
Understanding DPDK algorithmics
Userspace networking
Understanding DPDK
Ad

Similar to PFQ@ PAM12 (20)

PDF
Walk Through a Software Defined Everything PoC
PPTX
Juniper Networks Router Architecture
PDF
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
PDF
Making workload nomadic when accelerated
PPT
Introduction to symmetric multiprocessor
PPTX
Introduction to DPDK
PDF
DPDK Summit 2015 - Aspera - Charles Shiflett
PDF
Fastsocket Linxiaofeng
PPTX
Project Slides for Website 2020-22.pptx
PDF
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
PPTX
Multithreading computer architecture
PPTX
VLSI design Dr B.jagadeesh UNIT-5.pptx
PPTX
HPC and cloud distributed computing, as a journey
PDF
Designing HPC & Deep Learning Middleware for Exascale Systems
PPTX
Distributed Clouds and Software Defined Networking
PDF
CETH for XDP [Linux Meetup Santa Clara | July 2016]
PPTX
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PPTX
Microx - A Unix like kernel for Embedded Systems written from scratch.
PDF
ODP Presentation LinuxCon NA 2014
Walk Through a Software Defined Everything PoC
Juniper Networks Router Architecture
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Making workload nomadic when accelerated
Introduction to symmetric multiprocessor
Introduction to DPDK
DPDK Summit 2015 - Aspera - Charles Shiflett
Fastsocket Linxiaofeng
Project Slides for Website 2020-22.pptx
SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)
Multithreading computer architecture
VLSI design Dr B.jagadeesh UNIT-5.pptx
HPC and cloud distributed computing, as a journey
Designing HPC & Deep Learning Middleware for Exascale Systems
Distributed Clouds and Software Defined Networking
CETH for XDP [Linux Meetup Santa Clara | July 2016]
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
Microx - A Unix like kernel for Embedded Systems written from scratch.
ODP Presentation LinuxCon NA 2014

Recently uploaded (20)

PPTX
Sustainable Sites - Green Building Construction
PDF
PPT on Performance Review to get promotions
PDF
Well-logging-methods_new................
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
additive manufacturing of ss316l using mig welding
PDF
Digital Logic Computer Design lecture notes
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
DOCX
573137875-Attendance-Management-System-original
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
OOP with Java - Java Introduction (Basics)
Sustainable Sites - Green Building Construction
PPT on Performance Review to get promotions
Well-logging-methods_new................
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Internet of Things (IOT) - A guide to understanding
bas. eng. economics group 4 presentation 1.pptx
CH1 Production IntroductoryConcepts.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Operating System & Kernel Study Guide-1 - converted.pdf
additive manufacturing of ss316l using mig welding
Digital Logic Computer Design lecture notes
Foundation to blockchain - A guide to Blockchain Tech
CYBER-CRIMES AND SECURITY A guide to understanding
573137875-Attendance-Management-System-original
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Model Code of Practice - Construction Work - 21102022 .pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
Lesson 3_Tessellation.pptx finite Mathematics
OOP with Java - Java Introduction (Basics)

PFQ@ PAM12

  • 1. PFQ: a Novel Architecture for Packet Capture on Parallel Commodity Hardware Nicola Bonelli, Andrea Di Pietro, Stefano Giordano, Gregorio Procissi CNIT e Dip. di Ingegneria dell’Informazione - Università di Pisa
  • 2. Outline • Introduction and motivation • Multi-core programming guidelines • PFQ architecture • Performance evaluation • Conclusion and future work
  • 3. Introduction and Motivations • Designing monitoring applications has become a very challenging task: – The hardware has evolved: 10Gbits links, multi-core architectures and multi- queue network devices (MSI-X)… • The present software for traffic monitoring, including some parts of the Linux kernel, is not optimized for new hardware – (+) kernel support for multi-queue network adapters is implemented – (-) Linux kernel has a very bad support for monitoring applications – (-) PF_PACKET is extremely slow, even when used in memory-map mode (pcap) – (-) PF_RING has been designed for single-processor systems • Traffic monitoring should: – Exploits modern hardware, scaling possibly linearly with the number of cores – Decouple the hardware parallelism from the software one – Divide and conquer approach to steer packets to applications or threads
  • 4. Multi-thread on Multi-core • What’s wrong with the current software? – Previous multi-threading paradigms used for single-processor systems are still valid, but prevent the software from scaling with the number of cores. • For a software to be effective on multi-core system… – Semaphores, mutexes, and spinlocks are out of question! – R/W mutexes prevent readers from scaling, even though they are supposed to grant concurrent access to readers – Atomic operations are sometimes required, but must be used with moderation • sparse-counters instead of atomic ones • design algorithm as they can use amortized atomic operations – Sharing (writes to shared data) has serious impact on performance – writes to shared memory are delayed by the hardware, reads must be synchronized – False-sharing must and can always be avoided • wait-free algorithms are mandatory, use lock-free algorithm should be avoided (if possible)…
  • 5. PFQ preamble • PFQ is a novel capture system natively supporting 64bit multi-core architectures written on top of all the previously exposed guidelines • PFQ is not a custom driver • It is an architecture running on top of standard Ethernet drivers, as well as slightly modified ones “PFQ aware drivers” (PF_RING aware driver inheritance) • PFQ enables packet capturing, filtering, hw queues and devices aggregation, packet classifications, packet steering and so forth… • Decouples the hardware parallelism (i.e. Intel RSS) from the software one
  • 6. PFQ architecture Built on the top of the following components… • User-space C++11 library that provides the same abstraction as that of the STL: container and iterators • DB-MPSC queue: double-buffered multiple-producers queue (for the communication to user-space): – Allows NAPI contexts to enqueue packets concurrently – Reduce the sharing, eliminate the false sharing between user-space and NAPI contexts – Enables user-space copies of packets from the queue to a private buffer in a batch fashion • De-multiplexing Matrix: – perfect wait-free concurrently accessible data structure – no serialization is required to steer/copy packets • SPSC queue: – enables batching for socket buffers (skb), to increase temporal locality for the memory manager (SLAB for kernel prior to 2.6.39) • Driver aware: – an effective idea inherited from PF_RING
  • 8. Packet steering Given a packet and a set of sockets, which sockets need to receive it? • For capture engines that do not support it, filtering can be used to dispatch packets across a number of sockets: – Traversing the socket list to find those interested in the packet has linear complexity O(n). – Flexible approach because it enables dispatching as well as copies • We designed a “packet steering” paradigm that: – O(1) complexity to identify the destination sockets – Support both balancing and copies of packets – Custom hash functions for packet dispatching
  • 9. Packet steering • Completely concurrent block (wait-free): – Shared state (de-multiplexing matrix) is mostly read only – Writes, which are in general rare events, are serialized each other to prevent race conditions. The update of the state in the matrix is atomic • Load balancing groups: – A socket can create or subscribe a load-balancing group – It will receive a fraction of the overall traffic • Socket binding – One or more hardware queues of a given NIC – One or more NICs • Binding and balancing groups are orthogonal and can be concurrently used
  • 10. Socket queue: DB-MPSC • The queue of socket is an unavoidable contention point: – Load balancing shuffles packets across sockets • How handle contention without impacting the performance? – Use an atomic operation to reserve a slot within the queue (will be amortized in future implementations) – Reduce traffic coherence among the cores running k-thread and user-space thread – Swap between buffers is triggered by user-space thread or by water-mark – Packets can be copied in batch fashion, or consumed in-place
  • 11. Testbed: Mascara & Monsters Mascara Monsters 10 Gb link Xeon 6-core X5650, @2.57 GHz, 12GBytes RAM New socket PF_DIRECT for generation Intel 82599 multi-queue 10G ethernet adapter. By deploying 3-4 cores, it is possible to generate up to ~12 Mpps of 64 bytes. Xeon 6-core X5650 @2.57GHz, 12 GBytes RAM Intel 82599 multi-queue 10G ethernet adapter PFQ on board for traffic capture
  • 14. Load balancing across sockets • Using 12 capturing NAPI • Varying the number of user space threads
  • 15. Packet copy • Copying packets to a variable number of user space threads • 12 NAPI contexts within the kernel
  • 16. Future directions We are working to improve the packet steering framework… • How can we better distribute packets according to application- specific semantics? • Enhance balancing groups, allow a single socket to join multiple balancing groups • Each group is associated with a “specific steering function” • Investigating on the implementation for wait-free stateful algorithm (pimp/CAS) • Add the support of control- and data-plane socket • Implement a filtering mechanism by means of some bloom filter variant (capture filters)
  • 17. Conclusions • Modern commodity architectures are increasingly parallel • Multithread software is today not ready for multi-core architectures: • Need to strictly fulfill coding and design rules to achieve linear scalability • PFQ: a novel Linux packet capturing engine – Better scalability with respect to competitors – Flexible packet steering that eases the implementation of multi- thread user-space applications – Decouples kernel space and user space parallelism • PFQ webpage and download: – netgroup.iet.unipi.it/software/pfq