SlideShare a Scribd company logo
Intro to DPDK &
HW
Network Platforms Group
TRANSFORMING NETWORKING & STORAGE
2
Legal Disclaimer
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND
CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND
HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR
INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL
PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no
responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20 Performance tests and ratings are measured using specific computer
systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate
the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations
All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
Celeron, Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel SpeedStep, Intel XScale, Itanium, Pentium, Pentium Inside, VTune, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries.
Intel® Active Management Technology requires the platform to have an Intel® AMT-enabled chipset, network hardware and software, as well as connection with a power source and a corporate network connection. With regard to notebooks, Intel AMT may not be available or certain
capabilities may be limited over a host OS-based VPN or when connecting wirelessly, on battery power, sleeping, hibernating or powered off. For more information, see http://www.intel.com/technology/iamt.
64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your
system vendor for more information.
No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology is a security technology under development by Intel and requires for operation a computer system with Intel® Virtualization Technology, an Intel Trusted Execution Technology-
enabled processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other compatible measured virtual machine monitor. In addition, Intel Trusted Execution Technology requires the system to contain a TPMv1.2 as defined by the Trusted Computing Group and specific
software for some uses. See http://www.intel.com/technology/security/ for more information.
†Hyper-Threading Technology (HT Technology) requires a computer system with an Intel® Pentium® 4 Processor supporting HT Technology and an HT Technology-enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you
use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support HT Technology.
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software
configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.
* Other names and brands may be claimed as the property of others.
Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or these devices may be subject to
change without notice.
Copyright © 2013, Intel Corporation. All rights reserved.
TRANSFORMING NETWORKING & STORAGE
3
Topics
Why DPDK – PMD vs Linux interrupt driver, memory config, user space.
Licensing
Memory IA – NUMA, huge pages, TLBs on IA
Memory DPDK – mem pools, buffers, allocation etc.
Caching handling, DDIO
TRANSFORMING NETWORKING & STORAGE
4
Intel® Data Plane Development Kit (Intel® DPDK)
• Big Idea
Software solution for accelerating Packet Processing workloads on IA.
• Deployment Models • Performance
• Commercial Support
• Delivers 25X performance jump over Linux • Free, Open Source, BSD License
• Comprehensive Virtualization support • Enjoys vibrant community support
Concepts Code Commercial
1.1
28.5
0
10
20
30
Linux Intel® DPDK
PerCoreL3
Performance
(Mpps)
Platform
Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
TRANSFORMING NETWORKING & STORAGE
5
What Problems Does DPDK Address?
TRANSFORMING NETWORKING & STORAGE
6
Packet Size 64 bytes
40G Packets/second 59.5 Million each way
Packet arrival rate 16.8 ns
2 GHz Clock cycles 33 cycles
Typical Server Packet SizesNetwork Infrastructure Packet Sizes
Packet Size (B)
Packetspersecond
0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
64
128
192
256
320
384
448
512
576
640
704
768
832
896
960
1024
1088
1152
1216
1280
1344
1408
1472
What Problem Does DPDK address ?
Packet Size 1024 bytes
40G Packets/second 4.8 Million each way
Packet arrival rate 208.8 ns
2 GHz Clock cycles 417 cycles
40 Gbps Line Rate (or 4x10G) Rx
Process
Packet
Tx
TRANSFORMING NETWORKING & STORAGE
7
Typical Server Packet SizesNetwork Infrastructure Packet Sizes
Packet Size (B)
Packetspersecond
40 Gbps Line Rate (or 4x10G)
Packet Size 1024 bytes
40G Packets/second 4.8 Million each way
Packet arrival rate 208.8 ns
2 GHz Clock cycles 417 cycles
0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
64
128
192
256
320
384
448
512
576
640
704
768
832
896
960
1024
1088
1152
1216
1280
1344
1408
1472
The Problem Intel® DPDK Addresses
From a CPU perspective:
• L3 cache hit latency is ~40 cycles
• L3 miss, memory read is ~70ns (140 cycles at
2GHz)
Intel® silicon and Intel® software advances are
proactively addressing this problem statement
Packet Size 64 bytes
40G Packets/second 59.5 Million each way
Packet arrival rate 16.8 ns
2 GHz Clock cycles 33 cycles
TRANSFORMING NETWORKING & STORAGE
8
Benefits – Eliminating / Hiding Overheads
Interrupt
Context
Switch
Overhead
Kernel
User
Overhead
Core To
Thread
Scheduling
Overhead
Eliminating
Polling
User Mode Driver
Pthread Affinity
How?
TRANSFORMING NETWORKING & STORAGE
9
• DPDK is BSD licensed:
• http://opensource.org/licenses/BSD-3-Clause
• User is free to modify, copy and re-use code
• No need to provide source code in derived software (unlike GPL license)
Licensing
TRANSFORMING NETWORKING & STORAGE
10
DPDK Packet Processing Concepts
• DPDK is designed for high-speed packet processing on IA. This is achieved by optimizing the
software libraries to IA with some of the following concepts
• Huge Pages Cache alignment Ptheads with Affinity
• Prefetching New Instructions NUMA
• Intel® DDIO Memory Interleave Memory Channel
• Intel® Data Direct I/O Technology (Intel® DDIO)
• Enabled by default in all Intel® Xeon® processor E5-based platforms
• Enables PCIe adapters to route I/O traffic directly to L3 cache, reducing unnecessary trips to
system memory, providing more than double the throughput of previous-generation servers,
while further reducing power consumption and I/O latency.
• Pthreads
• On startup of the DPDK specifies the cores to be used via the Pthread call with affinity to tie an
application to a core. Reducing the kernel’s ability of moving the application to another local or
remote core affecting performance.
• The user may still use Ptheads or Fork calls after the DPDK has started to allow threads to float or
multiple thread to be tied to a single core.
TRANSFORMING NETWORKING & STORAGE
11
DPDK Packet Processing Concepts
• NUMA
• DPDK utilizes NUMA memory for allocation of resources to improve performance for processing
and PCIe I/O local to a processor.
• With out the NUMA set in a dual socket system memory is interleaved between the two sockets.
• Huge Pages
• DPDK utilizes 2M and 1G hugepages to reduce the case of TLB misses which can significantly
affect a cores overall performance.
• Cache Alignment
• Better performance by aligning structures on 64 Byte cache lines.
• Software Prefetching
• Needs to be issued “appropriately” ahead of time to be effective. Too early could cause eviction
before use
• Allows cache to be populated before data is accessed
• Memory channel use
• Memory pools add padding to objects to ensure even use of memory channels
• Number of channels specified at application start up
TRANSFORMING NETWORKING & STORAGE
12
Memory configuration
Intel Architecture
TRANSFORMING NETWORKING & STORAGE
13
Memory – performance topics
• NUMA architecture
• Caching
• TLBs
• Huge pages
• Memory allocation
TRANSFORMING NETWORKING & STORAGE
14
Intel® Core™ Microarchitecture Platform
Architecture
Integrated Memory Controller
• 4 DDR3 channels per socket
• Massive memory bandwidth
• Memory Bandwidth scales with #
of processors
• Very low memory latency
Intel® QuickPath Interconnect
(Intel® QPI)
• New point-to-point interconnect
• Socket to socket connections
• Socket to chipset connections
• Build scalable solutions
IVB
EP
IVB
EP
PCH
Significant performance leap for new platform
TRANSFORMING NETWORKING & STORAGE
15
Non-Uniform Memory Access (NUMA)
FSB architecture (legacy)
• All memory in one location
Starting with Intel® Core™
microarchitecture (Nehalem)
• Memory located in multiple places
Latency to memory dependent on
location
Local memory
• Highest BW
• Lowest latency
Remote Memory
• Higher latency
IVB
EP
IVB
EP
PCH
Ensure software is NUMA-optimized for best performance
l
TRANSFORMING NETWORKING & STORAGE
16
NUMA Considerations for Data Structure
Allocation
Intel® NIC
PCH
Core 0
I$ D$
Core 1
I$ D$
L2 Cache
Core 2
I$ D$
Core 3
I$ D$
L2 Cache
rx_queue 0
rx_queue 1
rx_queue 3
hash = (tcp->th_sport) ^
(tcp->th_dport) ^
(ip->ip_src.s_addr) ^
(ip->ip_dst.s_addr);
hash = hash % PRIME_NUMBER;
return lookup_table[hash];
DCA
Memory
Memory
Memory
Memory
Memory
Memory
rx_queue 2
PTU Metrics
• MEM_UNCORE_RETIRED.REMOTE_DRAM
• MEM_INSTRUCTIONS_RETIRED.LATENCY_ABOVE_THRESHOLD
DMI
PCIe
QPI
TRANSFORMING NETWORKING & STORAGE
17
Caching on IA
TRANSFORMING NETWORKING & STORAGE
18
• IA Processors have cache integrated on processor die.
• Fast access SRAM
• Code & data from system memory (DRAM) stored in fast
access cache memory
• Without a cache – CPU runs out of instructions from system
memory
• CPU Core “stalls” – waiting for data
• Cache miss (data not in cache)
• CPU needs to get data from system memory
• Cache populated with required data
• Not just the data required, but a block of info is copied
• “Cache line” – 64 Bytes on IA (IVB, HSW etc.)
 Cache hit – data present in cache
Caching on IA
TRANSFORMING NETWORKING & STORAGE
19
• Cache Consistency
• Cache is a copy of a piece of memory
• Needs to always reflect what is contained in system
memory
• Snoop
• Cache watches address lines for transaction
• Cache sees if any transactions access memory contained
within cache
• Cache keeps consistent with caches of other CPU cores
• Dirty data
• Data modified in cache but not in main memory
• Stale data
• Data modified in main memory, but not in cache
Caching on IA – some terms
TRANSFORMING NETWORKING & STORAGE
20
• 3 Levels of cache (SNB, IVB, HSW processors)
• L1 cache – 32KB data and 32KB instruction caches
• L2 cache – 256KB – unified (holds code & data)
• L3 cache (LLC) – 25MB (IVB) , 30MB (HSW) common cache for
all cores in CPU socket.
• L1 cache is smallest, and fastest.
• CPU tries to access data – not in L1 cache?
• Try L2 cache - not in L2 cache?
• Try L3 cache – not in L3 cache?
• Cache miss - need to access system memory (DRAM).
• L1 & L2 cache is per physical core (shared per logical core)
• L3 cache is shared (per CPU socket)
Caching on IA
TRANSFORMING NETWORKING & STORAGE
21
• What can be cached?
• Only DRAM can be cached
• IO, MMIO never cached
• L1 cache is smallest, and fastest.
• L1 Code cache is read-only
• Address residing in L1/L2 must be present in L3 cache –
“inclusive cache”
Caching on IA
TRANSFORMING NETWORKING & STORAGE
22
Huge Pages
TRANSFORMING NETWORKING & STORAGE
23
• All memory addresses virtual
• Memory appears contiguous to applications, even if physically
fragmented
• Map virtual address to physical address
• Use page tables to translate virtual address to physical address
• Default page size in Linux on IA is 4kB.
• 4 layers of page tables
Huge Pages
TRANSFORMING NETWORKING & STORAGE
24
Why Hugepages?
1
2
3 4
1
2
3
DTLB:
• 4K pages 64 entries, maps 256 KB, so to access 16G of memory 32MB of PTE tables read by CPU
• 2M pages 32 entries, maps 64 MB, so to access 16G of memory 64Kb of PDE tables read by CPU, fits into
CPU cache
One 2MB page = 512 of 4KB
pages,
512 less page cross penalties
Four memory accesses to
get to the page data
Three memory accesses to
get to the page data
TLB maps page numbers to page frames. Each TLB miss requires page
walk.
TRANSFORMING NETWORKING & STORAGE
25
• Use Linux hugepage support through “hugetlbfs” filesystem
• Each page is 2MB in size equivalent to 512 4KB pages
• Each page requires only 1 DTLB entry
• Reduce DTLB misses, and therefore page walks
• Gives improved performance
• Need to enable & allocate huge pages with Linux boot command (in GRUB
file)
• Better to enable at boot time – prevents fragmentation in physical
memory
Huge Pages
TRANSFORMING NETWORKING & STORAGE
26
Translation Lookaside Buffers
TLBs – virtual to physical memory address translation
Intel® 64 and IA-32 Architectures Software Developer’s Manual.
Volume 3. System Programming Guide. Chapter 4.10: Caching Translation Information
Intel® 64 and IA-32 Architectures Optimization Reference Manual.
TRANSFORMING NETWORKING & STORAGE
27
Translation Lookaside Buffers (TLBs)
• TLBs – Translation Lookaside Buffers – 2 types
• Instruction TLB
• Data TLB
• TLB is cache – maps virtual memory to physical memory
• When memory requested by application, OS maps virtual
address from process to physical address in memory
• Mapping of virtual to physical memory – Page Table Entry
(PTE)
• TLB is a cache for the Page Table
• If data is found in TLB during address lookup
• TLB hit
• Otherwise – TLB miss (page walk) - performance hit
• Huge pages (Linux) – can alleviate
TRANSFORMING NETWORKING & STORAGE
28
Translation Lookaside Buffers (TLBs)
• TLBs are a cache for page tables
• If memory address lookup is not in TLB -> TLB miss
• We must then “walk the page tables”
• This is slow, and costly
• We need to minimise TLB misses
• Solution is to use huge pages
• Use 2M or 1G huge pages instead of default 4k pages
TRANSFORMING NETWORKING & STORAGE
29
TLB Invalidation
• On multi-core systems one core may change the page table which is used by
other cores
• Page table change needs to be propagated to other cores TLBs
• This process is known as “TLB shootdown”
• Need to invalidate the TLBs to avoid using “stale” data
• Need to be aware of other CPU cores invalidating TLBs
• Costly for data plane applications.
• Examples – page faults, VM transitions (VM exit & entry)
• More info in section 4.10.4 of Volume 3A of Intel® 64 and IA-32 Architectures
Software Developer’s Manual
• https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-
32-architectures-software-developer-vol-3a-part-1-manual.pdf
TRANSFORMING NETWORKING & STORAGE
30
IOTLBs
• As well as TLBs for memory, there are TLBs for DMA – IOTLBs
• Page table structure for DMA address translation
• Sandy Bridge – no huge page support in IOTLBS – page table
fragmentation
• 2M and 1G huge pages fragmented to 4k page size
• Causes more IOTLB misses
• SNB could not achieve near 64 byte line rates for 10G NIC
• Huge page support added in IVB
• SR-IOV performance in IVB greatly enhanced
TRANSFORMING NETWORKING & STORAGE
31
Large Page Table Support
Reducing TLB and IOTLB misses with Large Page Table support
MemoryExtended Page
Tables
Intel® VT-d IOTLB,
translation cache
NIC
• Intel® Data Plane Development Kit (Intel® DPDK)
utilizes Large Page tables to create large contiguous
buffers
Intel® Architecture
Virtual
Machine Monitor
NIC
Intel DPDK
GPA HPA
Forwarding
Sample Code
NIC
Intel® Virtualization Technology for Directed I/O (Intel® VT-d)
TRANSFORMING NETWORKING & STORAGE
32
Memory Virtualization Challenges
VMM
CPU0
VM0 VMn
Guest
Page Tables
TLB
Shadow
Page Tables
Memory
Induced
VM Exits Remap
Address Translation
• Guest OS expects contiguous,
zero-based physical memory
• VMM must preserve this illusion
Page-table Shadowing
• VMM intercepts paging operations
• Constructs copy of page tables
Overheads
• VM exits add to execution time
• Shadow page tables consume
significant host memory
Guest
Page Tables
TRANSFORMING NETWORKING & STORAGE
33
Memory Virtualization with EPT
CPU0
VMM
I/O
Virtualization
Intel® VT-x
with EPT
VM0 VMn
Extended
Page Tables
(EPT)
EPT
Walker
No VM Exits
Extended Page Tables (EPT)
• Map guest physical to host
address
• New hardware page-table walker
Performance Benefit
• Guest OS can modify its own page
tables freely
• Eliminates VM exits
Memory Savings
• Shadow page tables required for
each guest user process (w/o
EPT)
• A single EPT supports entire VM
Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x)
TRANSFORMING NETWORKING & STORAGE
34
Memory Configuration
DPDK
TRANSFORMING NETWORKING & STORAGE
35
Memory Object Hierarchy
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
Memory Segment 0 Memory Segment 1 Memory Segment N
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
Physically contiguous
memory
Physically contiguous
memory
Memory
Zone:
RG_RX_RING_0 Memory Zone: MP_mbuf_pool
Memory
Zone:
RG_TX_RING_0
Ring:
RX_RING_0
Ring:
TX_RING_0
Memory Pool: mbuf_pool
Memory Zone:
MALLOC_HEAP0
Malloc heap
TRANSFORMING NETWORKING & STORAGE
36
Hugepages
• Use Linux hugepage support through “hugetlbfs” filesystem
• Each page is 2MB in size equivalent to 512 4KB pages
• Each page requires only 1 DTLB entry
• Reduce DTLB misses, and therefore page walks
• Gives improved performance
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
Memory Segment 0 Memory Segment 1 Memory Segment N
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
Physically contiguous
memory
Physically contiguous
memory
Memory
Zone:
RG_RX_RING_0
Memory Zone: MP_mbuf_pool
Memory
Zone:
RG_TX_RING_0
Ring:
RX_RING_0
Ring:
TX_RING_0
Memory Pool: mbuf_pool
Memory
Zone:
MALLOC_HEAP0
Malloc heap
TRANSFORMING NETWORKING & STORAGE
37
Memory Segments
• Internal unit for memory management is the memory segment
• Always backed by Huge Page (2 MB/1 GB page) memory
• Each segment is contiguous in physical and virtual memory
• Broken out into smaller memory zones for individual objects
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
Memory Segment 0 Memory Segment 1 Memory Segment N
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
Physically contiguous
memory
Physically contiguous
memory
Memory
Zone:
RG_RX_RING_0
Memory Zone: MP_mbuf_pool
Memory
Zone:
RG_TX_RING_0
Ring:
RX_RING_0
Ring:
TX_RING_0
Memory Pool: mbuf_pool
Memory
Zone:
MALLOC_HEAP0
Malloc heap
TRANSFORMING NETWORKING & STORAGE
38
Memory Zones
• Most basic unit of memory allocation – named block of memory
• Allocate-only, cannot free
• Cannot span a segment boundary – contiguous memory
• Physical address of allocated block available to caller
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
Memory Segment 0 Memory Segment 1 Memory Segment N
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
Physically contiguous
memory
Physically contiguous
memory
Memory
Zone:
RG_RX_RING_0
Memory Zone: MP_mbuf_pool
Memory
Zone:
RG_TX_RING_0
Ring:
RX_RING_0
Ring:
TX_RING_0
Memory Pool: mbuf_pool
Memory
Zone:
MALLOC_HEAP0
Malloc heap
TRANSFORMING NETWORKING & STORAGE
39
Malloc support – rte_malloc/rte_free
• Malloc library provided to allow easier application porting
• Backed by one or more memzones
• Uses hugepage memory, but supports memory freeing
• Not lock-free – avoid in data path
• Physical address information not available per-allocation
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
Memory Segment 0 Memory Segment 1 Memory Segment N
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
Physically contiguous
memory
Physically contiguous
memory
Memory
Zone:
RG_RX_RING_0
Memory Zone: MP_mbuf_pool
Memory
Zone:
RG_TX_RING_0
Ring:
RX_RING_0
Ring:
TX_RING_0
Memory Pool: mbuf_pool
Memory
Zone:
MALLOC_HEAP0
Malloc heap
TRANSFORMING NETWORKING & STORAGE
40
Memory Pools
• Pool of fixed-size buffers
• One pool can be safely shared among many threads
• Lock-free allocation and freeing of buffers to/from pool
• Designed for fast-path use
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
Memory Segment 0 Memory Segment 1 Memory Segment N
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
2MB
page
Physically contiguous
memory
Physically contiguous
memory
Memory
Zone:
RG_RX_RING_0
Memory Zone: MP_mbuf_pool
Memory
Zone:
RG_TX_RING_0
Ring:
RX_RING_0
Ring:
TX_RING_0
Memory Pool: mbuf_pool
Memory
Zone:
MALLOC_HEAP0
Malloc heap
TRANSFORMING NETWORKING & STORAGE
41
Memory Pools (continued)
Memory Pool
Pkt Buffers
(60K 2K
buffers)
Events
(2K 100B
buffers)
Events
(2K 100B
buffers)
Processor 0
10G
Intel®
DPDK
C4
Data
Plane
Intel®
DPDK
C3
Data
Plane
Intel®
DPDK
C2
Data
Plane
Intel®
DPDK
C1
Data
Plane
10G
Per-core
cached
buffers
• Size fixed at creation time:
• Fixed size elements
• Fixed number of elements
• Multi-producer / multi-consumer safe
• Safe for fast-path use
• Typical usage is packet buffers
• Optimized for performance:
• No locking, use CAS instructions
• All objects cache aligned
• Per core caches to minimise contention / use
of CAS instructions
• Support for bulk allocation / freeing of buffers
TRANSFORMING NETWORKING & STORAGE
42
• For DPDK application – allocated all memory from huge pages
• Allocate all memory at initialisation time (not during run time).
• Pools of buffers created.
• Buffers taken from pools as needed for packet processing
• Returned to pool after use
• Never need to use “malloc” at runtime.
• DPDK takes care of aligning memory to cache lines
Memory allocation - summary
TRANSFORMING NETWORKING & STORAGE
43
• rte_eal_init()
• Initialises Environment Abstraction Layer
• Takes care of allocating memory from huge pages
• rte_mempool_create()
• Create pool of message buffers (mbufs)
• This pool is used to hold packet data
• mbufs taken from and returned to this pool
Memory allocation
TRANSFORMING NETWORKING & STORAGE
44
Memory Buffer - mbuf
Memory buffer structure used throughout the Intel® DPDK
Header holds meta-data about packet and buffer
• Buffer & packet length
• Buffer physical address
• RSS hash or flow director filter information
• Offload flags
Body holds packet data plus room for additional headers and footers.
TRANSFORMING NETWORKING & STORAGE
45
Memory Buffer – chained mbuf
Mbufs generally used with memory pools
Size of mbuf fixed when the mempool is created
For packets too big for a single mbuf, the mbufs can be linked together in an
“mbuf chain”
TRANSFORMING NETWORKING & STORAGE
46
DDIO
TRANSFORMING NETWORKING & STORAGE
47
Data Direct I/O (DDIO)
• Ethernet controllers & NICs talk directly with CPU cache
• DDIO makes processor cache the primary source and
destination of I/O data, rather than main memory
• DDIO reduces latency, power consumption, and
memory bandwidth
• Lower latency – I/O date does not need to go via
main memory
• Lower power consumption – reduced memory
access
• More scalable I/O bandwidth – reduced memory
bottlenecks
TRANSFORMING NETWORKING & STORAGE
48
TRANSFORMING NETWORKING & STORAGE
49
DDIO requires no complex setup
• DDIO is enabled by default on all Romley platforms, including
pre-released platforms for OEMs, IHVs, and ISVs
− DDIO has been active on all Intel and industry Romley
development and validation
• DDIO has no hardware dependencies
• DDIO is invisible to software
− No driver changes are required
− No OS or VMM changes are required
− No application changes are required
1 intro to_dpdk_and_hw

More Related Content

PDF
DPDK: Multi Architecture High Performance Packet Processing
PDF
DPDK in Containers Hands-on Lab
PDF
DPDK & Layer 4 Packet Processing
PPTX
Understanding DPDK
PDF
Intel dpdk Tutorial
PDF
Intel DPDK Step by Step instructions
PPTX
DPDK KNI interface
PPTX
Introduction to DPDK
DPDK: Multi Architecture High Performance Packet Processing
DPDK in Containers Hands-on Lab
DPDK & Layer 4 Packet Processing
Understanding DPDK
Intel dpdk Tutorial
Intel DPDK Step by Step instructions
DPDK KNI interface
Introduction to DPDK

What's hot (20)

PDF
What are latest new features that DPDK brings into 2018?
PDF
Enabling new protocol processing with DPDK using Dynamic Device Personalization
PDF
eBPF - Rethinking the Linux Kernel
PDF
The Linux Block Layer - Built for Fast Storage
PDF
BPF Internals (eBPF)
PPTX
Dpdk applications
PDF
DPDK In Depth
PDF
Using VPP and SRIO-V with Clear Containers
PDF
Computing Performance: On the Horizon (2021)
PPSX
FD.IO Vector Packet Processing
PDF
Network Programming: Data Plane Development Kit (DPDK)
PDF
The ideal and reality of NVDIMM RAS
PPTX
Linux Network Stack
PDF
ACPI Debugging from Linux Kernel
PPTX
Debug dpdk process bottleneck & painpoints
PDF
introduction to linux kernel tcp/ip ptocotol stack
ODP
SR-IOV Introduce
PPTX
The TCP/IP Stack in the Linux Kernel
PDF
Rootless Kubernetes
PPTX
Lec04 gpu architecture
What are latest new features that DPDK brings into 2018?
Enabling new protocol processing with DPDK using Dynamic Device Personalization
eBPF - Rethinking the Linux Kernel
The Linux Block Layer - Built for Fast Storage
BPF Internals (eBPF)
Dpdk applications
DPDK In Depth
Using VPP and SRIO-V with Clear Containers
Computing Performance: On the Horizon (2021)
FD.IO Vector Packet Processing
Network Programming: Data Plane Development Kit (DPDK)
The ideal and reality of NVDIMM RAS
Linux Network Stack
ACPI Debugging from Linux Kernel
Debug dpdk process bottleneck & painpoints
introduction to linux kernel tcp/ip ptocotol stack
SR-IOV Introduce
The TCP/IP Stack in the Linux Kernel
Rootless Kubernetes
Lec04 gpu architecture
Ad

Viewers also liked (20)

PDF
3 additional dpdk_theory(1)
PDF
5 pipeline arch_rationale
PDF
4 dpdk roadmap(1)
PDF
6 profiling tools
PDF
2 new hw_features_cat_cod_etc
PDF
7 hands on
PDF
8 intel network builders overview
PDF
5. hands on - building local development environment with Open Mano
PDF
6. hands on - open mano demonstration in remote pool of servers
PDF
4. open mano set up and usage
PDF
9 creating cent_os 7_mages_for_dpdk_training
PDF
Introduction to nfv movilforum
PDF
3. configuring a compute node for nfv
PDF
Introduction to Open Mano
ODP
Dpdk performance
PPTX
Understanding DPDK algorithmics
PPTX
The Need for Complex Analytics from Forwarding Pipelines
PDF
Intrucciones reto NFV/ Instruction to apply to nfv challenge
PDF
Bases legales reto NFV/ Nfv challenge terms
PPTX
Packet Framework - Cristian Dumitrescu
3 additional dpdk_theory(1)
5 pipeline arch_rationale
4 dpdk roadmap(1)
6 profiling tools
2 new hw_features_cat_cod_etc
7 hands on
8 intel network builders overview
5. hands on - building local development environment with Open Mano
6. hands on - open mano demonstration in remote pool of servers
4. open mano set up and usage
9 creating cent_os 7_mages_for_dpdk_training
Introduction to nfv movilforum
3. configuring a compute node for nfv
Introduction to Open Mano
Dpdk performance
Understanding DPDK algorithmics
The Need for Complex Analytics from Forwarding Pipelines
Intrucciones reto NFV/ Instruction to apply to nfv challenge
Bases legales reto NFV/ Nfv challenge terms
Packet Framework - Cristian Dumitrescu
Ad

Similar to 1 intro to_dpdk_and_hw (20)

PDF
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
PDF
Crooke CWF Keynote FINAL final platinum
PDF
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
PDF
Droidcon2013 x86phones weggerle_taubert_intel
PDF
Алексей Слепцов_"Интернет вещей. Что это и для чего"
PDF
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
PDF
5 Cronin Steen - IOT Smart Cities
PDF
Intel Mobile Launch Information
PDF
High Memory Bandwidth Demo @ One Intel Station
PDF
Intel Public Roadmap for Desktop, Mobile, Data Center
PDF
Accelerate Ceph performance via SPDK related techniques
PDF
High Performance Computing: The Essential tool for a Knowledge Economy
PDF
INTEL CPU 3RD GEN.pdf variadas de computacion
PPTX
Internet of Things: Lightning Round, Sargent
PDF
Intel® QuickAssist Technology (Intel® QAT) and OpenSSL-1.1.0: Performance
PDF
AI & Computer Vision (OpenVINO) - CPBR12
PDF
Preparing the Data Center for the Internet of Things
PDF
Intel_Solid State Discs and Wireless Solutions in Embedded Devices
PDF
Intel HPC Update
PDF
Accelerating Apache Spark with Intel QuickAssist Technology
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
Crooke CWF Keynote FINAL final platinum
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
Droidcon2013 x86phones weggerle_taubert_intel
Алексей Слепцов_"Интернет вещей. Что это и для чего"
Accelerating SparkML Workloads on the Intel Xeon+FPGA Platform with Srivatsan...
5 Cronin Steen - IOT Smart Cities
Intel Mobile Launch Information
High Memory Bandwidth Demo @ One Intel Station
Intel Public Roadmap for Desktop, Mobile, Data Center
Accelerate Ceph performance via SPDK related techniques
High Performance Computing: The Essential tool for a Knowledge Economy
INTEL CPU 3RD GEN.pdf variadas de computacion
Internet of Things: Lightning Round, Sargent
Intel® QuickAssist Technology (Intel® QAT) and OpenSSL-1.1.0: Performance
AI & Computer Vision (OpenVINO) - CPBR12
Preparing the Data Center for the Internet of Things
Intel_Solid State Discs and Wireless Solutions in Embedded Devices
Intel HPC Update
Accelerating Apache Spark with Intel QuickAssist Technology

More from videos (14)

PDF
Logros y retos evento movilforum 02/2016
PPTX
Presentación Atlantida en Networking Day moviforum
PPTX
Presentación Quetal en Networking Day moviforum
PPTX
Presentación GMTECH en Networking Day moviforum
PDF
Presentación movilok en Networking Day moviforum
PPTX
Presentación 3G mobile en Networking Day moviforum
PPTX
Presentación microestrategy en Networking Day moviforum
PPTX
Presentación Telnet en Networking Day moviforum
PPTX
Presentación Alma technology en Networking Day movilforum
PPTX
Presentación acuerdo de colaboración Fieldeas y EasyOnPad en Networking Day m...
PPTX
Presentación Icar Vision en Networking Day movilforum
PDF
Presentación Billage en Networking Day movilforum
PPSX
Presentación Face On en Networking Day movilforum
PDF
Hp nfv movilforum as innovation engine for cs ps
Logros y retos evento movilforum 02/2016
Presentación Atlantida en Networking Day moviforum
Presentación Quetal en Networking Day moviforum
Presentación GMTECH en Networking Day moviforum
Presentación movilok en Networking Day moviforum
Presentación 3G mobile en Networking Day moviforum
Presentación microestrategy en Networking Day moviforum
Presentación Telnet en Networking Day moviforum
Presentación Alma technology en Networking Day movilforum
Presentación acuerdo de colaboración Fieldeas y EasyOnPad en Networking Day m...
Presentación Icar Vision en Networking Day movilforum
Presentación Billage en Networking Day movilforum
Presentación Face On en Networking Day movilforum
Hp nfv movilforum as innovation engine for cs ps

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
NewMind AI Monthly Chronicles - July 2025
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Building Integrated photovoltaic BIPV_UPV.pdf

1 intro to_dpdk_and_hw

  • 1. Intro to DPDK & HW Network Platforms Group
  • 2. TRANSFORMING NETWORKING & STORAGE 2 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Celeron, Intel, Intel logo, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel SpeedStep, Intel XScale, Itanium, Pentium, Pentium Inside, VTune, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Intel® Active Management Technology requires the platform to have an Intel® AMT-enabled chipset, network hardware and software, as well as connection with a power source and a corporate network connection. With regard to notebooks, Intel AMT may not be available or certain capabilities may be limited over a host OS-based VPN or when connecting wirelessly, on battery power, sleeping, hibernating or powered off. For more information, see http://www.intel.com/technology/iamt. 64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information. No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology is a security technology under development by Intel and requires for operation a computer system with Intel® Virtualization Technology, an Intel Trusted Execution Technology- enabled processor, chipset, BIOS, Authenticated Code Modules, and an Intel or other compatible measured virtual machine monitor. In addition, Intel Trusted Execution Technology requires the system to contain a TPMv1.2 as defined by the Trusted Computing Group and specific software for some uses. See http://www.intel.com/technology/security/ for more information. †Hyper-Threading Technology (HT Technology) requires a computer system with an Intel® Pentium® 4 Processor supporting HT Technology and an HT Technology-enabled chipset, BIOS, and operating system. Performance will vary depending on the specific hardware and software you use. See www.intel.com/products/ht/hyperthreading_more.htm for more information including details on which processors support HT Technology. Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor. * Other names and brands may be claimed as the property of others. Other vendors are listed by Intel as a convenience to Intel's general customer base, but Intel does not make any representations or warranties whatsoever regarding quality, reliability, functionality, or compatibility of these devices. This list and/or these devices may be subject to change without notice. Copyright © 2013, Intel Corporation. All rights reserved.
  • 3. TRANSFORMING NETWORKING & STORAGE 3 Topics Why DPDK – PMD vs Linux interrupt driver, memory config, user space. Licensing Memory IA – NUMA, huge pages, TLBs on IA Memory DPDK – mem pools, buffers, allocation etc. Caching handling, DDIO
  • 4. TRANSFORMING NETWORKING & STORAGE 4 Intel® Data Plane Development Kit (Intel® DPDK) • Big Idea Software solution for accelerating Packet Processing workloads on IA. • Deployment Models • Performance • Commercial Support • Delivers 25X performance jump over Linux • Free, Open Source, BSD License • Comprehensive Virtualization support • Enjoys vibrant community support Concepts Code Commercial 1.1 28.5 0 10 20 30 Linux Intel® DPDK PerCoreL3 Performance (Mpps) Platform Disclaimer: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
  • 5. TRANSFORMING NETWORKING & STORAGE 5 What Problems Does DPDK Address?
  • 6. TRANSFORMING NETWORKING & STORAGE 6 Packet Size 64 bytes 40G Packets/second 59.5 Million each way Packet arrival rate 16.8 ns 2 GHz Clock cycles 33 cycles Typical Server Packet SizesNetwork Infrastructure Packet Sizes Packet Size (B) Packetspersecond 0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 70,000,000 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 1088 1152 1216 1280 1344 1408 1472 What Problem Does DPDK address ? Packet Size 1024 bytes 40G Packets/second 4.8 Million each way Packet arrival rate 208.8 ns 2 GHz Clock cycles 417 cycles 40 Gbps Line Rate (or 4x10G) Rx Process Packet Tx
  • 7. TRANSFORMING NETWORKING & STORAGE 7 Typical Server Packet SizesNetwork Infrastructure Packet Sizes Packet Size (B) Packetspersecond 40 Gbps Line Rate (or 4x10G) Packet Size 1024 bytes 40G Packets/second 4.8 Million each way Packet arrival rate 208.8 ns 2 GHz Clock cycles 417 cycles 0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 70,000,000 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 1024 1088 1152 1216 1280 1344 1408 1472 The Problem Intel® DPDK Addresses From a CPU perspective: • L3 cache hit latency is ~40 cycles • L3 miss, memory read is ~70ns (140 cycles at 2GHz) Intel® silicon and Intel® software advances are proactively addressing this problem statement Packet Size 64 bytes 40G Packets/second 59.5 Million each way Packet arrival rate 16.8 ns 2 GHz Clock cycles 33 cycles
  • 8. TRANSFORMING NETWORKING & STORAGE 8 Benefits – Eliminating / Hiding Overheads Interrupt Context Switch Overhead Kernel User Overhead Core To Thread Scheduling Overhead Eliminating Polling User Mode Driver Pthread Affinity How?
  • 9. TRANSFORMING NETWORKING & STORAGE 9 • DPDK is BSD licensed: • http://opensource.org/licenses/BSD-3-Clause • User is free to modify, copy and re-use code • No need to provide source code in derived software (unlike GPL license) Licensing
  • 10. TRANSFORMING NETWORKING & STORAGE 10 DPDK Packet Processing Concepts • DPDK is designed for high-speed packet processing on IA. This is achieved by optimizing the software libraries to IA with some of the following concepts • Huge Pages Cache alignment Ptheads with Affinity • Prefetching New Instructions NUMA • Intel® DDIO Memory Interleave Memory Channel • Intel® Data Direct I/O Technology (Intel® DDIO) • Enabled by default in all Intel® Xeon® processor E5-based platforms • Enables PCIe adapters to route I/O traffic directly to L3 cache, reducing unnecessary trips to system memory, providing more than double the throughput of previous-generation servers, while further reducing power consumption and I/O latency. • Pthreads • On startup of the DPDK specifies the cores to be used via the Pthread call with affinity to tie an application to a core. Reducing the kernel’s ability of moving the application to another local or remote core affecting performance. • The user may still use Ptheads or Fork calls after the DPDK has started to allow threads to float or multiple thread to be tied to a single core.
  • 11. TRANSFORMING NETWORKING & STORAGE 11 DPDK Packet Processing Concepts • NUMA • DPDK utilizes NUMA memory for allocation of resources to improve performance for processing and PCIe I/O local to a processor. • With out the NUMA set in a dual socket system memory is interleaved between the two sockets. • Huge Pages • DPDK utilizes 2M and 1G hugepages to reduce the case of TLB misses which can significantly affect a cores overall performance. • Cache Alignment • Better performance by aligning structures on 64 Byte cache lines. • Software Prefetching • Needs to be issued “appropriately” ahead of time to be effective. Too early could cause eviction before use • Allows cache to be populated before data is accessed • Memory channel use • Memory pools add padding to objects to ensure even use of memory channels • Number of channels specified at application start up
  • 12. TRANSFORMING NETWORKING & STORAGE 12 Memory configuration Intel Architecture
  • 13. TRANSFORMING NETWORKING & STORAGE 13 Memory – performance topics • NUMA architecture • Caching • TLBs • Huge pages • Memory allocation
  • 14. TRANSFORMING NETWORKING & STORAGE 14 Intel® Core™ Microarchitecture Platform Architecture Integrated Memory Controller • 4 DDR3 channels per socket • Massive memory bandwidth • Memory Bandwidth scales with # of processors • Very low memory latency Intel® QuickPath Interconnect (Intel® QPI) • New point-to-point interconnect • Socket to socket connections • Socket to chipset connections • Build scalable solutions IVB EP IVB EP PCH Significant performance leap for new platform
  • 15. TRANSFORMING NETWORKING & STORAGE 15 Non-Uniform Memory Access (NUMA) FSB architecture (legacy) • All memory in one location Starting with Intel® Core™ microarchitecture (Nehalem) • Memory located in multiple places Latency to memory dependent on location Local memory • Highest BW • Lowest latency Remote Memory • Higher latency IVB EP IVB EP PCH Ensure software is NUMA-optimized for best performance l
  • 16. TRANSFORMING NETWORKING & STORAGE 16 NUMA Considerations for Data Structure Allocation Intel® NIC PCH Core 0 I$ D$ Core 1 I$ D$ L2 Cache Core 2 I$ D$ Core 3 I$ D$ L2 Cache rx_queue 0 rx_queue 1 rx_queue 3 hash = (tcp->th_sport) ^ (tcp->th_dport) ^ (ip->ip_src.s_addr) ^ (ip->ip_dst.s_addr); hash = hash % PRIME_NUMBER; return lookup_table[hash]; DCA Memory Memory Memory Memory Memory Memory rx_queue 2 PTU Metrics • MEM_UNCORE_RETIRED.REMOTE_DRAM • MEM_INSTRUCTIONS_RETIRED.LATENCY_ABOVE_THRESHOLD DMI PCIe QPI
  • 17. TRANSFORMING NETWORKING & STORAGE 17 Caching on IA
  • 18. TRANSFORMING NETWORKING & STORAGE 18 • IA Processors have cache integrated on processor die. • Fast access SRAM • Code & data from system memory (DRAM) stored in fast access cache memory • Without a cache – CPU runs out of instructions from system memory • CPU Core “stalls” – waiting for data • Cache miss (data not in cache) • CPU needs to get data from system memory • Cache populated with required data • Not just the data required, but a block of info is copied • “Cache line” – 64 Bytes on IA (IVB, HSW etc.)  Cache hit – data present in cache Caching on IA
  • 19. TRANSFORMING NETWORKING & STORAGE 19 • Cache Consistency • Cache is a copy of a piece of memory • Needs to always reflect what is contained in system memory • Snoop • Cache watches address lines for transaction • Cache sees if any transactions access memory contained within cache • Cache keeps consistent with caches of other CPU cores • Dirty data • Data modified in cache but not in main memory • Stale data • Data modified in main memory, but not in cache Caching on IA – some terms
  • 20. TRANSFORMING NETWORKING & STORAGE 20 • 3 Levels of cache (SNB, IVB, HSW processors) • L1 cache – 32KB data and 32KB instruction caches • L2 cache – 256KB – unified (holds code & data) • L3 cache (LLC) – 25MB (IVB) , 30MB (HSW) common cache for all cores in CPU socket. • L1 cache is smallest, and fastest. • CPU tries to access data – not in L1 cache? • Try L2 cache - not in L2 cache? • Try L3 cache – not in L3 cache? • Cache miss - need to access system memory (DRAM). • L1 & L2 cache is per physical core (shared per logical core) • L3 cache is shared (per CPU socket) Caching on IA
  • 21. TRANSFORMING NETWORKING & STORAGE 21 • What can be cached? • Only DRAM can be cached • IO, MMIO never cached • L1 cache is smallest, and fastest. • L1 Code cache is read-only • Address residing in L1/L2 must be present in L3 cache – “inclusive cache” Caching on IA
  • 22. TRANSFORMING NETWORKING & STORAGE 22 Huge Pages
  • 23. TRANSFORMING NETWORKING & STORAGE 23 • All memory addresses virtual • Memory appears contiguous to applications, even if physically fragmented • Map virtual address to physical address • Use page tables to translate virtual address to physical address • Default page size in Linux on IA is 4kB. • 4 layers of page tables Huge Pages
  • 24. TRANSFORMING NETWORKING & STORAGE 24 Why Hugepages? 1 2 3 4 1 2 3 DTLB: • 4K pages 64 entries, maps 256 KB, so to access 16G of memory 32MB of PTE tables read by CPU • 2M pages 32 entries, maps 64 MB, so to access 16G of memory 64Kb of PDE tables read by CPU, fits into CPU cache One 2MB page = 512 of 4KB pages, 512 less page cross penalties Four memory accesses to get to the page data Three memory accesses to get to the page data TLB maps page numbers to page frames. Each TLB miss requires page walk.
  • 25. TRANSFORMING NETWORKING & STORAGE 25 • Use Linux hugepage support through “hugetlbfs” filesystem • Each page is 2MB in size equivalent to 512 4KB pages • Each page requires only 1 DTLB entry • Reduce DTLB misses, and therefore page walks • Gives improved performance • Need to enable & allocate huge pages with Linux boot command (in GRUB file) • Better to enable at boot time – prevents fragmentation in physical memory Huge Pages
  • 26. TRANSFORMING NETWORKING & STORAGE 26 Translation Lookaside Buffers TLBs – virtual to physical memory address translation Intel® 64 and IA-32 Architectures Software Developer’s Manual. Volume 3. System Programming Guide. Chapter 4.10: Caching Translation Information Intel® 64 and IA-32 Architectures Optimization Reference Manual.
  • 27. TRANSFORMING NETWORKING & STORAGE 27 Translation Lookaside Buffers (TLBs) • TLBs – Translation Lookaside Buffers – 2 types • Instruction TLB • Data TLB • TLB is cache – maps virtual memory to physical memory • When memory requested by application, OS maps virtual address from process to physical address in memory • Mapping of virtual to physical memory – Page Table Entry (PTE) • TLB is a cache for the Page Table • If data is found in TLB during address lookup • TLB hit • Otherwise – TLB miss (page walk) - performance hit • Huge pages (Linux) – can alleviate
  • 28. TRANSFORMING NETWORKING & STORAGE 28 Translation Lookaside Buffers (TLBs) • TLBs are a cache for page tables • If memory address lookup is not in TLB -> TLB miss • We must then “walk the page tables” • This is slow, and costly • We need to minimise TLB misses • Solution is to use huge pages • Use 2M or 1G huge pages instead of default 4k pages
  • 29. TRANSFORMING NETWORKING & STORAGE 29 TLB Invalidation • On multi-core systems one core may change the page table which is used by other cores • Page table change needs to be propagated to other cores TLBs • This process is known as “TLB shootdown” • Need to invalidate the TLBs to avoid using “stale” data • Need to be aware of other CPU cores invalidating TLBs • Costly for data plane applications. • Examples – page faults, VM transitions (VM exit & entry) • More info in section 4.10.4 of Volume 3A of Intel® 64 and IA-32 Architectures Software Developer’s Manual • https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia- 32-architectures-software-developer-vol-3a-part-1-manual.pdf
  • 30. TRANSFORMING NETWORKING & STORAGE 30 IOTLBs • As well as TLBs for memory, there are TLBs for DMA – IOTLBs • Page table structure for DMA address translation • Sandy Bridge – no huge page support in IOTLBS – page table fragmentation • 2M and 1G huge pages fragmented to 4k page size • Causes more IOTLB misses • SNB could not achieve near 64 byte line rates for 10G NIC • Huge page support added in IVB • SR-IOV performance in IVB greatly enhanced
  • 31. TRANSFORMING NETWORKING & STORAGE 31 Large Page Table Support Reducing TLB and IOTLB misses with Large Page Table support MemoryExtended Page Tables Intel® VT-d IOTLB, translation cache NIC • Intel® Data Plane Development Kit (Intel® DPDK) utilizes Large Page tables to create large contiguous buffers Intel® Architecture Virtual Machine Monitor NIC Intel DPDK GPA HPA Forwarding Sample Code NIC Intel® Virtualization Technology for Directed I/O (Intel® VT-d)
  • 32. TRANSFORMING NETWORKING & STORAGE 32 Memory Virtualization Challenges VMM CPU0 VM0 VMn Guest Page Tables TLB Shadow Page Tables Memory Induced VM Exits Remap Address Translation • Guest OS expects contiguous, zero-based physical memory • VMM must preserve this illusion Page-table Shadowing • VMM intercepts paging operations • Constructs copy of page tables Overheads • VM exits add to execution time • Shadow page tables consume significant host memory Guest Page Tables
  • 33. TRANSFORMING NETWORKING & STORAGE 33 Memory Virtualization with EPT CPU0 VMM I/O Virtualization Intel® VT-x with EPT VM0 VMn Extended Page Tables (EPT) EPT Walker No VM Exits Extended Page Tables (EPT) • Map guest physical to host address • New hardware page-table walker Performance Benefit • Guest OS can modify its own page tables freely • Eliminates VM exits Memory Savings • Shadow page tables required for each guest user process (w/o EPT) • A single EPT supports entire VM Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x)
  • 34. TRANSFORMING NETWORKING & STORAGE 34 Memory Configuration DPDK
  • 35. TRANSFORMING NETWORKING & STORAGE 35 Memory Object Hierarchy 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page Memory Segment 0 Memory Segment 1 Memory Segment N 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page Physically contiguous memory Physically contiguous memory Memory Zone: RG_RX_RING_0 Memory Zone: MP_mbuf_pool Memory Zone: RG_TX_RING_0 Ring: RX_RING_0 Ring: TX_RING_0 Memory Pool: mbuf_pool Memory Zone: MALLOC_HEAP0 Malloc heap
  • 36. TRANSFORMING NETWORKING & STORAGE 36 Hugepages • Use Linux hugepage support through “hugetlbfs” filesystem • Each page is 2MB in size equivalent to 512 4KB pages • Each page requires only 1 DTLB entry • Reduce DTLB misses, and therefore page walks • Gives improved performance 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page Memory Segment 0 Memory Segment 1 Memory Segment N 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page Physically contiguous memory Physically contiguous memory Memory Zone: RG_RX_RING_0 Memory Zone: MP_mbuf_pool Memory Zone: RG_TX_RING_0 Ring: RX_RING_0 Ring: TX_RING_0 Memory Pool: mbuf_pool Memory Zone: MALLOC_HEAP0 Malloc heap
  • 37. TRANSFORMING NETWORKING & STORAGE 37 Memory Segments • Internal unit for memory management is the memory segment • Always backed by Huge Page (2 MB/1 GB page) memory • Each segment is contiguous in physical and virtual memory • Broken out into smaller memory zones for individual objects 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page Memory Segment 0 Memory Segment 1 Memory Segment N 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page Physically contiguous memory Physically contiguous memory Memory Zone: RG_RX_RING_0 Memory Zone: MP_mbuf_pool Memory Zone: RG_TX_RING_0 Ring: RX_RING_0 Ring: TX_RING_0 Memory Pool: mbuf_pool Memory Zone: MALLOC_HEAP0 Malloc heap
  • 38. TRANSFORMING NETWORKING & STORAGE 38 Memory Zones • Most basic unit of memory allocation – named block of memory • Allocate-only, cannot free • Cannot span a segment boundary – contiguous memory • Physical address of allocated block available to caller 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page Memory Segment 0 Memory Segment 1 Memory Segment N 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page Physically contiguous memory Physically contiguous memory Memory Zone: RG_RX_RING_0 Memory Zone: MP_mbuf_pool Memory Zone: RG_TX_RING_0 Ring: RX_RING_0 Ring: TX_RING_0 Memory Pool: mbuf_pool Memory Zone: MALLOC_HEAP0 Malloc heap
  • 39. TRANSFORMING NETWORKING & STORAGE 39 Malloc support – rte_malloc/rte_free • Malloc library provided to allow easier application porting • Backed by one or more memzones • Uses hugepage memory, but supports memory freeing • Not lock-free – avoid in data path • Physical address information not available per-allocation 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page Memory Segment 0 Memory Segment 1 Memory Segment N 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page Physically contiguous memory Physically contiguous memory Memory Zone: RG_RX_RING_0 Memory Zone: MP_mbuf_pool Memory Zone: RG_TX_RING_0 Ring: RX_RING_0 Ring: TX_RING_0 Memory Pool: mbuf_pool Memory Zone: MALLOC_HEAP0 Malloc heap
  • 40. TRANSFORMING NETWORKING & STORAGE 40 Memory Pools • Pool of fixed-size buffers • One pool can be safely shared among many threads • Lock-free allocation and freeing of buffers to/from pool • Designed for fast-path use 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page Memory Segment 0 Memory Segment 1 Memory Segment N 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page 2MB page Physically contiguous memory Physically contiguous memory Memory Zone: RG_RX_RING_0 Memory Zone: MP_mbuf_pool Memory Zone: RG_TX_RING_0 Ring: RX_RING_0 Ring: TX_RING_0 Memory Pool: mbuf_pool Memory Zone: MALLOC_HEAP0 Malloc heap
  • 41. TRANSFORMING NETWORKING & STORAGE 41 Memory Pools (continued) Memory Pool Pkt Buffers (60K 2K buffers) Events (2K 100B buffers) Events (2K 100B buffers) Processor 0 10G Intel® DPDK C4 Data Plane Intel® DPDK C3 Data Plane Intel® DPDK C2 Data Plane Intel® DPDK C1 Data Plane 10G Per-core cached buffers • Size fixed at creation time: • Fixed size elements • Fixed number of elements • Multi-producer / multi-consumer safe • Safe for fast-path use • Typical usage is packet buffers • Optimized for performance: • No locking, use CAS instructions • All objects cache aligned • Per core caches to minimise contention / use of CAS instructions • Support for bulk allocation / freeing of buffers
  • 42. TRANSFORMING NETWORKING & STORAGE 42 • For DPDK application – allocated all memory from huge pages • Allocate all memory at initialisation time (not during run time). • Pools of buffers created. • Buffers taken from pools as needed for packet processing • Returned to pool after use • Never need to use “malloc” at runtime. • DPDK takes care of aligning memory to cache lines Memory allocation - summary
  • 43. TRANSFORMING NETWORKING & STORAGE 43 • rte_eal_init() • Initialises Environment Abstraction Layer • Takes care of allocating memory from huge pages • rte_mempool_create() • Create pool of message buffers (mbufs) • This pool is used to hold packet data • mbufs taken from and returned to this pool Memory allocation
  • 44. TRANSFORMING NETWORKING & STORAGE 44 Memory Buffer - mbuf Memory buffer structure used throughout the Intel® DPDK Header holds meta-data about packet and buffer • Buffer & packet length • Buffer physical address • RSS hash or flow director filter information • Offload flags Body holds packet data plus room for additional headers and footers.
  • 45. TRANSFORMING NETWORKING & STORAGE 45 Memory Buffer – chained mbuf Mbufs generally used with memory pools Size of mbuf fixed when the mempool is created For packets too big for a single mbuf, the mbufs can be linked together in an “mbuf chain”
  • 46. TRANSFORMING NETWORKING & STORAGE 46 DDIO
  • 47. TRANSFORMING NETWORKING & STORAGE 47 Data Direct I/O (DDIO) • Ethernet controllers & NICs talk directly with CPU cache • DDIO makes processor cache the primary source and destination of I/O data, rather than main memory • DDIO reduces latency, power consumption, and memory bandwidth • Lower latency – I/O date does not need to go via main memory • Lower power consumption – reduced memory access • More scalable I/O bandwidth – reduced memory bottlenecks
  • 49. TRANSFORMING NETWORKING & STORAGE 49 DDIO requires no complex setup • DDIO is enabled by default on all Romley platforms, including pre-released platforms for OEMs, IHVs, and ISVs − DDIO has been active on all Intel and industry Romley development and validation • DDIO has no hardware dependencies • DDIO is invisible to software − No driver changes are required − No OS or VMM changes are required − No application changes are required