SlideShare a Scribd company logo
ESXi Native Networking Driver Model - Delivering on
Simplicity and Performance
Margaret Petrus, VMware
TEX4759
22
Disclaimer
 This presentation may contain product features that are currently
under development.
 This overview of new technology represents no commitment from
VMware to deliver these features in any generally available
product.
 Features are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
 Technical feasibility and market demand will affect final delivery.
 Pricing and packaging for any new technologies or features
discussed or presented have not been determined.
33
Key Takeaways
1. The benefits of moving to native driver model with an overview
of the different layers.
2. Jumpstart to build your own native driver.
3. The significant CPU savings achieved using the native model,
while retaining simplicity and supportability.
44
Agenda
 Overview of Native Model
 Module Components and Interactions
 Native Network Driver Deep Dive
 Building your driver in the native model
 Advanced Features
 Performance
 Summary
55
Overview of Native Model
66
Why Native Driver Model?
 Foundation to build new extensible features for ESXi hypervisor
 Increasing number of VMs in growing cloud deployments demand
 Device driver robustness
 Best performance
 Better supportability, manageability, and debuggability
 Provide long term binary compatibility support
 Better flexibility and support to release new features in the networking,
storage areas, etc.
77
High-level Native Driver Model Overview

vmkernel
I/O Subsystems Device
Manager
Device
Layer
Legend: Physical device Logical device Relationship
Device and Driver
Objects
Drivers
88
Module Components and Interactions
99
Quick Comparison with VMKLNX Model
VM
I/O Subsystems
vmkplexer
vmklinux
Linux driver
vmKernel
Emulated
Linux Driver
Model
VM
I/O Subsystems
Device Layer
Dev
Mgr
vmKernel
Native DriverNative DriverESXi Driver
Native Driver
Model
1010
Device layer
PCI Native driver
vmkdevmgr
IO subsys (scsi, net)‫‏‬
vmklinux
vmklnx_driver
ACPI
vmkctl
driver.map
Layer Interactions in Native Model vs. Vmklinux Model
User Level
Kernel
1111
Native Network Driver Deep Dive
1212
Device layer
PCI elxnet
vmkdevmgr
IO subsys (scsi, net)‫‏‬
ACPI
vmkctl
elxnet_devices.py
High Level Native Networking Driver Model (using elxnet)
Uplink Module‫‏‬
elxnet – Emulex Native Driver for BE3 Devices
User Level
Kernel
1313
Native Networking Driver Module Interactions
 Module layer - Register/Unregister driver with module layer interface
• init_module()
• cleanup_module()
 Device Driver layer – Register with device driver interface
• Provide vmk_DriverProps and vmk_DriverOps
• Callbacks for:
 DriverAttachDevice(), DriverDetachDevice()
 DriverScanDevice(), DriverForgetDevice()
 DriverStartDevice(), DriverQuiesceDevice()
 PCI layer – Needed for PCI Config access, BAR mapping, SR-IOV, etc.
• vmk_PCIReadConfig(), vmk_PCIWriteConfig()
• vmk_PCIMapIOResource(), vmk_PCIUnmapIOResource()
 Uplink layer - Provides the access to the networking stack
• Driver has to interact with uplink directly for all operations
• Uplink registration results in logical child (vmnicX) creation
• Register networking HW capabilities and provide appropriate callbacks
 Management CLI - Support only via esxcli, not ethtool!
1414
Module Layer
1515
Module Layer: init_module()
 Key steps:
1. Register module with vmkernel via vmk_ModuleRegister().
2. Initialize driver name via vmk_NameInitialize().
3. Create heap via vmk_HeapCreate() and memory pool via
vmk_MemPoolCreate().
4. Register for driver logging via vmk_LogRegister().
5. Create lock domain for the module via vmk_LockDomainCreate().
6. Register the driver with the driver database via vmk_DriverRegister().
 This is where you register the driver properties, i.e. the device layer CB handlers.
static vmk_DriverOps elxnetDrvOps = {
.attachDevice = elxnet_attachDevice,
.detachDevice = elxnet_detachDevice,
.scanDevice = elxnet_scanDevice,
.startDevice = elxnet_startDevice,
.quiesceDevice = elxnet_quiesceDevice,
.forgetDevice = elxnet_forgetDevice,
};
static vmk_DriverProps elxnetDrvProps = {
.ops = &elxnetDrvOps,
};
1616
Module Layer: cleanup_module()
The steps executed in init_module() occur in the reverse order:
1. Unregister driver via vmk_DriverUnregister().
2. Destroy created lock domain via vmk_LockDomainDestroy().
3. Unregister driver log via vmk_LogUnregister().
4. Destroy heap via vmk_HeapDestroy().
5. Destroy memory pool via vmk_MemPoolDestroy().
6. Unregister module via vmk_ModuleUnregister().
1717
Device Layer
1818
How does Native Driver claim its devices?
1. PCI bus drv scans PCI bus, detects PCI NICs, produces PCI NIC dev object.
2. Device Layer notifies Device Manager of device existence.
 Device Manager consults the PCI bus plugin to locate the driver
3. NIC Drv registers with Dev layer, providing CBs to claim PCI NIC dev object.
4. Device Manager binds NIC driver module to PCI NIC device object.
5. Device Layer calls NIC driver's AttachDevice callback:
 NIC driver claims PCI NIC device object
 NIC driver initializes hardware
6. Device Layer calls NIC driver's StartDevice callback
 NIC driver leaves quiesced state
7. Device Layer calls NIC driver's ScanDevice callback:
 NIC driver produces logical uplink device object.
8. Device Layer notifies Device Manager of logical device existence.
 Device Manager consults the Logical bus plugin to locate the driver
 Device manager binds the uplink device to the uplink driver
 attach, start, scan callbacks invoked for uplink device
9. NIC driver registers uplink capabilities in Uplink Registration callback.
10.NIC driver can start RX and networking subsystem can start TX on this NIC.
1919
Flow to claim NIC and make it IO-able
NIC Driver Networking
Subsystem
Device Layer
vmk_DriverAttachDevice(vmk_PCIDevice)
vmk_DriverStartDevice()
vmk_DriverScanDevice()
vmk_DeviceRegister(vmk_DeviceProps,
vmkDev, &uplinkDev) Create and Register uplinkDev
vmk_UplinkAssociate() to asynchronously
notify uplink for the device
vmk_UplinkCapRegister() to register
each capability
vmk_UplinkStartIO()
1. Arm interrupts in HW
2. Enable interrupts in vmkernel
3. Update uplink link status
Uplink is ready
for Tx/Rx
processing
HW initialized
for IO
2020
Device Layer: DriverAttachDevice()
 The attachDevice callback registered in vmk_DriverRegister() is invoked.
• The driver should start driving this device, get it ready for IO.
• If not capable of driving, return error and restore device to original state.
 What is done in this routine?
1. Allocate memory for driver data structures.
2. Invoke vmk_DeviceGetRegistrationData() to get PCI Device handle.
3. Invoke vmk_PCIQueryDeviceID() to validate driver can support this device.
4. Invoke vmk_PCIQueryDeviceAddr() to get the PCI Device Address.
5. Create the DMA engine via vmk_DMAEngineCreate() with right properties.
6. Map the bars via vmk_PCIMapIOResource() calls.
7. Initialize the HW and ensure that it comes up fine, else error out.
8. Setup stats collections, other driver specific stuff, etc.
9. Allocate interrupt vectors via vmk_PCIAllocIntrCookie() (w/ typeVec, numVec).
10. Create UplinkData – fills up the registration data ops and sharedData fields.
11. Do other controller setup and any other needed configurations.
12. Call vmk_DeviceSetAttachedDriverData() to associate drvPrivDataPtr with
vmk_Device handle.
2121
Device Layer: DriverStartDevice()
 Callback after successful attachDevice:
 Device will not be ready, i.e. not in IO-able state until this callback is done.
 Puts the device in an IO-able state.
 Can be invoked to place a device back in an IO-able state any time after
vmk_DriverQuiesceDevice() has explicitly put device in quiesced state.
 What it does?
1. Get drvPrivDataPtr using vmk_DeviceGetAttachedDriverData().
2. Post rx fragments for all the Rx queues it supports.
3. Register interrupts allocated during uplink shared data creation:
• Register interrupts via vmk_IntrRegister().
• Set affinity via vmk_NetPollInterruptSet().
4. Create any worker threads as worlds via vmk_WorldCreate().
2222
Device Layer: DriverScanDevice()
 Invoked at least once after a device has been attached to a driver.
 May be invoked at other device hotplug events as appropriate.
 New devices may be registered from this callback only.
 Main Steps:
1. Find bus type of the PCI Device via vmk_BusTypeFind().
2. Create logical address via vmk_LogicalCreateBusAddress().
3. Register device with vmkernel via vmk_DeviceRegister() passing in the
vmk_DeviceProps structure.
typedef struct {
vmk_Driver registeringDriver;
vmk_DeviceID *deviceID;  VMK_UPLINK_DEVICE_IDENTIFIER
vmk_DeviceOps *deviceOps;  has callback .removeDevice
vmk_AddrCookie registeringDriverData;  holds drvPrivDataPtr
vmk_AddrCookie registrationData;
} vmk_DeviceProps;
2323
Device Layer: DriverForgetDevice()
 Notification callback from vmkernel
 To indicate device is no longer accessible
 Driver no longer to wait indefinitely on any device operation
 Must always return with success for any subsequent device callbacks
 vmk_DriverQuiesceDevice()
 vmk_DriverDetachDevice()
 Case-specific callback, surprise removal only, not always called
2424
Device Layer: DriverQuiesceDevice()
 Callback places the device in quiesce’d state:
 Prepare for operations like device removal, driver unload, or system shutdown.
 This callback indicates that driver should:
• Complete any IO on the device
• Flush any device caches to quiesce device
 Steps (reverse of StartDevice):
1. Get drvPrivDataPtr via vmk_DeviceGetAttachedDriverData().
2. Halt and destroy any worker threads created during StartDevice.
3. Handle all Tx completions.
4. Cleanup all Rx queues.
5. Unregister interrupts for all Rx queues:
• Invoke vmk_NetPollInterruptUnSet() to remove affinity.
• Invoke vmk_IntrUnregister() to unregister previously registered interrupt.
2525
Device Layer: DriverDetachDevice()
 This is another handler passed in during vmk_DriverRegister() call.
• Driver should stop driving this device, and release its resources.
• Driver should not touch the device after this.
 Steps:
1. Get drvPrivDataPtr via vmk_DeviceGetAttachedDriverData().
2. Cleanup all the resources allocated for your interface:
 Destroy any queues allocated
 Notify HW that you are stopping all access
3. Cleanup UplinkData created and setup in DriverAttachDevice().
4. Release all interrupt vectors via vmk_PCIFreeIntrCookie().
5. Cleanup any memory allocated for driver structures from memory pool or heap.
6. Any other control path cleanup, i.e. destroy spinlock or semaphores.
7. Unmap BARs via vmk_PCIUnmapIOResource().
8. Destroy created DMA Engine via vmk_DMAEngineDestroy().
9. Free up and clean out any other resources allocated.
2626
Logical Uplink Layer
2727
Uplink Layer Major Data Structures
 vmk_UplinkRegData – uplink registration data
 Driver responsible for allocating and populating this structure
 Pointer to this struct is stored in vmk_DeviceProps->registrationData
 vmk_UplinkOps – handler for basic uplink operations
 vmk_UplinkSharedData – data shared between uplink layer and NIC driver
 Allocated and initialized by driver
 Driver readable and writable
 Uplink layer readable only
 vmk_UplinkSharedQueueInfo – shared info for all queues between uplink
layer and driver
 vmk_UplinkSharedQueueData – shared data for a single queue
2828
Uplink Layer: vmk_UplinkRegData
 Driver associates the following registration data to the vmk_Device
when creating the logical uplink:
typedef struct vmk_UplinkRegData {
vmk_revnum apiRevision; // VMKAPI version
vmk_ModuleID moduleID; // module ID of NIC drv
vmk_UplinkOps ops;
vmk_UplinkSharedData *sharedData; // Runtime data shared
// b/w kernel & driver
vmk_AddrCookie driverData; // Driver context data
} vmk_UplinkRegData;
2929
Uplink Layer: vmk_UplinkOps
 Structure containing function pointers for required driver operations.
 The functions are callbacks from the vmkernel into the NIC driver.
typedef struct vmk_UplinkOps {
vmk_UplinkTxCB uplinkTx; // Tx packt list CB
vmk_UplinkMTUSetCB uplinkMTUSet; // modify MTU CB
vmk_UplinkStateSetCB uplinkStateSet; // modify state CB
vmk_UplinkStatsGetCB uplinkStatsGet; // get stats CB
vmk_UplinkAssociateCB uplinkAssociate; // notify drv about assoc uplink
vmk_UplinkDisassociateCB uplinkDisassociate; // notify drv of disassoc uplink
vmk_UplinkCapEnableCB uplinkCapEnable; // cap enable CB
vmk_UplinkCapDisableCB uplinkCapDisable; // cap disable CB
vmk_UplinkStartIOCB uplinkStartIO; // start IO CB
vmk_UplinkQuiesceIOCB uplinkQuiesceIO; // queiesce all IO
vmk_UplinkResetCB uplinkReset; // reset issued uplink
} vmk_UplinkOps;
3030
Uplink Layer: vmk_UplinkSharedData
 The vmk_UplinkRegData->sharedData points to a driver allocated data
structure shared between vmkernel and NIC driver:
typedef struct vmk_UplinkSharedData {
vmk_VersionedAtomic lock; // ensure snapshot consistency
vmk_UplinkFlags flags; // uplink flags
vmk_UplinkState state; // uplink state
vmk_LinkStatus link; // uplink link status
vmk_uint32 mtu; // uplink mtu
vmk_EthAddress macAddr; // current logical MAC
vmk_EthAddress hwMacAddr; // permanent HW MAC
vmk_UplinkSupportedMode *supportedModes;
vmk_uint32 supportedModesArraySz;
vmk_UplinkDriverInfo driverInfo; // driver info
vmk_UplinkSharedQueueInfo *queueInfo; // shared qinfo
} vmk_UplinkSharedData;
3131
Uplink Layer: vmk_UplinkSharedQueueInfo
 Defines uplink level shared queue info for all queues
 For the queueData field above, drivers need to populate one queue
even if they do not support multiple queues.
vmk_UplinkSharedQueueInfo {
vmk_UplinkQueueType supportedQueueTypes;
vmk_UplinkQueueFilterClass supportedRxQueueFilterClasses;
vmk_UplinkQueueID defaultRxQueueID;
vmk_UplinkQueueID defaultTxQueueID;
vmk_uint32 maxRxQueues;
vmk_uint32 maxTxQueues;
vmk_uint32 activeRxQueues;
vmk_uint32 activeTxQueues;
vmk_BitVector *activeQueues;
vmk_uint32 maxTotalDeviceFilters;
vmk_UplinkSharedQueueData *queueData;
} vmk_UplinkSharedQueueInfo;
3232
Uplink Layer: vmk_UplinkSharedQueueData
 Contains all the info about one specific Tx or Rx queue.
 This struct is shared with uplink layer.
typedef struct vmk_UplinkSharedQueueData {
volatile vmk_UplinkQueueFlags flags;
vmk_UplinkQueueType type;
vmk_UplinkQueueID qid;
volatile vmk_UplinkQueueState state;
vmk_UplinkQueueFeature supportedFeatures;
vmk_UplinkQueueFeature activeFeatures;
vmk_uint32 maxFilters;
vmk_uint32 activeFilters;
vmk_NetPoll poll; // associated netPoll context
vmk_DMAEngine dmaEngine; // associated dma engine
vmk_UplinkQueuePriority priority; // tx queue priority
vmk_UplinkCoalesceParams coalesceParams;
} vmk_UplinkSharedQueueData;
3333
Creation of UplinkSharedData during DriverAttachDevice
 Create/initialize sharedData area:
 sharedData has a versioned atomic (not a spinlock)
 Uplink layer can only read from this area
 Driver can read/write to this area
 Driver needs to define its own spinlock for writer serialization
 Shared Data:
1. Supported speed/duplex modes to be advertised to uplink.
2. Current MTU setting, and link/speed/duplex states.
3. Queue info (numQ, supported queue types, supported filter classes).
4. Rx and Tx queue fields (flags, type, state, supportedFeatures, dmaEngine,
maxFilters).
5. netPoll for each Rx queue via vmk_NetPollCreate().
6. Allocated default Rx and Tx queues (not yet activated).
3434
Uplink Layer: uplinkStartIO() Callback
1. Arm the interrupts (link, multiQ, etc) in the HW.
2. Configure for VLAN filtering as needed
3. Change internal driver state to IO-able.
4. Set configured flow control.
5. Now, enable interrupts in vmkernel via vmk_IntrEnable().
6. Check for link status changes, update sharedData and invoke
vmk_UplinkUpdateLinkState() as needed.
3535
Uplink Layer: uplinkQuiesceIO() Callback
1. Check if IO is already quiesce’d due to possible failures.
2. Disarm interrupts.
3. Disable netpoll via vmk_NetPollDisable() and vmk_NetPollFlushRx().
4. Mark link state as down via vmk_UplinkUpdateLinkState().
5. Stop all Tx queues
6. Sync all vectors via vmk_IntrSync().
7. Disable all vectors via vmk_IntrDisable().
8. Change internal driver state to quiesced.
3636
Register NIC capabilities to Uplink Layer
 Handled when uplinkAssociateCB() is invoked to associate uplink with the
device.
 Call vmk_UplinkCapRegister() to register each capability.
 Two capability types:
 No callbacks needed:
 VMK_UPLINK_CAP_IPV4_CSO
 VMK_UPLINK_CAP_VLAN_RX_STRIP
 Capabilities that require callbacks:
 VMK_UPLINK_CAP_MULTI_QUEUE
 VMK_UPLINK_CAP_COALESCE_PARAMS
3737
 Callback Ops for VMK_UPLINK_CAP_COALESCE_PARAMS:
 Callback Ops for VMK_UPLINK_CAP_PRIV_STATS:
Examples of Capabilities with Callbacks
typedef struct vmk_UplinkCoalesceParamsOps {
vmk_UplinkCoalesceParamsGetCB getParams;
vmk_UplinkCoalesceParamsSetCB setParams;
} vmk_UplinkCoalesceParamsOps;
typedef struct vmk_UplinkPrivStatsOps {
vmk_UplinkPrivStatsLengthGetCB privStatsLengthGet;
vmk_UplinkPrivStatsGetCB privStatsGet;
} vmk_UplinkPrivStatsOps;
3838
Interrupt/Netpoll Handling
 Registering interrupts:
 vmk_IntrProps is populated and passed to vmkernel in DriverStartDevice().
 Driver Ack handler:
 Ack interrupt to HW if needed (INTx)
 Increment interrupt counter
 Driver Isr handler:
 Handle any queue notifications as needed
 Activate the netpoll for the particular queue via vmk_NetPollActivate()
 Driver NetPoll Callback Handler:
 Handle any Tx, Rx, or Ctrl events
 If there is work but budget exceeded, remain in poll mode & return VMK_TRUE
 If no more work, go back to interrupt mode and return VMK_FALSE
typedef struct vmk_IntrProps {
vmk_Device device;
vmk_Name deviceName;
vmk_IntrAcknowledge acknowledgeInterrupt;  driver ack handler
vmk_IntrHandler handler;  driver isr handler
void *handlerData;
vmk_uint64 attrs;
} vmk_IntrProps;
3939
Packet Management VMKAPIs in the Tx/Rx Path
 Basic allocation, release and field manipulation:
• vmk_PktAlloc()
• vmk_PktRelease()
• vmk_PktReleasePanic
• vmk_PktFrameLenGet()
• vmk_PktFrameLenSet()
• vmk_PktTrim()
• vmk_PktPartialCopy()
 SG Handling:
• vmk_PktSgArrayGet()
• vmk_PktSgElemGet()
• vmk_PktFrameMappedPointerGet()
• vmk_PktIsBufDescWritable()
 Processing the sent down packet list:
• vmk_PktListIterStart()
• vmk_PktListIterIsAtEnd()
• vmk_PktListGetFirstPkt()
• vmk_PktListIterInsertPktBefore()
• vmk_PktListIterRemovePkt()
• vmk_PktListAppendPkt()
4040
Packet Management VMKAPIs in the Tx/Rx Path
 Parse/Find the different layer headers:
• vmk_PktHeaderL2Find()
• vmk_PktHeaderL3Find()
• vmk_PktHeaderEntryGet()
• vmk_PktHeaderDataGet()
• vmk_PktHeaderDataRelease()
• vmk_PktHeaderLength()
 Offload Handling:
• vmk_PktIsMustCsum()
• vmk_PktSetCsumVfd()
• vmk_PktIsLargeTcpPacket()
• vmk_PktGetLargeTcpPacketMss()
 VLAN Handling:
• vmk_PktMustVlanTag()
• vmk_PktVlanIDGet()
• vmk_PktVlanIDSet()
• vmk_PktPriorityGet()
• vmk_PktPrioritySet()
4141
Advanced Features
 MultiQueue Handling
 SR-IOV
 VXLAN Offload
 Dynamic Load Balancing
4242
Multi-Queue Support
 Register multi-queue support via VMK_UPLINK_CAP_MULTI_QUEUE
 Following callbacks passed to uplink when registering this capability
typedef struct vmk_UplinkQueueOps {
vmk_UplinkQueueAllocCB queueAlloc;
vmk_UplinkQueueAllocWithAttrCB queueAllocWithAttr;
vmk_UplinkQueueReallocWithAttrCB queueReallocWithAttr;
vmk_UplinkQueueFreeCB queueFree;
vmk_UplinkQueueQuiesceCB queueQuiesce;
vmk_UplinkQueueStartCB queueStart;
vmk_UplinkQueueFilterApplyCB queueApplyFilter;
vmk_UplinkQueueFilterRemoveCB queueRemoveFilter;
vmk_UplinkQueueStatsGetCB queueGetStats;
vmk_UplinkQueueFeatureToggleCB queueToggleFeature;
vmk_UplinkQueueTxPrioritySetCB queueSetPriority;
vmk_UplinkQueueCoalesceParamsSetCB queueSetCoalesceParams;
} vmk_UplinkQueueOps;
4343
Multi-Queue VMKAPIs in the Tx/Rx path
 Refer to vmkapi_net_queue.h
 Main list of APIs for implementing multi-queue support:
• vmk_UplinkQueueMkFilterID()
• vmk_UplinkQueueMkTxQueueID()
• vmk_UplinkQueueMkRxQueueID()
• vmk_UplinkQueueIDVal()
• vmk_UplinkQueueIDType()
• vmk_UplinkQueueFilterIDVal()
• vmk_UplinkQueueIDUserVal()
• vmk_UplinkQueueSetQueueIDUserVal()
• vmk_UplinkQueueIDQueueDataIndex()
• vmk_UplinkQueueSetQueueIDQueueDataIndex()
• vmk_UplinkQueueGetNumQueuesSupported()
• vmk_UplinkQueueStart()
• vmk_UplinkQueueStop()
• vmk_PktQueueIDGet()
• vmk_PktQueueIDSet()
4444
SR-IOV Support
 Setup VFs:
• During DriverAttachDevice(), if SR-IOV is supported by device:
 Enable VFs via vmk_PCIEnableVFs()
• During DriverScanDevice(), driver
 Registers its VFs via vmk_PCIRegisterVF() passing along its .removeVF callback
static vmk_PCIVFDeviceOps elxnetVFDevOps = {
.removeVF = elxnet_removeVFDevice
};
 Sets control callback for VF w/ vmkernel via vmk_PCISetVFPrivateData()
 Cleanup VFs:
• The .removeVF callback registered during registration is called:
 vmk_PCIUnregisterVF() invoked to unregister particular VF from vmkernel
• DriverDetachDevice() should call vmk_PCIDisableVFs() to disable all its VFs.
 Misc VF vmkapi:
• vmk_PCIGetVFPCIDevice() should be used during VF registration to get the
vmk_PCIDevice handle of a PCI VF given its parent PF and VF index.
4545
VXLAN Offload Support
 Register vxlan offload capability via VMK_UPLINK_CAP_ENCAP_OFFLOAD
 Callback Ops for VMK_UPLINK_CAP_ENCAP_OFFLOAD:
 If supporting RX_VXLAN filter, indicate in supportedRxQueueFilterClasses
vmk_UplinkSharedQueueInfo->supportedRxQueueFilterClasses |=
VMK_UPLINK_QUEUE_FILTER_CLASS_VXLAN;
 Packet parser APIs to get information on inner encapsulated headers:
• vmk_PktHeaderEncapFind()
• vmk_PktHeaderEncapL2Find()
• vmk_PktHeaderEncapL3Find()
• vmk_PktHeaderEncapL4Find()
typedef struct vmk_UplinkEncapOffloadOps {
/** Handler used by vmkernel to notify VXLAN port number updated */
vmk_UplinkVXLANPortUpdateCB vxlanPortUpdate;
} vmk_UplinkEncapOffloadOps;
4646
Dynamic Load Balancing
 New NetQ feature introduced in ESXi 5.5 release:
• VMKNETDDI_QUEUEOPS_QUEUE_FEAT_DYNAMIC
 NIC requirements to support this feature:
• Device able to support different NetQ "features" on any particular NetQ
• Adding or removing support for a particular NetQ not require any critical operations
 If NIC driver registers DYNAMIC feature support, load balancer can/will
• Move filters between queues (i.e. bin-packing of filters), hence reducing the number
of queues in use
• Unpack filters to more queues either for latency sensitive VMs, or to reduce burden
on over saturated queues
4747
Performance
4848
Throughput in Gbps on a 16VM Configuration
3.00 2.97
9.40 9.40
3.03 3.02
9.41 9.40
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
9.000
10.000
Tx Throughput (256B) Rx Throughput (256B) Tx Throughput (64KB) Rx Throughput (64KB)
be2net
elxnet
4949
Overall CPU Gains on a 16VM Configuration
320.89
335.32
29.45
55.40
282.56
307.15
29.34
52.04
0.000
50.000
100.000
150.000
200.000
250.000
300.000
350.000
400.000
Tx CPU Util (256B) Rx CPU Util (256B) Tx CPU Util (64KB) Rx CPU Util (64KB)
be2net
elxnet
12% Savings 6% Savings8% Savings
5050
Vmkernel Cost Savings on a 16VM Configuration
137.92
132.75
8.06
26.17
89.50
96.29
7.03
21.34
0
20
40
60
80
100
120
140
160
Tx CPU Util (256B) Rx CPU Util (256B) Tx CPU Util (64KB) Rx CPU Util (64KB)
be2net
elxnet
35% Savings 27% Savings 13% Savings 18% Savings
5151
Total Mean Ping Response Time (usec) on a 16VM Config
134.19
126.82
130.04
133.39
116.23
122.41
105
110
115
120
125
130
135
140
128b 256b 512b
be2net
elxnet
Reduced by 1% Reduced by 6%Reduced by 8%
5252
Getting‫‏‬Started‫‏‬on‫‏‬the‫‏‬Native‫‏‬Driver…
 Go to https://developercenter.vmware.com/group/iovp/certs/5.5/dev-kits for
1. Native DDK Developer Guide
2. Needed toolchain RPMs
 vmware-esx-common-toolchain
 vmware-esx-kmdk-psa-toolchain
3. Vib-Suite RPM
 vmware-esx-vib-suite-5.5.0-0.0.xxxxxxx.i386.rpm
4. Vmkapi DDK RPM:
 vmware-esx-vmkapiddk-devtools-5.5.0-0.0.xxxxxxx.i386.rpm
5353
Summary
 A layered model approach with easy extensibility for new features
 Overview of native model
 Interaction of driver with different layers
 Basic structs and handlerOps for different layers
 Native model does not use vmklinux compatability layer
 A layer of indirection completely removed
 Translations (eg. pkt<->skb) avoided
o Allocation of skbs is not needed
o Savings in avoiding slab allocation (esp. at high packet rates)
 Driver communicates directly with various layers
 Performance boost in cpu savings
 New IO features for ESXi will only be developed for native model.
5454
Questions?
Contact VMware PM for more details of the native model support and for the devkits.
5555
Other VMware Activities Related to This Session
 HOL:
HOL-SDC-1302
vSphere Distributed Switch from A to Z
5656
• TAP Access membership includes:
New TAP Access NFR Bundle
• Access to NDA Roadmap sessions at VMworld, PEX and Onsite/Online
• VMware Solution Exchange (VSX) and Partner Locator listings
• VMware Ready logo (ISVs)
• Partner University and other resources in Partner Central
• TAP Elite includes all of the above plus:
• 5X the number of licenses in the NFR Bundle
• Unlimited product technical support
• 5 instances of SDK Support
• Services Software Solutions Bundle
• Annual Fees
• TAP Access - $750
• TAP Elite - $7,500
• Send email to tapalliance@vmware.com
TAP Membership Renewal – Great Benefits
5757
 TAP
• TAP support: 1-866-524-4966
• Email: tapalliance@vmware.com
• Partner Central:
http://www.vmware.com/partners/partners.html
 TAP Team
• Kristen Edwards – Sr. Alliance Program Manager
• Sheela Toor – Marketing Communication Manager
• Michael Thompson – Alliance Web Application Manager
• Audra Bowcutt –
• Ted Dunn –
• Dalene Bishop – Partner Enablement Manager, TAP
TAP Resources
 VMware Solution Exchange
• Marketplace support –
vsxalliance@vmware.com
• Partner Marketplace @ VMware
booth pod TAP1
THANK YOU
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance
ESXi Native Networking Driver Model - Delivering on
Simplicity and Performance
Margaret Petrus, VMware
TEX4759

More Related Content

PDF
Kdump and the kernel crash dump analysis
PDF
netfilter and iptables
PDF
Fun with Network Interfaces
PPTX
Understanding DPDK
PDF
Kdump-FUDcon-2015-Session
PDF
Qemu device prototyping
PPTX
The TCP/IP Stack in the Linux Kernel
PDF
BPF: Tracing and more
Kdump and the kernel crash dump analysis
netfilter and iptables
Fun with Network Interfaces
Understanding DPDK
Kdump-FUDcon-2015-Session
Qemu device prototyping
The TCP/IP Stack in the Linux Kernel
BPF: Tracing and more

What's hot (20)

PDF
Arm device tree and linux device drivers
PPTX
Linux Network Stack
PDF
Kernel_Crash_Dump_Analysis
PDF
LinuxCon 2015 Linux Kernel Networking Walkthrough
PDF
Physical Memory Management.pdf
PPTX
U-Boot presentation 2013
PDF
EBPF and Linux Networking
PDF
BPF Internals (eBPF)
PDF
DPDK: Multi Architecture High Performance Packet Processing
PDF
Intel dpdk Tutorial
PDF
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
PDF
Linux kernel debugging
PDF
Memory Management with Page Folios
PPTX
Linux Kernel MMC Storage driver Overview
ODP
nftables - the evolution of Linux Firewall
PDF
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
PDF
malloc & vmalloc in Linux
ODP
eBPF maps 101
PPTX
eBPF Basics
PDF
Advanced Namespaces and cgroups
Arm device tree and linux device drivers
Linux Network Stack
Kernel_Crash_Dump_Analysis
LinuxCon 2015 Linux Kernel Networking Walkthrough
Physical Memory Management.pdf
U-Boot presentation 2013
EBPF and Linux Networking
BPF Internals (eBPF)
DPDK: Multi Architecture High Performance Packet Processing
Intel dpdk Tutorial
Vmlinux: anatomy of bzimage and how x86 64 processor is booted
Linux kernel debugging
Memory Management with Page Folios
Linux Kernel MMC Storage driver Overview
nftables - the evolution of Linux Firewall
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
malloc & vmalloc in Linux
eBPF maps 101
eBPF Basics
Advanced Namespaces and cgroups
Ad

Viewers also liked (20)

PPS
El vuelo-de-los-gansos
PDF
160428 DESAYUNOS.PDF
PDF
BMW 7 Series Auto Expo Press Release
PPT
Wellness in Golden Years by Dr V K Chopra
PPT
Ch 6 slides
PDF
Documentos internacionales 2,014
PPT
Junaida abdul aziz 200913
PPTX
NOS PREPARAMOS PARA CONGRESO 2014
PDF
mirit ben nun israel art
PDF
Women in ICT Leadership Summit
PDF
Meisterkurs gesang 2014
PDF
Campamento Naútico adaptado Los Alcázares 2015 - circular
PPT
ITM 2013 - Roberto Fantoni, E. Ballarè, G. Sitzia "I 'Maestri Valsesiani' arc...
PPTX
Luxury Forum: Is social media just for chavs?
PDF
Cognistreamer's use case
PDF
Guia De Laboratorios Dap I 2008 Area Web
PDF
Case Sacks - Ação I Love Sacks
PDF
Burson-Marsteller - Turning Social Into Value
DOC
Rabbi akiva lesson plan
El vuelo-de-los-gansos
160428 DESAYUNOS.PDF
BMW 7 Series Auto Expo Press Release
Wellness in Golden Years by Dr V K Chopra
Ch 6 slides
Documentos internacionales 2,014
Junaida abdul aziz 200913
NOS PREPARAMOS PARA CONGRESO 2014
mirit ben nun israel art
Women in ICT Leadership Summit
Meisterkurs gesang 2014
Campamento Naútico adaptado Los Alcázares 2015 - circular
ITM 2013 - Roberto Fantoni, E. Ballarè, G. Sitzia "I 'Maestri Valsesiani' arc...
Luxury Forum: Is social media just for chavs?
Cognistreamer's use case
Guia De Laboratorios Dap I 2008 Area Web
Case Sacks - Ação I Love Sacks
Burson-Marsteller - Turning Social Into Value
Rabbi akiva lesson plan
Ad

Similar to VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance (20)

PDF
VMworld 2013: How to Exchange Status Message Between Guest and Host Using RPC
PDF
Windows guest debugging presentation from KVM Forum 2012
PDF
Project ACRN Device Model architecture introduction
ODP
SR-IOV Introduce
PPTX
Nytro-XV_NWD_VM_Performance_Acceleration
PPTX
Openstack Icehouse IaaS Presentation
PPT
Tech X Virtualization Tips
PPT
Using Virtualization To Improve Development And Testing
PPT
Windows Filtering Platform And Winsock Kernel
PPTX
01 - Velociraptor Installation and Overview.pptx
PPTX
01 - Velociraptor Installation and Overview.pptx
PPTX
Solving anything in VCL
ODP
Power ai image-pipeline
PDF
Windows内核技术介绍
PPTX
Infrastructure as Code in your CD pipelines - London Microsoft DevOps 0423
DOCX
Kl 031.30 eng_class_setup_guide_1.2
PPTX
Eco4Cloud - Company Presentation
PDF
OSDC 2019 | KubeVirt: Converge IT infrastructure into one single Kubernetes p...
PDF
Technical Report Vawtrak v2
PPT
Vsphere 4-partner-training180
VMworld 2013: How to Exchange Status Message Between Guest and Host Using RPC
Windows guest debugging presentation from KVM Forum 2012
Project ACRN Device Model architecture introduction
SR-IOV Introduce
Nytro-XV_NWD_VM_Performance_Acceleration
Openstack Icehouse IaaS Presentation
Tech X Virtualization Tips
Using Virtualization To Improve Development And Testing
Windows Filtering Platform And Winsock Kernel
01 - Velociraptor Installation and Overview.pptx
01 - Velociraptor Installation and Overview.pptx
Solving anything in VCL
Power ai image-pipeline
Windows内核技术介绍
Infrastructure as Code in your CD pipelines - London Microsoft DevOps 0423
Kl 031.30 eng_class_setup_guide_1.2
Eco4Cloud - Company Presentation
OSDC 2019 | KubeVirt: Converge IT infrastructure into one single Kubernetes p...
Technical Report Vawtrak v2
Vsphere 4-partner-training180

More from VMworld (20)

PPTX
VMworld 2016: vSphere 6.x Host Resource Deep Dive
PPTX
VMworld 2016: Troubleshooting 101 for Horizon
PPTX
VMworld 2016: Advanced Network Services with NSX
PPTX
VMworld 2016: How to Deploy VMware NSX with Cisco Infrastructure
PPTX
VMworld 2016: Enforcing a vSphere Cluster Design with PowerCLI Automation
PPTX
VMworld 2016: What's New with Horizon 7
PPTX
VMworld 2016: Virtual Volumes Technical Deep Dive
PPTX
VMworld 2016: Advances in Remote Display Protocol Technology with VMware Blas...
PPTX
VMworld 2016: The KISS of vRealize Operations!
PPTX
VMworld 2016: Getting Started with PowerShell and PowerCLI for Your VMware En...
PPTX
VMworld 2016: Ask the vCenter Server Exerts Panel
PPTX
VMworld 2016: Virtualize Active Directory, the Right Way!
PPTX
VMworld 2016: Migrating from a hardware based firewall to NSX to improve perf...
PPTX
VMworld 2015: Troubleshooting for vSphere 6
PPTX
VMworld 2015: Monitoring and Managing Applications with vRealize Operations 6...
PPTX
VMworld 2015: Advanced SQL Server on vSphere
PPTX
VMworld 2015: Virtualize Active Directory, the Right Way!
PPTX
VMworld 2015: Site Recovery Manager and Policy Based DR Deep Dive with Engine...
PPTX
VMworld 2015: Building a Business Case for Virtual SAN
PPTX
VMworld 2015: Explaining Advanced Virtual Volumes Configurations
VMworld 2016: vSphere 6.x Host Resource Deep Dive
VMworld 2016: Troubleshooting 101 for Horizon
VMworld 2016: Advanced Network Services with NSX
VMworld 2016: How to Deploy VMware NSX with Cisco Infrastructure
VMworld 2016: Enforcing a vSphere Cluster Design with PowerCLI Automation
VMworld 2016: What's New with Horizon 7
VMworld 2016: Virtual Volumes Technical Deep Dive
VMworld 2016: Advances in Remote Display Protocol Technology with VMware Blas...
VMworld 2016: The KISS of vRealize Operations!
VMworld 2016: Getting Started with PowerShell and PowerCLI for Your VMware En...
VMworld 2016: Ask the vCenter Server Exerts Panel
VMworld 2016: Virtualize Active Directory, the Right Way!
VMworld 2016: Migrating from a hardware based firewall to NSX to improve perf...
VMworld 2015: Troubleshooting for vSphere 6
VMworld 2015: Monitoring and Managing Applications with vRealize Operations 6...
VMworld 2015: Advanced SQL Server on vSphere
VMworld 2015: Virtualize Active Directory, the Right Way!
VMworld 2015: Site Recovery Manager and Policy Based DR Deep Dive with Engine...
VMworld 2015: Building a Business Case for Virtual SAN
VMworld 2015: Explaining Advanced Virtual Volumes Configurations

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Big Data Technologies - Introduction.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Per capita expenditure prediction using model stacking based on satellite ima...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity and Performance

  • 1. ESXi Native Networking Driver Model - Delivering on Simplicity and Performance Margaret Petrus, VMware TEX4759
  • 2. 22 Disclaimer  This presentation may contain product features that are currently under development.  This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.  Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.  Technical feasibility and market demand will affect final delivery.  Pricing and packaging for any new technologies or features discussed or presented have not been determined.
  • 3. 33 Key Takeaways 1. The benefits of moving to native driver model with an overview of the different layers. 2. Jumpstart to build your own native driver. 3. The significant CPU savings achieved using the native model, while retaining simplicity and supportability.
  • 4. 44 Agenda  Overview of Native Model  Module Components and Interactions  Native Network Driver Deep Dive  Building your driver in the native model  Advanced Features  Performance  Summary
  • 6. 66 Why Native Driver Model?  Foundation to build new extensible features for ESXi hypervisor  Increasing number of VMs in growing cloud deployments demand  Device driver robustness  Best performance  Better supportability, manageability, and debuggability  Provide long term binary compatibility support  Better flexibility and support to release new features in the networking, storage areas, etc.
  • 7. 77 High-level Native Driver Model Overview  vmkernel I/O Subsystems Device Manager Device Layer Legend: Physical device Logical device Relationship Device and Driver Objects Drivers
  • 9. 99 Quick Comparison with VMKLNX Model VM I/O Subsystems vmkplexer vmklinux Linux driver vmKernel Emulated Linux Driver Model VM I/O Subsystems Device Layer Dev Mgr vmKernel Native DriverNative DriverESXi Driver Native Driver Model
  • 10. 1010 Device layer PCI Native driver vmkdevmgr IO subsys (scsi, net)‫‏‬ vmklinux vmklnx_driver ACPI vmkctl driver.map Layer Interactions in Native Model vs. Vmklinux Model User Level Kernel
  • 12. 1212 Device layer PCI elxnet vmkdevmgr IO subsys (scsi, net)‫‏‬ ACPI vmkctl elxnet_devices.py High Level Native Networking Driver Model (using elxnet) Uplink Module‫‏‬ elxnet – Emulex Native Driver for BE3 Devices User Level Kernel
  • 13. 1313 Native Networking Driver Module Interactions  Module layer - Register/Unregister driver with module layer interface • init_module() • cleanup_module()  Device Driver layer – Register with device driver interface • Provide vmk_DriverProps and vmk_DriverOps • Callbacks for:  DriverAttachDevice(), DriverDetachDevice()  DriverScanDevice(), DriverForgetDevice()  DriverStartDevice(), DriverQuiesceDevice()  PCI layer – Needed for PCI Config access, BAR mapping, SR-IOV, etc. • vmk_PCIReadConfig(), vmk_PCIWriteConfig() • vmk_PCIMapIOResource(), vmk_PCIUnmapIOResource()  Uplink layer - Provides the access to the networking stack • Driver has to interact with uplink directly for all operations • Uplink registration results in logical child (vmnicX) creation • Register networking HW capabilities and provide appropriate callbacks  Management CLI - Support only via esxcli, not ethtool!
  • 15. 1515 Module Layer: init_module()  Key steps: 1. Register module with vmkernel via vmk_ModuleRegister(). 2. Initialize driver name via vmk_NameInitialize(). 3. Create heap via vmk_HeapCreate() and memory pool via vmk_MemPoolCreate(). 4. Register for driver logging via vmk_LogRegister(). 5. Create lock domain for the module via vmk_LockDomainCreate(). 6. Register the driver with the driver database via vmk_DriverRegister().  This is where you register the driver properties, i.e. the device layer CB handlers. static vmk_DriverOps elxnetDrvOps = { .attachDevice = elxnet_attachDevice, .detachDevice = elxnet_detachDevice, .scanDevice = elxnet_scanDevice, .startDevice = elxnet_startDevice, .quiesceDevice = elxnet_quiesceDevice, .forgetDevice = elxnet_forgetDevice, }; static vmk_DriverProps elxnetDrvProps = { .ops = &elxnetDrvOps, };
  • 16. 1616 Module Layer: cleanup_module() The steps executed in init_module() occur in the reverse order: 1. Unregister driver via vmk_DriverUnregister(). 2. Destroy created lock domain via vmk_LockDomainDestroy(). 3. Unregister driver log via vmk_LogUnregister(). 4. Destroy heap via vmk_HeapDestroy(). 5. Destroy memory pool via vmk_MemPoolDestroy(). 6. Unregister module via vmk_ModuleUnregister().
  • 18. 1818 How does Native Driver claim its devices? 1. PCI bus drv scans PCI bus, detects PCI NICs, produces PCI NIC dev object. 2. Device Layer notifies Device Manager of device existence.  Device Manager consults the PCI bus plugin to locate the driver 3. NIC Drv registers with Dev layer, providing CBs to claim PCI NIC dev object. 4. Device Manager binds NIC driver module to PCI NIC device object. 5. Device Layer calls NIC driver's AttachDevice callback:  NIC driver claims PCI NIC device object  NIC driver initializes hardware 6. Device Layer calls NIC driver's StartDevice callback  NIC driver leaves quiesced state 7. Device Layer calls NIC driver's ScanDevice callback:  NIC driver produces logical uplink device object. 8. Device Layer notifies Device Manager of logical device existence.  Device Manager consults the Logical bus plugin to locate the driver  Device manager binds the uplink device to the uplink driver  attach, start, scan callbacks invoked for uplink device 9. NIC driver registers uplink capabilities in Uplink Registration callback. 10.NIC driver can start RX and networking subsystem can start TX on this NIC.
  • 19. 1919 Flow to claim NIC and make it IO-able NIC Driver Networking Subsystem Device Layer vmk_DriverAttachDevice(vmk_PCIDevice) vmk_DriverStartDevice() vmk_DriverScanDevice() vmk_DeviceRegister(vmk_DeviceProps, vmkDev, &uplinkDev) Create and Register uplinkDev vmk_UplinkAssociate() to asynchronously notify uplink for the device vmk_UplinkCapRegister() to register each capability vmk_UplinkStartIO() 1. Arm interrupts in HW 2. Enable interrupts in vmkernel 3. Update uplink link status Uplink is ready for Tx/Rx processing HW initialized for IO
  • 20. 2020 Device Layer: DriverAttachDevice()  The attachDevice callback registered in vmk_DriverRegister() is invoked. • The driver should start driving this device, get it ready for IO. • If not capable of driving, return error and restore device to original state.  What is done in this routine? 1. Allocate memory for driver data structures. 2. Invoke vmk_DeviceGetRegistrationData() to get PCI Device handle. 3. Invoke vmk_PCIQueryDeviceID() to validate driver can support this device. 4. Invoke vmk_PCIQueryDeviceAddr() to get the PCI Device Address. 5. Create the DMA engine via vmk_DMAEngineCreate() with right properties. 6. Map the bars via vmk_PCIMapIOResource() calls. 7. Initialize the HW and ensure that it comes up fine, else error out. 8. Setup stats collections, other driver specific stuff, etc. 9. Allocate interrupt vectors via vmk_PCIAllocIntrCookie() (w/ typeVec, numVec). 10. Create UplinkData – fills up the registration data ops and sharedData fields. 11. Do other controller setup and any other needed configurations. 12. Call vmk_DeviceSetAttachedDriverData() to associate drvPrivDataPtr with vmk_Device handle.
  • 21. 2121 Device Layer: DriverStartDevice()  Callback after successful attachDevice:  Device will not be ready, i.e. not in IO-able state until this callback is done.  Puts the device in an IO-able state.  Can be invoked to place a device back in an IO-able state any time after vmk_DriverQuiesceDevice() has explicitly put device in quiesced state.  What it does? 1. Get drvPrivDataPtr using vmk_DeviceGetAttachedDriverData(). 2. Post rx fragments for all the Rx queues it supports. 3. Register interrupts allocated during uplink shared data creation: • Register interrupts via vmk_IntrRegister(). • Set affinity via vmk_NetPollInterruptSet(). 4. Create any worker threads as worlds via vmk_WorldCreate().
  • 22. 2222 Device Layer: DriverScanDevice()  Invoked at least once after a device has been attached to a driver.  May be invoked at other device hotplug events as appropriate.  New devices may be registered from this callback only.  Main Steps: 1. Find bus type of the PCI Device via vmk_BusTypeFind(). 2. Create logical address via vmk_LogicalCreateBusAddress(). 3. Register device with vmkernel via vmk_DeviceRegister() passing in the vmk_DeviceProps structure. typedef struct { vmk_Driver registeringDriver; vmk_DeviceID *deviceID;  VMK_UPLINK_DEVICE_IDENTIFIER vmk_DeviceOps *deviceOps;  has callback .removeDevice vmk_AddrCookie registeringDriverData;  holds drvPrivDataPtr vmk_AddrCookie registrationData; } vmk_DeviceProps;
  • 23. 2323 Device Layer: DriverForgetDevice()  Notification callback from vmkernel  To indicate device is no longer accessible  Driver no longer to wait indefinitely on any device operation  Must always return with success for any subsequent device callbacks  vmk_DriverQuiesceDevice()  vmk_DriverDetachDevice()  Case-specific callback, surprise removal only, not always called
  • 24. 2424 Device Layer: DriverQuiesceDevice()  Callback places the device in quiesce’d state:  Prepare for operations like device removal, driver unload, or system shutdown.  This callback indicates that driver should: • Complete any IO on the device • Flush any device caches to quiesce device  Steps (reverse of StartDevice): 1. Get drvPrivDataPtr via vmk_DeviceGetAttachedDriverData(). 2. Halt and destroy any worker threads created during StartDevice. 3. Handle all Tx completions. 4. Cleanup all Rx queues. 5. Unregister interrupts for all Rx queues: • Invoke vmk_NetPollInterruptUnSet() to remove affinity. • Invoke vmk_IntrUnregister() to unregister previously registered interrupt.
  • 25. 2525 Device Layer: DriverDetachDevice()  This is another handler passed in during vmk_DriverRegister() call. • Driver should stop driving this device, and release its resources. • Driver should not touch the device after this.  Steps: 1. Get drvPrivDataPtr via vmk_DeviceGetAttachedDriverData(). 2. Cleanup all the resources allocated for your interface:  Destroy any queues allocated  Notify HW that you are stopping all access 3. Cleanup UplinkData created and setup in DriverAttachDevice(). 4. Release all interrupt vectors via vmk_PCIFreeIntrCookie(). 5. Cleanup any memory allocated for driver structures from memory pool or heap. 6. Any other control path cleanup, i.e. destroy spinlock or semaphores. 7. Unmap BARs via vmk_PCIUnmapIOResource(). 8. Destroy created DMA Engine via vmk_DMAEngineDestroy(). 9. Free up and clean out any other resources allocated.
  • 27. 2727 Uplink Layer Major Data Structures  vmk_UplinkRegData – uplink registration data  Driver responsible for allocating and populating this structure  Pointer to this struct is stored in vmk_DeviceProps->registrationData  vmk_UplinkOps – handler for basic uplink operations  vmk_UplinkSharedData – data shared between uplink layer and NIC driver  Allocated and initialized by driver  Driver readable and writable  Uplink layer readable only  vmk_UplinkSharedQueueInfo – shared info for all queues between uplink layer and driver  vmk_UplinkSharedQueueData – shared data for a single queue
  • 28. 2828 Uplink Layer: vmk_UplinkRegData  Driver associates the following registration data to the vmk_Device when creating the logical uplink: typedef struct vmk_UplinkRegData { vmk_revnum apiRevision; // VMKAPI version vmk_ModuleID moduleID; // module ID of NIC drv vmk_UplinkOps ops; vmk_UplinkSharedData *sharedData; // Runtime data shared // b/w kernel & driver vmk_AddrCookie driverData; // Driver context data } vmk_UplinkRegData;
  • 29. 2929 Uplink Layer: vmk_UplinkOps  Structure containing function pointers for required driver operations.  The functions are callbacks from the vmkernel into the NIC driver. typedef struct vmk_UplinkOps { vmk_UplinkTxCB uplinkTx; // Tx packt list CB vmk_UplinkMTUSetCB uplinkMTUSet; // modify MTU CB vmk_UplinkStateSetCB uplinkStateSet; // modify state CB vmk_UplinkStatsGetCB uplinkStatsGet; // get stats CB vmk_UplinkAssociateCB uplinkAssociate; // notify drv about assoc uplink vmk_UplinkDisassociateCB uplinkDisassociate; // notify drv of disassoc uplink vmk_UplinkCapEnableCB uplinkCapEnable; // cap enable CB vmk_UplinkCapDisableCB uplinkCapDisable; // cap disable CB vmk_UplinkStartIOCB uplinkStartIO; // start IO CB vmk_UplinkQuiesceIOCB uplinkQuiesceIO; // queiesce all IO vmk_UplinkResetCB uplinkReset; // reset issued uplink } vmk_UplinkOps;
  • 30. 3030 Uplink Layer: vmk_UplinkSharedData  The vmk_UplinkRegData->sharedData points to a driver allocated data structure shared between vmkernel and NIC driver: typedef struct vmk_UplinkSharedData { vmk_VersionedAtomic lock; // ensure snapshot consistency vmk_UplinkFlags flags; // uplink flags vmk_UplinkState state; // uplink state vmk_LinkStatus link; // uplink link status vmk_uint32 mtu; // uplink mtu vmk_EthAddress macAddr; // current logical MAC vmk_EthAddress hwMacAddr; // permanent HW MAC vmk_UplinkSupportedMode *supportedModes; vmk_uint32 supportedModesArraySz; vmk_UplinkDriverInfo driverInfo; // driver info vmk_UplinkSharedQueueInfo *queueInfo; // shared qinfo } vmk_UplinkSharedData;
  • 31. 3131 Uplink Layer: vmk_UplinkSharedQueueInfo  Defines uplink level shared queue info for all queues  For the queueData field above, drivers need to populate one queue even if they do not support multiple queues. vmk_UplinkSharedQueueInfo { vmk_UplinkQueueType supportedQueueTypes; vmk_UplinkQueueFilterClass supportedRxQueueFilterClasses; vmk_UplinkQueueID defaultRxQueueID; vmk_UplinkQueueID defaultTxQueueID; vmk_uint32 maxRxQueues; vmk_uint32 maxTxQueues; vmk_uint32 activeRxQueues; vmk_uint32 activeTxQueues; vmk_BitVector *activeQueues; vmk_uint32 maxTotalDeviceFilters; vmk_UplinkSharedQueueData *queueData; } vmk_UplinkSharedQueueInfo;
  • 32. 3232 Uplink Layer: vmk_UplinkSharedQueueData  Contains all the info about one specific Tx or Rx queue.  This struct is shared with uplink layer. typedef struct vmk_UplinkSharedQueueData { volatile vmk_UplinkQueueFlags flags; vmk_UplinkQueueType type; vmk_UplinkQueueID qid; volatile vmk_UplinkQueueState state; vmk_UplinkQueueFeature supportedFeatures; vmk_UplinkQueueFeature activeFeatures; vmk_uint32 maxFilters; vmk_uint32 activeFilters; vmk_NetPoll poll; // associated netPoll context vmk_DMAEngine dmaEngine; // associated dma engine vmk_UplinkQueuePriority priority; // tx queue priority vmk_UplinkCoalesceParams coalesceParams; } vmk_UplinkSharedQueueData;
  • 33. 3333 Creation of UplinkSharedData during DriverAttachDevice  Create/initialize sharedData area:  sharedData has a versioned atomic (not a spinlock)  Uplink layer can only read from this area  Driver can read/write to this area  Driver needs to define its own spinlock for writer serialization  Shared Data: 1. Supported speed/duplex modes to be advertised to uplink. 2. Current MTU setting, and link/speed/duplex states. 3. Queue info (numQ, supported queue types, supported filter classes). 4. Rx and Tx queue fields (flags, type, state, supportedFeatures, dmaEngine, maxFilters). 5. netPoll for each Rx queue via vmk_NetPollCreate(). 6. Allocated default Rx and Tx queues (not yet activated).
  • 34. 3434 Uplink Layer: uplinkStartIO() Callback 1. Arm the interrupts (link, multiQ, etc) in the HW. 2. Configure for VLAN filtering as needed 3. Change internal driver state to IO-able. 4. Set configured flow control. 5. Now, enable interrupts in vmkernel via vmk_IntrEnable(). 6. Check for link status changes, update sharedData and invoke vmk_UplinkUpdateLinkState() as needed.
  • 35. 3535 Uplink Layer: uplinkQuiesceIO() Callback 1. Check if IO is already quiesce’d due to possible failures. 2. Disarm interrupts. 3. Disable netpoll via vmk_NetPollDisable() and vmk_NetPollFlushRx(). 4. Mark link state as down via vmk_UplinkUpdateLinkState(). 5. Stop all Tx queues 6. Sync all vectors via vmk_IntrSync(). 7. Disable all vectors via vmk_IntrDisable(). 8. Change internal driver state to quiesced.
  • 36. 3636 Register NIC capabilities to Uplink Layer  Handled when uplinkAssociateCB() is invoked to associate uplink with the device.  Call vmk_UplinkCapRegister() to register each capability.  Two capability types:  No callbacks needed:  VMK_UPLINK_CAP_IPV4_CSO  VMK_UPLINK_CAP_VLAN_RX_STRIP  Capabilities that require callbacks:  VMK_UPLINK_CAP_MULTI_QUEUE  VMK_UPLINK_CAP_COALESCE_PARAMS
  • 37. 3737  Callback Ops for VMK_UPLINK_CAP_COALESCE_PARAMS:  Callback Ops for VMK_UPLINK_CAP_PRIV_STATS: Examples of Capabilities with Callbacks typedef struct vmk_UplinkCoalesceParamsOps { vmk_UplinkCoalesceParamsGetCB getParams; vmk_UplinkCoalesceParamsSetCB setParams; } vmk_UplinkCoalesceParamsOps; typedef struct vmk_UplinkPrivStatsOps { vmk_UplinkPrivStatsLengthGetCB privStatsLengthGet; vmk_UplinkPrivStatsGetCB privStatsGet; } vmk_UplinkPrivStatsOps;
  • 38. 3838 Interrupt/Netpoll Handling  Registering interrupts:  vmk_IntrProps is populated and passed to vmkernel in DriverStartDevice().  Driver Ack handler:  Ack interrupt to HW if needed (INTx)  Increment interrupt counter  Driver Isr handler:  Handle any queue notifications as needed  Activate the netpoll for the particular queue via vmk_NetPollActivate()  Driver NetPoll Callback Handler:  Handle any Tx, Rx, or Ctrl events  If there is work but budget exceeded, remain in poll mode & return VMK_TRUE  If no more work, go back to interrupt mode and return VMK_FALSE typedef struct vmk_IntrProps { vmk_Device device; vmk_Name deviceName; vmk_IntrAcknowledge acknowledgeInterrupt;  driver ack handler vmk_IntrHandler handler;  driver isr handler void *handlerData; vmk_uint64 attrs; } vmk_IntrProps;
  • 39. 3939 Packet Management VMKAPIs in the Tx/Rx Path  Basic allocation, release and field manipulation: • vmk_PktAlloc() • vmk_PktRelease() • vmk_PktReleasePanic • vmk_PktFrameLenGet() • vmk_PktFrameLenSet() • vmk_PktTrim() • vmk_PktPartialCopy()  SG Handling: • vmk_PktSgArrayGet() • vmk_PktSgElemGet() • vmk_PktFrameMappedPointerGet() • vmk_PktIsBufDescWritable()  Processing the sent down packet list: • vmk_PktListIterStart() • vmk_PktListIterIsAtEnd() • vmk_PktListGetFirstPkt() • vmk_PktListIterInsertPktBefore() • vmk_PktListIterRemovePkt() • vmk_PktListAppendPkt()
  • 40. 4040 Packet Management VMKAPIs in the Tx/Rx Path  Parse/Find the different layer headers: • vmk_PktHeaderL2Find() • vmk_PktHeaderL3Find() • vmk_PktHeaderEntryGet() • vmk_PktHeaderDataGet() • vmk_PktHeaderDataRelease() • vmk_PktHeaderLength()  Offload Handling: • vmk_PktIsMustCsum() • vmk_PktSetCsumVfd() • vmk_PktIsLargeTcpPacket() • vmk_PktGetLargeTcpPacketMss()  VLAN Handling: • vmk_PktMustVlanTag() • vmk_PktVlanIDGet() • vmk_PktVlanIDSet() • vmk_PktPriorityGet() • vmk_PktPrioritySet()
  • 41. 4141 Advanced Features  MultiQueue Handling  SR-IOV  VXLAN Offload  Dynamic Load Balancing
  • 42. 4242 Multi-Queue Support  Register multi-queue support via VMK_UPLINK_CAP_MULTI_QUEUE  Following callbacks passed to uplink when registering this capability typedef struct vmk_UplinkQueueOps { vmk_UplinkQueueAllocCB queueAlloc; vmk_UplinkQueueAllocWithAttrCB queueAllocWithAttr; vmk_UplinkQueueReallocWithAttrCB queueReallocWithAttr; vmk_UplinkQueueFreeCB queueFree; vmk_UplinkQueueQuiesceCB queueQuiesce; vmk_UplinkQueueStartCB queueStart; vmk_UplinkQueueFilterApplyCB queueApplyFilter; vmk_UplinkQueueFilterRemoveCB queueRemoveFilter; vmk_UplinkQueueStatsGetCB queueGetStats; vmk_UplinkQueueFeatureToggleCB queueToggleFeature; vmk_UplinkQueueTxPrioritySetCB queueSetPriority; vmk_UplinkQueueCoalesceParamsSetCB queueSetCoalesceParams; } vmk_UplinkQueueOps;
  • 43. 4343 Multi-Queue VMKAPIs in the Tx/Rx path  Refer to vmkapi_net_queue.h  Main list of APIs for implementing multi-queue support: • vmk_UplinkQueueMkFilterID() • vmk_UplinkQueueMkTxQueueID() • vmk_UplinkQueueMkRxQueueID() • vmk_UplinkQueueIDVal() • vmk_UplinkQueueIDType() • vmk_UplinkQueueFilterIDVal() • vmk_UplinkQueueIDUserVal() • vmk_UplinkQueueSetQueueIDUserVal() • vmk_UplinkQueueIDQueueDataIndex() • vmk_UplinkQueueSetQueueIDQueueDataIndex() • vmk_UplinkQueueGetNumQueuesSupported() • vmk_UplinkQueueStart() • vmk_UplinkQueueStop() • vmk_PktQueueIDGet() • vmk_PktQueueIDSet()
  • 44. 4444 SR-IOV Support  Setup VFs: • During DriverAttachDevice(), if SR-IOV is supported by device:  Enable VFs via vmk_PCIEnableVFs() • During DriverScanDevice(), driver  Registers its VFs via vmk_PCIRegisterVF() passing along its .removeVF callback static vmk_PCIVFDeviceOps elxnetVFDevOps = { .removeVF = elxnet_removeVFDevice };  Sets control callback for VF w/ vmkernel via vmk_PCISetVFPrivateData()  Cleanup VFs: • The .removeVF callback registered during registration is called:  vmk_PCIUnregisterVF() invoked to unregister particular VF from vmkernel • DriverDetachDevice() should call vmk_PCIDisableVFs() to disable all its VFs.  Misc VF vmkapi: • vmk_PCIGetVFPCIDevice() should be used during VF registration to get the vmk_PCIDevice handle of a PCI VF given its parent PF and VF index.
  • 45. 4545 VXLAN Offload Support  Register vxlan offload capability via VMK_UPLINK_CAP_ENCAP_OFFLOAD  Callback Ops for VMK_UPLINK_CAP_ENCAP_OFFLOAD:  If supporting RX_VXLAN filter, indicate in supportedRxQueueFilterClasses vmk_UplinkSharedQueueInfo->supportedRxQueueFilterClasses |= VMK_UPLINK_QUEUE_FILTER_CLASS_VXLAN;  Packet parser APIs to get information on inner encapsulated headers: • vmk_PktHeaderEncapFind() • vmk_PktHeaderEncapL2Find() • vmk_PktHeaderEncapL3Find() • vmk_PktHeaderEncapL4Find() typedef struct vmk_UplinkEncapOffloadOps { /** Handler used by vmkernel to notify VXLAN port number updated */ vmk_UplinkVXLANPortUpdateCB vxlanPortUpdate; } vmk_UplinkEncapOffloadOps;
  • 46. 4646 Dynamic Load Balancing  New NetQ feature introduced in ESXi 5.5 release: • VMKNETDDI_QUEUEOPS_QUEUE_FEAT_DYNAMIC  NIC requirements to support this feature: • Device able to support different NetQ "features" on any particular NetQ • Adding or removing support for a particular NetQ not require any critical operations  If NIC driver registers DYNAMIC feature support, load balancer can/will • Move filters between queues (i.e. bin-packing of filters), hence reducing the number of queues in use • Unpack filters to more queues either for latency sensitive VMs, or to reduce burden on over saturated queues
  • 48. 4848 Throughput in Gbps on a 16VM Configuration 3.00 2.97 9.40 9.40 3.03 3.02 9.41 9.40 0.000 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000 Tx Throughput (256B) Rx Throughput (256B) Tx Throughput (64KB) Rx Throughput (64KB) be2net elxnet
  • 49. 4949 Overall CPU Gains on a 16VM Configuration 320.89 335.32 29.45 55.40 282.56 307.15 29.34 52.04 0.000 50.000 100.000 150.000 200.000 250.000 300.000 350.000 400.000 Tx CPU Util (256B) Rx CPU Util (256B) Tx CPU Util (64KB) Rx CPU Util (64KB) be2net elxnet 12% Savings 6% Savings8% Savings
  • 50. 5050 Vmkernel Cost Savings on a 16VM Configuration 137.92 132.75 8.06 26.17 89.50 96.29 7.03 21.34 0 20 40 60 80 100 120 140 160 Tx CPU Util (256B) Rx CPU Util (256B) Tx CPU Util (64KB) Rx CPU Util (64KB) be2net elxnet 35% Savings 27% Savings 13% Savings 18% Savings
  • 51. 5151 Total Mean Ping Response Time (usec) on a 16VM Config 134.19 126.82 130.04 133.39 116.23 122.41 105 110 115 120 125 130 135 140 128b 256b 512b be2net elxnet Reduced by 1% Reduced by 6%Reduced by 8%
  • 52. 5252 Getting‫‏‬Started‫‏‬on‫‏‬the‫‏‬Native‫‏‬Driver…  Go to https://developercenter.vmware.com/group/iovp/certs/5.5/dev-kits for 1. Native DDK Developer Guide 2. Needed toolchain RPMs  vmware-esx-common-toolchain  vmware-esx-kmdk-psa-toolchain 3. Vib-Suite RPM  vmware-esx-vib-suite-5.5.0-0.0.xxxxxxx.i386.rpm 4. Vmkapi DDK RPM:  vmware-esx-vmkapiddk-devtools-5.5.0-0.0.xxxxxxx.i386.rpm
  • 53. 5353 Summary  A layered model approach with easy extensibility for new features  Overview of native model  Interaction of driver with different layers  Basic structs and handlerOps for different layers  Native model does not use vmklinux compatability layer  A layer of indirection completely removed  Translations (eg. pkt<->skb) avoided o Allocation of skbs is not needed o Savings in avoiding slab allocation (esp. at high packet rates)  Driver communicates directly with various layers  Performance boost in cpu savings  New IO features for ESXi will only be developed for native model.
  • 54. 5454 Questions? Contact VMware PM for more details of the native model support and for the devkits.
  • 55. 5555 Other VMware Activities Related to This Session  HOL: HOL-SDC-1302 vSphere Distributed Switch from A to Z
  • 56. 5656 • TAP Access membership includes: New TAP Access NFR Bundle • Access to NDA Roadmap sessions at VMworld, PEX and Onsite/Online • VMware Solution Exchange (VSX) and Partner Locator listings • VMware Ready logo (ISVs) • Partner University and other resources in Partner Central • TAP Elite includes all of the above plus: • 5X the number of licenses in the NFR Bundle • Unlimited product technical support • 5 instances of SDK Support • Services Software Solutions Bundle • Annual Fees • TAP Access - $750 • TAP Elite - $7,500 • Send email to tapalliance@vmware.com TAP Membership Renewal – Great Benefits
  • 57. 5757  TAP • TAP support: 1-866-524-4966 • Email: tapalliance@vmware.com • Partner Central: http://www.vmware.com/partners/partners.html  TAP Team • Kristen Edwards – Sr. Alliance Program Manager • Sheela Toor – Marketing Communication Manager • Michael Thompson – Alliance Web Application Manager • Audra Bowcutt – • Ted Dunn – • Dalene Bishop – Partner Enablement Manager, TAP TAP Resources  VMware Solution Exchange • Marketplace support – vsxalliance@vmware.com • Partner Marketplace @ VMware booth pod TAP1
  • 60. ESXi Native Networking Driver Model - Delivering on Simplicity and Performance Margaret Petrus, VMware TEX4759