SlideShare a Scribd company logo
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
Xenwatch Multithreading
Dongli Zhang
Principal Member of Technical Staf
Oracle Linux
http://donglizhang.org
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
domU creation failure: problem
Reported by: https://lists.xenproject.org/archives/html/xen-devel/2016-06/msg00195.html
# xl create hvm.cfg
Parsing config from hvm.cfg
libxl: error: libxl_device.c:1080:device_backend_callback: Domain 2:unable to add device with path
/local/domain/0/backend/vbd/2/51712
libxl: error: libxl_create.c:1290:domcreate_launch_dm: Domain 2:unable to add disk devices
libxl: error: libxl_device.c:1080:device_backend_callback: Domain 2:unable to remove device with path
/local/domain/0/backend/vbd/2/51712
libxl: error: libxl_domain.c:1097:devices_destroy_cb: Domain 2:libxl__devices_destroy failed
libxl: error: libxl_domain.c:1000:libxl__destroy_domid: Domain 2:Non-existant domain
libxl: error: libxl_domain.c:959:domain_destroy_callback: Domain 2:Unable to destroy guest
libxl: error: libxl_domain.c:886:domain_destroy_cb: Domain 2:Destruction of domain failed
Reproduced by: http://donglizhang.org/xenwatch-stall-vif.patch
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
domU creation failure: observation
dom0# xl list
Name ID Mem VCPUs State Time(s)
Domain-0 0 799 4 r----- 50.2
(null) 2 0 2 --p--d 24.8
●
incomplete prior domU destroy
●
stalled xenwatch thread in ‘D’ state
●
xenwatch hangs at kthread_stop()
dom0# ps 38
PID TTY STAT TIME COMMAND
38 ? D 0:00 [xenwatch]
dom0# cat /proc/38/stack
[<0>] kthread_stop
[<0>] xenvif_disconnect_data
[<0>] set_backend_state
[<0>] frontend_changed
[<0>] xenwatch_thread
[<0>] kthread
[<0>] ret_from_fork
[<0>] 0xffffffff
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
domU creation failure: cause
# ethtool -S vif1.0
NIC statistics:
rx_gso_checksum_fixup: 0
tx_zerocopy_sent: 72518
tx_zerocopy_success: 0
tx_zerocopy_fail: 72517
tx_frag_overflow: 0
static bool
xenvif_dealloc_kthread_should_stop(struct xenvif_queue *queue)
{
/* Dealloc thread must remain running until all inflight
* packets complete. */
return kthread_should_stop() &&
!atomic_read(&queue->inflight_packets);
}
●
vif1.0-q0-dealloc thread cannot stop
●
remaining inflight packets on netback vif
●
vif1.0 statistics: sent > success + fail
●
sk_buf on hold by other kernel components!
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
xen-netback zerocopy
DomU
Dom0
data
sk_buf
Data mapped
from DomU
xen-netfront xen-netback
NIC driver
xenwatch
1.mapped from
domU to dom0
2. increment infligh packet
and forward to NIC
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
xen-netback zerocopy
DomU
Dom0
data
sk_buf
Data mapped
from DomU
xen-netfront xen-netback
NIC driver
xenwatch
1.mapped from
domU to dom0
2. increment infligh packet
and forward to NIC
3. NIC driver does not release
the grant mapping correctly!
4. xenwatch stall due to
remaining inflight packet (unmapped grant)
when removing xen-netback vif interface
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
domU creation failure: workaround?
Workaround mentioned at xen-devel:
https://lists.xenproject.org/archives/html/xen-devel/2016-06/msg00195.html
dom0# ifconfig ethX down
dom0# ifconfig ethX up
Reset DMA bufer and
unmap inflight memory
page from domU netfront
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
xenwatch stall extra case prerequisite
application
file system
device mapper
xen-blkfront
xvda-0 kthread
Xen Hypervisor
DomU
loop block (on nfs, iscsi
glusterfs or more)
iscsi
nvmeDom0 with
xen-blkback
1. Map data from blkfront
2. Encapsulate request as new bio
3. Submit bio to dom0 block device
event
channel
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
xenwatch stall extra case 1
[<0>] kthread_stop
[<0>] xen_blkif_disconnect
[<0>] xen_blkbk_remove
[<0>] xenbus_dev_remove
[<0>] __device_release_driver
[<0>] device_release_driver
[<0>] bus_remove_device
[<0>] device_del
[<0>] device_unregister
[<0>] frontend_changed
[<0>] xenbus_otherend_changed
[<0>] frontend_changed
[<0>] xenwatch_thread
[<0>] kthread
[<0>] ret_from_fork
[<0>] bt_get
[<0>] blk_mq_get_tag
[<0>] __blk_mq_alloc_request
[<0>] blk_mq_map_request
[<0>] blk_sq_make_request
[<0>] generic_make_request
[<0>] submit_bio
[<0>] dispatch_rw_block_io
[<0>] __do_block_io_op
[<0>] xen_blkif_schedule
[<0>] kthread
[<0>] ret_from_fork
xenwatch 3.xvda-0 hang and waiting for
idle block mq tag
Lack of free mq tag due to:
●
loop device
●
nfs
●
iscsi
●
ocfs2
●
more block/fs/storage issue...
xenwatch stall extra case 2
[<0>] gnttab_unmap_refs_sync
[<0>] free_persistent_gnts
[<0>] xen_blkbk_free_caches
[<0>] xen_blkif_disconnect
[<0>] xen_blkbk_remove
[<0>] xenbus_dev_remove
[<0>] device_release_driver
[<0>] bus_remove_device
[<0>] device_unregister
[<0>] frontend_changed
[<0>] xenbus_otherend_changed
[<0>] frontend_changed
[<0>] xenwatch_thread
[<0>] kthread
[<0>] ret_from_fork
xenwatch
static void
__gnttab_unmap_refs_async(...)
{
… ...
for (pc = 0; pc < item->count; pc++) {
if (page_count(item->pages[pc]) > 1) {
// delay grant unmap operation
… ...
}
}
… ...
}
When disconnecting xen-blkback device,
wait until all inflight persistent grant pages
are reclaimed
page_count is invalid as the page
is erroneously on-hold due to
iscsi or storage driver
storage
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
xenwatch stall symptom
●
‘(null)’ domU in ‘xl list’
●
xenwatch stall at xenstore update callback
●
DomU creation/destroy failure
●
Device hotplug failure
●
Incomplete live migration on source dom0
●
Reboot dom0 as only option (if workaround is not available)
More Impacts
The problem is
much more severe...
NFV
DomU = application
More domU running
concurrently
To quickly setup
and tear down NF
Let’s give up xen!
Xen developers are fired!
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
xen paravirtual driver framework
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
DomainU Guest Domain 0 Guest
Networking Stack
Application
xen-netfront driver
Networking Stack
Bridging /Routing
xen-netback driver
Xen
Hypervisor
Hardware
Physical NIC
Driver
Physical NIC
Grant Table
Event Channel
Xenbus/Xenstore
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
Paravirtual vs. PCI
PCI Driver Xen Paravirtual Driver
device discovery pci bus xenstore
device abstraction pci_dev / pci_driver xenbus_device / xenbus_driver
device
configuration
pci bar/capability xenstore
shared memory N/A or IOMMU grant table
notification interrupt event channel
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
device init and config
Motherboard
(hardware with
many slots)
Xenstore
(Dom0 software
daemon and database
for all guests)
pci bus
●
struct pci_dev
●
struct pci_driver
xenbus bus
●
struct xenbus_device
●
struct xenbus_driver
dom0# xenstore-ls
local = ""
domain = ""
0 = ""
name = "Domain-0"
device-model = ""
0 = ""
state = "running"
memory = ""
target = "524288"
static-max = "524288"
freemem-slack = "1254331"
libxl = ""
disable_udev = "1"
vm = ""
libxl = ""
plug into slots insert/update entries
Physical
NIC
Physical
Disk
Physical
CPU
Physical
DIMM
Virtual
NIC
Virtual
Disk
Virtual
CPU
Virtual
Memory
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
xenstore and xenwatch
●
watch at xenstore node with callback
●
callback triggered when xenstore node is updated
●
both dom0/domU kernel and toolstack can watch/update xenstore
xenstore
toolstack
Dom0 DomU
1. watch at
/local/domain/0/backend/
1. watch at
/local/domain/7/device
3. Notification:
create backend device
3. Notification:
create frontend device
2. Insert entries to xenstore:
●
/local/domain/0/backend/<device>/<domid>/…
●
/local/domain/7/device/<device>/<domid>/...
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
xenwatch with single thread
xenstore
event channel
event
frontend_changed()
event
handle_vcpu_hotplug_event()
event
backend_changed
wake up
xenbus
kthread
xenstore
ring
bufer
xenwatch
kthread … ...
●
xenbus_thread appends new watch event to the list
●
xenwatch_thread processes watch event from the list
read watch
event details
append event to
global list
process
struct xenbus_watch
be_watch = {
.node = "backend",
.callback = backend_changed
};
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
Xenwatch Multithreading Solution
To create a per-domU xenwatch
kernel thread
on dom0 for each domid
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
solution: challenges
● When to create/destroy per-domU xenwatch thread?
● How to calculate the domid given xenstore path?
● Split global locks into per-thread locks
xenwatch event path watched node
/local/domain/1/device/vif/0/state /local/domain/1/device/vif/0/state
backend/vif/1/0/hotplug-status backend/vif/1/0/hotplug-status
backend/vif/1/0/state backend
backend backend
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
solution: domU create/destroy 1/2
dom0# xenstore-watch /
/
/local/domain/7
/local/domain
/vm/612c6d38-fd87-4bb3-a3f5-53c546e83674
/vm
/libxl/7
… …
@introduceDomain
/libxl/7/dm-version
/libxl/7/device/vbd/51712
/libxl/7/device/vbd
/libxl/7/device
/libxl/7/device/vbd/51712/frontend
/libxl/7/device/vbd/51712/backend
/local/domain/7/device/vbd/51712
… ...
dom0# xenstore-watch /
/
/local/domain/0/device-model/7
/local/domain/7/device/vbd/51712
… ...
/local/domain/0/backend/vif/7/0/frontend-id
/local/domain/0/backend/vif/7/0/online
/local/domain/0/backend/vif/7/0/state
/local/domain/0/backend/vif/7/0/script
/local/domain/0/backend/vif/7/0/mac
… ...
/local/domain/0/backend/vkbd
/vm/612c6d38-fd87-4bb3-a3f5-53c546e83674
/local/domain/7
/libxl/7
@releaseDomain
xl create vm.cfg xl destroy 7
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
solution: domU create/destroy 2/2
●
creation: watch at “@introduceDomain”
●
destroy: watch at “@releaseDomain”
●
list “/local/domain” via XS_DIRECTORY
dom0 @introduceDomain
watch at
dom0 @releaseDomain
watch at
xenstore watch event xenstore watch event
List /local/domain to
identify which is
created
List /local/domain to
identify which is
removed
Suggested by Juergen Gross
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
solution: domid calculation
●
Xenwatch subscriber should know the pattern of node path
●
New callback for ‘struct xenbus_watch’: get_domid()
●
Xenwatch subscriber should implement the callback
struct xenbus_watch
{
struct list_head list;
const char *node;
void (*callback)(struct xenbus_watch *,
const char *path, const char *token);
domid_t (*get_domid)(struct xenbus_watch *watch,
const char *path, const char *token);
};
/* path: backend/<pvdev>/<domid>/... */
static domid_t be_get_domid(struct xenbus_watch *watch,
const char *path,
const char *token)
{
const char *p = path;
if (char_count(path, '/') < 2)
return 0;
p = strchr(p, '/') + 1;
p = strchr(p, '/') + 1;
return path_to_domid(p);
}
be_watch callback
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
Xenwatch Multithreading Framework
event
process
… ...eventevent
domid=2
xenwatch
kthread
domid=3
xenwatch
kthread
default
xenwatch
kthread
event… ...eventevent
event… ...eventevent
1. use .get_domid() callback to calculate domid
2. run callback if domid==0
3. otherwise, submit the event to per-domU event list
process
process
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
xenbus_watch unregistration optimization
domid=1
xenwatch
event event
domid=9
xenwatch
event
domid=11
xenwatch
event event
default
xenwatch
event event event event event
●
By default, traverse ALL lists to remove pending xenwatch events
●
.get_owner() is implemented if xenwatch is for a specific domU
●
Only traverse a single list for per-domU xenwatch
struct xenbus_watch
{
struct list_head list;
const char *node;
void (*callback)(struct xenbus_watch *,
const char *path, const char *token);
domid_t (*get_domid)(struct xenbus_watch *watch,
const char *path, const char *token);
domid_t (*get_owner)(struct xenbus_watch *watch);
};
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
Switch to xenwatch multithreading
// e.g., /local/domain/1/device/vbd/51712/state
static int watch_otherend(struct xenbus_device *dev)
{
struct xen_bus_type *bus =
container_of(dev->dev.bus, struct xen_bus_type, bus);
+ dev->otherend_watch.get_domid = otherend_get_domid;
+ dev->otherend_watch.get_owner = otherend_get_owner;
+
return xenbus_watch_pathfmt(dev, &dev->otherend_watch,
bus->otherend_changed,
"%s/%s", dev->otherend, "state");
+static domid_t otherend_get_domid(struct xenbus_watch *watch,
+ const char *path,
+ const char *token)
+{
+ struct xenbus_device *xendev =
+ container_of(watch, struct xenbus_device, otherend_watch);
+
+ return xendev->otherend_id;
+}
+
+
+static domid_t otherend_get_owner(struct xenbus_watch *watch)
+{
+ struct xenbus_device *xendev =
+ container_of(watch, struct xenbus_device, otherend_watch);
+
+ return xendev->otherend_id;
+}
Step 1: implement .get_domid()
Step 2: implement .get_owner() for per-domU xenbus_watch
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
Test Setup
●
Patch for implementation:
●
http://donglizhang.org/xenwatch-multithreading.patch
●
Patch to reproduce:
●
http://donglizhang.org/xenwatch-stall-vif.patch
●
Intercept sk_buf (with fragments) sent out from vifX.Y
●
Control when intercepted sk_buf is reclaimed
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
Test Result
dom0# xl list
Name ID Mem VCPUs State Time(s)
Domain-0 0 799 4 r----- 50.2
(null) 2 0 2 --p--d 29.9
1)sk_buf from vifX.Y is intercepted by xenwatch-stall-vif.patch
2)[xen-mtwatch-2] is stalled during VM shutdown
3)[xen-mtwatch-2] goes back to normal once sk_buf is released
dom0# ps -x | egrep "mtwatch|xen-xenwatch"
PID TTY STAT TIME COMMAND
39 ? S 0:00 [xenwatch]
2196 ? D 0:00 [xen-mtwatch-2]
dom0# cat /proc/2196/stack
[<0>] kthread_stop
[<0>] xenvif_disconnect_data
[<0>] set_backend_state
[<0>] frontend_changed
[<0>] xenwatch_thread
[<0>] kthread
[<0>] ret_from_fork
[<0>] 0xffffffff
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
Current Status
●
Total LOC: ~600
●
Feature can be enabled only on dom0
●
Xenwatch Multithreading is enabled only when:
●
xen_mtwatch kernel param
●
xen_initial_domain()
●
Feedback for [Patch RFC ] from xen-devel
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
Future work
●
Extend XS_DIRECTORY to XS_DIRECTORY_PART
●
To list 1000+ domU from xenstore
●
Port d4016288ab from Xen to Linux
●
Watch at parent node only (excluding descendants)
●
Only parent node’s update is notified
●
Watch at “/local/domain” for thread create/destroy
Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
Take-Home Message
●
There is limitation in single-threaded xenwatch
●
It is imperative to address such limitation
●
Xenwatch Multithreading can solve the problem
●
Only OS kernel is modified with ~600 LOC
●
Easy to apply to existing xenbus_watch
Question?Question?

More Related Content

PDF
XPDDS18: Design and Implementation of Automotive: Virtualization Based on Xen...
PDF
XPDDS18: The Art of Virtualizing Cache Maintenance - Julien Grall, Arm
PDF
XPDDS18: Performance tuning on Xen platform - Bo Zhang & Yifei Jiang, Huawei
PDF
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
PDF
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
PDF
XPDDS18: Unleashing the Power of Unikernels with Unikraft - Florian Schmidt, ...
PDF
XPDDS18: NVDIMM Overview - George Dunlap, Citrix
PDF
XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...
XPDDS18: Design and Implementation of Automotive: Virtualization Based on Xen...
XPDDS18: The Art of Virtualizing Cache Maintenance - Julien Grall, Arm
XPDDS18: Performance tuning on Xen platform - Bo Zhang & Yifei Jiang, Huawei
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
XPDS13: Enabling Fast, Dynamic Network Processing with ClickOS - Joao Martins...
XPDDS18: Unleashing the Power of Unikernels with Unikraft - Florian Schmidt, ...
XPDDS18: NVDIMM Overview - George Dunlap, Citrix
XPDS13: Performance Evaluation of Live Migration based on Xen ARM PVH - Jaeyo...

What's hot (20)

PDF
XPDS14 - RT-Xen: Real-Time Virtualization in Xen - Sisu Xi, Washington Univer...
PDF
XPDDS18: Memory Overcommitment in XEN - Huang Zhichao, Huawei
PDF
XPDDS18: Introducing ViryaOS: Secure Containers for Embedded and IoT - Stefan...
PDF
kexec / kdump implementation in Linux Kernel and Xen hypervisor
PDF
XS Boston 2008 Quantitative
PPTX
XPDDS18: Qemu and Xen: Reducing the attack surface - Paul Durrant, Citrix
PDF
XPDS13: Xen in OSS based In–Vehicle Infotainment Systems - Artem Mygaiev, Glo...
PDF
Xen RAS Status and Progress
PDF
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
PDF
XPDSS19: Live-Updating Xen - Amit Shah & David Woodhouse, Amazon
PDF
XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei
PPSX
Redesigning Xen Memory Sharing (Grant) Mechanism
PDF
XPDS13: Perf Support in Xen - Boris Ostrovsky, Oracle
ODP
Disk Performance Comparison Xen v.s. KVM
PDF
Xen Community Update 2011
PDF
XPDS13: Dual-Android on Nexus 10 - Lovene Bhatia, Samsung
PDF
XPDS13: Performance Optimization on Xen-based Android Device - Jack Ren, Inte...
PDF
XPDS13: Xen on ARM Update - Stefano Stabellini, Citrix
PDF
QEMU Disk IO Which performs Better: Native or threads?
PDF
STATUS UPDATE OF COLO PROJECT XIAOWEI YANG, HUAWEI AND WILL AULD, INTEL
XPDS14 - RT-Xen: Real-Time Virtualization in Xen - Sisu Xi, Washington Univer...
XPDDS18: Memory Overcommitment in XEN - Huang Zhichao, Huawei
XPDDS18: Introducing ViryaOS: Secure Containers for Embedded and IoT - Stefan...
kexec / kdump implementation in Linux Kernel and Xen hypervisor
XS Boston 2008 Quantitative
XPDDS18: Qemu and Xen: Reducing the attack surface - Paul Durrant, Citrix
XPDS13: Xen in OSS based In–Vehicle Infotainment Systems - Artem Mygaiev, Glo...
Xen RAS Status and Progress
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDSS19: Live-Updating Xen - Amit Shah & David Woodhouse, Amazon
XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei
Redesigning Xen Memory Sharing (Grant) Mechanism
XPDS13: Perf Support in Xen - Boris Ostrovsky, Oracle
Disk Performance Comparison Xen v.s. KVM
Xen Community Update 2011
XPDS13: Dual-Android on Nexus 10 - Lovene Bhatia, Samsung
XPDS13: Performance Optimization on Xen-based Android Device - Jack Ren, Inte...
XPDS13: Xen on ARM Update - Stefano Stabellini, Citrix
QEMU Disk IO Which performs Better: Native or threads?
STATUS UPDATE OF COLO PROJECT XIAOWEI YANG, HUAWEI AND WILL AULD, INTEL
Ad

Similar to XPDDS18: Xenwatch Multithreading - Dongli Zhang, Oracle (20)

PDF
XPDDS19: Support of PV Devices in Nested Xen - Jürgen Groß, SUSE
PDF
Platform Security Summit 18: Xen Security Weather Report 2018
PDF
OSSNA18: Xen Beginners Training
PDF
Improving Scalability of Xen: The 3,000 Domains Experiment
PDF
Xen in Linux (aka PVOPS update)
PDF
Xen in Safety-Critical Systems - Critical Summit 2022
PDF
XPDS13: Xen and XenServer Storage Performance - Felipe Franciosi, Citrix
PDF
Device virtualization and management in xen
PDF
Xen Project Weather Report (Aug 2015 Edition)
PDF
LCC17 - Securing Embedded Systems with the Hypervisor - Lars Kurth, Citrix
ODP
S4 xen hypervisor_20080622
PPTX
LinuxCon Japan 13 : 10 years of Xen and Beyond
PDF
XPDDS18: Xen Project Weather Report 2018
PDF
RHEL5 XEN HandOnTraining_v0.4.pdf
PPT
LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISOR
PPS
Xen Euro Par07
PDF
XPDDS17: The dm_op hypercall and libxendevicemodel - Paul Durrant, Citrix
ODP
Xen 4.3 Roadmap
PPTX
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
PDF
Fosdem 18: Securing embedded Systems using Virtualization
XPDDS19: Support of PV Devices in Nested Xen - Jürgen Groß, SUSE
Platform Security Summit 18: Xen Security Weather Report 2018
OSSNA18: Xen Beginners Training
Improving Scalability of Xen: The 3,000 Domains Experiment
Xen in Linux (aka PVOPS update)
Xen in Safety-Critical Systems - Critical Summit 2022
XPDS13: Xen and XenServer Storage Performance - Felipe Franciosi, Citrix
Device virtualization and management in xen
Xen Project Weather Report (Aug 2015 Edition)
LCC17 - Securing Embedded Systems with the Hypervisor - Lars Kurth, Citrix
S4 xen hypervisor_20080622
LinuxCon Japan 13 : 10 years of Xen and Beyond
XPDDS18: Xen Project Weather Report 2018
RHEL5 XEN HandOnTraining_v0.4.pdf
LOAD BALANCING OF APPLICATIONS USING XEN HYPERVISOR
Xen Euro Par07
XPDDS17: The dm_op hypercall and libxendevicemodel - Paul Durrant, Citrix
Xen 4.3 Roadmap
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
Fosdem 18: Securing embedded Systems using Virtualization
Ad

More from The Linux Foundation (20)

PDF
ELC2019: Static Partitioning Made Simple
PDF
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
PDF
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
PDF
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
PDF
XPDDS19 Keynote: Unikraft Weather Report
PDF
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
PDF
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
PDF
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
PDF
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
PPTX
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
PDF
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
PDF
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
PDF
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
PDF
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
PDF
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
PDF
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
PDF
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
PDF
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
PDF
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
PDF
XPDDS19: Implementing AMD MxGPU - Jonathan Farrell, Assured Information Security
ELC2019: Static Partitioning Made Simple
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Unikraft Weather Report
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
XPDDS19: Implementing AMD MxGPU - Jonathan Farrell, Assured Information Security

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Big Data Technologies - Introduction.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
KodekX | Application Modernization Development
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The AUB Centre for AI in Media Proposal.docx
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Chapter 3 Spatial Domain Image Processing.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Dropbox Q2 2025 Financial Results & Investor Presentation
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
Machine learning based COVID-19 study performance prediction
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Big Data Technologies - Introduction.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KodekX | Application Modernization Development
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

XPDDS18: Xenwatch Multithreading - Dongli Zhang, Oracle

  • 1. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. Xenwatch Multithreading Dongli Zhang Principal Member of Technical Staf Oracle Linux http://donglizhang.org
  • 2. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. domU creation failure: problem Reported by: https://lists.xenproject.org/archives/html/xen-devel/2016-06/msg00195.html # xl create hvm.cfg Parsing config from hvm.cfg libxl: error: libxl_device.c:1080:device_backend_callback: Domain 2:unable to add device with path /local/domain/0/backend/vbd/2/51712 libxl: error: libxl_create.c:1290:domcreate_launch_dm: Domain 2:unable to add disk devices libxl: error: libxl_device.c:1080:device_backend_callback: Domain 2:unable to remove device with path /local/domain/0/backend/vbd/2/51712 libxl: error: libxl_domain.c:1097:devices_destroy_cb: Domain 2:libxl__devices_destroy failed libxl: error: libxl_domain.c:1000:libxl__destroy_domid: Domain 2:Non-existant domain libxl: error: libxl_domain.c:959:domain_destroy_callback: Domain 2:Unable to destroy guest libxl: error: libxl_domain.c:886:domain_destroy_cb: Domain 2:Destruction of domain failed Reproduced by: http://donglizhang.org/xenwatch-stall-vif.patch
  • 3. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. domU creation failure: observation dom0# xl list Name ID Mem VCPUs State Time(s) Domain-0 0 799 4 r----- 50.2 (null) 2 0 2 --p--d 24.8 ● incomplete prior domU destroy ● stalled xenwatch thread in ‘D’ state ● xenwatch hangs at kthread_stop() dom0# ps 38 PID TTY STAT TIME COMMAND 38 ? D 0:00 [xenwatch] dom0# cat /proc/38/stack [<0>] kthread_stop [<0>] xenvif_disconnect_data [<0>] set_backend_state [<0>] frontend_changed [<0>] xenwatch_thread [<0>] kthread [<0>] ret_from_fork [<0>] 0xffffffff
  • 4. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. domU creation failure: cause # ethtool -S vif1.0 NIC statistics: rx_gso_checksum_fixup: 0 tx_zerocopy_sent: 72518 tx_zerocopy_success: 0 tx_zerocopy_fail: 72517 tx_frag_overflow: 0 static bool xenvif_dealloc_kthread_should_stop(struct xenvif_queue *queue) { /* Dealloc thread must remain running until all inflight * packets complete. */ return kthread_should_stop() && !atomic_read(&queue->inflight_packets); } ● vif1.0-q0-dealloc thread cannot stop ● remaining inflight packets on netback vif ● vif1.0 statistics: sent > success + fail ● sk_buf on hold by other kernel components!
  • 5. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. xen-netback zerocopy DomU Dom0 data sk_buf Data mapped from DomU xen-netfront xen-netback NIC driver xenwatch 1.mapped from domU to dom0 2. increment infligh packet and forward to NIC
  • 6. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. xen-netback zerocopy DomU Dom0 data sk_buf Data mapped from DomU xen-netfront xen-netback NIC driver xenwatch 1.mapped from domU to dom0 2. increment infligh packet and forward to NIC 3. NIC driver does not release the grant mapping correctly! 4. xenwatch stall due to remaining inflight packet (unmapped grant) when removing xen-netback vif interface
  • 7. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. domU creation failure: workaround? Workaround mentioned at xen-devel: https://lists.xenproject.org/archives/html/xen-devel/2016-06/msg00195.html dom0# ifconfig ethX down dom0# ifconfig ethX up Reset DMA bufer and unmap inflight memory page from domU netfront
  • 8. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. xenwatch stall extra case prerequisite application file system device mapper xen-blkfront xvda-0 kthread Xen Hypervisor DomU loop block (on nfs, iscsi glusterfs or more) iscsi nvmeDom0 with xen-blkback 1. Map data from blkfront 2. Encapsulate request as new bio 3. Submit bio to dom0 block device event channel
  • 9. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. xenwatch stall extra case 1 [<0>] kthread_stop [<0>] xen_blkif_disconnect [<0>] xen_blkbk_remove [<0>] xenbus_dev_remove [<0>] __device_release_driver [<0>] device_release_driver [<0>] bus_remove_device [<0>] device_del [<0>] device_unregister [<0>] frontend_changed [<0>] xenbus_otherend_changed [<0>] frontend_changed [<0>] xenwatch_thread [<0>] kthread [<0>] ret_from_fork [<0>] bt_get [<0>] blk_mq_get_tag [<0>] __blk_mq_alloc_request [<0>] blk_mq_map_request [<0>] blk_sq_make_request [<0>] generic_make_request [<0>] submit_bio [<0>] dispatch_rw_block_io [<0>] __do_block_io_op [<0>] xen_blkif_schedule [<0>] kthread [<0>] ret_from_fork xenwatch 3.xvda-0 hang and waiting for idle block mq tag Lack of free mq tag due to: ● loop device ● nfs ● iscsi ● ocfs2 ● more block/fs/storage issue...
  • 10. xenwatch stall extra case 2 [<0>] gnttab_unmap_refs_sync [<0>] free_persistent_gnts [<0>] xen_blkbk_free_caches [<0>] xen_blkif_disconnect [<0>] xen_blkbk_remove [<0>] xenbus_dev_remove [<0>] device_release_driver [<0>] bus_remove_device [<0>] device_unregister [<0>] frontend_changed [<0>] xenbus_otherend_changed [<0>] frontend_changed [<0>] xenwatch_thread [<0>] kthread [<0>] ret_from_fork xenwatch static void __gnttab_unmap_refs_async(...) { … ... for (pc = 0; pc < item->count; pc++) { if (page_count(item->pages[pc]) > 1) { // delay grant unmap operation … ... } } … ... } When disconnecting xen-blkback device, wait until all inflight persistent grant pages are reclaimed page_count is invalid as the page is erroneously on-hold due to iscsi or storage driver storage Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
  • 11. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. xenwatch stall symptom ● ‘(null)’ domU in ‘xl list’ ● xenwatch stall at xenstore update callback ● DomU creation/destroy failure ● Device hotplug failure ● Incomplete live migration on source dom0 ● Reboot dom0 as only option (if workaround is not available)
  • 12. More Impacts The problem is much more severe... NFV DomU = application More domU running concurrently To quickly setup and tear down NF Let’s give up xen! Xen developers are fired! Copyright © 2018, Oracle and/or its affiliates. All rights reserved.
  • 13. xen paravirtual driver framework Copyright © 2018, Oracle and/or its affiliates. All rights reserved. DomainU Guest Domain 0 Guest Networking Stack Application xen-netfront driver Networking Stack Bridging /Routing xen-netback driver Xen Hypervisor Hardware Physical NIC Driver Physical NIC Grant Table Event Channel Xenbus/Xenstore
  • 14. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. Paravirtual vs. PCI PCI Driver Xen Paravirtual Driver device discovery pci bus xenstore device abstraction pci_dev / pci_driver xenbus_device / xenbus_driver device configuration pci bar/capability xenstore shared memory N/A or IOMMU grant table notification interrupt event channel
  • 15. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. device init and config Motherboard (hardware with many slots) Xenstore (Dom0 software daemon and database for all guests) pci bus ● struct pci_dev ● struct pci_driver xenbus bus ● struct xenbus_device ● struct xenbus_driver dom0# xenstore-ls local = "" domain = "" 0 = "" name = "Domain-0" device-model = "" 0 = "" state = "running" memory = "" target = "524288" static-max = "524288" freemem-slack = "1254331" libxl = "" disable_udev = "1" vm = "" libxl = "" plug into slots insert/update entries Physical NIC Physical Disk Physical CPU Physical DIMM Virtual NIC Virtual Disk Virtual CPU Virtual Memory
  • 16. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. xenstore and xenwatch ● watch at xenstore node with callback ● callback triggered when xenstore node is updated ● both dom0/domU kernel and toolstack can watch/update xenstore xenstore toolstack Dom0 DomU 1. watch at /local/domain/0/backend/ 1. watch at /local/domain/7/device 3. Notification: create backend device 3. Notification: create frontend device 2. Insert entries to xenstore: ● /local/domain/0/backend/<device>/<domid>/… ● /local/domain/7/device/<device>/<domid>/...
  • 17. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. xenwatch with single thread xenstore event channel event frontend_changed() event handle_vcpu_hotplug_event() event backend_changed wake up xenbus kthread xenstore ring bufer xenwatch kthread … ... ● xenbus_thread appends new watch event to the list ● xenwatch_thread processes watch event from the list read watch event details append event to global list process struct xenbus_watch be_watch = { .node = "backend", .callback = backend_changed };
  • 18. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. Xenwatch Multithreading Solution To create a per-domU xenwatch kernel thread on dom0 for each domid
  • 19. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. solution: challenges ● When to create/destroy per-domU xenwatch thread? ● How to calculate the domid given xenstore path? ● Split global locks into per-thread locks xenwatch event path watched node /local/domain/1/device/vif/0/state /local/domain/1/device/vif/0/state backend/vif/1/0/hotplug-status backend/vif/1/0/hotplug-status backend/vif/1/0/state backend backend backend
  • 20. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. solution: domU create/destroy 1/2 dom0# xenstore-watch / / /local/domain/7 /local/domain /vm/612c6d38-fd87-4bb3-a3f5-53c546e83674 /vm /libxl/7 … … @introduceDomain /libxl/7/dm-version /libxl/7/device/vbd/51712 /libxl/7/device/vbd /libxl/7/device /libxl/7/device/vbd/51712/frontend /libxl/7/device/vbd/51712/backend /local/domain/7/device/vbd/51712 … ... dom0# xenstore-watch / / /local/domain/0/device-model/7 /local/domain/7/device/vbd/51712 … ... /local/domain/0/backend/vif/7/0/frontend-id /local/domain/0/backend/vif/7/0/online /local/domain/0/backend/vif/7/0/state /local/domain/0/backend/vif/7/0/script /local/domain/0/backend/vif/7/0/mac … ... /local/domain/0/backend/vkbd /vm/612c6d38-fd87-4bb3-a3f5-53c546e83674 /local/domain/7 /libxl/7 @releaseDomain xl create vm.cfg xl destroy 7
  • 21. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. solution: domU create/destroy 2/2 ● creation: watch at “@introduceDomain” ● destroy: watch at “@releaseDomain” ● list “/local/domain” via XS_DIRECTORY dom0 @introduceDomain watch at dom0 @releaseDomain watch at xenstore watch event xenstore watch event List /local/domain to identify which is created List /local/domain to identify which is removed Suggested by Juergen Gross
  • 22. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. solution: domid calculation ● Xenwatch subscriber should know the pattern of node path ● New callback for ‘struct xenbus_watch’: get_domid() ● Xenwatch subscriber should implement the callback struct xenbus_watch { struct list_head list; const char *node; void (*callback)(struct xenbus_watch *, const char *path, const char *token); domid_t (*get_domid)(struct xenbus_watch *watch, const char *path, const char *token); }; /* path: backend/<pvdev>/<domid>/... */ static domid_t be_get_domid(struct xenbus_watch *watch, const char *path, const char *token) { const char *p = path; if (char_count(path, '/') < 2) return 0; p = strchr(p, '/') + 1; p = strchr(p, '/') + 1; return path_to_domid(p); } be_watch callback
  • 23. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. Xenwatch Multithreading Framework event process … ...eventevent domid=2 xenwatch kthread domid=3 xenwatch kthread default xenwatch kthread event… ...eventevent event… ...eventevent 1. use .get_domid() callback to calculate domid 2. run callback if domid==0 3. otherwise, submit the event to per-domU event list process process
  • 24. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. xenbus_watch unregistration optimization domid=1 xenwatch event event domid=9 xenwatch event domid=11 xenwatch event event default xenwatch event event event event event ● By default, traverse ALL lists to remove pending xenwatch events ● .get_owner() is implemented if xenwatch is for a specific domU ● Only traverse a single list for per-domU xenwatch struct xenbus_watch { struct list_head list; const char *node; void (*callback)(struct xenbus_watch *, const char *path, const char *token); domid_t (*get_domid)(struct xenbus_watch *watch, const char *path, const char *token); domid_t (*get_owner)(struct xenbus_watch *watch); };
  • 25. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. Switch to xenwatch multithreading // e.g., /local/domain/1/device/vbd/51712/state static int watch_otherend(struct xenbus_device *dev) { struct xen_bus_type *bus = container_of(dev->dev.bus, struct xen_bus_type, bus); + dev->otherend_watch.get_domid = otherend_get_domid; + dev->otherend_watch.get_owner = otherend_get_owner; + return xenbus_watch_pathfmt(dev, &dev->otherend_watch, bus->otherend_changed, "%s/%s", dev->otherend, "state"); +static domid_t otherend_get_domid(struct xenbus_watch *watch, + const char *path, + const char *token) +{ + struct xenbus_device *xendev = + container_of(watch, struct xenbus_device, otherend_watch); + + return xendev->otherend_id; +} + + +static domid_t otherend_get_owner(struct xenbus_watch *watch) +{ + struct xenbus_device *xendev = + container_of(watch, struct xenbus_device, otherend_watch); + + return xendev->otherend_id; +} Step 1: implement .get_domid() Step 2: implement .get_owner() for per-domU xenbus_watch
  • 26. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. Test Setup ● Patch for implementation: ● http://donglizhang.org/xenwatch-multithreading.patch ● Patch to reproduce: ● http://donglizhang.org/xenwatch-stall-vif.patch ● Intercept sk_buf (with fragments) sent out from vifX.Y ● Control when intercepted sk_buf is reclaimed
  • 27. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. Test Result dom0# xl list Name ID Mem VCPUs State Time(s) Domain-0 0 799 4 r----- 50.2 (null) 2 0 2 --p--d 29.9 1)sk_buf from vifX.Y is intercepted by xenwatch-stall-vif.patch 2)[xen-mtwatch-2] is stalled during VM shutdown 3)[xen-mtwatch-2] goes back to normal once sk_buf is released dom0# ps -x | egrep "mtwatch|xen-xenwatch" PID TTY STAT TIME COMMAND 39 ? S 0:00 [xenwatch] 2196 ? D 0:00 [xen-mtwatch-2] dom0# cat /proc/2196/stack [<0>] kthread_stop [<0>] xenvif_disconnect_data [<0>] set_backend_state [<0>] frontend_changed [<0>] xenwatch_thread [<0>] kthread [<0>] ret_from_fork [<0>] 0xffffffff
  • 28. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. Current Status ● Total LOC: ~600 ● Feature can be enabled only on dom0 ● Xenwatch Multithreading is enabled only when: ● xen_mtwatch kernel param ● xen_initial_domain() ● Feedback for [Patch RFC ] from xen-devel
  • 29. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. Future work ● Extend XS_DIRECTORY to XS_DIRECTORY_PART ● To list 1000+ domU from xenstore ● Port d4016288ab from Xen to Linux ● Watch at parent node only (excluding descendants) ● Only parent node’s update is notified ● Watch at “/local/domain” for thread create/destroy
  • 30. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. Take-Home Message ● There is limitation in single-threaded xenwatch ● It is imperative to address such limitation ● Xenwatch Multithreading can solve the problem ● Only OS kernel is modified with ~600 LOC ● Easy to apply to existing xenbus_watch Question?Question?