Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Why use Xen for large scale Enterprise
Deployments?
Konrad Rzeszutek Wilk
Software Developer Manager
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
2
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 3
 A bit of history
Where does the code come from?
Distributions and kernels
Features
The end result
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Unbreakable Enterprise Kernel and Oracle Linux purpose:
• Red Hat and Oracle split:
– Oracle supports a kernel based on RHEL distribution but with our own kernel -
Unbreakable Enterprise Kernel (UEK).
We want better performance for customers. The kernel is being updated more often
and with features and benefits to take advantage of Oracle products.
– As such an Oracle Linux Distribution along with UEK kernels is offered.
The UEK kernel is used in other products – OVM.
4
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle’s virtualization product (OVM):
We use Xen for hypervisor. For kernel we use UEK – in the past (OVM 2) we
had SLES based kernel.
• OVM 2 (Xen 3.4)
– Linux 2.6.32 based on SLES Xen Patches (classic)
While the newer ones are based on paravirt (pvops):
• OVM 3 (Xen 4.1)
– UEK2 kernel (2.6.39)
• OVM 3.3 (Xen 4.3)
– UEK3 kernel (3.8.13)
5
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Kernels (UEK: 2.6.39, 3.8).
• Oracle’s approach is
– Available for anybody (https://oss.oracle.com/git/).
– Make features available for everybody.
• Best way is to have it upstream so every distribution can have it.
• The end goal is for applications to run as best as they can.
• Large set of patches (big divergence from upstream) inhibit this as there is
a lot of complexity in them. Classic Xen patches is an example of this.
6
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Developers approach to patches:
• We forget what we did after 6 months (more or less).
• Want the code in one place (one repository).
• Want to develop new features against code to make it better and faster.
Don’t want to retouch the old code over and over.
• Want to fix new bugs in new shinny code.
• Big patches are scary.
7
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Quality Assurance approach to patches:
• Want to find the bug and have it fixed.
– Don't want bugs to re-appear later in a new version of kernel (aka regressions).
• Want to catch new bugs, expose new scenarios, not find old bugs.
• Ideal situation:
– new hardware = new bugs
– not new hardware = old bugs.
8
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Linux kernel: 2.6.32 (…) 3.0 (…) 3.8 (…) 3.11 (…) 3.15
Linux stable tree: 2.6.32LT 3.0 LT 3.8 LT
Unbreakable Linux UEK1 UEK2 UEK3
Unbreakable Enterprise Kernel origin
Backporting patches from upstream (Linus's tree) for new functionality.
Long-term kernels is where the community puts in the fixes and features
deemed necessary by maintainers. The version number gives an idea of
origin, for example 2.6.39 was 3.0 but some of the code is from 3.11.
9
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
The process to make this work:
• Patches MUST go upstream (Linus’s tree).
• New functionality developed against upstream kernel.
• Bug-fixes also developed against upstream kernel (where applicable as
some code had been re-worked).
• In some instances, where they do not make sense to go upstream, we keep
them in our tree.
• The problem we had with OVM2 was that it had a huge patchset of Xen
code – and not in any way easy to review.
10
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Upstreaming Xen in Linus’s tree
We started with slowly integrating pieces and pieces, one on top of each
other.
 Linux 3.0 had the initial domain support (but no backend drivers).
 Later versions gained different backend drivers (block, network, etc).
 For Xen (hypervisor) we did not have a huge set so much easier.
 What we ended up doing was:
Linus tree UEK tree OVM and Oracle Linux
Xen upstream OVM
11
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
The “problem” with Linus’s tree and Xen tree:
• High quality of code.
– Code has to go through numerous reviews before accepted. It takes time.
• The end result is:
– High quality and beautiful code.
– Performance driven (no maintainer wants code that slows things down).
– Improve the existing code.
• A fantastic side effect is that other distributions and users gain these
features right out of the box (such as Fedora Core, Debian, Red Hat, etc)
12
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Linux features that we are developing:
 Data safety
 DIF and DIX (Data Integretty), hardening ext4 and XFS against fuzzing attacks and
corrupt filesystems.
 DIRECT_IO - bypass caches so that data goes directly to the disk.
 Expose this via the AIO system call for applications.
 Better use of CPU and memory for:
 Making fsck work faster.
 De duplication of various filesystems (btrfs).
 Faster snapshotting.
 Quota calculations on XFS.
 dtrace
13
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Linux features we have been developing:
 NFS/RDMA (InfiniBand), NFS v4.0, support for NFS client using ZFS storage
and Solaris NFS.
• Security fixes before Linux gets released (And after too).
• Xen:
– The initial domain support and hardware features to match classic Xen support.
– Features in block and frontend to improve I/O.
– Lower latency for PCI passthrough devices.
– Near bare metal performance of guests.
– Continuous upstream presence to catch and fix regressions during Linus's merge
window.
– perf’ support for Xen and more.
14
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
In Xen ecosystem (hypervisor and toolstack):
• Xen Advisory Board where we collaborate with other companies using Xen
– To do more testing across all vendors workloads.
– Get more developers.
– Companies work together on features (Xen block subsystem).
• OASIS VirtIO workgroup to define the VirtIO specification.
• Faster boot, faster deallocation/allocation for huge guests.
• Faster performance on NUMA machines.
• Faster guests – replacing PV with PVH.
15
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
In the Xen ecosystem (hypervisor and toolstack)
• 'perf' support
– For full stack (hypervisor, guests, etc) performance view of what they are running and
performance bottlenecks.
• Xen hypervisor debugger – to troubleshoot in the field.
• Lower interrupt latencies for PCI passthrough.
• Transcendent memory (cooperative memory ballooning with benefits)
– An answer to memory overcommit – where Linux balloons out pages it does not think
it will use often but which can take a lot of memory space. Hypervisor can
deduplicate + compress those across different guests. End result is that we can fit
more guests on a machine and still have good performance (sometimes even 4%
benefit!)
16
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Exadata Database Machine (have X4-2, X4-4, X4-8).
17
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
X4-8:
18
From Sun Server X4-8 Service Manual
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Under the hood we have:
• NUMA
– 2, 4 or 8 sockets (CPU)
– Each socket has its own local memory.
– PCIe slots off sockets (I/O NUMA) with InfiniBand or flash in them.
– All sockets connected via QuickPath Interconnect (QPI).
• For best performance we don’t want to use QPI excessively, an solution is:
– Partitioning per socket.
– We have various size guests that reside within their NUMA node.
• Combined with intelligent software (GRID, Oracle RAC) gives top-notch
performance.
19
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Networking – 40G and more:
• Multiple ways of having better performance:
– PCIe passthrough (InfiniBand or Network Integrated Cards) – SRIOV – what we
concentrate on for best performance for Engineered Systems. But no migration!
– Intel Data Plane Development Kit (DPDK). Low latency, but no migration!
– Improving Xen netback and netfront (Citrix driven, they are the maintainers of Linux
Xen netback driver).
• Want the guest to run without invoking the hypervisor for privileged
operations (aka less VMEXITs):
– Interrupts go directly to the guest (posted interrupts). Improvement in Linux to use
vAPIC instead of event channels for PCIe interrupt.
– Lower the latency of interrupt delivery if we have to go through hypervisor.
20
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Storage: More IOPS!
• Classis OVM deployment is OCFS2 shared across different hosts.
• We have SSDs, now PCIe flash, and in the future NVMe.
• For better performance we do:
– Improve Xen block frontend and backend. Joint projects with Citrix on increasing
throughput and lowering latency.
– SR-IOV for even higher throughput and low latency (but no migration) for Engineered
Systems.
21
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Guests improvements:
• ParaVirtualized guests problem:
– Page updates and syscall require context switch to hypervisor.
– ParaVirtualized Hardware uses the hardware to do page updates and syscall instead
of requiring the guest to do the hypercalls. End result is removal of bottlenecks in PV
22
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Xen hypervisor bottlenecks:
• Identify them using ‘perf’ to visualize and get full system stack (hypervisor
and guests).
23
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Xen transcendent memory.
• Memory is becoming a bottleneck in virtualized system – we want
more! However we have memory in-efficient workloads.
24
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
End goal
• Performance, high quality, stability and security for all different workloads.
• Push patches upstream to benefit everybody.
25
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle is hiring!
konrad.wilk@oracle.com
26
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 27

More Related Content

PDF
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
PDF
XPDS16: libvirt and Tools: What's New and What's Next - James Fehlig, SUSE
PDF
XPDS14 - Xen on ARM: Status and Performance - Stefano Stabellini, Citrix
PDF
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, Oracle
PDF
XPDS14: Removing the Xen Linux Upstream Delta of Various Linux Distros - Luis...
PDF
XPDS14 - Xen in EFI World - Daniel Kiper, Oracle
PDF
XPDS16: The OpenXT Project in 2016 - Christopher Clark, BAE Systems
PDF
XPDS16: Xen Orchestra: building a Cloud on top of Xen - Olivier Lambert & Jul...
XPDS14 - Towards Massive Server Consolidation - Filipe Manco, NEC
XPDS16: libvirt and Tools: What's New and What's Next - James Fehlig, SUSE
XPDS14 - Xen on ARM: Status and Performance - Stefano Stabellini, Citrix
XPDS14: Xen 4.5 Roadmap - Konrad Wilk, Oracle
XPDS14: Removing the Xen Linux Upstream Delta of Various Linux Distros - Luis...
XPDS14 - Xen in EFI World - Daniel Kiper, Oracle
XPDS16: The OpenXT Project in 2016 - Christopher Clark, BAE Systems
XPDS16: Xen Orchestra: building a Cloud on top of Xen - Olivier Lambert & Jul...

What's hot (20)

PDF
XPDS16: Xenbedded: Xen-based client virtualization for phones and tablets - ...
PDF
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
PDF
Xen and the art of embedded virtualization (ELC 2017)
PDF
QEMU Disk IO Which performs Better: Native or threads?
PDF
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
PDF
ELC21: VM-to-VM Communication Mechanisms for Embedded
PDF
XPDS16: Hypervisor-based Security: Vicarious Learning via Introspektioneerin...
PDF
XPDS13: In-Guest Mechanism to Strengthen Guest Separation - Philip Tricca, Ci...
PDF
XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei
PPTX
Xen Project CI for OpenStack Overview
PDF
Virtualization with KVM (Kernel-based Virtual Machine)
PDF
XPDS14 - Xen as High-Performance NFV Platform - Jun Nakajima, Intel
PPTX
LinuxCon Japan 13 : 10 years of Xen and Beyond
PDF
Rootlinux17: An introduction to Xen Project Virtualisation
PDF
XPDS16: Xen Development Update
PDF
XPDS16: CPUID handling for guests - Andrew Cooper, Citrix
PDF
Virtualization Architecture & KVM
PDF
Bare-Metal Hypervisor as a Platform for Innovation
PDF
LFNW2014 Advanced Security Features of Xen Project Hypervisor
PDF
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
XPDS16: Xenbedded: Xen-based client virtualization for phones and tablets - ...
XPDS14 - Intel(r) Virtualization Technology for Directed I/O (VT-d) Posted In...
Xen and the art of embedded virtualization (ELC 2017)
QEMU Disk IO Which performs Better: Native or threads?
XPDDS18: CPUFreq in Xen on ARM - Oleksandr Tyshchenko, EPAM Systems
ELC21: VM-to-VM Communication Mechanisms for Embedded
XPDS16: Hypervisor-based Security: Vicarious Learning via Introspektioneerin...
XPDS13: In-Guest Mechanism to Strengthen Guest Separation - Philip Tricca, Ci...
XPDS16: Xen Scalability Analysis - Weidong Han, Zhichao Huang & Wei Yang, Huawei
Xen Project CI for OpenStack Overview
Virtualization with KVM (Kernel-based Virtual Machine)
XPDS14 - Xen as High-Performance NFV Platform - Jun Nakajima, Intel
LinuxCon Japan 13 : 10 years of Xen and Beyond
Rootlinux17: An introduction to Xen Project Virtualisation
XPDS16: Xen Development Update
XPDS16: CPUID handling for guests - Andrew Cooper, Citrix
Virtualization Architecture & KVM
Bare-Metal Hypervisor as a Platform for Innovation
LFNW2014 Advanced Security Features of Xen Project Hypervisor
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
Ad

Viewers also liked (6)

PDF
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
PPTX
XPDS14: Unikernels: Who, What, Where, When, Why - Adam Wick, Galois
PDF
Xen Project Contributor Training - Part 1 introduction v1.0
PPTX
Xen Project Contributor Training Part2 : Processes and Conventions v1.1
PDF
Performance Tuning Xen
PDF
SXSW 2016 takeaways
XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, C...
XPDS14: Unikernels: Who, What, Where, When, Why - Adam Wick, Galois
Xen Project Contributor Training - Part 1 introduction v1.0
Xen Project Contributor Training Part2 : Processes and Conventions v1.1
Performance Tuning Xen
SXSW 2016 takeaways
Ad

Similar to LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszutek Wilk , Oracle (20)

PPTX
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
PPTX
Oracle virtual appliance
PDF
Oracle Linux Nov 2011 Webcast
PDF
2018_GENIVI_XenOverview-123456789011.pdf
PDF
Develop Your Own Operating Systems using Cheap ARM Boards
PPTX
Using MySQL Containers
PPT
LinuxONE cavemen mmit 20160505 v1.0
PDF
High availability virtualization with proxmox
PDF
Linux one vs x86
PDF
Linux one vs x86 18 july
PPT
les_01.ppt of the Oracle course train_1 file
PDF
Oracle Linux/Oracle VM & Oracle Cloud Overview
PPTX
Flexible compute
PPTX
Sanger, upcoming Openstack for Bio-informaticians
DOCX
Resume
PDF
OC|Webcast "Die neue Welt der Virtualisierung"
PPTX
Why containers
PDF
Best Practices for Deploying Enterprise Applications on UNIX
PDF
DevOps Supercharged with Docker on Exadata
PDF
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Oracle virtual appliance
Oracle Linux Nov 2011 Webcast
2018_GENIVI_XenOverview-123456789011.pdf
Develop Your Own Operating Systems using Cheap ARM Boards
Using MySQL Containers
LinuxONE cavemen mmit 20160505 v1.0
High availability virtualization with proxmox
Linux one vs x86
Linux one vs x86 18 july
les_01.ppt of the Oracle course train_1 file
Oracle Linux/Oracle VM & Oracle Cloud Overview
Flexible compute
Sanger, upcoming Openstack for Bio-informaticians
Resume
OC|Webcast "Die neue Welt der Virtualisierung"
Why containers
Best Practices for Deploying Enterprise Applications on UNIX
DevOps Supercharged with Docker on Exadata
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...

More from The Linux Foundation (20)

PDF
ELC2019: Static Partitioning Made Simple
PDF
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
PDF
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
PDF
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
PDF
XPDDS19 Keynote: Unikraft Weather Report
PDF
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
PDF
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
PDF
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
PDF
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
PPTX
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
PPTX
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
PDF
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
PDF
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
PDF
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
PDF
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
PDF
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
PDF
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
PDF
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
PDF
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
PDF
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE
ELC2019: Static Partitioning Made Simple
XPDDS19: How TrenchBoot is Enabling Measured Launch for Open-Source Platform ...
XPDDS19 Keynote: Xen in Automotive - Artem Mygaiev, Director, Technology Solu...
XPDDS19 Keynote: Xen Project Weather Report 2019 - Lars Kurth, Director of Op...
XPDDS19 Keynote: Unikraft Weather Report
XPDDS19 Keynote: Secret-free Hypervisor: Now and Future - Wei Liu, Software E...
XPDDS19 Keynote: Xen Dom0-less - Stefano Stabellini, Principal Engineer, Xilinx
XPDDS19 Keynote: Patch Review for Non-maintainers - George Dunlap, Citrix Sys...
XPDDS19: Memories of a VM Funk - Mihai Donțu, Bitdefender
OSSJP/ALS19: The Road to Safety Certification: Overcoming Community Challeng...
OSSJP/ALS19: The Road to Safety Certification: How the Xen Project is Making...
XPDDS19: Speculative Sidechannels and Mitigations - Andrew Cooper, Citrix
XPDDS19: Keeping Coherency on Arm: Reborn - Julien Grall, Arm ltd
XPDDS19: QEMU PV Backend 'qdevification'... What Does it Mean? - Paul Durrant...
XPDDS19: Status of PCI Emulation in Xen - Roger Pau Monné, Citrix Systems R&D
XPDDS19: [ARM] OP-TEE Mediator in Xen - Volodymyr Babchuk, EPAM Systems
XPDDS19: Bringing Xen to the Masses: The Story of Building a Community-driven...
XPDDS19: Will Robots Automate Your Job Away? Streamlining Xen Project Contrib...
XPDDS19: Client Virtualization Toolstack in Go - Nick Rosbrook & Brendan Kerr...
XPDDS19: Core Scheduling in Xen - Jürgen Groß, SUSE

Recently uploaded (20)

PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
Modernising the Digital Integration Hub
PPT
Geologic Time for studying geology for geologist
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
August Patch Tuesday
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Developing a website for English-speaking practice to English as a foreign la...
A novel scalable deep ensemble learning framework for big data classification...
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Modernising the Digital Integration Hub
Geologic Time for studying geology for geologist
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Group 1 Presentation -Planning and Decision Making .pptx
A review of recent deep learning applications in wood surface defect identifi...
1 - Historical Antecedents, Social Consideration.pdf
Chapter 5: Probability Theory and Statistics
Module 1.ppt Iot fundamentals and Architecture
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Univ-Connecticut-ChatGPT-Presentaion.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
August Patch Tuesday
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Enhancing emotion recognition model for a student engagement use case through...
Developing a website for English-speaking practice to English as a foreign la...

LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszutek Wilk , Oracle

  • 1. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Why use Xen for large scale Enterprise Deployments? Konrad Rzeszutek Wilk Software Developer Manager
  • 2. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. 2
  • 3. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 3  A bit of history Where does the code come from? Distributions and kernels Features The end result
  • 4. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Unbreakable Enterprise Kernel and Oracle Linux purpose: • Red Hat and Oracle split: – Oracle supports a kernel based on RHEL distribution but with our own kernel - Unbreakable Enterprise Kernel (UEK). We want better performance for customers. The kernel is being updated more often and with features and benefits to take advantage of Oracle products. – As such an Oracle Linux Distribution along with UEK kernels is offered. The UEK kernel is used in other products – OVM. 4
  • 5. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Oracle’s virtualization product (OVM): We use Xen for hypervisor. For kernel we use UEK – in the past (OVM 2) we had SLES based kernel. • OVM 2 (Xen 3.4) – Linux 2.6.32 based on SLES Xen Patches (classic) While the newer ones are based on paravirt (pvops): • OVM 3 (Xen 4.1) – UEK2 kernel (2.6.39) • OVM 3.3 (Xen 4.3) – UEK3 kernel (3.8.13) 5
  • 6. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Kernels (UEK: 2.6.39, 3.8). • Oracle’s approach is – Available for anybody (https://oss.oracle.com/git/). – Make features available for everybody. • Best way is to have it upstream so every distribution can have it. • The end goal is for applications to run as best as they can. • Large set of patches (big divergence from upstream) inhibit this as there is a lot of complexity in them. Classic Xen patches is an example of this. 6
  • 7. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Developers approach to patches: • We forget what we did after 6 months (more or less). • Want the code in one place (one repository). • Want to develop new features against code to make it better and faster. Don’t want to retouch the old code over and over. • Want to fix new bugs in new shinny code. • Big patches are scary. 7
  • 8. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Quality Assurance approach to patches: • Want to find the bug and have it fixed. – Don't want bugs to re-appear later in a new version of kernel (aka regressions). • Want to catch new bugs, expose new scenarios, not find old bugs. • Ideal situation: – new hardware = new bugs – not new hardware = old bugs. 8
  • 9. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Linux kernel: 2.6.32 (…) 3.0 (…) 3.8 (…) 3.11 (…) 3.15 Linux stable tree: 2.6.32LT 3.0 LT 3.8 LT Unbreakable Linux UEK1 UEK2 UEK3 Unbreakable Enterprise Kernel origin Backporting patches from upstream (Linus's tree) for new functionality. Long-term kernels is where the community puts in the fixes and features deemed necessary by maintainers. The version number gives an idea of origin, for example 2.6.39 was 3.0 but some of the code is from 3.11. 9
  • 10. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | The process to make this work: • Patches MUST go upstream (Linus’s tree). • New functionality developed against upstream kernel. • Bug-fixes also developed against upstream kernel (where applicable as some code had been re-worked). • In some instances, where they do not make sense to go upstream, we keep them in our tree. • The problem we had with OVM2 was that it had a huge patchset of Xen code – and not in any way easy to review. 10
  • 11. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Upstreaming Xen in Linus’s tree We started with slowly integrating pieces and pieces, one on top of each other.  Linux 3.0 had the initial domain support (but no backend drivers).  Later versions gained different backend drivers (block, network, etc).  For Xen (hypervisor) we did not have a huge set so much easier.  What we ended up doing was: Linus tree UEK tree OVM and Oracle Linux Xen upstream OVM 11
  • 12. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | The “problem” with Linus’s tree and Xen tree: • High quality of code. – Code has to go through numerous reviews before accepted. It takes time. • The end result is: – High quality and beautiful code. – Performance driven (no maintainer wants code that slows things down). – Improve the existing code. • A fantastic side effect is that other distributions and users gain these features right out of the box (such as Fedora Core, Debian, Red Hat, etc) 12
  • 13. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Linux features that we are developing:  Data safety  DIF and DIX (Data Integretty), hardening ext4 and XFS against fuzzing attacks and corrupt filesystems.  DIRECT_IO - bypass caches so that data goes directly to the disk.  Expose this via the AIO system call for applications.  Better use of CPU and memory for:  Making fsck work faster.  De duplication of various filesystems (btrfs).  Faster snapshotting.  Quota calculations on XFS.  dtrace 13
  • 14. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Linux features we have been developing:  NFS/RDMA (InfiniBand), NFS v4.0, support for NFS client using ZFS storage and Solaris NFS. • Security fixes before Linux gets released (And after too). • Xen: – The initial domain support and hardware features to match classic Xen support. – Features in block and frontend to improve I/O. – Lower latency for PCI passthrough devices. – Near bare metal performance of guests. – Continuous upstream presence to catch and fix regressions during Linus's merge window. – perf’ support for Xen and more. 14
  • 15. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | In Xen ecosystem (hypervisor and toolstack): • Xen Advisory Board where we collaborate with other companies using Xen – To do more testing across all vendors workloads. – Get more developers. – Companies work together on features (Xen block subsystem). • OASIS VirtIO workgroup to define the VirtIO specification. • Faster boot, faster deallocation/allocation for huge guests. • Faster performance on NUMA machines. • Faster guests – replacing PV with PVH. 15
  • 16. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | In the Xen ecosystem (hypervisor and toolstack) • 'perf' support – For full stack (hypervisor, guests, etc) performance view of what they are running and performance bottlenecks. • Xen hypervisor debugger – to troubleshoot in the field. • Lower interrupt latencies for PCI passthrough. • Transcendent memory (cooperative memory ballooning with benefits) – An answer to memory overcommit – where Linux balloons out pages it does not think it will use often but which can take a lot of memory space. Hypervisor can deduplicate + compress those across different guests. End result is that we can fit more guests on a machine and still have good performance (sometimes even 4% benefit!) 16
  • 17. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Exadata Database Machine (have X4-2, X4-4, X4-8). 17
  • 18. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | X4-8: 18 From Sun Server X4-8 Service Manual
  • 19. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Under the hood we have: • NUMA – 2, 4 or 8 sockets (CPU) – Each socket has its own local memory. – PCIe slots off sockets (I/O NUMA) with InfiniBand or flash in them. – All sockets connected via QuickPath Interconnect (QPI). • For best performance we don’t want to use QPI excessively, an solution is: – Partitioning per socket. – We have various size guests that reside within their NUMA node. • Combined with intelligent software (GRID, Oracle RAC) gives top-notch performance. 19
  • 20. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Networking – 40G and more: • Multiple ways of having better performance: – PCIe passthrough (InfiniBand or Network Integrated Cards) – SRIOV – what we concentrate on for best performance for Engineered Systems. But no migration! – Intel Data Plane Development Kit (DPDK). Low latency, but no migration! – Improving Xen netback and netfront (Citrix driven, they are the maintainers of Linux Xen netback driver). • Want the guest to run without invoking the hypervisor for privileged operations (aka less VMEXITs): – Interrupts go directly to the guest (posted interrupts). Improvement in Linux to use vAPIC instead of event channels for PCIe interrupt. – Lower the latency of interrupt delivery if we have to go through hypervisor. 20
  • 21. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Storage: More IOPS! • Classis OVM deployment is OCFS2 shared across different hosts. • We have SSDs, now PCIe flash, and in the future NVMe. • For better performance we do: – Improve Xen block frontend and backend. Joint projects with Citrix on increasing throughput and lowering latency. – SR-IOV for even higher throughput and low latency (but no migration) for Engineered Systems. 21
  • 22. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Guests improvements: • ParaVirtualized guests problem: – Page updates and syscall require context switch to hypervisor. – ParaVirtualized Hardware uses the hardware to do page updates and syscall instead of requiring the guest to do the hypercalls. End result is removal of bottlenecks in PV 22
  • 23. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Xen hypervisor bottlenecks: • Identify them using ‘perf’ to visualize and get full system stack (hypervisor and guests). 23
  • 24. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Xen transcendent memory. • Memory is becoming a bottleneck in virtualized system – we want more! However we have memory in-efficient workloads. 24
  • 25. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | End goal • Performance, high quality, stability and security for all different workloads. • Push patches upstream to benefit everybody. 25
  • 26. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | Oracle is hiring! konrad.wilk@oracle.com 26
  • 27. Copyright © 2014, Oracle and/or its affiliates. All rights reserved. | 27