SlideShare a Scribd company logo
5 NEW HIGH-PERFORMANCE
FEATURES IN RED HAT OPENSHIFT
Patterns and technology to run critical, high
performance line-of-business applications on Red
Hat OpenShift Container Platform
Derek Carr, Jeremy Eder
Red Hat Product Engineering
May 2018
A SHORT STORY
I WANT TO RUN ON
OPENSHIFT
BUT MY WORKLOAD IS
SPECIAL
IT IS LATENCY SENSITIVE
IT HAS A HUGE CACHE
IT REQUIRES HOST
LEVEL TUNING
IT NEEDS A SPECIAL
DEVICE
IT’S A SNOWFLAKE
ACTUALLY, I HAVE A
LOT OF WORKLOADS
LIKE THIS!
Red Hat Summit 2018 5 New High Performance Features in OpenShift
WHAT HAS CHANGED
COMMUNITY
sig-node, sig-scheduling, wg-resource-mgmt
● Expand the set of workloads runnable on the platform
● Maintain reliability
● Keep it simple
WORKLOAD TYPES
Going beyond generic web hosting workloads
Big Data
NFV
FSI
Animation
ISVsHPC
Machine
Learning
● Identify requirement overlap
across verticals
● Plumb enhancements
generically
● Allow flexibility
DRILL DOWN ON OVERLAP
Feature FSI NFV ISV BD/ML ANIM HPC
CPU pinning (cpuset) Yes Yes Yes Maybe Maybe Yes
Device passthrough (GPU, NIC, etc.) Yes Yes Yes Yes Maybe Yes
Sysctl support Yes Yes Yes Yes Yes Yes
HugePages Yes Yes Yes Yes Maybe Maybe
NUMA Yes Yes Yes Maybe Maybe Yes
Separate control and data plane Yes Yes Yes Yes Yes Yes
Kernel module loading Yes Yes Yes Maybe Yes Maybe
PROGRESS REPORT
What has been done in the last year?
● CPU manager (static pinning)
● HugePages
● Device Plugins (GPU, etc.)
● Sysctl support
● Extended Resources
RESOURCE MANAGEMENT: PRIMER
RESOURCES AND TUNING OPTIONS
Natively understood
● CPU
● Memory
● Ephemeral storage
● Persistent storage
● HugePages
● Device Plugins
● Extended Resources
Control knobs available
● CPU Manager policy (none, static)
● sysctls (safe / unsafe)
RESOURCE REQUIREMENTS
Describes the compute resources needed by a pod
Limits
● Maximum burst (if available)
Requests
● Minimum amount
(guaranteed)
Overcommit
● Ration of limit to request
QUALITY OF SERVICE
Keep the end-user API simple, let the platform optimize for SLA guarantees
Burstable
● Requests < Limits
Best Effort
● No resource limits
Guaranteed
● Requests = Limits
Beware
● Workload SLA
● Eviction
Future
● QoS is an abstraction to allow kubelets to support different tuning
options in the future for particular resource types while keeping
API simple
CLUSTER TOPOLOGY
Control Plane
Compute Nodes and Storage Tier
Infrastructure
master
and etcd
master
and etcd
master
and etcd
registry
and
router
registry
and
router
LB
registry
and
router
NODE BOOTSTRAPPING
Compute Nodes...
Config Maps
node-compute
node-cpu-bound
node-master
node-highmem
Fetch config from server
● Default kubelet arguments
● Default labels
● Default taints
● Changes are kept in sync
node-config.yaml
(node-highmem)
node-config.yaml
(node-cpu-bound)
FEATURE 1: CPU MANAGER
CPU
How is it exposed?
● Compressible
● Measured in cores
● Not normalized for clock speed
● Use node labels to differentiate
● Assigned per container
CPU
How is it enforced?
Result
● Distributed across all cores
● Throttling
Completely Fair
Scheduler (CFS)
● Requests via cpu.shares
● Limits via cpu.cfs_quota_us
Challenge
● CPU bound workloads
(cache affinity and
scheduling latency) are
impacted
CPU MANAGEMENT POLICY
Tuning the node for cpu-bound workloads
Supported policies
● none is the default policy (just integrates with CFS)
● static allows containers in Guaranteed pods with integer cpu requests
exclusive CPUs on the node enforced via cpuset cgroup controller
Benefits
● End-user API is simple (kubelet option)
● Increases CPU affinity and decreases context switches
● Kubelet manages local node topology (important when doing devices)
● More dynamic policies could be introduced in the future
DEMO 1 - CPU Pinning
Enable cpu pinning via dynamic node config: Demo
● Inspect node config map, see kubeletArguments
for --cpu-manager-policy=static
● Inspect cpuset.cpus of pod containers assigned
either shared or exclusive cores
FEATURE 2: HUGE PAGES
HUGE PAGES
Supports the allocation and consumption of pre-allocated huge pages
Scenario
● Large memory
working set sensitive
to TLB misses
(RDBMS, JVM, cache,
packet processors)
HUGE PAGES
Example Pod
Usage
● Pod request
● Node must pre-allocate
● EmptyDir
(medium=hugepages)
● shmget w/ SHM_HUGETLB
DEMO 2 - Pod that requires huge pages
Dynamically pre-allocate huge pages and schedule a pod: Demo
● Deploy DaemonSet to pre-allocate huge pages
● Inspect node allocatable
● Deploy a pod that consumes huge pages
FEATURES 3, 4:
EXTENDED RESOURCES
and
DEVICE PLUGINS
DEMO 3 - Extended Resources
Counting dongles: Demo
● Implementation detail
○ For device plugins
● Industry leading UX!
○ (PATCH via curl)
DEVICE PLUGINS
gRPC service to expose devices to kubelet
Initialization
● Is the device healthy?
Registration
● Register with kubelet
Serving mode
● Monitor device health
● Allocate device
DEMO 4, 5 - GPUs
Consume a GPU in OpenShift: Infrastructure Demo, Multi-GPU Jupyter/Caffe Demo
● Deploy
nvidia-device-plugin
DaemonSet
● Inspect node
allocatable
● Deploy a pod that
consumes a GPU
FEATURE 5: SYSCTL
SYSCTL
The three types
Unsafe
● Experimental Kubelet Flag
● kernel.sem*
● kernel.shm*
● kernel.msg*
● fs.mqueue.*
● net.*
Safe
● Enabled by default
● kernel.shm_rmid_forced
● net.ipv4.ip_local_port_range
● net.ipv4.tcp_syncookies
Node-level
● Can’t set from a pod
● Potentially affects other
pods
● Many interesting sysctls
● Use TuneD
Red Hat is working to graduate this feature to Beta during Kubernetes 1.11 release
● KEP: Promote sysctl annotations to fields
● Feedback welcome!
DEMO 6 - SYSCTL
Usage in a pod: Demo
WHAT’S NEXT
ROADMAP
Red Hat continues to invest in evolving support
Topic areas
● NUMA
● Co-located device scheduling
● External device monitoring
● Resource API V2
Red Hat Confidential
WORKLOAD COVERAGE: Metal, KVM, Containers
Code Path Coverage
● CPU – linpack, lmbench
● Memory – lmbench, McCalpin STREAM
● Disk IO – iozone, fio
● Networks – netperf – 10/40Gbit,
Infiniband/RoCE, Bypass
Application Performance
● Linpack MPI, HPC workloads
● Database: DB2, Oracle 11/12, Sybase 15.x ,
MySQL, MariaDB, Postgres, MongoDB
● OLTP – TPC-C, TPC-VMS
● DSS – TPC-H/xDS
● Big Data – TPCx-HS, Bigbench
● SPEC cpu, jbb, sfs, virt, cloud
● SAP – SLCS, SD
● STAC = FSI (STAC-N,A2)
● SAS mixed Analytic, SAS grid (gfs2)
QUESTIONS?
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHat
THANK YOU

More Related Content

PDF
NVIDIA GTC 2018: Spectre/Meltdown Impact on High Performance Workloads
PDF
NVIDIA GTC 2018: Enabling GPU-as-a-Service Providers with Red Hat OpenShift
PDF
Triangle Kubernetes Meetup - Performance Sensitive Apps in OpenShift
PDF
Best practices for optimizing Red Hat platforms for large scale datacenter de...
PDF
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
PDF
Rhel8 Beta - Halifax RHUG
PDF
A Container Stack for Openstack - OpenStack Silicon Valley
PDF
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud
NVIDIA GTC 2018: Spectre/Meltdown Impact on High Performance Workloads
NVIDIA GTC 2018: Enabling GPU-as-a-Service Providers with Red Hat OpenShift
Triangle Kubernetes Meetup - Performance Sensitive Apps in OpenShift
Best practices for optimizing Red Hat platforms for large scale datacenter de...
NVIDIA GTC 2019: Red Hat and the NVIDIA DGX: Tried, Tested, Trusted
Rhel8 Beta - Halifax RHUG
A Container Stack for Openstack - OpenStack Silicon Valley
Part 4 Maximizing the utilization of GPU resources on-premise and in the cloud

What's hot (20)

PDF
LinuxCon NA 2016: When Containers and Virtualization Do - and Don’t - Work T...
PDF
OSCON 2017: To contain or not to contain
PDF
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
PDF
KubeCon NA, Seattle, 2016: Performance and Scalability Tuning Kubernetes for...
PDF
2021.02 new in Ceph Pacific Dashboard
PDF
Ceph Tech Talk: Ceph at DigitalOcean
PDF
Red Hat Summit 2017: Wicked Fast PaaS: Performance Tuning of OpenShift and D...
PDF
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
PDF
Part 2 Maximizing the utilization of GPU resources on-premise and in the cloud
PDF
Deploying Containers at Scale on OpenStack
PDF
XPDS16: Live scalability for vGPU using gScale - Xiao Zheng, Intel
PDF
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
PPTX
Designing for High Performance Ceph at Scale
PDF
NantOmics
PDF
02 ai inference acceleration with components all in open hardware: opencapi a...
PDF
ceph openstack dream team
PDF
CephFS Update
PDF
Let's turn your PostgreSQL into columnar store with cstore_fdw
PDF
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
PDF
What's New with Ceph - Ceph Day Silicon Valley
LinuxCon NA 2016: When Containers and Virtualization Do - and Don’t - Work T...
OSCON 2017: To contain or not to contain
Protecting the Galaxy - Multi-Region Disaster Recovery with OpenStack and Ceph
KubeCon NA, Seattle, 2016: Performance and Scalability Tuning Kubernetes for...
2021.02 new in Ceph Pacific Dashboard
Ceph Tech Talk: Ceph at DigitalOcean
Red Hat Summit 2017: Wicked Fast PaaS: Performance Tuning of OpenShift and D...
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Part 2 Maximizing the utilization of GPU resources on-premise and in the cloud
Deploying Containers at Scale on OpenStack
XPDS16: Live scalability for vGPU using gScale - Xiao Zheng, Intel
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Designing for High Performance Ceph at Scale
NantOmics
02 ai inference acceleration with components all in open hardware: opencapi a...
ceph openstack dream team
CephFS Update
Let's turn your PostgreSQL into columnar store with cstore_fdw
PGConf.ASIA 2019 Bali - PostgreSQL on K8S at Zalando - Alexander Kukushkin
What's New with Ceph - Ceph Day Silicon Valley
Ad

Similar to Red Hat Summit 2018 5 New High Performance Features in OpenShift (20)

PDF
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
PDF
Red Hat Enterprise Linux: Open, hyperconverged infrastructure
PPTX
Kubernetes 101
PDF
Containers for the Enterprise: Delivering OpenShift on OpenStack for Performa...
PDF
SCM Puppet: from an intro to the scaling
PDF
Achieving the Ultimate Performance with KVM
PDF
Achieving the Ultimate Performance with KVM
PDF
Kubernetes and Cloud Native Update Q4 2018
PDF
Deploying PostgreSQL on Kubernetes
PDF
Testing kubernetes and_open_shift_at_scale_20170209
PDF
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
PDF
Libvirt/KVM Driver Update (Kilo)
PDF
tburke_rhel6_summit.pdf
PDF
Known basic of NFV Features
PPTX
Rook - cloud-native storage
PDF
LCU14 310- Cisco ODP v2
PDF
What_s_New_in_OpenShift_Container_Platform_4.6.pdf
PDF
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
PPTX
Ceph Community Talk on High-Performance Solid Sate Ceph
PDF
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Red Hat Enterprise Linux: Open, hyperconverged infrastructure
Kubernetes 101
Containers for the Enterprise: Delivering OpenShift on OpenStack for Performa...
SCM Puppet: from an intro to the scaling
Achieving the Ultimate Performance with KVM
Achieving the Ultimate Performance with KVM
Kubernetes and Cloud Native Update Q4 2018
Deploying PostgreSQL on Kubernetes
Testing kubernetes and_open_shift_at_scale_20170209
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Libvirt/KVM Driver Update (Kilo)
tburke_rhel6_summit.pdf
Known basic of NFV Features
Rook - cloud-native storage
LCU14 310- Cisco ODP v2
What_s_New_in_OpenShift_Container_Platform_4.6.pdf
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ad

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
KodekX | Application Modernization Development
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Modernizing your data center with Dell and AMD
Unlocking AI with Model Context Protocol (MCP)
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 3 Spatial Domain Image Processing.pdf
Review of recent advances in non-invasive hemoglobin estimation
Per capita expenditure prediction using model stacking based on satellite ima...
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The Rise and Fall of 3GPP – Time for a Sabbatical?

Red Hat Summit 2018 5 New High Performance Features in OpenShift

  • 1. 5 NEW HIGH-PERFORMANCE FEATURES IN RED HAT OPENSHIFT Patterns and technology to run critical, high performance line-of-business applications on Red Hat OpenShift Container Platform Derek Carr, Jeremy Eder Red Hat Product Engineering May 2018
  • 3. I WANT TO RUN ON OPENSHIFT
  • 4. BUT MY WORKLOAD IS SPECIAL
  • 5. IT IS LATENCY SENSITIVE
  • 6. IT HAS A HUGE CACHE
  • 8. IT NEEDS A SPECIAL DEVICE
  • 10. ACTUALLY, I HAVE A LOT OF WORKLOADS LIKE THIS!
  • 13. COMMUNITY sig-node, sig-scheduling, wg-resource-mgmt ● Expand the set of workloads runnable on the platform ● Maintain reliability ● Keep it simple
  • 14. WORKLOAD TYPES Going beyond generic web hosting workloads Big Data NFV FSI Animation ISVsHPC Machine Learning ● Identify requirement overlap across verticals ● Plumb enhancements generically ● Allow flexibility
  • 15. DRILL DOWN ON OVERLAP Feature FSI NFV ISV BD/ML ANIM HPC CPU pinning (cpuset) Yes Yes Yes Maybe Maybe Yes Device passthrough (GPU, NIC, etc.) Yes Yes Yes Yes Maybe Yes Sysctl support Yes Yes Yes Yes Yes Yes HugePages Yes Yes Yes Yes Maybe Maybe NUMA Yes Yes Yes Maybe Maybe Yes Separate control and data plane Yes Yes Yes Yes Yes Yes Kernel module loading Yes Yes Yes Maybe Yes Maybe
  • 16. PROGRESS REPORT What has been done in the last year? ● CPU manager (static pinning) ● HugePages ● Device Plugins (GPU, etc.) ● Sysctl support ● Extended Resources
  • 18. RESOURCES AND TUNING OPTIONS Natively understood ● CPU ● Memory ● Ephemeral storage ● Persistent storage ● HugePages ● Device Plugins ● Extended Resources Control knobs available ● CPU Manager policy (none, static) ● sysctls (safe / unsafe)
  • 19. RESOURCE REQUIREMENTS Describes the compute resources needed by a pod Limits ● Maximum burst (if available) Requests ● Minimum amount (guaranteed) Overcommit ● Ration of limit to request
  • 20. QUALITY OF SERVICE Keep the end-user API simple, let the platform optimize for SLA guarantees Burstable ● Requests < Limits Best Effort ● No resource limits Guaranteed ● Requests = Limits Beware ● Workload SLA ● Eviction Future ● QoS is an abstraction to allow kubelets to support different tuning options in the future for particular resource types while keeping API simple
  • 21. CLUSTER TOPOLOGY Control Plane Compute Nodes and Storage Tier Infrastructure master and etcd master and etcd master and etcd registry and router registry and router LB registry and router
  • 22. NODE BOOTSTRAPPING Compute Nodes... Config Maps node-compute node-cpu-bound node-master node-highmem Fetch config from server ● Default kubelet arguments ● Default labels ● Default taints ● Changes are kept in sync node-config.yaml (node-highmem) node-config.yaml (node-cpu-bound)
  • 23. FEATURE 1: CPU MANAGER
  • 24. CPU How is it exposed? ● Compressible ● Measured in cores ● Not normalized for clock speed ● Use node labels to differentiate ● Assigned per container
  • 25. CPU How is it enforced? Result ● Distributed across all cores ● Throttling Completely Fair Scheduler (CFS) ● Requests via cpu.shares ● Limits via cpu.cfs_quota_us Challenge ● CPU bound workloads (cache affinity and scheduling latency) are impacted
  • 26. CPU MANAGEMENT POLICY Tuning the node for cpu-bound workloads Supported policies ● none is the default policy (just integrates with CFS) ● static allows containers in Guaranteed pods with integer cpu requests exclusive CPUs on the node enforced via cpuset cgroup controller Benefits ● End-user API is simple (kubelet option) ● Increases CPU affinity and decreases context switches ● Kubelet manages local node topology (important when doing devices) ● More dynamic policies could be introduced in the future
  • 27. DEMO 1 - CPU Pinning Enable cpu pinning via dynamic node config: Demo ● Inspect node config map, see kubeletArguments for --cpu-manager-policy=static ● Inspect cpuset.cpus of pod containers assigned either shared or exclusive cores
  • 29. HUGE PAGES Supports the allocation and consumption of pre-allocated huge pages Scenario ● Large memory working set sensitive to TLB misses (RDBMS, JVM, cache, packet processors)
  • 30. HUGE PAGES Example Pod Usage ● Pod request ● Node must pre-allocate ● EmptyDir (medium=hugepages) ● shmget w/ SHM_HUGETLB
  • 31. DEMO 2 - Pod that requires huge pages Dynamically pre-allocate huge pages and schedule a pod: Demo ● Deploy DaemonSet to pre-allocate huge pages ● Inspect node allocatable ● Deploy a pod that consumes huge pages
  • 32. FEATURES 3, 4: EXTENDED RESOURCES and DEVICE PLUGINS
  • 33. DEMO 3 - Extended Resources Counting dongles: Demo ● Implementation detail ○ For device plugins ● Industry leading UX! ○ (PATCH via curl)
  • 34. DEVICE PLUGINS gRPC service to expose devices to kubelet Initialization ● Is the device healthy? Registration ● Register with kubelet Serving mode ● Monitor device health ● Allocate device
  • 35. DEMO 4, 5 - GPUs Consume a GPU in OpenShift: Infrastructure Demo, Multi-GPU Jupyter/Caffe Demo ● Deploy nvidia-device-plugin DaemonSet ● Inspect node allocatable ● Deploy a pod that consumes a GPU
  • 37. SYSCTL The three types Unsafe ● Experimental Kubelet Flag ● kernel.sem* ● kernel.shm* ● kernel.msg* ● fs.mqueue.* ● net.* Safe ● Enabled by default ● kernel.shm_rmid_forced ● net.ipv4.ip_local_port_range ● net.ipv4.tcp_syncookies Node-level ● Can’t set from a pod ● Potentially affects other pods ● Many interesting sysctls ● Use TuneD Red Hat is working to graduate this feature to Beta during Kubernetes 1.11 release ● KEP: Promote sysctl annotations to fields ● Feedback welcome!
  • 38. DEMO 6 - SYSCTL Usage in a pod: Demo
  • 40. ROADMAP Red Hat continues to invest in evolving support Topic areas ● NUMA ● Co-located device scheduling ● External device monitoring ● Resource API V2
  • 41. Red Hat Confidential WORKLOAD COVERAGE: Metal, KVM, Containers Code Path Coverage ● CPU – linpack, lmbench ● Memory – lmbench, McCalpin STREAM ● Disk IO – iozone, fio ● Networks – netperf – 10/40Gbit, Infiniband/RoCE, Bypass Application Performance ● Linpack MPI, HPC workloads ● Database: DB2, Oracle 11/12, Sybase 15.x , MySQL, MariaDB, Postgres, MongoDB ● OLTP – TPC-C, TPC-VMS ● DSS – TPC-H/xDS ● Big Data – TPCx-HS, Bigbench ● SPEC cpu, jbb, sfs, virt, cloud ● SAP – SLCS, SD ● STAC = FSI (STAC-N,A2) ● SAS mixed Analytic, SAS grid (gfs2)