SlideShare a Scribd company logo
cgroupv2: Linux’s new unified control
group system
Chris Down (cdown@fb.com)
Production Engineer, Web Foundation
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
About me
■ Chris Down
■ Production Engineer at Facebook
■ Working in Web Foundation
■ Working on cgroupv2 ↔ OS integration, especially inside Facebook
■ Dealing with web server and backend service reliability
■ Managing >100,000 machines at scale
■ Building tools to manage and maintain Facebook’s fleet of machines
■ Debugging production incidents, with a focus on Linux internals
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
In this talk
■ A short intro to control groups and where/why they are used
■ cgroup(v1), what went well, what didn’t
■ Why a new major version/API break was needed
■ Fundamental design decisions in cgroupv2
■ New features and improvements in cgroupv2
■ State of cgroupv2, what’s ready, what’s not
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
What are cgroups?
■ cgroup == control group
■ System for resource management on Linux
■ Provides directory hierarchy as main point of interaction (at /sys/fs/cgroup)
■ Limit, throttle, manage, and account for resource usage per control group
■ Each resource interface is provided by a controller
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Practical uses
■ Isolating core workload from background resource needs
■ Web server vs. system processes (eg. Chef, metric collection, etc)
■ Time critical work vs. long-term asynchronous jobs
■ Ensuring machines with multiple workloads (eg. containers) do not allow one workload
to overpower the others
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How did this work in cgroupv1?
cgroupv1 has a hierarchy per-resource, for example:
% ls /sys/fs/cgroup
cpu/ cpuacct/ cpuset/ devices/ freezer/
memory/ net_cls/ pids/
Each resource hierarchy contains cgroups for this resource:
% find /sys/fs/cgroup/pids -type d
/sys/fs/cgroup/pids/background.slice
/sys/fs/cgroup/pids/background.slice/async.slice
/sys/fs/cgroup/pids/workload.slice
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How did this work in cgroupv1?
■ Separate hierarchy/cgroups for
each resource
■ Even if they have the same
name, cgroups for each
resource are distinct
■ cgroups can be nested inside
each other
/sys/fs/cgroup
resource A
cgroup 1
cgroup 2
...
resource B
cgroup 3
cgroup 4
...
resource C
cgroup 5
cgroup 6
...
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How did this work in cgroupv1?
■ Limits and accounting are performed per-cgroup
■ For example, you might set memory.limit_in_bytes in cgroup 3, if resource B is
“memory”.
/sys/fs/cgroup
resource A
cgroup 1
pid 1 pid 2
resource B
cgroup 3
pid 3 pid 4
resource C
cgroup 5
pid 2 pid 3
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How did this work in cgroupv1?
■ One PID is in exactly one cgroup per resource
■ For example, PID 2 is in separate cgroups for resource A and C, but in the root cgroup
for resource B since it’s not explicitly assigned
/sys/fs/cgroup
resource A
cgroup 1
pid 1 pid 2
resource B
cgroup 3
pid 3 pid 4
resource C
cgroup 5
pid 2 pid 3
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Hierarchy in cgroupv1
/sys/fs/cgroup
memory/ service 1 memory.limit_in_bytes
cpu/ service 1 cpu.shares
blkio/ service 2 blkio.throttle.read_iops_device
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How does this work in cgroupv2?
cgroupv2 has a unified hierarchy, for example:
% ls /sys/fs/cgroup
background.slice/ workload.slice/
Each cgroup can support multiple resource domains:
% ls /sys/fs/cgroup/background.slice
async.slice/ foo.mount/ cgroup.subtree_control
memory.high memory.max pids.current pids.max
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How does this work in cgroupv2?
■ cgroups are “global” now —
not limited to one resource
■ Resources are now opt-in for
cgroups
/sys/fs/cgroup
cgroup 1
cgroup 2
...
cgroup 3
cgroup 4
...
cgroup 5
cgroup 6
...
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Hierarchy in cgroupv1
/sys/fs/cgroup
memory/ service 1 memory.limit_in_bytes
cpu/ service 1 cpu.shares
blkio/ service 2 blkio.throttle.read_iops_device
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Hierarchy in cgroupv2
/sys/fs/cgroup
service 1
memory.high ∗
cpu.weight
cgroup.subtree_control
memory
cpu
service 2
io.max (riops key)
cgroup.subtree_control io
∗
will discuss about this vs. memory.max later
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Fundamental differences between v1 and v2
■ Unified hierarchy — resources apply to cgroups now
■ Granularity at PID, not TID level
■ Focus on simplicity/clarity over ultimate flexibility
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
v2 improvements: Tracking of non-immediate charges
Some types of resources are not immediately chargeable (eg. page cache writeback,
network packets going in/out)
In v1:
■ These resources are not able to be tied to a cgroup
■ Charged to the root cgroup for each resource type, so are essentially unlimited
In v2:
■ Resources spent in page cache writebacks and network are charged to the correct
cgroup
■ Can be considered as part of cgroup limits
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
v2 improvements: Communication with backing subsystems
In v1:
■ Most actions for non-share based resources reacted violently to hitting thresholds
■ For example, in the memory cgroup, the only option was to OOM kill or freeze
In v2:
■ Many cgroup controllers inform subsystems before problems occur
■ Subsystems can take action to remedy to avoid violent action (eg. forced direct reclaim
with memory.high, window size manipulation on reclaim failure)
■ Much easier to deal with temporary spikes in a resource’s usage
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
v2 improvements: Consistency between controllers
In v1:
■ Controllers often have inconsistent interfaces
■ Some controller hierarchies inherit values from parents, some don’t
■ Controllers that have similar limiting methods (eg. io/cpu) have inconsistent APIs
In v2:
■ Our crack team of API Design ExpertsTM have ensured your sheer delight∗
∗
well, at least we can pray we didn’t screw it up too badly
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
v2 improvements: Some reasonable configurations are now possible
In v1:
■ Some limitations could not be fixed due to backwards compatibility (eg. memory limit
types)
■ memory.{,kmem.,kmem.tcp.,memsw.,[...]}limit_in_bytes
In v2:
■ Less iterative, more designed up front
■ In the case of memory limit types, we now have universal thresholds
(memory.{high,max})
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Current support
■ Controllers merged into mainline kernel
■ I/O
■ Memory
■ PID
■ Controllers still pending merge
■ CPU (thread: https://bit.ly/cgroupv2cpu)
Disagreements: Process granularity, constraints around pid placement in cgroups
■ Controllers still being worked on
■ Freezer
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Chris, is Facebook really using cgroupv2 in production?
Yes! Really!
■ Currently rolled out to a non-negligible percentage of web servers
■ Running managed with systemd (see Davide Cavalca’s “Deploying systemd at scale” talk
at systemd.conf 2016)
■ Already getting better results on spiky workloads
■ Better separation of workload services from system services (eg. service routing,
metric collection)
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
How can I get it?
cgroupv2 is stable since Linux 4.5. Here are some useful kernel commandline flags:
■ systemd.unified_cgroup_hierarchy=1
■ systemd will mount /sys/fs/cgroup as cgroupv2
■ Available from systemd v226 onwards
■ cgroup_no_v1=all
■ The kernel will disable all v1 cgroup controllers
■ Available from Linux 4.6 onwards
Mounting (if your init system doesn’t do it for you): % mount -t cgroup2 none /sys/fs/cgroup
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
People at Facebook working on cgroupv2
■ FB kernel team working on upstreaming improvements and new features
■ FB operating systems team and Web Foundation working on OS integration &
production testing
■ Tupperware (scheduler & containerisation) team working on rollout to all container
users
cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com)
Still have questions?
■ cgroupv2 has great user-facing docs: https://bit.ly/cgroupv2
■ I’ll be around for questions over pizza 😃
DOXLON November 2016: Facebook Engineering on cgroupv2

More Related Content

PPTX
Linux Kernel Init Process
PDF
Advanced Namespaces and cgroups
PPTX
First steps on CentOs7
PDF
Cgroup resource mgmt_v1
PDF
Linux Containers From Scratch
PDF
Containers and Namespaces in the Linux Kernel
PDF
Namespaces and cgroups - the basis of Linux containers
PDF
Linux cgroups and namespaces
Linux Kernel Init Process
Advanced Namespaces and cgroups
First steps on CentOs7
Cgroup resource mgmt_v1
Linux Containers From Scratch
Containers and Namespaces in the Linux Kernel
Namespaces and cgroups - the basis of Linux containers
Linux cgroups and namespaces

What's hot (20)

PDF
Linux Containers From Scratch: Makfile MicroVPS
PDF
Introduction to RCU
PDF
Let's Containerize New York with Docker!
DOCX
Using cgroups in docker container
PDF
Containers with systemd-nspawn
PDF
Linux Locking Mechanisms
PDF
Effective service and resource management with systemd
PDF
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
PDF
Namespaces in Linux
PDF
NUSE (Network Stack in Userspace) at #osio
PDF
Virtualization which isn't: LXC (Linux Containers)
PDF
High Availability Storage (susecon2016)
ODP
LSA2 - 01 Virtualization with KVM
PDF
Introduction to Docker and deployment and Azure
PDF
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
PDF
mTCP使ってみた
PDF
Linux: LVM
PDF
Playing BBR with a userspace network stack
PDF
Network Stack in Userspace (NUSE)
PDF
LibOS as a regression test framework for Linux networking #netdev1.1
Linux Containers From Scratch: Makfile MicroVPS
Introduction to RCU
Let's Containerize New York with Docker!
Using cgroups in docker container
Containers with systemd-nspawn
Linux Locking Mechanisms
Effective service and resource management with systemd
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Namespaces in Linux
NUSE (Network Stack in Userspace) at #osio
Virtualization which isn't: LXC (Linux Containers)
High Availability Storage (susecon2016)
LSA2 - 01 Virtualization with KVM
Introduction to Docker and deployment and Azure
OpenNebulaConf 2016 - The DRBD SDS for OpenNebula by Philipp Reisner, LINBIT
mTCP使ってみた
Linux: LVM
Playing BBR with a userspace network stack
Network Stack in Userspace (NUSE)
LibOS as a regression test framework for Linux networking #netdev1.1
Ad

Viewers also liked (20)

PDF
Apostila De Dispositivos EléTricos
PDF
Business selectors
PDF
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
PDF
Aws + Puppet = Dynamic Scale
PDF
Sunbrella Ottomans by Outdoor Elegance
PDF
Java standards in WCM
PPTX
What does "monitoring" mean? (FOSDEM 2017)
PDF
Bridging the Gap: Connecting AWS and Kafka
PPTX
Python Pants Build System for Large Codebases
PPTX
Developing highly scalable applications with Symfony and RabbitMQ
PDF
Linux Malware Analysis
PPTX
NSM (Network Security Monitoring) - Tecland Chapeco
PDF
AWS + Puppet = Dynamic Scale
PDF
Chicago AWS user group meetup - May 2014 at Cohesive
PDF
Platform - Technical architecture
PDF
Automated Infrastructure Security: Monitoring using FOSS
PPTX
Expect the unexpected: Anticipate and prepare for failures in microservices b...
PPT
Een Gezond Gebit2
DOC
Gaurav dev ops (AWS, Linux, Automation-ansible, jenkins:CI and CD:Ansible)
PDF
Yirgacheffe Chelelelktu Washed Coffee 2015
Apostila De Dispositivos EléTricos
Business selectors
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Aws + Puppet = Dynamic Scale
Sunbrella Ottomans by Outdoor Elegance
Java standards in WCM
What does "monitoring" mean? (FOSDEM 2017)
Bridging the Gap: Connecting AWS and Kafka
Python Pants Build System for Large Codebases
Developing highly scalable applications with Symfony and RabbitMQ
Linux Malware Analysis
NSM (Network Security Monitoring) - Tecland Chapeco
AWS + Puppet = Dynamic Scale
Chicago AWS user group meetup - May 2014 at Cohesive
Platform - Technical architecture
Automated Infrastructure Security: Monitoring using FOSS
Expect the unexpected: Anticipate and prepare for failures in microservices b...
Een Gezond Gebit2
Gaurav dev ops (AWS, Linux, Automation-ansible, jenkins:CI and CD:Ansible)
Yirgacheffe Chelelelktu Washed Coffee 2015
Ad

Similar to DOXLON November 2016: Facebook Engineering on cgroupv2 (20)

PDF
Grub2 and troubleshooting_ol7_boot_problems
ODP
Enabling ceph-mgr to control Ceph services via Kubernetes
PDF
Systemd: the modern Linux init system you will learn to love
PDF
Ggplot2 Installation Instructions
PDF
Trivadis TechEvent 2016 cgroups im Einsatz von Florian Feicht
PDF
Control your service resources with systemd
PDF
Choosing Right Garbage Collector to Increase Efficiency of Java Memory Usage
PDF
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
PDF
Tuning systemd for embedded
PPT
Seven problems of Linux Containers
PPT
Seven problems of Linux containers
PDF
Red Hat Global File System (GFS)
PDF
systemd @ Facebook -- a year later
PDF
Using GTP on Linux with libgtpnl
PDF
Intro to Kernel Debugging - Just make the crashing stop!
PPT
Linux Crash Dump Capture and Analysis
PPTX
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
PDF
Systemd for developers
PDF
Docker 原理與實作
PDF
Gitops Hands On
Grub2 and troubleshooting_ol7_boot_problems
Enabling ceph-mgr to control Ceph services via Kubernetes
Systemd: the modern Linux init system you will learn to love
Ggplot2 Installation Instructions
Trivadis TechEvent 2016 cgroups im Einsatz von Florian Feicht
Control your service resources with systemd
Choosing Right Garbage Collector to Increase Efficiency of Java Memory Usage
Elastic JVM for Scalable Java EE Applications Running in Containers #Jakart...
Tuning systemd for embedded
Seven problems of Linux Containers
Seven problems of Linux containers
Red Hat Global File System (GFS)
systemd @ Facebook -- a year later
Using GTP on Linux with libgtpnl
Intro to Kernel Debugging - Just make the crashing stop!
Linux Crash Dump Capture and Analysis
Вячеслав Блинов «Java Garbage Collection: A Performance Impact»
Systemd for developers
Docker 原理與實作
Gitops Hands On

More from Outlyer (20)

PPTX
Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...
PPTX
How & When to Feature Flag
PPTX
Why You Need to Stop Using "The" Staging Server
PPTX
How GitHub combined with CI empowers rapid product delivery at Credit Karma
PPTX
Packaging Services with Nix
PDF
Minimum Viable Docker: our journey towards orchestration
PDF
Ops is dead. long live ops.
PDF
The service mesh: resilient communication for microservice applications
PPTX
Microservices: Why We Did It (and should you?)
PPTX
Renan Dias: Using Alexa to deploy applications to Kubernetes
PDF
Alex Dias: how to build a docker monitoring solution
PPTX
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...
PDF
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group
PDF
Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty
PDF
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...
PPTX
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
PPTX
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...
PDF
Zero Downtime Postgres Upgrades
PDF
DOXLON November 2016 - ELK Stack and Beats
PDF
DOXLON November 2016 - Data Democratization Using Splunk
Murat Karslioglu, VP Solutions @ OpenEBS - Containerized storage for containe...
How & When to Feature Flag
Why You Need to Stop Using "The" Staging Server
How GitHub combined with CI empowers rapid product delivery at Credit Karma
Packaging Services with Nix
Minimum Viable Docker: our journey towards orchestration
Ops is dead. long live ops.
The service mesh: resilient communication for microservice applications
Microservices: Why We Did It (and should you?)
Renan Dias: Using Alexa to deploy applications to Kubernetes
Alex Dias: how to build a docker monitoring solution
How to build a container monitoring solution - David Gildeh, CEO and Co-Found...
Heresy in the church of - Corey Quinn, Principal at The Quinn Advisory Group
Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty
A Holistic View of Operational Capabilities—Roy Rapoport, Insight Engineering...
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
Building a production-ready, fully-scalable Docker Swarm using Terraform & Pa...
Zero Downtime Postgres Upgrades
DOXLON November 2016 - ELK Stack and Beats
DOXLON November 2016 - Data Democratization Using Splunk

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Review of recent advances in non-invasive hemoglobin estimation
Big Data Technologies - Introduction.pptx
Machine learning based COVID-19 study performance prediction
Reach Out and Touch Someone: Haptics and Empathic Computing
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Unlocking AI with Model Context Protocol (MCP)
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Diabetes mellitus diagnosis method based random forest with bat algorithm
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
“AI and Expert System Decision Support & Business Intelligence Systems”
Mobile App Security Testing_ A Comprehensive Guide.pdf
Modernizing your data center with Dell and AMD
Review of recent advances in non-invasive hemoglobin estimation

DOXLON November 2016: Facebook Engineering on cgroupv2

  • 1. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Production Engineer, Web Foundation
  • 2. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) About me ■ Chris Down ■ Production Engineer at Facebook ■ Working in Web Foundation ■ Working on cgroupv2 ↔ OS integration, especially inside Facebook ■ Dealing with web server and backend service reliability ■ Managing >100,000 machines at scale ■ Building tools to manage and maintain Facebook’s fleet of machines ■ Debugging production incidents, with a focus on Linux internals
  • 3. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) In this talk ■ A short intro to control groups and where/why they are used ■ cgroup(v1), what went well, what didn’t ■ Why a new major version/API break was needed ■ Fundamental design decisions in cgroupv2 ■ New features and improvements in cgroupv2 ■ State of cgroupv2, what’s ready, what’s not
  • 4. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) What are cgroups? ■ cgroup == control group ■ System for resource management on Linux ■ Provides directory hierarchy as main point of interaction (at /sys/fs/cgroup) ■ Limit, throttle, manage, and account for resource usage per control group ■ Each resource interface is provided by a controller
  • 5. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Practical uses ■ Isolating core workload from background resource needs ■ Web server vs. system processes (eg. Chef, metric collection, etc) ■ Time critical work vs. long-term asynchronous jobs ■ Ensuring machines with multiple workloads (eg. containers) do not allow one workload to overpower the others
  • 6. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How did this work in cgroupv1? cgroupv1 has a hierarchy per-resource, for example: % ls /sys/fs/cgroup cpu/ cpuacct/ cpuset/ devices/ freezer/ memory/ net_cls/ pids/ Each resource hierarchy contains cgroups for this resource: % find /sys/fs/cgroup/pids -type d /sys/fs/cgroup/pids/background.slice /sys/fs/cgroup/pids/background.slice/async.slice /sys/fs/cgroup/pids/workload.slice
  • 7. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How did this work in cgroupv1? ■ Separate hierarchy/cgroups for each resource ■ Even if they have the same name, cgroups for each resource are distinct ■ cgroups can be nested inside each other /sys/fs/cgroup resource A cgroup 1 cgroup 2 ... resource B cgroup 3 cgroup 4 ... resource C cgroup 5 cgroup 6 ...
  • 8. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How did this work in cgroupv1? ■ Limits and accounting are performed per-cgroup ■ For example, you might set memory.limit_in_bytes in cgroup 3, if resource B is “memory”. /sys/fs/cgroup resource A cgroup 1 pid 1 pid 2 resource B cgroup 3 pid 3 pid 4 resource C cgroup 5 pid 2 pid 3
  • 9. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How did this work in cgroupv1? ■ One PID is in exactly one cgroup per resource ■ For example, PID 2 is in separate cgroups for resource A and C, but in the root cgroup for resource B since it’s not explicitly assigned /sys/fs/cgroup resource A cgroup 1 pid 1 pid 2 resource B cgroup 3 pid 3 pid 4 resource C cgroup 5 pid 2 pid 3
  • 10. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Hierarchy in cgroupv1 /sys/fs/cgroup memory/ service 1 memory.limit_in_bytes cpu/ service 1 cpu.shares blkio/ service 2 blkio.throttle.read_iops_device
  • 11. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How does this work in cgroupv2? cgroupv2 has a unified hierarchy, for example: % ls /sys/fs/cgroup background.slice/ workload.slice/ Each cgroup can support multiple resource domains: % ls /sys/fs/cgroup/background.slice async.slice/ foo.mount/ cgroup.subtree_control memory.high memory.max pids.current pids.max
  • 12. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How does this work in cgroupv2? ■ cgroups are “global” now — not limited to one resource ■ Resources are now opt-in for cgroups /sys/fs/cgroup cgroup 1 cgroup 2 ... cgroup 3 cgroup 4 ... cgroup 5 cgroup 6 ...
  • 13. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Hierarchy in cgroupv1 /sys/fs/cgroup memory/ service 1 memory.limit_in_bytes cpu/ service 1 cpu.shares blkio/ service 2 blkio.throttle.read_iops_device
  • 14. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Hierarchy in cgroupv2 /sys/fs/cgroup service 1 memory.high ∗ cpu.weight cgroup.subtree_control memory cpu service 2 io.max (riops key) cgroup.subtree_control io ∗ will discuss about this vs. memory.max later
  • 15. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Fundamental differences between v1 and v2 ■ Unified hierarchy — resources apply to cgroups now ■ Granularity at PID, not TID level ■ Focus on simplicity/clarity over ultimate flexibility
  • 16. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) v2 improvements: Tracking of non-immediate charges Some types of resources are not immediately chargeable (eg. page cache writeback, network packets going in/out) In v1: ■ These resources are not able to be tied to a cgroup ■ Charged to the root cgroup for each resource type, so are essentially unlimited In v2: ■ Resources spent in page cache writebacks and network are charged to the correct cgroup ■ Can be considered as part of cgroup limits
  • 17. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) v2 improvements: Communication with backing subsystems In v1: ■ Most actions for non-share based resources reacted violently to hitting thresholds ■ For example, in the memory cgroup, the only option was to OOM kill or freeze In v2: ■ Many cgroup controllers inform subsystems before problems occur ■ Subsystems can take action to remedy to avoid violent action (eg. forced direct reclaim with memory.high, window size manipulation on reclaim failure) ■ Much easier to deal with temporary spikes in a resource’s usage
  • 18. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) v2 improvements: Consistency between controllers In v1: ■ Controllers often have inconsistent interfaces ■ Some controller hierarchies inherit values from parents, some don’t ■ Controllers that have similar limiting methods (eg. io/cpu) have inconsistent APIs In v2: ■ Our crack team of API Design ExpertsTM have ensured your sheer delight∗ ∗ well, at least we can pray we didn’t screw it up too badly
  • 19. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) v2 improvements: Some reasonable configurations are now possible In v1: ■ Some limitations could not be fixed due to backwards compatibility (eg. memory limit types) ■ memory.{,kmem.,kmem.tcp.,memsw.,[...]}limit_in_bytes In v2: ■ Less iterative, more designed up front ■ In the case of memory limit types, we now have universal thresholds (memory.{high,max})
  • 20. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Current support ■ Controllers merged into mainline kernel ■ I/O ■ Memory ■ PID ■ Controllers still pending merge ■ CPU (thread: https://bit.ly/cgroupv2cpu) Disagreements: Process granularity, constraints around pid placement in cgroups ■ Controllers still being worked on ■ Freezer
  • 21. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Chris, is Facebook really using cgroupv2 in production? Yes! Really! ■ Currently rolled out to a non-negligible percentage of web servers ■ Running managed with systemd (see Davide Cavalca’s “Deploying systemd at scale” talk at systemd.conf 2016) ■ Already getting better results on spiky workloads ■ Better separation of workload services from system services (eg. service routing, metric collection)
  • 22. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) How can I get it? cgroupv2 is stable since Linux 4.5. Here are some useful kernel commandline flags: ■ systemd.unified_cgroup_hierarchy=1 ■ systemd will mount /sys/fs/cgroup as cgroupv2 ■ Available from systemd v226 onwards ■ cgroup_no_v1=all ■ The kernel will disable all v1 cgroup controllers ■ Available from Linux 4.6 onwards Mounting (if your init system doesn’t do it for you): % mount -t cgroup2 none /sys/fs/cgroup
  • 23. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) People at Facebook working on cgroupv2 ■ FB kernel team working on upstreaming improvements and new features ■ FB operating systems team and Web Foundation working on OS integration & production testing ■ Tupperware (scheduler & containerisation) team working on rollout to all container users
  • 24. cgroupv2: Linux’s new unified control group system Chris Down (cdown@fb.com) Still have questions? ■ cgroupv2 has great user-facing docs: https://bit.ly/cgroupv2 ■ I’ll be around for questions over pizza 😃