SlideShare a Scribd company logo
Confidential + Proprietary
Task Migration at Scale Using CRIU
Linux Plumbers Conference 2018
Victor Marmol Andy Tucker
vmarmol@google.com agtucker@google.com
2018-11-15
Confidential + Proprietary
Who we are
Outside of Google, we’ve worked on open source cluster management and
containers
2
lmcty
Confidential + Proprietary
Who we are
Inside Google: we’re part of the Borg team
● Manages all compute jobs
● Runs on every server
Images by Connie Zhou3
Confidential + Proprietary
What is Borg?
Google’s cluster management system
● Borgmaster: Cluster control and main API
entrypoint
● Borglet: On-machine management daemon
● Suite of tools and UIs for managing jobs
● Many purpose-built platforms created on top of
Borg
● Everything runs on Borg and everything runs in
containers
4
Confidential + Proprietary
Borg Machine
Borg basics
Base compute primitive: Task
● A priority signals how quickly a task should
schedule
● It’s appclass describes a task as either
serving (latency sensitive) or batch
● Static content/binaries provided by
packages
● A container isolates a task’s resources
● Native Linux processes
● Share an IP with the machine, ports
are allocated for each task
Task
Container
Processes Packages
Allocated Ports
5
Processes
Processes
Processes
Packages
Packages
Packages
Confidential + Proprietary
Borg basics: evictions
When a task is forcefully terminated by Borg
● Typically receive a notification: 1-5min
● Our SLO allows for quite a few evictions
● Applications must handle them
Reasons for evictions
● Preemption: a higher priority task needs the resources
● Software upgrades (e.g.: kernel, firmware)
● Re-balancing for availability or performance
6
Confidential + Proprietary
Evictions are impactful and hard to handle
Technical Complexity
● Handling evictions requires state management
○ How and what state to serialize and where to store it
● Application-specific and not very reusable
Lost Compute
● Batch jobs run at lower priorities and get preempted often
● Even platforms that handle them for users, don’t do a great job
Task 0 Shard 1 Task 2 Task 3 Task 4Evicted!
Compute is lost... 7
Confidential + Proprietary
Migrations to avoid evictions
Transparently replace evictions with migration
Native task migration offering in Borg
● Borg controls the eviction → always knows when to migrate
● Native management of state allows reuse for all workloads
Various possible mechanisms
● Checkpoint/restore
○ Pause application, transfer state, resume
○ Long blackout period, no brownout
● Live
○ Very short blackout, but with a longer brownout
○ Very low impact to applications
8
Confidential + Proprietary
Challenges with task migration
Migrating network connections
Port collisions and port use
Storage migration is slow
Must virtualize machine-local resources
Linux process state hard to migrate
9
Confidential + Proprietary
Challenges with task migration
Migrating network connections
Port collisions and port use
Storage migration is slow
Must virtualize machine-local resources
Linux process state hard to migrate
Little to no local storage
Linux namespaces
CRIU!
NET namespaces and IPv6 per-container
Drop the connection, user handles reconnections
10
Confidential + Proprietary
Migration
Workflow
11
Confidential + Proprietary
Machine A
Task Migration
Checkpoint/Restore
12
Confidential + Proprietary
Machine A
Task
Isolated Task Environment
● Linux namespaces
● Little local storage
● IPv6
● Google libraries (e.g.: Stubby/gRPC)
13
Confidential + Proprietary
Machine A
Task
Checkpoint
● Pause task
14
Confidential + Proprietary
Machine A
Checkpoint
● Pause task
● Serialize stateTask
15
Confidential + Proprietary
Machine A
Checkpoint
● Pause task
● Serialize state
● Upload to distributed storage
Task
Colossus
16
Confidential + Proprietary
Machine A
Colossus
17
Confidential + Proprietary
Colossus
Migration
● Borgmaster chooses new machine to
schedule the task.
Machine B
18
Confidential + Proprietary
Colossus
Machine B
Restore
● Download from distributed storage
Task
19
Confidential + Proprietary
Machine B
Restore
● Download from distributed storage
● Deserialize state Task
20
Confidential + Proprietary
Machine B
Restore
● Download from distributed storage
● Deserialize state
● Continue running task
Task
21
Confidential + Proprietary
Machine B
Isolated Task Environment
● Machine is opaque to the task
● Your local data travels with the task
● Your IP changes
● Google libraries re-establish
connections
Task
22
Confidential + Proprietary
Task
IPv6 + NET namespace
Networking
Networking @ Google
● Standardized RPC implementation: Stubby/gRPC
● Nearly all communication is RPC
● Unique IPv6 address per task
● BNS: Borg DNS, used by RPC layer
Task Migration
● Stubby/gRPC automatically reconnects
● Reconnect is transparent to users
● IP address changes, but this is rarely a problem
23
Network
Process
Stubby/gRPC Library
Confidential + Proprietary
Storage
Storage @ Google: Minimized local storage
● Most tasks are stateless, few require local
SSD/HDD
● Those that require state use our remote storage
stacks (e.g.: Colossus, Spanner)
● Small local storage is offered via tmpfs
Task migration
● Lack of local storage greatly simplifies work
● Remote storage stacks use RPC and thus recover
gracefully
● Small local storage is migrated with task
24
Task
Colossus
Process
Stubby/gRPC Library
Spanner
Tmpfs scratch space
PD
Confidential + Proprietary
Task environment
Container
● Primarily used for resource isolation
● Full namespaces applied
Security
● Root is not mapped into user namespace
● Capabilities are strictly limited
Root filesystem
● Separate from the host machine’s
● Built and bundled by the task as a package
25
Task
Container
full cgroups + full namespaces
Process
no root + limited caps
Isolated Filesystem
Confidential + Proprietary
CRIU
Checkpoint/Restore in User Space
● Used to serialize/deserialize the task’s process
Security and isolation
● Run inside a task’s container
● Run with minimal privileges
The Migrator
● Injected into task during a migration, orchestrates
the migration
● Manages execution of CRIU
● Encrypts and compresses checkpoint on the fly
○ Pretends to be a CRIU pageserver 26
Task
Container
Borglet [root]
Process
Migrator [root]
CRIU
Colossus
Process [user]
CRIU [user]
Confidential + Proprietary
In practice today
Migrations take 1-2min and succeed 90%+ of the time
Where the time goes
● Checkpoint/restore is relatively fast for well-behaved tasks
● Writing/reading to remote storage dominates checkpoint/restore
● Scheduling delays are also a large source of latency
Causes of failures
● Timeouts from high task resource usage (e.g.: threads, memory)
● Different host environments
● Misc failures in serialization (e.g.: unsupported features)
27
Confidential + Proprietary
Users
Works well for batch jobs
● Latency tolerant, longer-running, and lower priority
● Some are highly sharded and see many evictions
● Long pipelines suffer when some parts are evicted
User feedback
● They love it! Super simple to adopt
● Desire for advanced features
○ Migration notifications
○ User-controlled pause/resume
Not a great offering for latency-sensitive jobs
28
1s
10s
1m
3m
10m
Batch/LTLS
Confidential + Proprietary
Adoption challenges
Handling connection failures
● In theory: users are taught to expect failures
● In practice: users don’t handle failures well
○ Expect them not to occur and reset their state when they do
Isolating task environment
● Users make assumptions about the underlying host
○ Services are available via localhost
○ Expecting host:port to work
● Users don’t expect the underlying host to change at runtime
○ Certain features detected at startup and never refreshed (e.g.: kernel, CPU, location)
29
Confidential + Proprietary
Experience with CRIU
In one word? AMAZING!
● Mostly worked out of the box with few changes
● Reliability and performance have been great in production
● Community has been helpful and quick to fix issues
Our changes
● Performance improvements for checkpoint/restore
● Increasing/improving some limitations (see next slide)
● Most patches sent upstream
30
Confidential + Proprietary
CRIU security
CRIU suggested to run as root
● Security auditing found a series of bugs
● A malicious task can hijack a CRIU process
Recommendation
● Run CRIU as the task’s user
● Run in user namespace without root mapped in
● Trim privileges to minimal set
31
Task
Container
Borglet [root]
Process [user]
Migrator [root]
CRIU [user]
Confidential + Proprietary
What could do with improvements
Performance
● Some expensive operations remain, some have kernel limitations
○ e.g.: waitpid on all threads is O(n2
)
Security
● Reducing need for root and elevated capabilities
● Not well tested in this setup
Misc
● Contributing patches back is a bit hard
32
Confidential + Proprietary
What could do with improvements
Live migration
● Parts of incremental restore are very, very difficult
● Lots of work ahead to do the type of brownout used in VM live migration today
Handling time
● Hard to abstract away many of the time HW counters
● Time namespaces to the rescue?
33
Confidential + Proprietary
Future work
Increasing adoption internally
● Reduce lost compute and simplify user tasks
● Targeting on-by-default for large batch workloads
Machine-to-machine migration
● Skip the distributed storage of the checkpoint
● Reduces migration times to ~30s
Live migration
● Able to address latency sensitive workloads
● Will require some work in our stack and in CRIU
34
Confidential + Proprietary
Questions?
Native task migration offering in Borg
● Reduces compute lost to evictions
● Simplifies task handling of preemptions
● Addresses most batch workloads
● Serving workloads need live migration
CRIU
● Works amazingly well out of the box
● Security an area of investment
● We are excited about and look forward to live migration!
Victor Marmol
vmarmol@google.com
Andy Tucker
agtucker@google.com
35

More Related Content

PDF
Using all of the high availability options in MariaDB
PPTX
TRex Realistic Traffic Generator - Stateless support
PDF
WF2405在切換到其他模式,原先設定值是否會消失
PDF
6 verification tools
PPTX
Low power in vlsi with upf basics part 2
PDF
Session 8,9 PCI Express
PPT
CPLD & FPLD
PPTX
開發人員不可不知的 Windows Container 容器技術預覽
Using all of the high availability options in MariaDB
TRex Realistic Traffic Generator - Stateless support
WF2405在切換到其他模式,原先設定值是否會消失
6 verification tools
Low power in vlsi with upf basics part 2
Session 8,9 PCI Express
CPLD & FPLD
開發人員不可不知的 Windows Container 容器技術預覽

What's hot (20)

PPT
VLSI Design- Guru.ppt
PPTX
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
PDF
Intro ProxySQL
PDF
Microservices in Go with Go kit
PPTX
PDF
Using Assertions in AMS Verification
PDF
MariaDB MaxScale: an Intelligent Database Proxy
PDF
Ral by pushpa
PDF
Session 6 sv_randomization
PPTX
Query logging with proxysql
PDF
Basics of Functional Verification - Arrow Devices
PDF
2019 2 testing and verification of vlsi design_verification
PDF
UVM ARCHITECTURE FOR VERIFICATION
PPTX
USB3.0ドライバ開発の道
PPT
Verilog Lecture4 2014
PDF
How Cisco Provides World-Class Technology Conference Experiences Using Automa...
PDF
PostgreSQL WAL for DBAs
PDF
Robert Kubis - gRPC - boilerplate to high-performance scalable APIs - code.t...
PDF
System verilog important
PDF
CXL_説明_公開用.pdf
VLSI Design- Guru.ppt
[135] 오픈소스 데이터베이스, 은행 서비스에 첫발을 내밀다.
Intro ProxySQL
Microservices in Go with Go kit
Using Assertions in AMS Verification
MariaDB MaxScale: an Intelligent Database Proxy
Ral by pushpa
Session 6 sv_randomization
Query logging with proxysql
Basics of Functional Verification - Arrow Devices
2019 2 testing and verification of vlsi design_verification
UVM ARCHITECTURE FOR VERIFICATION
USB3.0ドライバ開発の道
Verilog Lecture4 2014
How Cisco Provides World-Class Technology Conference Experiences Using Automa...
PostgreSQL WAL for DBAs
Robert Kubis - gRPC - boilerplate to high-performance scalable APIs - code.t...
System verilog important
CXL_説明_公開用.pdf
Ad

Similar to Task migration using CRIU (20)

PDF
Series of Unfortunate Netflix Container Events - QConNYC17
PDF
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
PDF
Server fleet management using Camunda by Akhil Ahuja
PDF
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
PDF
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
PDF
Migrating to Apache Spark at Netflix
PDF
OpenFlow @ Google
PDF
Continuous Deployment Applied at MyHeritage
PPTX
The Road to Kubernetes
PPTX
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
PDF
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
PDF
Accumulo Summit Keynote 2018
PDF
Building a Small Datacenter
PDF
Building a Small DC
PDF
Scheduling a fuller house - Talk at QCon NY 2016
PDF
Netflix Container Scheduling and Execution - QCon New York 2016
PDF
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
PPTX
Webinar: Building a multi-cloud Kubernetes storage on GitLab
PDF
Scaling Monitoring At Databricks From Prometheus to M3
PPTX
AWS Techniques and lessons writing low cost autoscaling GitLab runners
Series of Unfortunate Netflix Container Events - QConNYC17
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Server fleet management using Camunda by Akhil Ahuja
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Migrating to Apache Spark at Netflix
OpenFlow @ Google
Continuous Deployment Applied at MyHeritage
The Road to Kubernetes
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
Accumulo Summit Keynote 2018
Building a Small Datacenter
Building a Small DC
Scheduling a fuller house - Talk at QCon NY 2016
Netflix Container Scheduling and Execution - QCon New York 2016
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Scaling Monitoring At Databricks From Prometheus to M3
AWS Techniques and lessons writing low cost autoscaling GitLab runners
Ad

More from Rohit Jnagal (7)

PDF
Memory Bandwidth QoS
PDF
Cat @ scale
PDF
Native container monitoring
PDF
Kubernetes intro public - kubernetes meetup 4-21-2015
PDF
Docker n co
PDF
Docker Overview
PDF
Docker internals
Memory Bandwidth QoS
Cat @ scale
Native container monitoring
Kubernetes intro public - kubernetes meetup 4-21-2015
Docker n co
Docker Overview
Docker internals

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PDF
Empathic Computing: Creating Shared Understanding
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
cuic standard and advanced reporting.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
Empathic Computing: Creating Shared Understanding
Chapter 3 Spatial Domain Image Processing.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
cuic standard and advanced reporting.pdf
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
A Presentation on Artificial Intelligence
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
Big Data Technologies - Introduction.pptx
Programs and apps: productivity, graphics, security and other tools
“AI and Expert System Decision Support & Business Intelligence Systems”

Task migration using CRIU

  • 1. Confidential + Proprietary Task Migration at Scale Using CRIU Linux Plumbers Conference 2018 Victor Marmol Andy Tucker vmarmol@google.com agtucker@google.com 2018-11-15
  • 2. Confidential + Proprietary Who we are Outside of Google, we’ve worked on open source cluster management and containers 2 lmcty
  • 3. Confidential + Proprietary Who we are Inside Google: we’re part of the Borg team ● Manages all compute jobs ● Runs on every server Images by Connie Zhou3
  • 4. Confidential + Proprietary What is Borg? Google’s cluster management system ● Borgmaster: Cluster control and main API entrypoint ● Borglet: On-machine management daemon ● Suite of tools and UIs for managing jobs ● Many purpose-built platforms created on top of Borg ● Everything runs on Borg and everything runs in containers 4
  • 5. Confidential + Proprietary Borg Machine Borg basics Base compute primitive: Task ● A priority signals how quickly a task should schedule ● It’s appclass describes a task as either serving (latency sensitive) or batch ● Static content/binaries provided by packages ● A container isolates a task’s resources ● Native Linux processes ● Share an IP with the machine, ports are allocated for each task Task Container Processes Packages Allocated Ports 5 Processes Processes Processes Packages Packages Packages
  • 6. Confidential + Proprietary Borg basics: evictions When a task is forcefully terminated by Borg ● Typically receive a notification: 1-5min ● Our SLO allows for quite a few evictions ● Applications must handle them Reasons for evictions ● Preemption: a higher priority task needs the resources ● Software upgrades (e.g.: kernel, firmware) ● Re-balancing for availability or performance 6
  • 7. Confidential + Proprietary Evictions are impactful and hard to handle Technical Complexity ● Handling evictions requires state management ○ How and what state to serialize and where to store it ● Application-specific and not very reusable Lost Compute ● Batch jobs run at lower priorities and get preempted often ● Even platforms that handle them for users, don’t do a great job Task 0 Shard 1 Task 2 Task 3 Task 4Evicted! Compute is lost... 7
  • 8. Confidential + Proprietary Migrations to avoid evictions Transparently replace evictions with migration Native task migration offering in Borg ● Borg controls the eviction → always knows when to migrate ● Native management of state allows reuse for all workloads Various possible mechanisms ● Checkpoint/restore ○ Pause application, transfer state, resume ○ Long blackout period, no brownout ● Live ○ Very short blackout, but with a longer brownout ○ Very low impact to applications 8
  • 9. Confidential + Proprietary Challenges with task migration Migrating network connections Port collisions and port use Storage migration is slow Must virtualize machine-local resources Linux process state hard to migrate 9
  • 10. Confidential + Proprietary Challenges with task migration Migrating network connections Port collisions and port use Storage migration is slow Must virtualize machine-local resources Linux process state hard to migrate Little to no local storage Linux namespaces CRIU! NET namespaces and IPv6 per-container Drop the connection, user handles reconnections 10
  • 12. Confidential + Proprietary Machine A Task Migration Checkpoint/Restore 12
  • 13. Confidential + Proprietary Machine A Task Isolated Task Environment ● Linux namespaces ● Little local storage ● IPv6 ● Google libraries (e.g.: Stubby/gRPC) 13
  • 14. Confidential + Proprietary Machine A Task Checkpoint ● Pause task 14
  • 15. Confidential + Proprietary Machine A Checkpoint ● Pause task ● Serialize stateTask 15
  • 16. Confidential + Proprietary Machine A Checkpoint ● Pause task ● Serialize state ● Upload to distributed storage Task Colossus 16
  • 18. Confidential + Proprietary Colossus Migration ● Borgmaster chooses new machine to schedule the task. Machine B 18
  • 19. Confidential + Proprietary Colossus Machine B Restore ● Download from distributed storage Task 19
  • 20. Confidential + Proprietary Machine B Restore ● Download from distributed storage ● Deserialize state Task 20
  • 21. Confidential + Proprietary Machine B Restore ● Download from distributed storage ● Deserialize state ● Continue running task Task 21
  • 22. Confidential + Proprietary Machine B Isolated Task Environment ● Machine is opaque to the task ● Your local data travels with the task ● Your IP changes ● Google libraries re-establish connections Task 22
  • 23. Confidential + Proprietary Task IPv6 + NET namespace Networking Networking @ Google ● Standardized RPC implementation: Stubby/gRPC ● Nearly all communication is RPC ● Unique IPv6 address per task ● BNS: Borg DNS, used by RPC layer Task Migration ● Stubby/gRPC automatically reconnects ● Reconnect is transparent to users ● IP address changes, but this is rarely a problem 23 Network Process Stubby/gRPC Library
  • 24. Confidential + Proprietary Storage Storage @ Google: Minimized local storage ● Most tasks are stateless, few require local SSD/HDD ● Those that require state use our remote storage stacks (e.g.: Colossus, Spanner) ● Small local storage is offered via tmpfs Task migration ● Lack of local storage greatly simplifies work ● Remote storage stacks use RPC and thus recover gracefully ● Small local storage is migrated with task 24 Task Colossus Process Stubby/gRPC Library Spanner Tmpfs scratch space PD
  • 25. Confidential + Proprietary Task environment Container ● Primarily used for resource isolation ● Full namespaces applied Security ● Root is not mapped into user namespace ● Capabilities are strictly limited Root filesystem ● Separate from the host machine’s ● Built and bundled by the task as a package 25 Task Container full cgroups + full namespaces Process no root + limited caps Isolated Filesystem
  • 26. Confidential + Proprietary CRIU Checkpoint/Restore in User Space ● Used to serialize/deserialize the task’s process Security and isolation ● Run inside a task’s container ● Run with minimal privileges The Migrator ● Injected into task during a migration, orchestrates the migration ● Manages execution of CRIU ● Encrypts and compresses checkpoint on the fly ○ Pretends to be a CRIU pageserver 26 Task Container Borglet [root] Process Migrator [root] CRIU Colossus Process [user] CRIU [user]
  • 27. Confidential + Proprietary In practice today Migrations take 1-2min and succeed 90%+ of the time Where the time goes ● Checkpoint/restore is relatively fast for well-behaved tasks ● Writing/reading to remote storage dominates checkpoint/restore ● Scheduling delays are also a large source of latency Causes of failures ● Timeouts from high task resource usage (e.g.: threads, memory) ● Different host environments ● Misc failures in serialization (e.g.: unsupported features) 27
  • 28. Confidential + Proprietary Users Works well for batch jobs ● Latency tolerant, longer-running, and lower priority ● Some are highly sharded and see many evictions ● Long pipelines suffer when some parts are evicted User feedback ● They love it! Super simple to adopt ● Desire for advanced features ○ Migration notifications ○ User-controlled pause/resume Not a great offering for latency-sensitive jobs 28 1s 10s 1m 3m 10m Batch/LTLS
  • 29. Confidential + Proprietary Adoption challenges Handling connection failures ● In theory: users are taught to expect failures ● In practice: users don’t handle failures well ○ Expect them not to occur and reset their state when they do Isolating task environment ● Users make assumptions about the underlying host ○ Services are available via localhost ○ Expecting host:port to work ● Users don’t expect the underlying host to change at runtime ○ Certain features detected at startup and never refreshed (e.g.: kernel, CPU, location) 29
  • 30. Confidential + Proprietary Experience with CRIU In one word? AMAZING! ● Mostly worked out of the box with few changes ● Reliability and performance have been great in production ● Community has been helpful and quick to fix issues Our changes ● Performance improvements for checkpoint/restore ● Increasing/improving some limitations (see next slide) ● Most patches sent upstream 30
  • 31. Confidential + Proprietary CRIU security CRIU suggested to run as root ● Security auditing found a series of bugs ● A malicious task can hijack a CRIU process Recommendation ● Run CRIU as the task’s user ● Run in user namespace without root mapped in ● Trim privileges to minimal set 31 Task Container Borglet [root] Process [user] Migrator [root] CRIU [user]
  • 32. Confidential + Proprietary What could do with improvements Performance ● Some expensive operations remain, some have kernel limitations ○ e.g.: waitpid on all threads is O(n2 ) Security ● Reducing need for root and elevated capabilities ● Not well tested in this setup Misc ● Contributing patches back is a bit hard 32
  • 33. Confidential + Proprietary What could do with improvements Live migration ● Parts of incremental restore are very, very difficult ● Lots of work ahead to do the type of brownout used in VM live migration today Handling time ● Hard to abstract away many of the time HW counters ● Time namespaces to the rescue? 33
  • 34. Confidential + Proprietary Future work Increasing adoption internally ● Reduce lost compute and simplify user tasks ● Targeting on-by-default for large batch workloads Machine-to-machine migration ● Skip the distributed storage of the checkpoint ● Reduces migration times to ~30s Live migration ● Able to address latency sensitive workloads ● Will require some work in our stack and in CRIU 34
  • 35. Confidential + Proprietary Questions? Native task migration offering in Borg ● Reduces compute lost to evictions ● Simplifies task handling of preemptions ● Addresses most batch workloads ● Serving workloads need live migration CRIU ● Works amazingly well out of the box ● Security an area of investment ● We are excited about and look forward to live migration! Victor Marmol vmarmol@google.com Andy Tucker agtucker@google.com 35