This document discusses Google's use of CRIU for task migration at scale. It provides background on Borg, Google's cluster management system, and how tasks run in isolated containers. CRIU is used to checkpoint and restore task state, allowing tasks to be migrated transparently to avoid evictions. While migrations currently take 1-2 minutes, work is ongoing to improve performance and implement live migration to support latency-sensitive tasks. Security around CRIU's use of privileges is also an area of focus. Overall, CRIU has worked well but continued collaboration is needed to address remaining challenges.
Related topics: