SlideShare a Scribd company logo
Containers on Baremetal
And Preemptible Servers
At CERN and SKA
Belmiro Moreira, CERN
@belmoreira
John Garbutt, StackHPC
@johnthetubaguy
Containers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal and Preemptible VMs at CERN and SKA
CERN - Large Hadron Collider (LHC)
CERN: Compact Muon Solenoid (CMS)
CERN - Cloud resources status board - 11/05/2018@09:23
What is SKA?
Image courtesy of CSIRO
Containers on Baremetal and Preemptible VMs at CERN and SKA
Containers on Baremetal
SKA’s Science Data Processor
ALaSKA
SKA Performance Prototype
AlaSKA
Why Baremetal? Why Containers?
● Single security zone
● No need for virt, so target baremetal
● 30 seconds to switch ingest to Supernova, Fast radio burst, …
● Easier development and deployment cycles
Magnum with Ironic
● Magnum used extensively at CERN
● Docker Swarm and Kubernetes are supported
● Historically a separate driver for baremetal, badly maintained
● Queens moves to using Fedora Atomic for VM and baremetal
System Integration Prototype
Lessons learned
● Extra network ports added after initial setup
● Updated Docker version in Fedora Atomic 27
● Updating Atomic image with RDMA drivers was tricky
● Root disk wasn’t resized by cloud-init
http://www.stackhpc.com/magnum-queens.html
Preemptible Instances
Resources Utilization
● Public Clouds give the illusion of infinite capacity
○ Users pay for resources that they use
● Private Clouds
○ Resource management usually is based in project quotas
○ Prevent resources being exhausted
○ Prevent “over-committing” resources/quota
○ Manage individual projects requirements
○ Reserve resources for operations with higher priority
○ Scientific Clouds
■ Projects have different funding models
■ They expect a predefined number of resources available
■ But not always these resources are used full time
Idle Resources with quotas
Project 1 Project 2
Quota utilization Quota utilization
Idle Resources with quotas
Idle Resources with quotas
Maximize Resource Utilization
● Public Clouds
○ Based on different pricing/SLA considering resource availability
○ Reserved instances vs spot-market
● Private Clouds
○ Quotas are hard limits. Leads to a reduction in resource utilization
○ Preemptible instances
■ Projects that exhausted their quota can continue to create instances
● Opportunistic workloads
● Low SLA
Preemptible Instances
● Proposal to implement Preemptible Instances into OpenStack
○ Build a prototype
○ Minimise changes required in OpenStack nova
● Starting simple
○ Use dedicated projects for Preemptible Instances
■ Avoids tagging individual instances
○ Introduce a “Reaper” service
■ Orchestrator to manage preemptible instances
● Removes preemptible instances when resources are required for non preemptible
instances
● Applies a maximum TTL to preemptible instances
Workflow
Instance set into
PENDING state
Nova notifications
consumer
Selects
preemptible(s)
instance(s) to delete
Nova “Reaper” service
nova-api
nova-scheduler
“No Valid Host”
“No Valid Host”
Notification
1) delete selected preemptible(s)
2) rebuild instance
Reset instance ERROR state
Workflow
● The creation of a non preemptible VM fails because there aren’t available
resources
● Instances that fail with “Nova Valid Host”, go to “PENDING” state instead of
“ERROR”
● The Reaper service is notified and it tries to free the requested resources
○ Rebuild the instance
○ Or change instance state to “ERROR”
Current work in Preemptible Instances
● Add instance state PENDING (spec)
○ https://review.openstack.org/#/c/554212/
● Allow rebuild instances in cell0 (spec)
○ https://review.openstack.org/#/c/554218/
● Add scheduling notification
○ https://review.openstack.org/#/c/566470/
● Implement instance state PENDING
○ https://review.openstack.org/#/c/566473/
● Reaper prototype:
○ https://gitlab.cern.ch/ttsiouts/ReaperServicePrototype
Join the Scientific SIG and...
Get involved!
https://www.openstack.org/science/
Belmiro Moreira, CERN
@belmoreira
John Garbutt, StackHPC
@johnthetubaguy
Join the Scientific SIG and...
Get involved!
https://www.openstack.org/science/

More Related Content

PDF
Future Science on Future OpenStack
PDF
Moving from CellsV1 to CellsV2 at CERN
PDF
10 Years of OpenStack at CERN - From 0 to 300k cores
PPTX
CERN User Story
PDF
CERN OpenStack Cloud Control Plane - From VMs to K8s
PPTX
20170926 cern cloud v4
PPTX
20161025 OpenStack at CERN Barcelona
PDF
Evolution of Openstack Networking at CERN
Future Science on Future OpenStack
Moving from CellsV1 to CellsV2 at CERN
10 Years of OpenStack at CERN - From 0 to 300k cores
CERN User Story
CERN OpenStack Cloud Control Plane - From VMs to K8s
20170926 cern cloud v4
20161025 OpenStack at CERN Barcelona
Evolution of Openstack Networking at CERN

What's hot (20)

PPTX
The OpenStack Cloud at CERN - OpenStack Nordic
PDF
Cern Cloud Architecture - February, 2016
PPTX
20150924 rda federation_v1
PDF
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
PDF
OpenStack @ CERN, by Tim Bell
PPTX
20190620 accelerating containers v3
PPTX
OpenContrail Implementations
PDF
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
PPTX
Operators experience and perspective on SDN with VLANs and L3 Networks
PPTX
20121017 OpenStack CERN Accelerating Science
PPTX
Learning to Scale OpenStack
PPTX
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
PDF
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
PPTX
20181219 ucc open stack 5 years v3
PPTX
OpenContrail Experience tcp cloud OpenStack Summit Tokyo
PPTX
OpenStack at CERN : A 5 year perspective
PDF
Unveiling CERN Cloud Architecture - October, 2015
PDF
What's new in OpenStack Liberty
PPTX
Kubernetes and OpenStack at Scale
PDF
How to Survive an OpenStack Cloud Meltdown with Ceph
The OpenStack Cloud at CERN - OpenStack Nordic
Cern Cloud Architecture - February, 2016
20150924 rda federation_v1
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
OpenStack @ CERN, by Tim Bell
20190620 accelerating containers v3
OpenContrail Implementations
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Operators experience and perspective on SDN with VLANs and L3 Networks
20121017 OpenStack CERN Accelerating Science
Learning to Scale OpenStack
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
20181219 ucc open stack 5 years v3
OpenContrail Experience tcp cloud OpenStack Summit Tokyo
OpenStack at CERN : A 5 year perspective
Unveiling CERN Cloud Architecture - October, 2015
What's new in OpenStack Liberty
Kubernetes and OpenStack at Scale
How to Survive an OpenStack Cloud Meltdown with Ceph
Ad

Similar to Containers on Baremetal and Preemptible VMs at CERN and SKA (20)

PDF
OpenNebula and StorPool: Building Powerful Clouds
PPTX
Neutron Updates - Kilo Edition
PDF
Boyan Krosnov - Building a software-defined cloud - our experience
PDF
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
PDF
OpenStack Neutron: What's New In Kilo and a Look Toward Liberty
PDF
To Russia with Love: Deploying Kubernetes in Exotic Locations On Prem
PDF
Build bare metal kubernetes cluster for hpc on open stack in translational me...
PDF
Spark on Kubernetes
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
PDF
From swarm to swam-mode in the CERN container service
PDF
Rally--OpenStack Benchmarking at Scale
PDF
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
PDF
Our Multi-Year Journey to a 10x Faster Confluent Cloud
PPTX
Profiling & Testing with Spark
PPTX
Kubernetes @ Squarespace: Kubernetes in the Datacenter
PDF
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
PDF
Kubernetes vs dockers swarm supporting onap oom on multi-cloud multi-stack en...
PDF
Bug smash day magnum
PDF
Bug smash day magnum
PDF
Reliable Performance at Scale with Apache Spark on Kubernetes
OpenNebula and StorPool: Building Powerful Clouds
Neutron Updates - Kilo Edition
Boyan Krosnov - Building a software-defined cloud - our experience
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
OpenStack Neutron: What's New In Kilo and a Look Toward Liberty
To Russia with Love: Deploying Kubernetes in Exotic Locations On Prem
Build bare metal kubernetes cluster for hpc on open stack in translational me...
Spark on Kubernetes
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
From swarm to swam-mode in the CERN container service
Rally--OpenStack Benchmarking at Scale
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Profiling & Testing with Spark
Kubernetes @ Squarespace: Kubernetes in the Datacenter
OpenNebulaConf2018 - Is Hyperconverged Infrastructure what you need? - Boyan ...
Kubernetes vs dockers swarm supporting onap oom on multi-cloud multi-stack en...
Bug smash day magnum
Bug smash day magnum
Reliable Performance at Scale with Apache Spark on Kubernetes
Ad

Recently uploaded (20)

PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Digital Strategies for Manufacturing Companies
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
history of c programming in notes for students .pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Transform Your Business with a Software ERP System
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
ai tools demonstartion for schools and inter college
PDF
top salesforce developer skills in 2025.pdf
How to Choose the Right IT Partner for Your Business in Malaysia
Digital Strategies for Manufacturing Companies
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
history of c programming in notes for students .pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
Transform Your Business with a Software ERP System
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Design an Analysis of Algorithms I-SECS-1021-03
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Upgrade and Innovation Strategies for SAP ERP Customers
How to Migrate SBCGlobal Email to Yahoo Easily
PTS Company Brochure 2025 (1).pdf.......
Odoo POS Development Services by CandidRoot Solutions
ai tools demonstartion for schools and inter college
top salesforce developer skills in 2025.pdf

Containers on Baremetal and Preemptible VMs at CERN and SKA

  • 1. Containers on Baremetal And Preemptible Servers At CERN and SKA
  • 2. Belmiro Moreira, CERN @belmoreira John Garbutt, StackHPC @johnthetubaguy
  • 6. CERN - Large Hadron Collider (LHC)
  • 7. CERN: Compact Muon Solenoid (CMS)
  • 8. CERN - Cloud resources status board - 11/05/2018@09:23
  • 9. What is SKA? Image courtesy of CSIRO
  • 12. SKA’s Science Data Processor
  • 15. Why Baremetal? Why Containers? ● Single security zone ● No need for virt, so target baremetal ● 30 seconds to switch ingest to Supernova, Fast radio burst, … ● Easier development and deployment cycles
  • 16. Magnum with Ironic ● Magnum used extensively at CERN ● Docker Swarm and Kubernetes are supported ● Historically a separate driver for baremetal, badly maintained ● Queens moves to using Fedora Atomic for VM and baremetal
  • 18. Lessons learned ● Extra network ports added after initial setup ● Updated Docker version in Fedora Atomic 27 ● Updating Atomic image with RDMA drivers was tricky ● Root disk wasn’t resized by cloud-init http://www.stackhpc.com/magnum-queens.html
  • 20. Resources Utilization ● Public Clouds give the illusion of infinite capacity ○ Users pay for resources that they use ● Private Clouds ○ Resource management usually is based in project quotas ○ Prevent resources being exhausted ○ Prevent “over-committing” resources/quota ○ Manage individual projects requirements ○ Reserve resources for operations with higher priority ○ Scientific Clouds ■ Projects have different funding models ■ They expect a predefined number of resources available ■ But not always these resources are used full time
  • 21. Idle Resources with quotas Project 1 Project 2 Quota utilization Quota utilization
  • 24. Maximize Resource Utilization ● Public Clouds ○ Based on different pricing/SLA considering resource availability ○ Reserved instances vs spot-market ● Private Clouds ○ Quotas are hard limits. Leads to a reduction in resource utilization ○ Preemptible instances ■ Projects that exhausted their quota can continue to create instances ● Opportunistic workloads ● Low SLA
  • 25. Preemptible Instances ● Proposal to implement Preemptible Instances into OpenStack ○ Build a prototype ○ Minimise changes required in OpenStack nova ● Starting simple ○ Use dedicated projects for Preemptible Instances ■ Avoids tagging individual instances ○ Introduce a “Reaper” service ■ Orchestrator to manage preemptible instances ● Removes preemptible instances when resources are required for non preemptible instances ● Applies a maximum TTL to preemptible instances
  • 26. Workflow Instance set into PENDING state Nova notifications consumer Selects preemptible(s) instance(s) to delete Nova “Reaper” service nova-api nova-scheduler “No Valid Host” “No Valid Host” Notification 1) delete selected preemptible(s) 2) rebuild instance Reset instance ERROR state
  • 27. Workflow ● The creation of a non preemptible VM fails because there aren’t available resources ● Instances that fail with “Nova Valid Host”, go to “PENDING” state instead of “ERROR” ● The Reaper service is notified and it tries to free the requested resources ○ Rebuild the instance ○ Or change instance state to “ERROR”
  • 28. Current work in Preemptible Instances ● Add instance state PENDING (spec) ○ https://review.openstack.org/#/c/554212/ ● Allow rebuild instances in cell0 (spec) ○ https://review.openstack.org/#/c/554218/ ● Add scheduling notification ○ https://review.openstack.org/#/c/566470/ ● Implement instance state PENDING ○ https://review.openstack.org/#/c/566473/ ● Reaper prototype: ○ https://gitlab.cern.ch/ttsiouts/ReaperServicePrototype
  • 29. Join the Scientific SIG and... Get involved! https://www.openstack.org/science/
  • 30. Belmiro Moreira, CERN @belmoreira John Garbutt, StackHPC @johnthetubaguy
  • 31. Join the Scientific SIG and... Get involved! https://www.openstack.org/science/