SlideShare a Scribd company logo
Multi-Cloud Federated
Kubernetes at CERN
Clenimar Filemon
clenimar@lsd.ufcg.edu.br
Ricardo Rocha
ricardo.rocha@cern.ch
Founded in 1954
What is 96% of the universe made of?
Fundamental Science
Why isn’t there anti-matter in the universe?
What was the state of matter just after the Big Bang?
Multi-Cloud Federated Kubernetes at CERN
Multi-Cloud Federated Kubernetes at CERN
Multi-Cloud Federated Kubernetes at CERN
Multi-Cloud Federated Kubernetes at CERN
~40 MHz
~ 1PB/sec
L1
Trigger
~ 100 kHz
HL
Trigger
Collisions
Hardware Filter
Software Filter
~ 1 kHz
Raw Data
~ 1-10 GB/s
Huge Data
Still Big
Still Big
320 000 Cores
3 300 Users
4 300 Projects
10 000 Hypervisors 210 Kubernetes Clusters
250 Petabytes
200+ Sites
700 000 Cores
~400 000 Jobs
Distributed Computing
~30 GiB/s
CERN
T1
T2
...
...
...
...
...
...
...
...
Reconstruction
Calibration
Simulation
Analysis
CMS Higgs Event, May 2012
ATLAS Higgs Analysis, May 2012
Motivation for Federation
Periodic load spikes
International Conferences, Reconstruction Campaigns
Simplification
Monitoring, Lifecycle, Alarms
Deployment
Uniform API, Replication, Load Balancing
Use Cases: CERN Batch System, RECAST Analysis
Sched Collector
Negotiator
StartD
AcctGroup = "ATLAS"
JobPrio = 0
RequestCpus = 2
RequestMemory = 4260
...
CERNEnvironment = “production”
Datacenter = “meyrin”
HasMPI = true
TotalCpus = 8
TotalMemory = 22500
...
Matchmaking with ClassAds
Fair Share
Preemption
Running Virtualized
Extensive Experience in HEP
External Storage and Networking
Sched Collector
Negotiator
StartD
AcctGroup = "ATLAS"
JobPrio = 0
RequestCpus = 2
RequestMemory = 4260
...
CERNEnvironment = “production”
Datacenter = “meyrin”
HasMPI = true
TotalCpus = 8
TotalMemory = 22500
...
Matchmaking with ClassAds
Fair Share
Preemption
Running Virtualized
Extensive Experience in HEP
External Storage and Networking
Sched
Negotiator
Collector
Host
kubefed init fed --host-cluster-context=condor-host ...
kind: DaemonSet
...
hostNetwork: true
containers:
- name: condor-startd
image: .../cloud/condor-startd
command: ["/usr/sbin/condor_startd", "-f"]
securityContext:
privileged: true
livenessProbe:
exec:
command:
- condor_who
Sched
Negotiator
Collector
Host
StartD
...
StartD
...
StartD
...
kubefed init fed --host-cluster-context=condor-host ...
kubefed join --context fed tsystems 
--host-cluster-context condor-host --cluster-context tsystems
REANA / RECAST
Reusable Analysis Platform
Workflow Engine (Yadage)
Each step a Kubernetes Job
Integrated Monitoring & Logging
Centralized Log Collection
https://github.com/reanahubhttps://github.com/recast-hep https://github.com/diana-hep/yadage
https://www.youtube.com/watch?v=jNyd97LiTXk
Thank You
Great Community, Amazing Tools
Credits
CERN OpenStack Cloud and Batch teams (Spyros Trigazis and all)
Lukas Heinrich, REANA / RECAST
Kelsey Hightower

More Related Content

PDF
HNSciCloud represented at HUAWEI CONNECT 2017 in Shanghai
PPT
Genomics at the Speed of Light: Understanding the Living Ocean
PPTX
20150924 rda federation_v1
PPTX
20161025 OpenStack at CERN Barcelona
PDF
DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...
PDF
Overview of Qiskit Ignis - Struggle with errors -
PDF
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
HNSciCloud represented at HUAWEI CONNECT 2017 in Shanghai
Genomics at the Speed of Light: Understanding the Living Ocean
20150924 rda federation_v1
20161025 OpenStack at CERN Barcelona
DEEP-mon: Dynamic and Energy Efficient Power monitoring for container-based i...
Overview of Qiskit Ignis - Struggle with errors -
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...

What's hot (20)

PDF
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
PDF
IBM Cloud Community Summit 2018:「Kubernetes in Muiticloudで戦うCloud Native時代」 b...
PDF
SkyhookDM - Towards an Arrow-Native Storage System
PDF
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
PPTX
The OpenStack Cloud at CERN - OpenStack Nordic
PDF
Data-intensive IceCube Cloud Burst
PPT
Ajal vjcet
PPTX
Physics Data Processing - The online connection
PDF
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
PDF
Burst data retrieval after 50k GPU Cloud run
PDF
TiReX: Tiled Regular eXpression matching architecture
PDF
PIT Overload Analysis in Content Centric Networks - Slides ICN '13
PDF
scTGIFの鬼QC機能の追加
PPT
Quick Coarse-grained kinetic Monte Carlo overview
PPTX
"Building and running the cloud GPU vacuum cleaner"
PDF
1細胞オミックスのための新GSEA手法
PPT
Pig TPC-H Benchmark and Performance Tuning
ODP
Aurora Dublin
PDF
Updates on the Fake Object Pipeline for HSC Survey
PDF
Nika it consulting weekly update
inGeneoS: Intercontinental Genetic sequencing over trans-Pacific networks and...
IBM Cloud Community Summit 2018:「Kubernetes in Muiticloudで戦うCloud Native時代」 b...
SkyhookDM - Towards an Arrow-Native Storage System
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
The OpenStack Cloud at CERN - OpenStack Nordic
Data-intensive IceCube Cloud Burst
Ajal vjcet
Physics Data Processing - The online connection
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
Burst data retrieval after 50k GPU Cloud run
TiReX: Tiled Regular eXpression matching architecture
PIT Overload Analysis in Content Centric Networks - Slides ICN '13
scTGIFの鬼QC機能の追加
Quick Coarse-grained kinetic Monte Carlo overview
"Building and running the cloud GPU vacuum cleaner"
1細胞オミックスのための新GSEA手法
Pig TPC-H Benchmark and Performance Tuning
Aurora Dublin
Updates on the Fake Object Pipeline for HSC Survey
Nika it consulting weekly update
Ad

Similar to Multi-Cloud Federated Kubernetes at CERN (20)

PPT
Terabit Applications: What Are They, What is Needed to Enable Them?
PDF
Jarp big data_sydney_v7
PPT
The Optiputer - Toward a Terabit LAN
PDF
Computing Challenges at the Large Hadron Collider
PDF
OSMC 2012 | Monitoring at CERN by Christophe Haen
PPTX
Big Data for Big Discoveries
PDF
Big Fast Data in High-Energy Particle Physics
PDF
Hpc, grid and cloud computing - the past, present, and future challenge
PDF
Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...
PPT
Valladolid final-septiembre-2010
PPTX
CERN User Story
PPTX
Coding the Continuum
PDF
Pic archiver stansted
PPTX
Better Information Faster: Programming the Continuum
PPT
Why Researchers are Using Advanced Networks
PDF
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
PPT
Supercomputers
PPT
Supercomputers
PPT
Science and Cyberinfrastructure in the Data-Dominated Era
PPT
Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analys...
Terabit Applications: What Are They, What is Needed to Enable Them?
Jarp big data_sydney_v7
The Optiputer - Toward a Terabit LAN
Computing Challenges at the Large Hadron Collider
OSMC 2012 | Monitoring at CERN by Christophe Haen
Big Data for Big Discoveries
Big Fast Data in High-Energy Particle Physics
Hpc, grid and cloud computing - the past, present, and future challenge
Kafka Summit SF 2017 - Accelerating Particles to Explore the Mysteries of the...
Valladolid final-septiembre-2010
CERN User Story
Coding the Continuum
Pic archiver stansted
Better Information Faster: Programming the Continuum
Why Researchers are Using Advanced Networks
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Supercomputers
Supercomputers
Science and Cyberinfrastructure in the Data-Dominated Era
Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analys...
Ad

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
A Presentation on Artificial Intelligence
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Big Data Technologies - Introduction.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine learning based COVID-19 study performance prediction
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Tartificialntelligence_presentation.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
MIND Revenue Release Quarter 2 2025 Press Release
Assigned Numbers - 2025 - Bluetooth® Document
A Presentation on Artificial Intelligence
A comparative analysis of optical character recognition models for extracting...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Big Data Technologies - Introduction.pptx

Multi-Cloud Federated Kubernetes at CERN

  • 1. Multi-Cloud Federated Kubernetes at CERN Clenimar Filemon clenimar@lsd.ufcg.edu.br Ricardo Rocha ricardo.rocha@cern.ch
  • 2. Founded in 1954 What is 96% of the universe made of? Fundamental Science Why isn’t there anti-matter in the universe? What was the state of matter just after the Big Bang?
  • 7. ~40 MHz ~ 1PB/sec L1 Trigger ~ 100 kHz HL Trigger Collisions Hardware Filter Software Filter ~ 1 kHz Raw Data ~ 1-10 GB/s Huge Data Still Big Still Big
  • 8. 320 000 Cores 3 300 Users 4 300 Projects 10 000 Hypervisors 210 Kubernetes Clusters 250 Petabytes
  • 9. 200+ Sites 700 000 Cores ~400 000 Jobs Distributed Computing ~30 GiB/s CERN T1 T2 ... ... ... ... ... ... ... ... Reconstruction Calibration Simulation Analysis
  • 10. CMS Higgs Event, May 2012 ATLAS Higgs Analysis, May 2012
  • 11. Motivation for Federation Periodic load spikes International Conferences, Reconstruction Campaigns Simplification Monitoring, Lifecycle, Alarms Deployment Uniform API, Replication, Load Balancing Use Cases: CERN Batch System, RECAST Analysis
  • 12. Sched Collector Negotiator StartD AcctGroup = "ATLAS" JobPrio = 0 RequestCpus = 2 RequestMemory = 4260 ... CERNEnvironment = “production” Datacenter = “meyrin” HasMPI = true TotalCpus = 8 TotalMemory = 22500 ... Matchmaking with ClassAds Fair Share Preemption Running Virtualized Extensive Experience in HEP External Storage and Networking
  • 13. Sched Collector Negotiator StartD AcctGroup = "ATLAS" JobPrio = 0 RequestCpus = 2 RequestMemory = 4260 ... CERNEnvironment = “production” Datacenter = “meyrin” HasMPI = true TotalCpus = 8 TotalMemory = 22500 ... Matchmaking with ClassAds Fair Share Preemption Running Virtualized Extensive Experience in HEP External Storage and Networking
  • 14. Sched Negotiator Collector Host kubefed init fed --host-cluster-context=condor-host ...
  • 15. kind: DaemonSet ... hostNetwork: true containers: - name: condor-startd image: .../cloud/condor-startd command: ["/usr/sbin/condor_startd", "-f"] securityContext: privileged: true livenessProbe: exec: command: - condor_who Sched Negotiator Collector Host StartD ... StartD ... StartD ... kubefed init fed --host-cluster-context=condor-host ... kubefed join --context fed tsystems --host-cluster-context condor-host --cluster-context tsystems
  • 16. REANA / RECAST Reusable Analysis Platform Workflow Engine (Yadage) Each step a Kubernetes Job Integrated Monitoring & Logging Centralized Log Collection https://github.com/reanahubhttps://github.com/recast-hep https://github.com/diana-hep/yadage
  • 18. Thank You Great Community, Amazing Tools Credits CERN OpenStack Cloud and Batch teams (Spyros Trigazis and all) Lukas Heinrich, REANA / RECAST Kelsey Hightower

Editor's Notes

  • #3: Dark energy + dark matter Quark gluon plasma, moments after the big bang
  • #4: Location, lake, alps mont blanc, swiss-french border Complex of accelerators, higher and higher energy. 27km circumference Two beams of protons travelling on different directions, close to speed of light
  • #5: Almost 10.000 magnets Kept in the ring thanks to these superconducting magnets Temperature kept at 1.9K (-271 celsius) to keep superconducting properties
  • #6: Sibling of ATLAS, with similar goals but different design 14.000 tons 20 meters long, 15x15
  • #7: Anti-matter Decelarator, creates anti-atoms to better understand its properties AMS experiment, launched on mission STS-134 (penultime shuttle mission), measures antimatter in cosmic rays
  • #9: 2 floors Historical building, 50 years old, from mainframes to racks Internet backbone, biggest in late 80s and early 90s
  • #10: Hierachical system, few big T1s and many smaller T2s
  • #11: Not a physicist, but learned to look for patterns in plots
  • #13: Based on htcondor, which HEP has decades of experience operating Component description, requests and resources published as classads Advanced functionality (fair share, pre-emption) Currently running mostly on virtualized resources Important: htcondor relies on an established storage and net infrastructure, handling compute only (which is what we try to federate)
  • #14: StartD is our first containerization goal, deployed at scale
  • #15: Host cluster, with the condor control plane One command only to establish federation
  • #16: StartD deployment as a daemonset, meaning we get one instance on every host Clusters are added again with one single command CVMFS caching for software distribution speed-up Data access and networking outside the scope of this exercise