SlideShare a Scribd company logo
Tim Bell
CERN
@noggin143
OpenStack UK Days
26th September 2017
Understanding the Universe
through Clouds
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 1
2
CERN: founded in 1954: 12 European States
“Science for Peace”
Today: 22 Member States
Member States: Austria, Belgium, Bulgaria, Czech Republic, Denmark, Finland,
France, Germany, Greece, Hungary, Israel, Italy, Netherlands, Norway, Poland,
Portugal, Romania, Slovak Republic, Spain, Sweden, Switzerland and
United Kingdom
Associate Member States: Pakistan, India, Ukraine, Turkey
States in accession to Membership: Cyprus, Serbia
Applications for Membership or Associate Membership:
Brazil, Croatia, Lithuania, Russia, Slovenia
Observers to Council: India, Japan, Russia, United States of America;
European Union, JINR and UNESCO
~ 2300 staff
~ 1400 other paid personnel
~ 12500 scientific users
Budget (2017) ~1000 MCHF
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 2
The Large Hadron Collider (LHC)
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 3
~700 MB/s
~10 GB/s
>1 GB/s
>1 GB/s
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 4
Tim.Bell@cern.ch 5Universe and Clouds - 26th September 2017
6
Tier-1: permanent
storage, re-processing,
analysis
Tier-0
(CERN and Hungary):
data recording,
reconstruction and
distribution
Tier-2: Simulation,
end-user analysis
> 2 million jobs/day
~750k CPU cores
600 PB of storage
~170 sites,
42 countries
10-100 Gb links
WLCG:
An International collaboration to distribute and analyse LHC data
Integrates computer centres worldwide that provide computing and storage
resource into a single infrastructure accessible by all LHC physicists
The Worldwide LHC Computing Grid
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 7
Asia North America
South America
Europe
LHCOne: Overlay network
Allows national network providers to
manage HEP traffic on general
purpose network
0
10
20
30
40
50
60
70
JAN FEB MAR APR MAY JUN JUL AUG SEPT OCT NOV DEC JAN FEB MAR APR MAY
A big data problem
Tim.Bell@cern.ch 8
2016: 49.4 PB LHC data/
58 PB all experiments/
73 PB total
ALICE: 7.6 PB
ATLAS: 17.4 PB
CMS: 16.0 PB
LHCb: 8.5 PB
11 PB in July
180 PB on tape
800 M files
Universe and Clouds - 26th September 2017
Public Procurement Cycle
Step Time (Days) Elapsed (Days)
User expresses requirement 0
Market Survey prepared 15 15
Market Survey for possible
vendors
30 45
Specifications prepared 15 60
Vendor responses 30 90
Test systems evaluated 30 120
Offers adjudicated 10 130
Finance committee 30 160
Hardware delivered 90 250
Burn in and acceptance 30 days typical with 380 worst
case
280
Total 280+ Days
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 9
OpenStack London July 2011 Vinopolis
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 10
CERN Tool Chain
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 11
CERN OpenStack Service Timeline
(*) Pilot (?) Trial Retired
ESSEX
Nova (*)
Glance (*)
Horizon (*)
Keystone (*)
FOLSOM
Nova (*)
Glance (*)
Horizon (*)
Keystone (*)
Quantum
Cinder
GRIZZLY
Nova
Glance
Horizon
Keystone
Quantum
Cinder
Ceilometer (*)
HAVANA
Nova
Glance
Horizon
Keystone
Neutron
Cinder
Ceilometer (*)
Heat
ICEHOUSE
Nova
Glance
Horizon
Keystone
Neutron
Cinder
Ceilometer
Heat
Ironic
Trove
JUNO
Nova
Glance
Horizon
Keystone
Neutron
Cinder
Ceilometer
Heat (*)
Rally (*)
5 April 2012 27 September 2012 4 April 2013 17 October 2013 17 April 2014 16 October 2014
July 2013
CERN OpenStack
Production
February 2014
CERN OpenStack
Havana
October 2014
CERN OpenStack
Icehouse
March2015
CERN OpenStack
Juno
LIBERTY
Nova
Glance
Horizon
Keystone
Neutron (*)
Cinder
Ceilometer
Heat
Rally
EC2API
Magnum (*)
Barbican (*)
September 2015
CERN OpenStack
Kilo
KILO
Nova
Glance
Horizon
Keystone
Neutron
Cinder
Ceilometer
Heat
Rally
Manila
September 2016
CERN OpenStack
Liberty
MITAKA
Nova
Glance
Horizon
Keystone
Neutron
Cinder
Ceilometer
Heat
Rally
EC2API
Magnum
Barbican
Ironic (?)
Mistral (?)
Manila (?)
March 2017
CERN OpenStack
Mitaka
NEWTON
Nova
Glance
Horizon
Keystone
Neutron
Cinder
Ceilometer
Heat
Rally
EC2API
Magnum
Barbican
Ironic (?)
Mistral (?)
Manila (?)
7 April 201630 April 2015 15 October 2015
OCATA
Nova
Glance
Horizon
Keystone
Neutron
Cinder
Ceilometer
Heat
Rally
EC2API
Magnum
Barbican
Ironic (?)
Mistral (?)
Manila (*)
22 Feb 2017
June 2017
CERN OpenStack
Newton
6 Oct 2016
PIKE
28 Aug 2017
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 12
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 13
Currently >8000 hypervisors, 281K cores running 33,000 VMs
 From ~200TB total to ~450 TB of RBD + 50 TB RGW• From ~200TB total to ~450 TB of RBD + 50 TB RGW
OpenStack Glance + Cinder
Example: ~25 puppet masters reading
node configurations at up to 40kHz
iops
14Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
• Scale tests with Ceph Luminous up to 65PB in a block storage pool
http://ceph.com/community/new-luminous-scalability/
Software Deployment
15Tim.Bell@cern.ch
 Deployment based on CentOS and RDO
- Upstream, only patched where necessary
(e.g. nova/neutron for CERN networks)
- Some few customizations
- Works well for us
 Puppet for config’ management
- Introduced with the adoption of AI paradigm
 We submit upstream whenever possible
- openstack, openstack-puppet, RDO, …
 Updates done service-by-service over several months
- Running services on dedicated (virtual) servers helps
(Exception: ceilometer and nova on compute nodes)
- Aim to be around 6-9 months behind trunk
 Upgrade testing done with packstack and devstack
- Depends on service: from simple DB upgrades to full shadow installations
Universe and Clouds - 26th September 2017
Community Experience
 Open source collaboration sets model for in-
house teams
 External recognition by the community is highly
rewarding for contributors
 Reviews and being reviewed is a constant
learning experience
 Productive for job market for staff
 Working groups, like the Scientific and Large
Deployment teams, discuss wide range of
topics
 Effective knowledge transfer mechanisms
consistent with the CERN mission
 Dojos at CERN bring good attendance
 Ceph, CentOS, Elastic, OpenStack CH, …
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 16
Top level cell
 Runs API service
 Top cell scheduler
~50 child cells run
 Compute nodes
 Scheduler
 Conductor
 Decided to not use HA
Version 2 coming
 Default for all
Scaling Nova
17Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
Rally
18Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
What’s new? Magnum
 Container Engine as a Service
 Kubernetes, Docker, Mesos…
$ magnum cluster-create --name myswarmcluster --cluster-template swarm --node-count 100
$ magnum cluster-list
+------+----------------+------------+--------------+-----------------+
| uuid | name | node_count | master_count | status |
+------+----------------+------------+--------------+-----------------+
| .... | myswarmcluster | 100 | 1 | CREATE_COMPLETE |
+------+----------------+------------+--------------+-----------------+
$ $(magnum cluster-config myswarmcluster --dir magnum/myswarmcluster)
$ docker info / ps / ...
$ docker run --volume-driver cvmfs -v atlas.cern.ch:/cvmfs/atlas -it centos /bin/bash
[root@32f4cf39128d /]#
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 19
Scaling Magnum to 7M req/s
Rally drove the tests
1000 node clusters (4000 cores)
Cluster Size (Nodes) Concurrency Deployment Time
(min)
2 50 2.5
16 10 4
32 10 4
128 5 5.5
512 1 14
1000 1 23
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 20
What’s new? Mistral
 Workflow-as-a-Service used for multi-step actions,
triggered by users or events
 Horizon dashboard for visualising results
 Examples
 Multi-step project creation
 Scheduled snapshot of VMs
 Expire personal resources after 6 months
 Code at https://gitlab.cern.ch/cloud-
infrastructure/mistral-workflows
 Some more complex cases coming in the pipeline
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 21
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 22
Automate provisioning
23
Automate routine procedures
- Common place for workflows
- Clean web interface
- Scheduled jobs, cron-style
- Traceability and auditing
- Fine-grained access control
- …
Procedures for
- OpenStack project creation
- OpenStack quota changes
- Notifications of VM owners
- Usage and health reports
- …
Disable
compute
node
• Disable nova-service
• Switch Alarms OFF
• Update Service-Now ticket
Notifications
• Send e-mail to VM owners
Other tasks
Post new
message broker
Add remote AT
job
Save intervention
details
Send calendar
invitation
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
Manila: Overview
24
• File Share Project in OpenStack
- Provisioning of shared file systems to VMs
- ‘Cinder for file shares’
• APIs for tenants to request shares
- Fulfilled by backend drivers
- Acessed from instances
• Support for variety of NAS protocols
- NFS, CIFS, MapR-FS, GlusterFS, CephFS, …
• Supports the notion of share types
- Map features to backends
Manila
Backend
1. Request share
2. Create share
4. Access share
User
instances
3. Provide handle
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
25
LHC Incident in April 2016
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
Manila testing: #fouinehammer
26
m-share
Driver
RabbitMQ
m-sched
m-api DBm-api m-api
1 … 500 nodes
1 ... 10k PODs
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
Commercial Clouds
Universe and Clouds - 26th September 2017Tim.Bell@cern.ch 27
Development areas going forward
 Spot Market
 Cells V2
 Neutron scaling – no Cells equivalent yet
 Magnum rolling upgrades
 Collaborations with Industry
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 28
Operations areas going forward
 Further automate migrations
 Around 5,000 VMs / year
 First campaign in 2016 needed some additional
scripting such as pausing very active VMs
 Newton live migration includes most use cases
 Software Defined Networking
 Nova network to Neutron migration to be completed
 In addition to flat network in use currently
 Introduce higher level functions such as LBaaS
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 29
Future Challenges
0
100
200
300
400
500
600
700
800
900
1000
Raw Derived
Data estimates for 1st year of HL-LHC (PB)
ALICE ATLAS CMS LHCb
0
50000
100000
150000
200000
250000
CPU (HS06)
CPU Needs for 1st Year of HL-LHC (kHS06)
ALICE ATLAS CMS LHCb
B
First run LS1 Second run LS2 Third run LS3 HL-LHC
…
FCC?
2013 2014 2015 2016 2017 201820112010 2012 2019 2023 2024 2030?20212020 2022 …2025
CPU:
• x60 from 2016
Data:
• Raw 2016: 50 PB  2027: 600 PB
• Derived (1 copy): 2016: 80 PB  2027: 900 PB
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 30
 Raw data volume for LHC increases exponentially and with it processing and analysis
load
 Technology at ~20%/year will bring x6-10 in 10-11 years
 Estimates of resource needs at HL-LHC x10 above what is realistic to expect from
technology with reasonably constant cost
Summary
 OpenStack has provided a strong base for
scaling resources over the past 4 years without
significant increase in CERN staff
 Additional functionality on top of pure
Infrastructure-as-a-Service is now coming to
production
 Community and industry collaboration has been
productive and inspirational for the CERN team
 Some big computing challenges up ahead…
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 31
Further Information
Technical details on the CERN cloud at
http://openstack-in-production.blogspot.fr
Custom CERN code is at https://github.com/cernops
Scientific Working Group at
https://wiki.openstack.org/wiki/Scientific_working_group
Helix Nebula details at http://www.helix-nebula.eu/
http://cern.ch/IT ©CERN CC-BY-SA 4.0Universe and Clouds - 19th June 2017 Tim.Bell@cern.ch 32
Backup
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 34
WLCG MoU Signatures
2017:
- 63 MoU’s
- 167 sites; 42 countries
Partners
Contributors
Associates
Research
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 35
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 36
How do we monitor?
37
Processing
kafka
Data
Centres
Data
Sources
Data
Access
Storage/Search
WLCG
Transport
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
Tuning
38
 Many hypervisors are configured for compute
optimisation
 CPU Passthrough so VM sees identical CPU
 Extended Page Tables so memory page mapping is
done in hardware
 Core pinning so scheduler keeps the cores on the
underlying physical cores
 Huge pages to improve memory page cache utilisation
 Flavors are set to be NUMA aware
 Improvements of up to 20% in performance
 Impact is that the VMs cannot be live migrated so
service machines are not configured this way
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
Provisioning services Moving towards
Elastic Hybrid IaaS
model:
• In house resources at full
occupation
• Elastic use of commercial
& public clouds
• Assume “spot-market”
style pricing
OpenStack Resource Provisioning
(>1 physical data centre)
HTCondor
Public Cloud
VMsContainersBare Metal and HPC
(LSF)
Volunteer
Computing
IT & Experiment
Services
End Users CI/CD
APIs
CLIs
GUIs
Experiment Pilot Factories
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 39
Simulating Elasticity
 Deliveries are around 1-2 times per year
 Resources are for
 Batch compute … immediately needed … compute optimised
 Services … needed as projects request quota ... Support live
migration with generic CPU definition
 Elasticity is simulated by
 Creating opportunistic batch projects running on resources available
for services in the future
 Draining opportunistic batch as needed
 End result is
 High utilisation of ‘spare’ resources
 Simulation of an elastic cloud
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 40
Pick the interesting events
 40 million per second
 Fast, simple information
 Hardware trigger in
a few micro seconds
 100 thousand per second
 Fast algorithms in local
computer farm
 Software trigger in <1 second
 Few 100 per second
 Recorded for study
41
Muon
tracks
Energy
deposits
Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch

More Related Content

PPTX
The OpenStack Cloud at CERN - OpenStack Nordic
PPTX
20161025 OpenStack at CERN Barcelona
PPTX
CERN User Story
PDF
10 Years of OpenStack at CERN - From 0 to 300k cores
PDF
OpenStack @ CERN, by Tim Bell
PPTX
20150924 rda federation_v1
PDF
Cern Cloud Architecture - February, 2016
PDF
Future Science on Future OpenStack
The OpenStack Cloud at CERN - OpenStack Nordic
20161025 OpenStack at CERN Barcelona
CERN User Story
10 Years of OpenStack at CERN - From 0 to 300k cores
OpenStack @ CERN, by Tim Bell
20150924 rda federation_v1
Cern Cloud Architecture - February, 2016
Future Science on Future OpenStack

What's hot (20)

PDF
Containers on Baremetal and Preemptible VMs at CERN and SKA
PDF
CERN OpenStack Cloud Control Plane - From VMs to K8s
PDF
Moving from CellsV1 to CellsV2 at CERN
PDF
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
PPTX
20121017 OpenStack CERN Accelerating Science
PPTX
20190620 accelerating containers v3
PPTX
OpenStack at CERN : A 5 year perspective
PDF
Evolution of Openstack Networking at CERN
PDF
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
PPTX
The OpenStack Cloud at CERN
PDF
Unveiling CERN Cloud Architecture - October, 2015
PDF
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
PPTX
OpenContrail Implementations
PPTX
Operators experience and perspective on SDN with VLANs and L3 Networks
PPTX
Manila on CephFS at CERN (OpenStack Summit Boston, 11 May 2017)
PPTX
Integrating Bare-metal Provisioning into CERN's Private Cloud
PPTX
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
PPTX
Operational War Stories from 5 Years of Running OpenStack in Production
PPTX
OpenContrail Experience tcp cloud OpenStack Summit Tokyo
PDF
OpenStack Toronto Q3 MeetUp - September 28th 2017
Containers on Baremetal and Preemptible VMs at CERN and SKA
CERN OpenStack Cloud Control Plane - From VMs to K8s
Moving from CellsV1 to CellsV2 at CERN
Multi-Cell OpenStack: How to Evolve Your Cloud to Scale - November, 2014
20121017 OpenStack CERN Accelerating Science
20190620 accelerating containers v3
OpenStack at CERN : A 5 year perspective
Evolution of Openstack Networking at CERN
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
The OpenStack Cloud at CERN
Unveiling CERN Cloud Architecture - October, 2015
Deep Dive Into the CERN Cloud Infrastructure - November, 2013
OpenContrail Implementations
Operators experience and perspective on SDN with VLANs and L3 Networks
Manila on CephFS at CERN (OpenStack Summit Boston, 11 May 2017)
Integrating Bare-metal Provisioning into CERN's Private Cloud
OpenStack Ousts vCenter for DevOps and Unites IT Silos at AVG Technologies
Operational War Stories from 5 Years of Running OpenStack in Production
OpenContrail Experience tcp cloud OpenStack Summit Tokyo
OpenStack Toronto Q3 MeetUp - September 28th 2017
Ad

Similar to 20170926 cern cloud v4 (20)

PDF
Helix Nebula - The Science Cloud, Status Update
PPTX
CERN Mass and Agility talk at OSCON 2014
PDF
Swami osi bangalore2017days pike release_updates
PDF
Introduction and Overview of OpenStack for IaaS
PDF
Openstack Pakistan Workshop (intro)
PDF
Openstack For Beginners
PDF
CloudLightning and the OPM-based Use Case
PPTX
20181219 ucc open stack 5 years v3
PPTX
20181219 ucc open stack 5 years v3
PPTX
Better Information Faster: Programming the Continuum
PDF
What is OpenStack and the added value of IBM solutions
PPTX
20140509 cern open_stack_linuxtag_v3
PDF
Cloud Infrastructure
PDF
Kubernetes meetup 102
PPTX
Intro to OpenStack
PPTX
Hybrid Cloud for CERN
PDF
PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...
 
PDF
CERN & Huawei collaboration to improve OpenStack for running large scale scie...
PDF
Using the Open Science Data Cloud for Data Science Research
PDF
Toward 10,000 Containers on OpenStack
Helix Nebula - The Science Cloud, Status Update
CERN Mass and Agility talk at OSCON 2014
Swami osi bangalore2017days pike release_updates
Introduction and Overview of OpenStack for IaaS
Openstack Pakistan Workshop (intro)
Openstack For Beginners
CloudLightning and the OPM-based Use Case
20181219 ucc open stack 5 years v3
20181219 ucc open stack 5 years v3
Better Information Faster: Programming the Continuum
What is OpenStack and the added value of IBM solutions
20140509 cern open_stack_linuxtag_v3
Cloud Infrastructure
Kubernetes meetup 102
Intro to OpenStack
Hybrid Cloud for CERN
PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...
 
CERN & Huawei collaboration to improve OpenStack for running large scale scie...
Using the Open Science Data Cloud for Data Science Research
Toward 10,000 Containers on OpenStack
Ad

More from Tim Bell (17)

PPTX
CERN IT Monitoring
PPTX
CERN Status at OpenStack Shanghai Summit November 2019
PPTX
20190314 cern register v3
PPTX
OpenStack Paris 2014 - Federation, are we there yet ?
PPTX
20141103 cern open_stack_paris_v3
PPTX
Open stack operations feedback loop v1.4
PPT
CERN clouds and culture at GigaOm London 2013
PPTX
20130529 openstack cee_day_v6
PDF
Academic cloud experiences cern v4
PDF
Ceilometer lsf-intergration-openstack-summit
PDF
Havana survey results-final-v2
PDF
Havana survey results-final
PPTX
20121205 open stack_accelerating_science_v3
PPTX
20121115 open stack_ch_user_group_v1.2
PPTX
20121017 OpenStack Accelerating Science
PPTX
Accelerating science with Puppet
PPT
20120524 cern data centre evolution v2
CERN IT Monitoring
CERN Status at OpenStack Shanghai Summit November 2019
20190314 cern register v3
OpenStack Paris 2014 - Federation, are we there yet ?
20141103 cern open_stack_paris_v3
Open stack operations feedback loop v1.4
CERN clouds and culture at GigaOm London 2013
20130529 openstack cee_day_v6
Academic cloud experiences cern v4
Ceilometer lsf-intergration-openstack-summit
Havana survey results-final-v2
Havana survey results-final
20121205 open stack_accelerating_science_v3
20121115 open stack_ch_user_group_v1.2
20121017 OpenStack Accelerating Science
Accelerating science with Puppet
20120524 cern data centre evolution v2

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation theory and applications.pdf
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Cloud computing and distributed systems.
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation theory and applications.pdf
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Mobile App Security Testing_ A Comprehensive Guide.pdf

20170926 cern cloud v4

  • 1. Tim Bell CERN @noggin143 OpenStack UK Days 26th September 2017 Understanding the Universe through Clouds Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 1
  • 2. 2 CERN: founded in 1954: 12 European States “Science for Peace” Today: 22 Member States Member States: Austria, Belgium, Bulgaria, Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Israel, Italy, Netherlands, Norway, Poland, Portugal, Romania, Slovak Republic, Spain, Sweden, Switzerland and United Kingdom Associate Member States: Pakistan, India, Ukraine, Turkey States in accession to Membership: Cyprus, Serbia Applications for Membership or Associate Membership: Brazil, Croatia, Lithuania, Russia, Slovenia Observers to Council: India, Japan, Russia, United States of America; European Union, JINR and UNESCO ~ 2300 staff ~ 1400 other paid personnel ~ 12500 scientific users Budget (2017) ~1000 MCHF Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 2
  • 3. The Large Hadron Collider (LHC) Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 3 ~700 MB/s ~10 GB/s >1 GB/s >1 GB/s
  • 4. Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 4
  • 5. Tim.Bell@cern.ch 5Universe and Clouds - 26th September 2017
  • 6. 6 Tier-1: permanent storage, re-processing, analysis Tier-0 (CERN and Hungary): data recording, reconstruction and distribution Tier-2: Simulation, end-user analysis > 2 million jobs/day ~750k CPU cores 600 PB of storage ~170 sites, 42 countries 10-100 Gb links WLCG: An International collaboration to distribute and analyse LHC data Integrates computer centres worldwide that provide computing and storage resource into a single infrastructure accessible by all LHC physicists The Worldwide LHC Computing Grid Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
  • 7. Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 7 Asia North America South America Europe LHCOne: Overlay network Allows national network providers to manage HEP traffic on general purpose network 0 10 20 30 40 50 60 70 JAN FEB MAR APR MAY JUN JUL AUG SEPT OCT NOV DEC JAN FEB MAR APR MAY
  • 8. A big data problem Tim.Bell@cern.ch 8 2016: 49.4 PB LHC data/ 58 PB all experiments/ 73 PB total ALICE: 7.6 PB ATLAS: 17.4 PB CMS: 16.0 PB LHCb: 8.5 PB 11 PB in July 180 PB on tape 800 M files Universe and Clouds - 26th September 2017
  • 9. Public Procurement Cycle Step Time (Days) Elapsed (Days) User expresses requirement 0 Market Survey prepared 15 15 Market Survey for possible vendors 30 45 Specifications prepared 15 60 Vendor responses 30 90 Test systems evaluated 30 120 Offers adjudicated 10 130 Finance committee 30 160 Hardware delivered 90 250 Burn in and acceptance 30 days typical with 380 worst case 280 Total 280+ Days Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 9
  • 10. OpenStack London July 2011 Vinopolis Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 10
  • 11. CERN Tool Chain Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 11
  • 12. CERN OpenStack Service Timeline (*) Pilot (?) Trial Retired ESSEX Nova (*) Glance (*) Horizon (*) Keystone (*) FOLSOM Nova (*) Glance (*) Horizon (*) Keystone (*) Quantum Cinder GRIZZLY Nova Glance Horizon Keystone Quantum Cinder Ceilometer (*) HAVANA Nova Glance Horizon Keystone Neutron Cinder Ceilometer (*) Heat ICEHOUSE Nova Glance Horizon Keystone Neutron Cinder Ceilometer Heat Ironic Trove JUNO Nova Glance Horizon Keystone Neutron Cinder Ceilometer Heat (*) Rally (*) 5 April 2012 27 September 2012 4 April 2013 17 October 2013 17 April 2014 16 October 2014 July 2013 CERN OpenStack Production February 2014 CERN OpenStack Havana October 2014 CERN OpenStack Icehouse March2015 CERN OpenStack Juno LIBERTY Nova Glance Horizon Keystone Neutron (*) Cinder Ceilometer Heat Rally EC2API Magnum (*) Barbican (*) September 2015 CERN OpenStack Kilo KILO Nova Glance Horizon Keystone Neutron Cinder Ceilometer Heat Rally Manila September 2016 CERN OpenStack Liberty MITAKA Nova Glance Horizon Keystone Neutron Cinder Ceilometer Heat Rally EC2API Magnum Barbican Ironic (?) Mistral (?) Manila (?) March 2017 CERN OpenStack Mitaka NEWTON Nova Glance Horizon Keystone Neutron Cinder Ceilometer Heat Rally EC2API Magnum Barbican Ironic (?) Mistral (?) Manila (?) 7 April 201630 April 2015 15 October 2015 OCATA Nova Glance Horizon Keystone Neutron Cinder Ceilometer Heat Rally EC2API Magnum Barbican Ironic (?) Mistral (?) Manila (*) 22 Feb 2017 June 2017 CERN OpenStack Newton 6 Oct 2016 PIKE 28 Aug 2017 Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 12
  • 13. Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 13 Currently >8000 hypervisors, 281K cores running 33,000 VMs
  • 14.  From ~200TB total to ~450 TB of RBD + 50 TB RGW• From ~200TB total to ~450 TB of RBD + 50 TB RGW OpenStack Glance + Cinder Example: ~25 puppet masters reading node configurations at up to 40kHz iops 14Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch • Scale tests with Ceph Luminous up to 65PB in a block storage pool http://ceph.com/community/new-luminous-scalability/
  • 15. Software Deployment 15Tim.Bell@cern.ch  Deployment based on CentOS and RDO - Upstream, only patched where necessary (e.g. nova/neutron for CERN networks) - Some few customizations - Works well for us  Puppet for config’ management - Introduced with the adoption of AI paradigm  We submit upstream whenever possible - openstack, openstack-puppet, RDO, …  Updates done service-by-service over several months - Running services on dedicated (virtual) servers helps (Exception: ceilometer and nova on compute nodes) - Aim to be around 6-9 months behind trunk  Upgrade testing done with packstack and devstack - Depends on service: from simple DB upgrades to full shadow installations Universe and Clouds - 26th September 2017
  • 16. Community Experience  Open source collaboration sets model for in- house teams  External recognition by the community is highly rewarding for contributors  Reviews and being reviewed is a constant learning experience  Productive for job market for staff  Working groups, like the Scientific and Large Deployment teams, discuss wide range of topics  Effective knowledge transfer mechanisms consistent with the CERN mission  Dojos at CERN bring good attendance  Ceph, CentOS, Elastic, OpenStack CH, … Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 16
  • 17. Top level cell  Runs API service  Top cell scheduler ~50 child cells run  Compute nodes  Scheduler  Conductor  Decided to not use HA Version 2 coming  Default for all Scaling Nova 17Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
  • 18. Rally 18Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
  • 19. What’s new? Magnum  Container Engine as a Service  Kubernetes, Docker, Mesos… $ magnum cluster-create --name myswarmcluster --cluster-template swarm --node-count 100 $ magnum cluster-list +------+----------------+------------+--------------+-----------------+ | uuid | name | node_count | master_count | status | +------+----------------+------------+--------------+-----------------+ | .... | myswarmcluster | 100 | 1 | CREATE_COMPLETE | +------+----------------+------------+--------------+-----------------+ $ $(magnum cluster-config myswarmcluster --dir magnum/myswarmcluster) $ docker info / ps / ... $ docker run --volume-driver cvmfs -v atlas.cern.ch:/cvmfs/atlas -it centos /bin/bash [root@32f4cf39128d /]# Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 19
  • 20. Scaling Magnum to 7M req/s Rally drove the tests 1000 node clusters (4000 cores) Cluster Size (Nodes) Concurrency Deployment Time (min) 2 50 2.5 16 10 4 32 10 4 128 5 5.5 512 1 14 1000 1 23 Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 20
  • 21. What’s new? Mistral  Workflow-as-a-Service used for multi-step actions, triggered by users or events  Horizon dashboard for visualising results  Examples  Multi-step project creation  Scheduled snapshot of VMs  Expire personal resources after 6 months  Code at https://gitlab.cern.ch/cloud- infrastructure/mistral-workflows  Some more complex cases coming in the pipeline Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 21
  • 22. Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 22
  • 23. Automate provisioning 23 Automate routine procedures - Common place for workflows - Clean web interface - Scheduled jobs, cron-style - Traceability and auditing - Fine-grained access control - … Procedures for - OpenStack project creation - OpenStack quota changes - Notifications of VM owners - Usage and health reports - … Disable compute node • Disable nova-service • Switch Alarms OFF • Update Service-Now ticket Notifications • Send e-mail to VM owners Other tasks Post new message broker Add remote AT job Save intervention details Send calendar invitation Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
  • 24. Manila: Overview 24 • File Share Project in OpenStack - Provisioning of shared file systems to VMs - ‘Cinder for file shares’ • APIs for tenants to request shares - Fulfilled by backend drivers - Acessed from instances • Support for variety of NAS protocols - NFS, CIFS, MapR-FS, GlusterFS, CephFS, … • Supports the notion of share types - Map features to backends Manila Backend 1. Request share 2. Create share 4. Access share User instances 3. Provide handle Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
  • 25. 25 LHC Incident in April 2016 Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
  • 26. Manila testing: #fouinehammer 26 m-share Driver RabbitMQ m-sched m-api DBm-api m-api 1 … 500 nodes 1 ... 10k PODs Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
  • 27. Commercial Clouds Universe and Clouds - 26th September 2017Tim.Bell@cern.ch 27
  • 28. Development areas going forward  Spot Market  Cells V2  Neutron scaling – no Cells equivalent yet  Magnum rolling upgrades  Collaborations with Industry Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 28
  • 29. Operations areas going forward  Further automate migrations  Around 5,000 VMs / year  First campaign in 2016 needed some additional scripting such as pausing very active VMs  Newton live migration includes most use cases  Software Defined Networking  Nova network to Neutron migration to be completed  In addition to flat network in use currently  Introduce higher level functions such as LBaaS Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 29
  • 30. Future Challenges 0 100 200 300 400 500 600 700 800 900 1000 Raw Derived Data estimates for 1st year of HL-LHC (PB) ALICE ATLAS CMS LHCb 0 50000 100000 150000 200000 250000 CPU (HS06) CPU Needs for 1st Year of HL-LHC (kHS06) ALICE ATLAS CMS LHCb B First run LS1 Second run LS2 Third run LS3 HL-LHC … FCC? 2013 2014 2015 2016 2017 201820112010 2012 2019 2023 2024 2030?20212020 2022 …2025 CPU: • x60 from 2016 Data: • Raw 2016: 50 PB  2027: 600 PB • Derived (1 copy): 2016: 80 PB  2027: 900 PB Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 30  Raw data volume for LHC increases exponentially and with it processing and analysis load  Technology at ~20%/year will bring x6-10 in 10-11 years  Estimates of resource needs at HL-LHC x10 above what is realistic to expect from technology with reasonably constant cost
  • 31. Summary  OpenStack has provided a strong base for scaling resources over the past 4 years without significant increase in CERN staff  Additional functionality on top of pure Infrastructure-as-a-Service is now coming to production  Community and industry collaboration has been productive and inspirational for the CERN team  Some big computing challenges up ahead… Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 31
  • 32. Further Information Technical details on the CERN cloud at http://openstack-in-production.blogspot.fr Custom CERN code is at https://github.com/cernops Scientific Working Group at https://wiki.openstack.org/wiki/Scientific_working_group Helix Nebula details at http://www.helix-nebula.eu/ http://cern.ch/IT ©CERN CC-BY-SA 4.0Universe and Clouds - 19th June 2017 Tim.Bell@cern.ch 32
  • 34. Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 34 WLCG MoU Signatures 2017: - 63 MoU’s - 167 sites; 42 countries
  • 35. Partners Contributors Associates Research Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 35
  • 36. Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 36
  • 37. How do we monitor? 37 Processing kafka Data Centres Data Sources Data Access Storage/Search WLCG Transport Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
  • 38. Tuning 38  Many hypervisors are configured for compute optimisation  CPU Passthrough so VM sees identical CPU  Extended Page Tables so memory page mapping is done in hardware  Core pinning so scheduler keeps the cores on the underlying physical cores  Huge pages to improve memory page cache utilisation  Flavors are set to be NUMA aware  Improvements of up to 20% in performance  Impact is that the VMs cannot be live migrated so service machines are not configured this way Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch
  • 39. Provisioning services Moving towards Elastic Hybrid IaaS model: • In house resources at full occupation • Elastic use of commercial & public clouds • Assume “spot-market” style pricing OpenStack Resource Provisioning (>1 physical data centre) HTCondor Public Cloud VMsContainersBare Metal and HPC (LSF) Volunteer Computing IT & Experiment Services End Users CI/CD APIs CLIs GUIs Experiment Pilot Factories Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 39
  • 40. Simulating Elasticity  Deliveries are around 1-2 times per year  Resources are for  Batch compute … immediately needed … compute optimised  Services … needed as projects request quota ... Support live migration with generic CPU definition  Elasticity is simulated by  Creating opportunistic batch projects running on resources available for services in the future  Draining opportunistic batch as needed  End result is  High utilisation of ‘spare’ resources  Simulation of an elastic cloud Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch 40
  • 41. Pick the interesting events  40 million per second  Fast, simple information  Hardware trigger in a few micro seconds  100 thousand per second  Fast algorithms in local computer farm  Software trigger in <1 second  Few 100 per second  Recorded for study 41 Muon tracks Energy deposits Universe and Clouds - 26th September 2017 Tim.Bell@cern.ch

Editor's Notes

  • #3: 2
  • #4: Largest scientific apparatus ever build 27km around 2 general purpose detectors: Huge microscopes – to explore the very small – using a long lever arm 2 specialized detectors
  • #5: Over 1,600 magnets lowered down shafts and cooled to -271 C to become superconducting. Two beam pipes, vacuum 10 times less than the moon
  • #10: However, CERN is a publically funded body with strict purchasing rules to make sure that the contributions from our contributing countries are also provided back to the member states, our hardware purchases should be distributed to each of the countries in ratio of their contributions., So, we have a public procurement cycle that takes 280 days in the best case… we define the specifications 6 months before we actually have the h/w available and that is in the best case. Worst case, we find issues when the servers are delivered. We’ve had cases such as swapping out 7,000 disk drives where you stop tracking by the drive but measure it by the pallet of disks. With these constraints, we needed to find an approach that allows us to be flexible for the physicists while still being compliant with the rules.
  • #13: We started looking at OpenStack in 2011, at an event in London at the Vinopolis and started pilots. We are gradually expanding the functionality of the CERN cloud through the releases. We experiment with some new technology, some make it to production in a release or so, others such as Ironic, we have a look at and then come back a year or so later The service catalog functions allow us to easily expose selected function to early users
  • #20: Should take around 1 0/15 minutes to execute first command
  • #22: Should take around 1 0/15 minutes to execute first command