SlideShare a Scribd company logo
Toward  10,000  Containers  
on  OpenStack
Ricardo  Rocha
Spyros  Trigazis
(CERN)
Ton  Ngo
Winnie  Tsang
(IBM)
Talk  outline
1. Introduction
2. Benchmarks
3. CERN  Cloud  result
4. CNCF  Cloud  result
5. Conclusion
• Acknowledgement:  
• CERN  cloud  team
• CNCF  Lab
• IBM  team:  Douglas  Davis,  Simeon  Monov
• Rackspace team:  Adrian  Otto,  Chris  Hultin,  Drago  Rosson
• Many  thanks  to  the  Magnum  team  for  all  the  progress
About  OpenStack  Magnum
• Mission:    management  service  for  container  infrastructure
• Create  /  configure  nodes  (VM/baremetal),  networking,  storage  
• Deep  integration  with  Openstack services
• Lifecycle  operation  on  cluster
• Native  container  API
• Current  support:  
• Kubernetes
• Swarm
• Mesos
Newton  and  Upcoming  Release
• Newton  features:
• Cluster  and  drivers  refactoring
• Documentation:    user  guide,  installation  guide  
• Baremetal:  Kubernetes  cluster  
• Storage:    cinder  volume,  Docker  storage  
• Networking:  decouple  lbaas,  floating  IP,  Flannel  overlay  network
• Distro:    OpenSUSE
• Internal:  asynchronous  operation,  certificate  DB  storage,  notification,  rollback
• Upcoming  release
• Heterogeneous  clusters
• Cluster  upgrades
• Advanced  container  networking
• Additional  drivers:  DC/OS,  further  baremetal support
Benchmarks
Rally  
An  Openstack benchmark  test  tool
• Easily  extended  by  plugin
• Test  result  in  HTML  reports
• Used  by  many  projects
• Context:    set  up  environment
• Scenario:    run  benchmark
• Recommended  for  a  production  service
to  verify  that  the  service  behaves  as
expected  at  all  time
Kubernetes  Cluster
pods,
containers
Rally
report
Toward 10,000 Containers on OpenStack
Rally  Plugin  for  Magnum
Scenarios  for  cluster:
• Create  and  list  clusters(support  k8s,  swarm  and  mesos)
• Create  and  list  cluster  templates
Scenarios  for  container:
• Create  and  list  pods(k8s)
• Create  and  list  rcs(k8s)
• Create  and  list  containers(swarm)
• Create  and  list  apps(mesos)
Sample  Rally  input  task  files  
• -­-­-­
• MagnumClusters.create_and_list_clusters:
• -­
• args:
• node_count:  4
• runner:
• type:  "constant”
• times:  10
• concurrency:  2
• context:
• users:
• tenants:  1
• users_per_tenant:  1
• cluster_templates:
• image_id:  "fedora-­atomic-­latest"
• external_network_id:  "public"
• dns_nameserver:  "8.8.8.8"
• flavor_id:  "m1.small"
• docker_volume_size:  5
• network_driver:  "flannel"
• coe:  "kubernetes"
-­-­-­
K8sPods.create_and_list_pods:
-­
args:
manifest:  "artifacts/nginx.yaml.k8s"
runner:
type:  "constant"
times:  20
concurrency:  2
context:
users:
tenants:  1
users_per_tenant:  1
cluster_templates:
image_id:  "fedora-­atomic-­latest"
external_network_id:  "public"
dns_nameserver:  "8.8.8.8"
flavor_id:  "m1.small"
docker_volume_size:  5
network_driver:  "flannel"
coe:  "kubernetes"
clusters:
node_count:  2
ca_certs:
directory:  "/home/stack"
load
driver
Google/Kubernetes  benchmark
Steady  state  performance  
in  a  large  Kubernetes  cluster
• Create  a  Kubernetes  cluster  with  800  vcpu
(e.g.  200  nodes  x  4  cpu)
• Requires  a  DNS  service,  SkyDNS for  k8s<=1.2,  
embedded  in  newer  releases
• Launch  nginx pods  serving  millions  of  
HTTP  requests  per  second
• It  is  possible  to  scale  the  load  bots  and  
the service  pods  as  needed
• Google  has  published  the  configuration  and  
result  data,  so  we  can  compare  with  their  results
Kubernetes  Cluster
nginx
millions  request/sec
CERN  Cloud  result
CERN  OpenStack  Infrastructure
Production  since  2013
~190.000  cores ~4million  VMs  created ~200  VMs  created  /  hour
CERN  Container  Use  Cases
• Batch  processing
• End  user  analysis  /  Jupyter Notebooks
• Machine  Learning  /  TensorFlow /  Keras
• Infrastructure  Services
• Data  Movement,  Web  Servers,  PaaS,  ...
• Continuous  Integration  /  Deployment
• And  many  others...
CERN  Magnum  Deployment
• Integrate  containers  in  the  CERN  cloud
• Shared  identity,  networking  integration,  storage  access,  …
• Agnostic  to  container  orchestration  engines
• Docker  Swarm,  Kubernetes,  Mesos
• Fast,  Easy  to  use
Container  Investigations Magnum  Tests
Pilot  Service  Deployed
11  /  2015 02  /  2016
Production  Service
CERN  /  HEP  Service  Integration,  Networking,  CVMFS,  EOS
10  /  2016Mesos  Support  
Upstream  Development
CERN  Magnum  Deployment
• Clusters  are  described  by  cluster  templates
• Shared/public  templates  for  most  common  setups,  
customizable  by  users
$ magnum cluster-template-list
+------+---------------------------+
| uuid | name |
+------+---------------------------+
| .... | swarm |
| .... | swarm-ha |
| .... | kubernetes |
| .... | kubernetes-ha |
| .... | mesos |
| .... | mesos-ha |
+------+---------------------------+
CERN  Magnum  Deployment
• Clusters  are  described  by  cluster  templates
• Shared/public  templates  for  most  common  setups,  
customizable  by  users
$ magnum cluster-create --name myswarmcluster --cluster-template swarm --node-count 100
$ magnum cluster-list
+------+----------------+------------+--------------+-----------------+
| uuid | name | node_count | master_count | status |
+------+----------------+------------+--------------+-----------------+
| .... | myswarmcluster | 100 | 1 | CREATE_COMPLETE |
+------+----------------+------------+--------------+-----------------+
$ $(magnum cluster-config myswarmcluster --dir magnum/myswarmcluster)
$ docker info / ps / ...
$ docker run --volume-driver cvmfs -v atlas.cern.ch:/cvmfs/atlas -it centos /bin/bash
[root@32f4cf39128d /]#
CERN  Benchmark  Setup
• Setup  in  one  dedicated  cell
• 240  hypervisors
• Each  32  cores,  64  GB  RAM,  10Gb  links
• Container  images  stored  in  Cinder  volumes,  in  our  CEPH  cluster
• Default  today  in  Magnum
• Deployed  /  configured  using  puppet  (as  all  our  production  setup)
• Magnum  /  Heat  Setup
• Dedicated  controller(s),  in  VMs
• Dedicated  rabbitmq,  clustered,  in  VMs
• Dropped  explicit  Neutron  resource  creation
• Floating  IPs,  Ports,  Private  Networks,  LBaaS
CERN  Results
• Several  iterations  before  arriving  at  a  reliable  setup
• First  run:  2  million  requests  /  s
• Bay  of  200  nodes  (400  cores,  800  GB  Ram)
First  Tests
~100/200  node  bays
Large  Tests
Up  to  1000  node  bays
CERN  Results
• Services  coped  with  request  increase
• x4  in  Nova,  x8  in  Cinder,  ==  in  Keystone
• Almost  business  as  usual…  though
• Keystone  stores  a  revocation  tree  (memcache)
• Populated  on  every  project/user/trustee  creation
• And  is  checked  for  every  token  validation
• -­>  Network  traffic  in  one cache  node  (shard)
• -­>  >12  seconds  ave request  time  vs  the  average  
of  3ms
First  Tests
~100/200  node  bays Large  Tests
Up  to  1000  node  bays
CERN  Results
• Second  run:  rally  and  7  million  requests  /  sec
• Lots  of  iterations!   Example
Scale  Magnum  
Conductor
Deploy  Barbican
CERN  Results
● Second  go:  rally  and  7  million  requests  /  sec  
○ Kubernetes  7  million  requests  /  sec
○ 1000  node  clusters  
(4000  cores,  8000  GB  /  RAM)
Cluster  Size  (Nodes) Concurrency Deployment  Time  
(min)
2 50 2.5
16 10 4
32 10 4
128 5 5.5
512 1 14
1000 1 23
CERN  Tuning
• Heat
• Timeouts  when  contacting  rabbitmq
• Large  stack  deletion  sometimes  needs  multiple  tries
• Magnum
• ‘Too  many  files  opened’
• 503s,  scale  the  conductor
• RabbitMQ instabilities
• Flannel  network  config
• Keystone
• Revocation  tree  can  cause  some  scalability  issues
ulimit  -­‐n  4096
max_stacks_per_tenant:  10000  was  100
max_template_size:  5242880  (*10  previous)
max_nested_stack_depth:  10  (was  5)
engine_life_check_timeout:  10  (was  2)
rpc_poll_timeout:  600  (was  1)
rpc_response_timeout:  600  (was  60)
rcp_queue_expiration:  600  (was  60)
disabled  memcache
Deployed  Barbican
Downgrade  to  3.3.5
-­‐-­‐labels  flannel_network_cidr=10.0.0.0/8,
        flannel_network_subnetlen=22,
        flannel_backend=vxlan
CERN  Tuning  (continued)
• Cinder
• Slow  deletion  triggering  heat  stack  deletion  timeouts
• Heat  engine  issues  (too  many  retrials,  timeouts)
• Make  Cinder  optional?  Lots  of  traffic  with  high  load  apps!
• Heat  stack  deployment  scaling  linearly
• For  large  stacks  >128  nodes
• Summary  of  a  1000  node  cluster:  1003  stacks,  22000  resources,  47000  events
• That’s  ~70000  records  in  the  heat  db for  one  stack
• Heat:  Performance  Scalability  Improvements  -­ Thu  27th  11:50  am
• Flannel  backend  tests
• udp:  ~450Mbit/s,  vxlan:  ~920  Mbit/s,  host-­gw:  ~950Mbit/s
• Change  default?  We  set  vxlan at  CERN  right  now
CNCF  Cloud  Result
90
computes
CNCF  Benchmark  Setup
• Granted  access  1  month  ago  and  built  with  Openstack
Ansible with  Newton  release
• On-­going  scalability  study  for  Magnum,  Heat  and  COEs
• Hardware  configuration
• 2x  Intel  E5-­2680v3  12-­core
• 128GB  RAM
• 2x  Intel  S3610  400GB  SSD
• 10x  Intel  2TB  NLSAS  HDD
• 1x  QP  Intel  X710"
• Cinder  configured  with  the  lvm-­driver,  
disabled  later
• Neutron  configured  with  linux  bridge
ha-­proxy
5
controllers
5
controllers
3  neutron  
controllers
3  neutron  
controllers
90
computes
90
computes
CNCF  results
Two  rounds  of  tests:
• 35  node  cluster  with  one  master,  24  cores  
and  120GB  of  ram,  (840  cores)
• 80  node  cluster  with  one  master,  24  cores  
and  120GB  of  ram,  (1920  cores)
Flannel  backend  configuration  host-­gw or  
udp)  VS  vxlan at  CERN
nodes containers reqs/sec latency flannel
35 1100 1M 83.2  ms udp
80 1100 1M 1.33  ms host-­gw
80 3100 3M 26.1  ms host-­gw
Rally  data  at  CNCF
Cluster  creation
Cluster  
Size  
(Nodes)
Concurrency
Number  
of  
Clusters
Deployment  
Time  (min)
2 10 100 3.02
2 10 1000
Able  to  create  
219  clusters
32 5 100
Able  to  create  
28  clusters
512 1 1 *
4000 1 1 *
COE Cluster  Size  
(Nodes)
Concurrency
Number  of  
Containers
Deployment  
Time  (sec)
K8S 2 4 8 2.3
Swarm 2 4 8 6.2
Mesos 2 4 8 122.0
Container  creation
Tuning  at  CNCF
• Apply  the  same  improvements  discovered  at  CERN
• Heat  tuning
• Cinder  decoupling
• Disabled  Floating  IPs  to  create  many  large  clusters  
concurrently
• But  we  need  Floating  IPs  for  the  master  node  or  the  load  balancer
• Still  working  on  tuning  rabbit,  adding  separate  clusters  for  
each  service  (like  at  CERN)
• Consider  this  option  in  OpenStack  Ansible for  large  deployment
• Using  database  for  certificates  didn’t  impact  the  overall  
performance:
• Reasonable  alternative  to  Barbican
Conclusion
Conclusions
• Scalability:
• Deploy  clusters
• Deploy  containers
• Steady  state:  app
• Good:
• Nova  and  neutron  were  solid
• Once  the  infrastructure  is  in  place,  we  can  match  the  performance  published  by  Google
• Magnum  itself  not  a  bottleneck:    many  tuning  knobs  for  building  complex  cluster
• Need  work:  
• Really  an  Openstack scaling  and  stability  problem
• Linear  scaling  in  heat  and  keystone  (when  creating  a  large  number  of  cluster  and  
using  uuid  tokens,  token  validation  in  keystone  becomes  too  slow)
• Did  we  hit  10,000  containers?  
• YES
Best  practices  
How  to  avoid  the  bottlenecks  for  now
• Tune  your  Openstack
• Rabbit,  Heat
• Consider  trade-­off  in  deploying  cluster:    
• Local  storage  or  cinder  volume  
• Fewer  larger  nodes  or  more  smaller  nodes
• Floating  IP  per  node  or  not
• Load  balancer  
• Networking:    udp,  host-­gw
Next  steps
• Rerun  tests  focusing  on  cluster  lifecycle  operations
• Rolling  upgrades,  node  retirement  /  replacement,  …
• Summarize  best  practices  in  Magnum  documentation
• Run  similar  application  scaling  tests  for  other  COEs
• Swarm  3K,  Mesos  50.000  containers  in  real  time
• Decouple  Cinder  for  container  storage
• Bugs:    
• Floating  IP  handling,  client,  state  synchronization  with  Heat
• Long  term  issue:
• Developers  use  devstack
• How  can  we  discover  bottlenecks,  scaling  problems  in  a  systematic  way?
Thank  You
Ricardo  Rocha
ricardo.rocha@cern.ch
Spyros  Trigazis
spyridon.trigazis@cern.ch
@strigazi
Ton  Ngo  
ton@us.ibm.com
@tango245
Winnie  Tsang
wtsang@us.ibm.com

More Related Content

PDF
Sanger OpenStack presentation March 2017
PDF
ELK: Moose-ively scaling your log system
PPTX
RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hair
PDF
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
PDF
DataStax: Extreme Cassandra Optimization: The Sequel
PDF
How Prometheus Store the Data
PPTX
Tuning Java GC to resolve performance issues
PPTX
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
Sanger OpenStack presentation March 2017
ELK: Moose-ively scaling your log system
RENCI User Group Meeting 2017 - I Upgraded iRODS and I still have all my hair
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
DataStax: Extreme Cassandra Optimization: The Sequel
How Prometheus Store the Data
Tuning Java GC to resolve performance issues
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers

What's hot (18)

PDF
Cobbler, Func and Puppet: Tools for Large Scale Environments
PDF
Stabilizing Ceph
PDF
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
PDF
OSv at Cassandra Summit
PDF
淺談 Java GC 原理、調教和 新發展
PPTX
Introduction to Apache ZooKeeper
PDF
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
PPTX
A deep dive into trove: Scale 13x Linux Expo 2/22/15
PDF
Building the Right Platform Architecture for Hadoop
PDF
Fight with Metaspace OOM
PDF
Open stack china_201109_sjtu_jinyh
PDF
[262] netflix 빅데이터 플랫폼
PPTX
Building and Deploying Application to Apache Mesos
PDF
2016-JAN-28 -- High Performance Production Databases on Ceph
ODP
Introduction to Mesos
PPTX
Rate limiters in big data systems
PDF
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
PDF
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
Cobbler, Func and Puppet: Tools for Large Scale Environments
Stabilizing Ceph
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
OSv at Cassandra Summit
淺談 Java GC 原理、調教和 新發展
Introduction to Apache ZooKeeper
Ceph Day Melbourne - Scale and performance: Servicing the Fabric and the Work...
A deep dive into trove: Scale 13x Linux Expo 2/22/15
Building the Right Platform Architecture for Hadoop
Fight with Metaspace OOM
Open stack china_201109_sjtu_jinyh
[262] netflix 빅데이터 플랫폼
Building and Deploying Application to Apache Mesos
2016-JAN-28 -- High Performance Production Databases on Ceph
Introduction to Mesos
Rate limiters in big data systems
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
Ad

Similar to Toward 10,000 Containers on OpenStack (20)

PDF
From swarm to swam-mode in the CERN container service
PDF
Bug smash day magnum
PDF
Bug smash day magnum
PDF
Kubernetes: My BFF
PPTX
Managing Container Clusters in OpenStack Native Way
PPTX
Openstack Magnum: Container-as-a-Service
PPTX
Container orchestration and microservices world
PDF
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
PPTX
It's not too late to learn about k8s
PPTX
Climb Technical Overview
PPTX
20190620 accelerating containers v3
PDF
Why kubernetes for Serverless (FaaS)
PDF
Kubernetes for Serverless - Serverless Summit 2017 - Krishna Kumar
PDF
Google Kubernetes Engine Deep Dive Meetup
PPTX
Kubernetes Introduction
PPTX
Mesos and Kubernetes ecosystem overview
PDF
Quantifying the Noisy Neighbor Problem in Openstack
PPTX
On Docker and its use for LHC at CERN
PPTX
Sanger, upcoming Openstack for Bio-informaticians
PPTX
Flexible compute
From swarm to swam-mode in the CERN container service
Bug smash day magnum
Bug smash day magnum
Kubernetes: My BFF
Managing Container Clusters in OpenStack Native Way
Openstack Magnum: Container-as-a-Service
Container orchestration and microservices world
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
It's not too late to learn about k8s
Climb Technical Overview
20190620 accelerating containers v3
Why kubernetes for Serverless (FaaS)
Kubernetes for Serverless - Serverless Summit 2017 - Krishna Kumar
Google Kubernetes Engine Deep Dive Meetup
Kubernetes Introduction
Mesos and Kubernetes ecosystem overview
Quantifying the Noisy Neighbor Problem in Openstack
On Docker and its use for LHC at CERN
Sanger, upcoming Openstack for Bio-informaticians
Flexible compute
Ad

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced Soft Computing BINUS July 2025.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 3 Spatial Domain Image Processing.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
20250228 LYD VKU AI Blended-Learning.pptx
Modernizing your data center with Dell and AMD
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
NewMind AI Weekly Chronicles - August'25 Week I
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Monthly Chronicles - July 2025
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf
Network Security Unit 5.pdf for BCA BBA.

Toward 10,000 Containers on OpenStack

  • 1. Toward  10,000  Containers   on  OpenStack Ricardo  Rocha Spyros  Trigazis (CERN) Ton  Ngo Winnie  Tsang (IBM)
  • 2. Talk  outline 1. Introduction 2. Benchmarks 3. CERN  Cloud  result 4. CNCF  Cloud  result 5. Conclusion • Acknowledgement:   • CERN  cloud  team • CNCF  Lab • IBM  team:  Douglas  Davis,  Simeon  Monov • Rackspace team:  Adrian  Otto,  Chris  Hultin,  Drago  Rosson • Many  thanks  to  the  Magnum  team  for  all  the  progress
  • 3. About  OpenStack  Magnum • Mission:    management  service  for  container  infrastructure • Create  /  configure  nodes  (VM/baremetal),  networking,  storage   • Deep  integration  with  Openstack services • Lifecycle  operation  on  cluster • Native  container  API • Current  support:   • Kubernetes • Swarm • Mesos
  • 4. Newton  and  Upcoming  Release • Newton  features: • Cluster  and  drivers  refactoring • Documentation:    user  guide,  installation  guide   • Baremetal:  Kubernetes  cluster   • Storage:    cinder  volume,  Docker  storage   • Networking:  decouple  lbaas,  floating  IP,  Flannel  overlay  network • Distro:    OpenSUSE • Internal:  asynchronous  operation,  certificate  DB  storage,  notification,  rollback • Upcoming  release • Heterogeneous  clusters • Cluster  upgrades • Advanced  container  networking • Additional  drivers:  DC/OS,  further  baremetal support
  • 6. Rally   An  Openstack benchmark  test  tool • Easily  extended  by  plugin • Test  result  in  HTML  reports • Used  by  many  projects • Context:    set  up  environment • Scenario:    run  benchmark • Recommended  for  a  production  service to  verify  that  the  service  behaves  as expected  at  all  time Kubernetes  Cluster pods, containers Rally report
  • 8. Rally  Plugin  for  Magnum Scenarios  for  cluster: • Create  and  list  clusters(support  k8s,  swarm  and  mesos) • Create  and  list  cluster  templates Scenarios  for  container: • Create  and  list  pods(k8s) • Create  and  list  rcs(k8s) • Create  and  list  containers(swarm) • Create  and  list  apps(mesos)
  • 9. Sample  Rally  input  task  files   • -­-­-­ • MagnumClusters.create_and_list_clusters: • -­ • args: • node_count:  4 • runner: • type:  "constant” • times:  10 • concurrency:  2 • context: • users: • tenants:  1 • users_per_tenant:  1 • cluster_templates: • image_id:  "fedora-­atomic-­latest" • external_network_id:  "public" • dns_nameserver:  "8.8.8.8" • flavor_id:  "m1.small" • docker_volume_size:  5 • network_driver:  "flannel" • coe:  "kubernetes" -­-­-­ K8sPods.create_and_list_pods: -­ args: manifest:  "artifacts/nginx.yaml.k8s" runner: type:  "constant" times:  20 concurrency:  2 context: users: tenants:  1 users_per_tenant:  1 cluster_templates: image_id:  "fedora-­atomic-­latest" external_network_id:  "public" dns_nameserver:  "8.8.8.8" flavor_id:  "m1.small" docker_volume_size:  5 network_driver:  "flannel" coe:  "kubernetes" clusters: node_count:  2 ca_certs: directory:  "/home/stack"
  • 10. load driver Google/Kubernetes  benchmark Steady  state  performance   in  a  large  Kubernetes  cluster • Create  a  Kubernetes  cluster  with  800  vcpu (e.g.  200  nodes  x  4  cpu) • Requires  a  DNS  service,  SkyDNS for  k8s<=1.2,   embedded  in  newer  releases • Launch  nginx pods  serving  millions  of   HTTP  requests  per  second • It  is  possible  to  scale  the  load  bots  and   the service  pods  as  needed • Google  has  published  the  configuration  and   result  data,  so  we  can  compare  with  their  results Kubernetes  Cluster nginx millions  request/sec
  • 12. CERN  OpenStack  Infrastructure Production  since  2013 ~190.000  cores ~4million  VMs  created ~200  VMs  created  /  hour
  • 13. CERN  Container  Use  Cases • Batch  processing • End  user  analysis  /  Jupyter Notebooks • Machine  Learning  /  TensorFlow /  Keras • Infrastructure  Services • Data  Movement,  Web  Servers,  PaaS,  ... • Continuous  Integration  /  Deployment • And  many  others...
  • 14. CERN  Magnum  Deployment • Integrate  containers  in  the  CERN  cloud • Shared  identity,  networking  integration,  storage  access,  … • Agnostic  to  container  orchestration  engines • Docker  Swarm,  Kubernetes,  Mesos • Fast,  Easy  to  use Container  Investigations Magnum  Tests Pilot  Service  Deployed 11  /  2015 02  /  2016 Production  Service CERN  /  HEP  Service  Integration,  Networking,  CVMFS,  EOS 10  /  2016Mesos  Support   Upstream  Development
  • 15. CERN  Magnum  Deployment • Clusters  are  described  by  cluster  templates • Shared/public  templates  for  most  common  setups,   customizable  by  users $ magnum cluster-template-list +------+---------------------------+ | uuid | name | +------+---------------------------+ | .... | swarm | | .... | swarm-ha | | .... | kubernetes | | .... | kubernetes-ha | | .... | mesos | | .... | mesos-ha | +------+---------------------------+
  • 16. CERN  Magnum  Deployment • Clusters  are  described  by  cluster  templates • Shared/public  templates  for  most  common  setups,   customizable  by  users $ magnum cluster-create --name myswarmcluster --cluster-template swarm --node-count 100 $ magnum cluster-list +------+----------------+------------+--------------+-----------------+ | uuid | name | node_count | master_count | status | +------+----------------+------------+--------------+-----------------+ | .... | myswarmcluster | 100 | 1 | CREATE_COMPLETE | +------+----------------+------------+--------------+-----------------+ $ $(magnum cluster-config myswarmcluster --dir magnum/myswarmcluster) $ docker info / ps / ... $ docker run --volume-driver cvmfs -v atlas.cern.ch:/cvmfs/atlas -it centos /bin/bash [root@32f4cf39128d /]#
  • 17. CERN  Benchmark  Setup • Setup  in  one  dedicated  cell • 240  hypervisors • Each  32  cores,  64  GB  RAM,  10Gb  links • Container  images  stored  in  Cinder  volumes,  in  our  CEPH  cluster • Default  today  in  Magnum • Deployed  /  configured  using  puppet  (as  all  our  production  setup) • Magnum  /  Heat  Setup • Dedicated  controller(s),  in  VMs • Dedicated  rabbitmq,  clustered,  in  VMs • Dropped  explicit  Neutron  resource  creation • Floating  IPs,  Ports,  Private  Networks,  LBaaS
  • 18. CERN  Results • Several  iterations  before  arriving  at  a  reliable  setup • First  run:  2  million  requests  /  s • Bay  of  200  nodes  (400  cores,  800  GB  Ram) First  Tests ~100/200  node  bays Large  Tests Up  to  1000  node  bays
  • 19. CERN  Results • Services  coped  with  request  increase • x4  in  Nova,  x8  in  Cinder,  ==  in  Keystone • Almost  business  as  usual…  though • Keystone  stores  a  revocation  tree  (memcache) • Populated  on  every  project/user/trustee  creation • And  is  checked  for  every  token  validation • -­>  Network  traffic  in  one cache  node  (shard) • -­>  >12  seconds  ave request  time  vs  the  average   of  3ms First  Tests ~100/200  node  bays Large  Tests Up  to  1000  node  bays
  • 20. CERN  Results • Second  run:  rally  and  7  million  requests  /  sec • Lots  of  iterations!   Example Scale  Magnum   Conductor Deploy  Barbican
  • 21. CERN  Results ● Second  go:  rally  and  7  million  requests  /  sec   ○ Kubernetes  7  million  requests  /  sec ○ 1000  node  clusters   (4000  cores,  8000  GB  /  RAM) Cluster  Size  (Nodes) Concurrency Deployment  Time   (min) 2 50 2.5 16 10 4 32 10 4 128 5 5.5 512 1 14 1000 1 23
  • 22. CERN  Tuning • Heat • Timeouts  when  contacting  rabbitmq • Large  stack  deletion  sometimes  needs  multiple  tries • Magnum • ‘Too  many  files  opened’ • 503s,  scale  the  conductor • RabbitMQ instabilities • Flannel  network  config • Keystone • Revocation  tree  can  cause  some  scalability  issues ulimit  -­‐n  4096 max_stacks_per_tenant:  10000  was  100 max_template_size:  5242880  (*10  previous) max_nested_stack_depth:  10  (was  5) engine_life_check_timeout:  10  (was  2) rpc_poll_timeout:  600  (was  1) rpc_response_timeout:  600  (was  60) rcp_queue_expiration:  600  (was  60) disabled  memcache Deployed  Barbican Downgrade  to  3.3.5 -­‐-­‐labels  flannel_network_cidr=10.0.0.0/8,        flannel_network_subnetlen=22,        flannel_backend=vxlan
  • 23. CERN  Tuning  (continued) • Cinder • Slow  deletion  triggering  heat  stack  deletion  timeouts • Heat  engine  issues  (too  many  retrials,  timeouts) • Make  Cinder  optional?  Lots  of  traffic  with  high  load  apps! • Heat  stack  deployment  scaling  linearly • For  large  stacks  >128  nodes • Summary  of  a  1000  node  cluster:  1003  stacks,  22000  resources,  47000  events • That’s  ~70000  records  in  the  heat  db for  one  stack • Heat:  Performance  Scalability  Improvements  -­ Thu  27th  11:50  am • Flannel  backend  tests • udp:  ~450Mbit/s,  vxlan:  ~920  Mbit/s,  host-­gw:  ~950Mbit/s • Change  default?  We  set  vxlan at  CERN  right  now
  • 25. 90 computes CNCF  Benchmark  Setup • Granted  access  1  month  ago  and  built  with  Openstack Ansible with  Newton  release • On-­going  scalability  study  for  Magnum,  Heat  and  COEs • Hardware  configuration • 2x  Intel  E5-­2680v3  12-­core • 128GB  RAM • 2x  Intel  S3610  400GB  SSD • 10x  Intel  2TB  NLSAS  HDD • 1x  QP  Intel  X710" • Cinder  configured  with  the  lvm-­driver,   disabled  later • Neutron  configured  with  linux  bridge ha-­proxy 5 controllers 5 controllers 3  neutron   controllers 3  neutron   controllers 90 computes 90 computes
  • 26. CNCF  results Two  rounds  of  tests: • 35  node  cluster  with  one  master,  24  cores   and  120GB  of  ram,  (840  cores) • 80  node  cluster  with  one  master,  24  cores   and  120GB  of  ram,  (1920  cores) Flannel  backend  configuration  host-­gw or   udp)  VS  vxlan at  CERN nodes containers reqs/sec latency flannel 35 1100 1M 83.2  ms udp 80 1100 1M 1.33  ms host-­gw 80 3100 3M 26.1  ms host-­gw
  • 27. Rally  data  at  CNCF Cluster  creation Cluster   Size   (Nodes) Concurrency Number   of   Clusters Deployment   Time  (min) 2 10 100 3.02 2 10 1000 Able  to  create   219  clusters 32 5 100 Able  to  create   28  clusters 512 1 1 * 4000 1 1 * COE Cluster  Size   (Nodes) Concurrency Number  of   Containers Deployment   Time  (sec) K8S 2 4 8 2.3 Swarm 2 4 8 6.2 Mesos 2 4 8 122.0 Container  creation
  • 28. Tuning  at  CNCF • Apply  the  same  improvements  discovered  at  CERN • Heat  tuning • Cinder  decoupling • Disabled  Floating  IPs  to  create  many  large  clusters   concurrently • But  we  need  Floating  IPs  for  the  master  node  or  the  load  balancer • Still  working  on  tuning  rabbit,  adding  separate  clusters  for   each  service  (like  at  CERN) • Consider  this  option  in  OpenStack  Ansible for  large  deployment • Using  database  for  certificates  didn’t  impact  the  overall   performance: • Reasonable  alternative  to  Barbican
  • 30. Conclusions • Scalability: • Deploy  clusters • Deploy  containers • Steady  state:  app • Good: • Nova  and  neutron  were  solid • Once  the  infrastructure  is  in  place,  we  can  match  the  performance  published  by  Google • Magnum  itself  not  a  bottleneck:    many  tuning  knobs  for  building  complex  cluster • Need  work:   • Really  an  Openstack scaling  and  stability  problem • Linear  scaling  in  heat  and  keystone  (when  creating  a  large  number  of  cluster  and   using  uuid  tokens,  token  validation  in  keystone  becomes  too  slow) • Did  we  hit  10,000  containers?   • YES
  • 31. Best  practices   How  to  avoid  the  bottlenecks  for  now • Tune  your  Openstack • Rabbit,  Heat • Consider  trade-­off  in  deploying  cluster:     • Local  storage  or  cinder  volume   • Fewer  larger  nodes  or  more  smaller  nodes • Floating  IP  per  node  or  not • Load  balancer   • Networking:    udp,  host-­gw
  • 32. Next  steps • Rerun  tests  focusing  on  cluster  lifecycle  operations • Rolling  upgrades,  node  retirement  /  replacement,  … • Summarize  best  practices  in  Magnum  documentation • Run  similar  application  scaling  tests  for  other  COEs • Swarm  3K,  Mesos  50.000  containers  in  real  time • Decouple  Cinder  for  container  storage • Bugs:     • Floating  IP  handling,  client,  state  synchronization  with  Heat • Long  term  issue: • Developers  use  devstack • How  can  we  discover  bottlenecks,  scaling  problems  in  a  systematic  way?
  • 33. Thank  You Ricardo  Rocha ricardo.rocha@cern.ch Spyros  Trigazis spyridon.trigazis@cern.ch @strigazi Ton  Ngo   ton@us.ibm.com @tango245 Winnie  Tsang wtsang@us.ibm.com