Toward 10,000 Containers on OpenStack

Toward 10,000 Containers
on OpenStack
Ricardo Rocha
Spyros Trigazis
(CERN)
Ton Ngo
Winnie Tsang
(IBM)

Talk outline
1. Introduction
2. Benchmarks
3. CERN Cloud result
4. CNCF Cloud result
5. Conclusion
• Acknowledgement:
• CERN cloud team
• CNCF Lab
• IBM team: Douglas Davis, Simeon Monov
• Rackspace team: Adrian Otto, Chris Hultin, Drago Rosson
• Many thanks to the Magnum team for all the progress

About OpenStack Magnum
• Mission: management service for container infrastructure
• Create / configure nodes (VM/baremetal), networking, storage
• Deep integration with Openstack services
• Lifecycle operation on cluster
• Native container API
• Current support:
• Kubernetes
• Swarm
• Mesos

Newton and Upcoming Release
• Newton features:
• Cluster and drivers refactoring
• Documentation: user guide, installation guide
• Baremetal: Kubernetes cluster
• Storage: cinder volume, Docker storage
• Networking: decouple lbaas, floating IP, Flannel overlay network
• Distro: OpenSUSE
• Internal: asynchronous operation, certificate DB storage, notification, rollback
• Upcoming release
• Heterogeneous clusters
• Cluster upgrades
• Advanced container networking
• Additional drivers: DC/OS, further baremetal support

Rally
An Openstack benchmark test tool
• Easily extended by plugin
• Test result in HTML reports
• Used by many projects
• Context: set up environment
• Scenario: run benchmark
• Recommended for a production service
to verify that the service behaves as
expected at all time
Kubernetes Cluster
pods,
containers
Rally
report

Toward 10,000 Containers on OpenStack

Rally Plugin for Magnum
Scenarios for cluster:
• Create and list clusters(support k8s, swarm and mesos)
• Create and list cluster templates
Scenarios for container:
• Create and list pods(k8s)
• Create and list rcs(k8s)
• Create and list containers(swarm)
• Create and list apps(mesos)

Sample Rally input task files
• ---
• MagnumClusters.create_and_list_clusters:
• -
• args:
• node_count: 4
• runner:
• type: "constant”
• times: 10
• concurrency: 2
• context:
• users:
• tenants: 1
• users_per_tenant: 1
• cluster_templates:
• image_id: "fedora-atomic-latest"
• external_network_id: "public"
• dns_nameserver: "8.8.8.8"
• flavor_id: "m1.small"
• docker_volume_size: 5
• network_driver: "flannel"
• coe: "kubernetes"
---
K8sPods.create_and_list_pods:
-
args:
manifest: "artifacts/nginx.yaml.k8s"
runner:
type: "constant"
times: 20
concurrency: 2
context:
users:
tenants: 1
users_per_tenant: 1
cluster_templates:
image_id: "fedora-atomic-latest"
external_network_id: "public"
dns_nameserver: "8.8.8.8"
flavor_id: "m1.small"
docker_volume_size: 5
network_driver: "flannel"
coe: "kubernetes"
clusters:
node_count: 2
ca_certs:
directory: "/home/stack"

load
driver
Google/Kubernetes benchmark
Steady state performance
in a large Kubernetes cluster
• Create a Kubernetes cluster with 800 vcpu
(e.g. 200 nodes x 4 cpu)
• Requires a DNS service, SkyDNS for k8s<=1.2,
embedded in newer releases
• Launch nginx pods serving millions of
HTTP requests per second
• It is possible to scale the load bots and
the service pods as needed
• Google has published the configuration and
result data, so we can compare with their results
Kubernetes Cluster
nginx
millions request/sec

CERN OpenStack Infrastructure
Production since 2013
~190.000 cores ~4million VMs created ~200 VMs created / hour

CERN Container Use Cases
• Batch processing
• End user analysis / Jupyter Notebooks
• Machine Learning / TensorFlow / Keras
• Infrastructure Services
• Data Movement, Web Servers, PaaS, ...
• Continuous Integration / Deployment
• And many others...

CERN Magnum Deployment
• Integrate containers in the CERN cloud
• Shared identity, networking integration, storage access, …
• Agnostic to container orchestration engines
• Docker Swarm, Kubernetes, Mesos
• Fast, Easy to use
Container Investigations Magnum Tests
Pilot Service Deployed
11 / 2015 02 / 2016
Production Service
CERN / HEP Service Integration, Networking, CVMFS, EOS
10 / 2016Mesos Support
Upstream Development

• Clusters are described by cluster templates
• Shared/public templates for most common setups,
customizable by users
$ magnum cluster-template-list
+------+---------------------------+
| uuid | name |
+------+---------------------------+
| .... | swarm |
| .... | swarm-ha |
| .... | kubernetes |
| .... | kubernetes-ha |
| .... | mesos |
| .... | mesos-ha |
+------+---------------------------+

• Clusters are described by cluster templates
• Shared/public templates for most common setups,
customizable by users
$ magnum cluster-create --name myswarmcluster --cluster-template swarm --node-count 100
$ magnum cluster-list
+------+----------------+------------+--------------+-----------------+
| uuid | name | node_count | master_count | status |
+------+----------------+------------+--------------+-----------------+
| .... | myswarmcluster | 100 | 1 | CREATE_COMPLETE |
+------+----------------+------------+--------------+-----------------+
$ $(magnum cluster-config myswarmcluster --dir magnum/myswarmcluster)
$ docker info / ps / ...
$ docker run --volume-driver cvmfs -v atlas.cern.ch:/cvmfs/atlas -it centos /bin/bash
[root@32f4cf39128d /]#

CERN Benchmark Setup
• Setup in one dedicated cell
• 240 hypervisors
• Each 32 cores, 64 GB RAM, 10Gb links
• Container images stored in Cinder volumes, in our CEPH cluster
• Default today in Magnum
• Deployed / configured using puppet (as all our production setup)
• Magnum / Heat Setup
• Dedicated controller(s), in VMs
• Dedicated rabbitmq, clustered, in VMs
• Dropped explicit Neutron resource creation
• Floating IPs, Ports, Private Networks, LBaaS

CERN Results
• Several iterations before arriving at a reliable setup
• First run: 2 million requests / s
• Bay of 200 nodes (400 cores, 800 GB Ram)
First Tests
~100/200 node bays
Large Tests
Up to 1000 node bays

CERN Results
• Services coped with request increase
• x4 in Nova, x8 in Cinder, == in Keystone
• Almost business as usual… though
• Keystone stores a revocation tree (memcache)
• Populated on every project/user/trustee creation
• And is checked for every token validation
• -> Network traffic in one cache node (shard)
• -> >12 seconds ave request time vs the average
of 3ms
First Tests
~100/200 node bays Large Tests
Up to 1000 node bays

CERN Results
• Second run: rally and 7 million requests / sec
• Lots of iterations! Example
Scale Magnum
Conductor
Deploy Barbican

CERN Results
● Second go: rally and 7 million requests / sec
○ Kubernetes 7 million requests / sec
○ 1000 node clusters
(4000 cores, 8000 GB / RAM)
Cluster Size (Nodes) Concurrency Deployment Time
(min)
2 50 2.5
16 10 4
32 10 4
128 5 5.5
512 1 14
1000 1 23

CERN Tuning
• Heat
• Timeouts when contacting rabbitmq
• Large stack deletion sometimes needs multiple tries
• Magnum
• ‘Too many files opened’
• 503s, scale the conductor
• RabbitMQ instabilities
• Flannel network config
• Keystone
• Revocation tree can cause some scalability issues
ulimit -‐n 4096
max_stacks_per_tenant: 10000 was 100
max_template_size: 5242880 (*10 previous)
max_nested_stack_depth: 10 (was 5)
engine_life_check_timeout: 10 (was 2)
rpc_poll_timeout: 600 (was 1)
rpc_response_timeout: 600 (was 60)
rcp_queue_expiration: 600 (was 60)
disabled memcache
Deployed Barbican
Downgrade to 3.3.5
-‐-‐labels flannel_network_cidr=10.0.0.0/8,
flannel_network_subnetlen=22,
flannel_backend=vxlan

CERN Tuning (continued)
• Cinder
• Slow deletion triggering heat stack deletion timeouts
• Heat engine issues (too many retrials, timeouts)
• Make Cinder optional? Lots of traffic with high load apps!
• Heat stack deployment scaling linearly
• For large stacks >128 nodes
• Summary of a 1000 node cluster: 1003 stacks, 22000 resources, 47000 events
• That’s ~70000 records in the heat db for one stack
• Heat: Performance Scalability Improvements - Thu 27th 11:50 am
• Flannel backend tests
• udp: ~450Mbit/s, vxlan: ~920 Mbit/s, host-gw: ~950Mbit/s
• Change default? We set vxlan at CERN right now

90
computes
CNCF Benchmark Setup
• Granted access 1 month ago and built with Openstack
Ansible with Newton release
• On-going scalability study for Magnum, Heat and COEs
• Hardware configuration
• 2x Intel E5-2680v3 12-core
• 128GB RAM
• 2x Intel S3610 400GB SSD
• 10x Intel 2TB NLSAS HDD
• 1x QP Intel X710"
• Cinder configured with the lvm-driver,
disabled later
• Neutron configured with linux bridge
ha-proxy
5
controllers
5
controllers
3 neutron
controllers
3 neutron
controllers
90
computes
90
computes

CNCF results
Two rounds of tests:
• 35 node cluster with one master, 24 cores
and 120GB of ram, (840 cores)
• 80 node cluster with one master, 24 cores
and 120GB of ram, (1920 cores)
Flannel backend configuration host-gw or
udp) VS vxlan at CERN
nodes containers reqs/sec latency flannel
35 1100 1M 83.2 ms udp
80 1100 1M 1.33 ms host-gw
80 3100 3M 26.1 ms host-gw

Rally data at CNCF
Cluster creation
Cluster
Size
(Nodes)
Concurrency
Number
of
Clusters
Deployment
Time (min)
2 10 100 3.02
2 10 1000
Able to create
219 clusters
32 5 100
Able to create
28 clusters
512 1 1 *
4000 1 1 *
COE Cluster Size
(Nodes)
Concurrency
Number of
Containers
Deployment
Time (sec)
K8S 2 4 8 2.3
Swarm 2 4 8 6.2
Mesos 2 4 8 122.0
Container creation

Tuning at CNCF
• Apply the same improvements discovered at CERN
• Heat tuning
• Cinder decoupling
• Disabled Floating IPs to create many large clusters
concurrently
• But we need Floating IPs for the master node or the load balancer
• Still working on tuning rabbit, adding separate clusters for
each service (like at CERN)
• Consider this option in OpenStack Ansible for large deployment
• Using database for certificates didn’t impact the overall
performance:
• Reasonable alternative to Barbican

Conclusions
• Scalability:
• Deploy clusters
• Deploy containers
• Steady state: app
• Good:
• Nova and neutron were solid
• Once the infrastructure is in place, we can match the performance published by Google
• Magnum itself not a bottleneck: many tuning knobs for building complex cluster
• Need work:
• Really an Openstack scaling and stability problem
• Linear scaling in heat and keystone (when creating a large number of cluster and
using uuid tokens, token validation in keystone becomes too slow)
• Did we hit 10,000 containers?
• YES

Best practices
How to avoid the bottlenecks for now
• Tune your Openstack
• Rabbit, Heat
• Consider trade-off in deploying cluster:
• Local storage or cinder volume
• Fewer larger nodes or more smaller nodes
• Floating IP per node or not
• Load balancer
• Networking: udp, host-gw

Next steps
• Rerun tests focusing on cluster lifecycle operations
• Rolling upgrades, node retirement / replacement, …
• Summarize best practices in Magnum documentation
• Run similar application scaling tests for other COEs
• Swarm 3K, Mesos 50.000 containers in real time
• Decouple Cinder for container storage
• Bugs:
• Floating IP handling, client, state synchronization with Heat
• Long term issue:
• Developers use devstack
• How can we discover bottlenecks, scaling problems in a systematic way?

Thank You
Ricardo Rocha
ricardo.rocha@cern.ch
Spyros Trigazis
spyridon.trigazis@cern.ch
@strigazi
Ton Ngo
ton@us.ibm.com
@tango245
Winnie Tsang
wtsang@us.ibm.com

Toward 10,000 Containers on OpenStack

More Related Content

What's hot (18)

Similar to Toward 10,000 Containers on OpenStack (20)

Recently uploaded (20)

Toward 10,000 Containers on OpenStack