SlideShare a Scribd company logo
To Build My Own Cloud with B
Sergey Dzyuban
SBTech
LAMP
Windowsadmin
2003 2007 2010 2014 2016
Web Developer .NET Developer Technical Account
Manager
Team Lead DevOps Tech Lead
Sergey Dzyuban
DevOps Tech Lead at SBTech
R&D offices
Datacenters
1000+
2500
700
200
(1K+ instances)
300 CPU/3Tb RAM 28 nodes
2 Tb
100K+
3
Windows and Data Centers
Platform as a Service Unit – place in SBTech geography
Infrastructure Team efforts
Platform as a Service Team
How resource
request works
DEV
Dev Team asked to
create separated
Kafka instance
7
How resource
request works
DEV
IT
Dev Team asked to
create separated
Kafka instance
• request IT to
provide Linux VM
• configure access
8
How resource
request works
DEV
IT
Infra Team
• request Infrastructure
Team to setup Kafka
• add monitoring
• add healthcheck
9
How resource
request works
DEV
IT
Infra Team
• Configure monitoring
• Configure access to the
requester
10
Maintain
infrastructure
manually -
issues
• Number of requests are
increasing
• Services and machines
after some time go
offline, die, appear in
broken state.
• Some services required
continuous
maintenance
DEV
Infra Team
DEV
DEV
DEV
DEV
DEV
DEV
11
PaaS for Microservices
Building hosting platform for microservices orchestration
Who are You –
Mr. Microservice?
Hashi Corp Vault is good
enough for securing
configuration settings,
but also requires some
initial infrastructure
setup.
Mr. Microservice
Consul
Fabio-lb
logs
metrics
trace
cache
state
eventsdiscovery
config
21
Who are You –
Mr. Microservice?
But sometimes it’s
something specific even
for experienced DevOps
engineers.
Mr. Microservice
Consul
Fabio-lb
logs
metrics
trace
cache
state
eventsdiscovery
config
22
Who are You –
Mr. Microservice?
But sometimes it’s
something specific even
for experienced DevOps
engineers.
Mr. Microservice
Consul
Fabio-lb
logs
metrics
trace
cache
state
eventsdiscovery
config
23
So what is Dev ENV ?
EXPECTATION REALITY
Maintain
infrastructure
manually - IaaC
• Deployment of each
component requires unique
deployment activities and
scripts.
• Do not resolve problem
with health and status
checks
Infra Team
25
Maintain
infrastructure
manually - IaaC
• Deployment of each
component requires
unique deployment
activities and scripts.
• Do not resolve problem
with health and status
checks
Infra Team
26
Maintain
infrastructure
manually -
Docker
• Using Docker allows to
standardize
deployment.
• Docker Image for each
of components needs
to be prepared and
preconfigured for our
needs.
Infra Team
27
Maintain
infrastructure
manually -
Docker
… lots of SSH
connections
Infra Team
29
Maintain
infrastructure
manually
Manual resource
assignment has a lot of
complexity and require
some operation flow to be
proceeded.
30
Maintain
infrastructure
manually -
Cloud
• Cloud Engineers have a lot
of good examples how
component deployment
automation should looks
like.
• Next step was to provide
simple user experience to
maintain this process just
simply stupid.
31
Cloud
Management
Experience
• Browser based
• Clear configuration and deployment
• Simple scaling
• Build-in monitoring
• Services catalog
• Self documented
32
It’s all about Jenkins
The short story how we started to use DC/OS
Jenkins for
CI/CD
• ~ 700 repositories
• ~ 500 builds per day
• ~ 1 build per minute
• ~ 40 windows slaves
35
Jenkins for
CI/CD
It requires some
additional interfaces to
provide and process
information for
developers and build
engineers.
36
Jenkins for
CI/CD
• Elastic + Kibana
• Go APIs
• Angular SPAs
• MySQL and Redis
• Zabbix for Jenkins slaves
monitoring
• Sonar
37
Jenkins for
CI/CD
And here is a
moment, You
realize that having
many pets in your
datacenter may be
not a good idea.
38
Jenkins for
CI/CD
The good news –
Simpsons already did it.
39
Jenkins for
CI/CD
Requirements to infrastructure management:
• Works on-premise
• Needs to be distributed
• Supports health-checks
• Self-Healing
• Easy deploy and maintain
• Can scale
• Has persistence storage support
• User friendly
40
DC/OS overview
Short story about evolution from Mesos to DC/OS
DC/OS
Overview
• Physical master
nodes cluster
• Any number of
worker nodes
Masters
Private Node Private Node Public Node
PrivateZone DMZ
42
PrivateZone DMZ
Master-Slave
Architecture
Master Node:
• Zookeeper
• Marathon
• Mesos-master
Worker Node
• Mesos-slave
Zookeeper
Mesos
Marathon
43
Master-Slave
Architecture
Mesos
Marathon
Zookeeper
Cassandra
Elastic
Jenkins
Spark
Storm
Hadoop
Marathon
Aurora
GUI
Networks
Security
Logs
Metrics
Packages
Storage
Orchestration
Jenkins
Spark
Storm
Hadoop
Elastic
Jenkins
Spark
Storm
Hadoop
Cassandra
Elastic
Jenkins
Spark
Storm
Hadoop
Marathon
Aurora
Hadoop
Marathon
Aurora
Storm
Hadoop
Marathon
Metronome
DC/OS
Metronome
44
DCOS Components
DMZ
Master-Slave
Architecture
To provide Software as
a Service, better way to
share full access to the
services.
Zookeeper
Mesos
Marathon
47
Master-Slave
Architecture
Mesos resource
management allows to
share resources across
different components
automatically.
Zookeeper
Mesos
Marathon
DMZ
48
Resource
Management
in Mesos
When new deployment
starts, task will be started
on a random Agent Node
with free resource
available.
In case of lack of resources,
deployment will be put on
Pending state.
workerA
8 Gb4
10 Gb
workerB
CPU MEM
4 Gb
HDD
2
10 Gb
2 CPU
6 GB RAM
5 Gb HDD
49
Resource
Management
in Mesos
The killing feature of DC/OS
is a better resources
utilization.
Health check, deployment,
distribution and resources
management will be
proceed on DC/OS (Mesos)
side.
workerA
8 Gb4
10 Gb
workerB
CPU MEM
4 Gb
HDD
2
10 Gb
1 CPU
1 GB RAM
0 Gb HDD
2 x
50
Resource
Management
in Mesos
DC/OS takes care about Agent
Nodes health check and each
running service instance health
check.
In case of unhealthy state, service
or all services in broken node will
be redeployed to other nodes.
workerA
8 Gb4
10 Gb
workerB
CPU MEM
4 Gb
HDD
2
10 Gb
x 2
51
DEMO – DC/OS cluster in AWS
Hiding Pets behind the Cattles
Managing hardware for Docker orchestration private cloud
Install Deploy Configure Observe
55
How we started
• Vagrant + Virtual Box
• Mini PC
• 8 TVs with Mini PC
• Ubuntu 16.04
• Daily usage – for Scrum and Monitoring
Dashboards
• 8 * 2 CPU = 16 CPU
• 8 * 4 Gb RAM = 32 GB RAM
• PC
• VMWare
• Google Cloud
63
Install Deploy Configure Observe
64
DC/OS Initial
Setup
We started from the simple
one-master node
configuration and one slave
node, dedicated for services
deployments.
Elastic was the first try – it
aggregates a lot of logs, and
goes broken from time to
time.
Master
cluster
DevOps
4 8 Gb
65
DC/OS Initial
Setup – first
customer
We need some service to
be running – we request
machine for this service in
VMWare and add it as
DC/OS Agent.
At backstage – all VMs
became a DC/OS nodes.
Master
cluster
DevOps
4 8 Gb
66
DC/OS Initial
Setup – TV
boxes
The best start was to use TV
boxes for scrum meetings in all
office rooms.
It gives lot of free resources
just for fun.
10 * 2 CPU = 20 CPU
10 * 4Gb = 40 Gb
Master
cluster
24 48 Gb
67
DC/OS Initial
Setup – internal
services
Such setup allow us to
run all services required
for internal needs of
infrastructure teams.
.. and a little more like
bots for Slack, Sonar, etc.
24 48 Gb
Slack Bot
68
DC/OS Initial
Setup – issues
24 48 Gb
The main issues on this stage were:
• Master Node performance
Master nodes had lack of resources, which causes
often DC/OS UI failures or Marathon failures.
Temporary solution – master node restart.
• Agent Nodes failures
Out of free disk space, machine shutdown, CPU
high load, out of memory – the most common
reason of failures.
69
DC/OS Initial
Setup – Cluster
With number of Agents
greater than 1 single
master became the gap:
• In case of failure –
system goes down
• Lack of performance
Master
cluster
Master
Master
24 48 Gb
70
DC/OS Initial
Setup – Cluster
With number of Agents
greater than 1 single
master became the gap:
• In case of failure –
system goes down
• Lack of performance
Master
cluster
Master
Master
24 48 Gb
71
DC/OS Initial
Setup – Cluster
Master
cluster
Master
Master
24 48 Gb
With number of Agents
greater than 1 single
master became the gap:
• In case of failure –
system goes down
• Lack of performance
DC/OS Initial
Setup – more
VMs
With current setup, DC/OS
became ready for
consuming external
requests.
Master
cluster
Master
Master
Dev
40 60 Gb
73
DC/OS Initial
Setup –
Hardware PCs
Few dedicated PCs
became a cluster
member in worker Agent
role.
Master
cluster
Master
Master
DevTeam
60 90 Gb
74
DC/OS Initial
Setup –
Google Cloud
Creating scaling group in
the cloud makes a DC/OS
cluster unlimited by
resources.
Master
cluster
Master
Master
Unit
100 160 Gb
75
Summary
• Number of nodes was increased eventually
• Infra Team used DCOS for own needs only for
the first time
• To monitor and bootstrap the cluster some
additional resources were required: Zabbix,
Grafana
• Different type of nodes allow to increase
flexibility and give positive grow speed.
• Adding Google Cloud instance eliminated
cluster size limit. With hybrid cloud DC/OS can
grow much more quickly.
76
Install Deploy Configure Observe
77
Add services
quickly
Service catalog allows to
chose service from a
predefined list and deploy
in one click.
If needed – own repository
can be added.
78
Add services
flexible
For all the other cases –
Service manual deployment
are available
1. Single Container (Docker)
2. Bash runtime
3. Multi-container
79
Be structural
Services can be organized
as folder structure.
This feature allows to
isolate environments for
different dev teams.
80
Be discoverable
Mesos DNS, integrated
with company DNS
server, allows to access
each service directly by
Agent IP/port.
Summary
• DC/OS allows to build complex DEV/UAT
environments bases on Docker infrastructure
• The simplest way of deployment – Universe
Catalog with well known services deployed in
one click.
• Each service can be placed to a separated folder.
• Mesos DNS includes full folder structure in
service DNS name.
• Marathon-LB allows to proxy any external call
thought HA Proxy to target service instance
(transforming IP and Port)
92
Is there life after delivery?
Sure, service life only begins here …
Install Deploy Configure Observe
94
Master
Nodes
Migration
95
Cluster breath
Nodes health uncertainty
Service mobility
96
Mesos lost tasks
97
DC/OS often
releases
98
Be careful with
Zookeeper
99
Scale your
services
100
Afterword
Short summary
What is DC/OS?
Distributed System Cluster Manager
Container Platform Operating System
Service Catalog Network
102
Initialclustersetup
JenkinsInfrastructure
PlatformasaService
MicroservicesInfrastructure
HybridCloud
103
References
• DC/OS official page: https://dcos.io/
• DC/OS Documentation: https://docs.mesosphere.com/1.11/overview/
• Marathon GitHub: https://github.com/mesosphere/marathon
• Mesos: http://mesos.apache.org/documentation/latest/
• Zookeeper: https://zookeeper.apache.org/
• Exhibitor: https://github.com/soabase/exhibitor/wiki
Thank You
Q&A

More Related Content

PDF
Managing Complexity at Velocity
PDF
TIAD : Automating the aplication lifecycle
PDF
Ceph, Xen, and CloudStack: Semper Melior-XPUS13 McGarry
PDF
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...
PDF
Introduction to Stacki - World's fastest Linux server provisioning Tool
PDF
OpenStack Summit Vancouver: Lessons learned on upgrades
PPTX
CloudStackFinalProject
PDF
VMworld 2013: Automating the Software Defined Data Center: How Do I Get Started
Managing Complexity at Velocity
TIAD : Automating the aplication lifecycle
Ceph, Xen, and CloudStack: Semper Melior-XPUS13 McGarry
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...
Introduction to Stacki - World's fastest Linux server provisioning Tool
OpenStack Summit Vancouver: Lessons learned on upgrades
CloudStackFinalProject
VMworld 2013: Automating the Software Defined Data Center: How Do I Get Started

What's hot (20)

PPTX
HadoopCon- Trend Micro SPN Hadoop Overview
PDF
OpenStack Deployments with Chef
PPTX
VMworld 2016: vSphere 6.x Host Resource Deep Dive
PDF
KoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLA
PDF
[NYC Meetup] Docker at Nuxeo
PDF
Rails infrastructure
PDF
SQLDay2013_Denny Cherry - SQLServer2012inaHighlyAvailableWorld
PPTX
OpenStack Cinder
PDF
Containerizing legacy applications - OSCON 2016
PPTX
Automating Yourself Out of Trouble
PDF
Achieving Infrastructure Portability with Chef
PPTX
Building clouds with apache cloudstack apache roadshow 2018
PDF
Xen @ Google, 2011
PDF
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
PPTX
Openstack Study Nova 1
PPTX
Hypervisor Selection in Apache CloudStack 4.4
PDF
Puppet on a string
PDF
2015 03-26 cloud platform master class for cloudplatform 4 5 - public
PDF
Introduction openstack-meetup-nov-28
HadoopCon- Trend Micro SPN Hadoop Overview
OpenStack Deployments with Chef
VMworld 2016: vSphere 6.x Host Resource Deep Dive
KoprowskiT_SQLSat152_Bulgaria_HighAvailabilityOfSQLintheContextOfSLA
[NYC Meetup] Docker at Nuxeo
Rails infrastructure
SQLDay2013_Denny Cherry - SQLServer2012inaHighlyAvailableWorld
OpenStack Cinder
Containerizing legacy applications - OSCON 2016
Automating Yourself Out of Trouble
Achieving Infrastructure Portability with Chef
Building clouds with apache cloudstack apache roadshow 2018
Xen @ Google, 2011
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop
Openstack Study Nova 1
Hypervisor Selection in Apache CloudStack 4.4
Puppet on a string
2015 03-26 cloud platform master class for cloudplatform 4 5 - public
Introduction openstack-meetup-nov-28
Ad

Similar to Sergey Dzyuban "To Build My Own Cloud with Blackjack…" (20)

PPTX
To Build My Own Cloud with Blackjack…
PPTX
Episode 1: Building Kubernetes-as-a-Service
PDF
Highly efficient container orchestration and continuous delivery with DC/OS a...
PDF
Introduction to Apache Mesos and DC/OS
PDF
Introduction to DevOps and the Practical Use Cases at Credit OK
PDF
DOO-007_How to run containers in production, at scale!
PDF
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
PDF
Kubernetes One-Click Deployment: Hands-on Workshop (Munich)
PPTX
DevOps in Age of Kubernetes
PDF
CoreOS and cloud provider integration: simple cloud-init example at Exoscale
PDF
Choosing PaaS: Cisco and Open Source Options: an overview
PDF
Using DC/OS for Continuous Delivery - DevPulseCon 2017
PDF
Modern Container Orchestration (Without Breaking the Bank)
PPTX
Containerization - The DevOps Revolution
PPT
Google does containers: Hello Kubernetes - Steve Wong and Vladimir Vivien - D...
PDF
Introduction to Modern DevOps Technologies
PDF
PaaS Solutions Comparison
PDF
OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...
PPTX
Doing Dropbox the Native Cloud Native Way
PDF
TechBeats #2
To Build My Own Cloud with Blackjack…
Episode 1: Building Kubernetes-as-a-Service
Highly efficient container orchestration and continuous delivery with DC/OS a...
Introduction to Apache Mesos and DC/OS
Introduction to DevOps and the Practical Use Cases at Credit OK
DOO-007_How to run containers in production, at scale!
DevOps vs. Site Reliability Engineering (SRE) in Age of Kubernetes
Kubernetes One-Click Deployment: Hands-on Workshop (Munich)
DevOps in Age of Kubernetes
CoreOS and cloud provider integration: simple cloud-init example at Exoscale
Choosing PaaS: Cisco and Open Source Options: an overview
Using DC/OS for Continuous Delivery - DevPulseCon 2017
Modern Container Orchestration (Without Breaking the Bank)
Containerization - The DevOps Revolution
Google does containers: Hello Kubernetes - Steve Wong and Vladimir Vivien - D...
Introduction to Modern DevOps Technologies
PaaS Solutions Comparison
OSDC 2018 | From batch to pipelines – why Apache Mesos and DC/OS are a soluti...
Doing Dropbox the Native Cloud Native Way
TechBeats #2
Ad

More from Fwdays (20)

PDF
"Mastering UI Complexity: State Machines and Reactive Patterns at Grammarly",...
PDF
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
PPTX
"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai
PPTX
"Як ми переписали Сільпо на Angular", Євген Русаков
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
PDF
"Validation and Observability of AI Agents", Oleksandr Denisyuk
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
PPTX
"Co-Authoring with a Machine: What I Learned from Writing a Book on Generativ...
PPTX
"Human-AI Collaboration Models for Better Decisions, Faster Workflows, and Cr...
PDF
"AI is already here. What will happen to your team (and your role) tomorrow?"...
PPTX
"Is it worth investing in AI in 2025?", Alexander Sharko
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
PDF
"Scaling in space and time with Temporal", Andriy Lupa.pdf
PDF
"Database isolation: how we deal with hundreds of direct connections to the d...
PDF
"Scaling in space and time with Temporal", Andriy Lupa .pdf
PPTX
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
PPTX
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
PPTX
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
PPTX
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...
"Mastering UI Complexity: State Machines and Reactive Patterns at Grammarly",...
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai
"Як ми переписали Сільпо на Angular", Євген Русаков
"AI Transformation: Directions and Challenges", Pavlo Shaternik
"Validation and Observability of AI Agents", Oleksandr Denisyuk
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
"Co-Authoring with a Machine: What I Learned from Writing a Book on Generativ...
"Human-AI Collaboration Models for Better Decisions, Faster Workflows, and Cr...
"AI is already here. What will happen to your team (and your role) tomorrow?"...
"Is it worth investing in AI in 2025?", Alexander Sharko
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Database isolation: how we deal with hundreds of direct connections to the d...
"Scaling in space and time with Temporal", Andriy Lupa .pdf
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Modernizing your data center with Dell and AMD
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Encapsulation theory and applications.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
KodekX | Application Modernization Development
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
A Presentation on Artificial Intelligence
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Modernizing your data center with Dell and AMD
20250228 LYD VKU AI Blended-Learning.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Review of recent advances in non-invasive hemoglobin estimation
Advanced methodologies resolving dimensionality complications for autism neur...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Encapsulation_ Review paper, used for researhc scholars
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Encapsulation theory and applications.pdf
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
KodekX | Application Modernization Development
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
A Presentation on Artificial Intelligence
Reach Out and Touch Someone: Haptics and Empathic Computing

Sergey Dzyuban "To Build My Own Cloud with Blackjack…"