@LostInBrittany
Amsterdam | April 2 - 3, 2019
Monitoring OVH: 300k servers, 28 DCs...
and one Metrics platform
Horacio Gonzalez
@LostInBrittany
@LostInBrittany
Who are we?
Introducing myself and
introducing OVH
@LostInBrittany
Horacio Gonzalez
@LostInBrittany
Spaniard lost in Brittany, developer,
dreamer and all-around geek
Flutter
@LostInBrittany
OVH : Key Figures
1.3M Customers worldwide in 138Countries
1.5 Billions euros investment over five years
28 Datacenters (growing)
350k Dedicated Servers
200k Private cloud VMs running
650k Public cloud Instances created in a month
20TB bandwidth capacity
35 Points of presence
4TB Anti DDoS capacity
Hosting capacity : 1.3M Physical Servers
+ 2 500 Employees in 19 countries
18Years of Innovation
@LostInBrittany
OVH: A Global Leader on Cloud
Own
15 Tbps
Netwok
with
35 PoPs
> 1.3M Customers in 138 Countries
Hosting capacity :
1.3M Physical
Servers
360k
Servers already
deployed
2018
27 Datacenters
2020
50 Datacenters
1 Dedicated IaaS
Europe
200k Private cloud
VMs running
@LostInBrittany
Ranking & Recognition
1st
European Cloud Provider*
1st
Hosting provider in Europe
1st
Provider Microsoft Exchange
Certified vCloud Datacenter
Certified Kubernetes platform (CNCF)
Vmware Global Service Provider 2013-2016
Veeam Best Cloud Partner of the year (2018)
* Netcraft 2017 -
@LostInBrittany
OVH: Our solutions
Cloud
Web
Hosting
▪ Dedicated Server
▪ Data Storage
▪ Network and
Security
▪ Licences
Mobile
Hosting Telecom
VoIP
SMS/Fax
Virtual desktop
Cloud HubiC
Over theBox
Containers
Compute
Database
Object Storage
Securities
Messaging
VPS
Public Cloud
Private Cloud
Serveur dédié
Cloud Desktop
Hybrid Cloud
Domain names
Email
CDN
Web hosting
MS Office
MS solutions
@LostInBrittany
Once upon a time...
Because I love telling tales
@LostInBrittany
This talk is about a tale...
A true one nevertheless
@LostInBrittany
And as in most tales
It begins with a mission
@LostInBrittany
And a band of heroes
Engulfed into the adventure
@LostInBrittany
They fight against mishaps
And all kind of foes
@LostInBrittany
They build mighty fortresses
Pushing the limits of possible
@LostInBrittany
And defend them day after day
Against all odds
@LostInBrittany
But we don't know yet the end
Because this tale isn't finished yet
@LostInBrittany
It begins with a mission
Build a metrics platform for OVH
@LostInBrittany
Why do we need metrics?
To make better decisions
by using numbers
@LostInBrittany
Why do we need metrics?
We want our code to add value
@LostInBrittany
Why do we need metrics?
We need to make better decisions
about our code
@LostInBrittany
Why do we need metrics?
Code adds value when it runs
not when we write it
@LostInBrittany
Why do we need metrics?
We need to know what our code
does when it runs
@LostInBrittany
Why do we need metrics?
We can’t do this
unless we measure it
@LostInBrittany
Why do we need metrics?
We have a mental model
of what our code does
@LostInBrittany
Why do we need metrics?
This representation
can be wrong
@LostInBrittany
Why do we need metrics?
We can’t know until
we measure it
@LostInBrittany
“The app is slow.” - User
Find the bottleneck
‘’‘’
@LostInBrittany
“The app is slow.” - User
“The page takes 500ms!” - Ops
Find the bottleneck
‘’
@LostInBrittany
SQL Query?
Template Rendering?
Session Storage?
?
Find the bottleneck
@LostInBrittany
We don't know
?
Find the bottleneck
@LostInBrittany
With observability:
SQL Query………………………....53ms
Template Rendering……….1ms
Session Storage………......315ms
=
Find the bottleneck
@LostInBrittany
With observability:
SQL Query………………………....53ms
Template Rendering……….1ms
Session Storage………......315ms
=
Find the bottleneck
@LostInBrittany
Why do we need metrics?
We improve our mental model by measuring what our
code does
@LostInBrittany
Why do we need metrics?
We use our mental model
to decide what to do
@LostInBrittany
Why do we need metrics?
A better mental model makes us better at deciding what
to do
@LostInBrittany
Why do we need metrics?
Better decisions makes us
better at generating value
@LostInBrittany
Why do we need metrics?
Measuring make your
App better
@LostInBrittany
It began with a mission
Build a metrics platform for OVH
@LostInBrittany
A metrics platform for OVH
For all OVH
@LostInBrittany
Building OVH Metrics
One Platform to unify them all,
One Platform to find them,
One Platform to bring them all
and in the Metrics monitor them
@LostInBrittany
What is OVH Metrics?
Managed Cloud Platform
for Time Series
@LostInBrittany
OVH monitoring story
We had lots of partial solutions...
@LostInBrittany
OVH monitoring story
One Platform to unify them all
What should we build it on?
@LostInBrittany
OVH monitoring story
Including a really big
@LostInBrittany
OpenTSDB drawbacks
OpenTSDB RowKey Design
!
@LostInBrittany
OpenTSDB Rowkey design flaws
● .*regex.* => full table scans
● High cardinality issues (Query latencies)
We needed something able to manage hundreds of
millions time series
OpenTSBD didn't scale for us
@LostInBrittany
OpenTSDB other flaws
● Compaction (or append writes)
● /api/query : 1 endpoint per function?
● Asynchronous
● Unauthenticated
● ...
@LostInBrittany
Scaling OpenTSDB
@LostInBrittany
Metrics needs
First need:
To be massively scalable
@LostInBrittany
Analytics is the key to success
Fetching data is only the tip of the iceberg
@LostInBrittany
Analysing metrics data
To be scalable, analysis must be done in the database, not in
user's computer
@LostInBrittany
Metrics needs
Second need:
To have rich query capabilities
@LostInBrittany
Enter Warp 10...
Open-source
Time series
Database
@LostInBrittany
More than a Time Series DB
Warp 10 is a software platform that
● Ingests and stores
time series
● Manipulates and
analyzes time series
@LostInBrittany
Manipulating Time Series with Warp 10
A true Time Series analysis toolbox
○ Hundreds of functions
○ Manipulation frameworks
○ Analysis workflow
@LostInBrittany
Manipulating Time Series with Warp 10
A Time Series manipulation language
WarpScript
@LostInBrittany
Did you say scalability?
From the smallest to the largest...
@LostInBrittany
More Warp 10 goodness
● Secured & multi tenant
● In memory Index
● No cardinality issues
● Lockfree ingestion
● WarpScript Query Language
● Support more data types
● Synchronous (transactions)
● Better Performance
● Better Scalability
● Versatile
(standalone, distributed)
@LostInBrittany
OVH Observability Metrics Platform
@LostInBrittany
Metrics Data Platform
@LostInBrittany
Building an ecosystem
From Warp 10 to OVH Metrics
@LostInBrittany
Multi-protocol
Why to choose? We need them all!
@LostInBrittany
Open source monitoring tools
@LostInBrittany
Open source monitoring tools
@LostInBrittany
Open source monitoring tools
@LostInBrittany
Open source monitoring tools
@LostInBrittany
Open source monitoring tools
@LostInBrittany
Open source monitoring tools
@LostInBrittany
Open source monitoring tools
Why choose?
Let’s support all of them!
@LostInBrittany
Metrics Platform
@LostInBrittany
Metrics Platform
https://
graphite
influx
opentsdb
prometheus
warp10
...
.<region>.metrics.ovh.net
@LostInBrittany
Metrics Live
In-memory, high-performance Metrics instances
@LostInBrittany
In-memory: Metrics live
+120 million of writes/s
@LostInBrittany
In-memory: Metrics live
@LostInBrittany
In-memory: Metrics live
@LostInBrittany
Monitoring is only the beginning
OVH Metrics answer to many other use cases
@LostInBrittany
• Billing ……………………………………………..…....(e.g. bill on monthly max consumption)
• Monitoring …..…………………………...(APM, infrastructure,appliances,...)
• IoT …………………………………………….………………....(Manage devices, operator integration, ...)
• Geo Location ….....………………...(Manage localized fleets)
Use cases families
@LostInBrittany
• DC Temperature/Elec/Cooling map
• Pay as you go billing (PCI/IPLB)
• GSCAN
• Monitoring
• ML Model scoring (Anti-Fraude)
• Pattern Detection for medical applications
Use cases
@LostInBrittany
SREing Metrics
With a great power
comes a great responsibility
@LostInBrittany
432 000 000 000
datapoints / day
Metrics' own metrics
@LostInBrittany
10 Tb / day
Metrics' own metrics
@LostInBrittany
5 000 000 dp/s
Metrics' own metrics
@LostInBrittany
500 000 000 series
Metrics' own metrics
@LostInBrittany
BHS:
● 30 nodes
● 400 TB
● 120 Mbps
GRA:
● 150 nodes
● 2 PB
● 1.1 Gbps
Our clusters size
@LostInBrittany
Our cluster architecture
@LostInBrittany
Detecting errors
85
Before it's too late
@LostInBrittany
Extract errors from logs
@LostInBrittany
Forward logs and extract metrics!
Tailor
@LostInBrittany
Monitoring the JVM
@LostInBrittany
Documentation
@LostInBrittany
JVM GC
The good, the bad
and the ugly
@LostInBrittany
The good
@LostInBrittany
The bad
@LostInBrittany
#java #jdk11 #zgc
… and the ugly
@LostInBrittany
Monitoring HBase
@LostInBrittany
Number of open regions
@LostInBrittany
Queues length
@LostInBrittany
Number of read and write requests
@LostInBrittany
Preserve data locality
@LostInBrittany
Host health
@LostInBrittany
Pokédex
Inventory all animals.
@LostInBrittany
Merging all data sources
@LostInBrittany
Global visualization
@LostInBrittany
Correlate information
@LostInBrittany
Sacha
The best tamer
@LostInBrittany
An awesome CLI
@LostInBrittany
Retrieving bare informations
@LostInBrittany
Create region map
@LostInBrittany
Move region to another region server
@LostInBrittany
Drain regions of the region server
@LostInBrittany
Managing multiple hardware profiles
@LostInBrittany
Balance the cluster
@LostInBrittany
Conclusion
That's all folks!

More Related Content

PDF
Isomorphic Reactive Programming
PPTX
Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...
PDF
Metrics spark meetup
PPTX
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
PPTX
Observability for Application Developers (1)-1.pptx
PPTX
Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org - Dev...
PDF
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
PDF
Azure Monitoring Overview
Isomorphic Reactive Programming
Case study: How Cozy Cloud monitors every layer of its activity using OVH Met...
Metrics spark meetup
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Observability for Application Developers (1)-1.pptx
Real-Time Metrics and Distributed Monitoring - Jeff Pierce, Change.org - Dev...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
Azure Monitoring Overview

Similar to Horacio Gonzalez - Monitoring OVH - Codemotion Amsterdam 2019 (20)

PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
PPTX
Metrics on the front, data in the back
PDF
OVH Analytics Data Compute - Apache Spark Cluster as a Service
PDF
Ovh analytics data compute with apache spark as a service meetup ovh bordeaux
PDF
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
PDF
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
PDF
Metrics driven development with dedicated Observability Team
PDF
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
PDF
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
PPTX
Observability - the good, the bad, and the ugly
PDF
Observability, Distributed Tracing, and Open Source: The Missing Primer
PPTX
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
PDF
Metrics & more
PDF
Metrics driven development 10.09.2014
PPTX
Introduction to OVH Analytics Data Platform
PDF
OVH Analytics Data Compute and Apache Spark as a Service
PDF
How to Use Big Data by Onehub
PPTX
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
PPSX
Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
PDF
OSMC 2016 - Friends and foes by Heinrich Hartmann
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Metrics on the front, data in the back
OVH Analytics Data Compute - Apache Spark Cluster as a Service
Ovh analytics data compute with apache spark as a service meetup ovh bordeaux
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Metrics driven development with dedicated Observability Team
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Observability - the good, the bad, and the ugly
Observability, Distributed Tracing, and Open Source: The Missing Primer
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Metrics & more
Metrics driven development 10.09.2014
Introduction to OVH Analytics Data Platform
OVH Analytics Data Compute and Apache Spark as a Service
How to Use Big Data by Onehub
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Metrics at Scale @ UBER (Mantas Klasavicius Technology Stream)
OSMC 2016 - Friends and foes by Heinrich Hartmann
Ad

More from Codemotion (20)

PDF
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
PDF
Pompili - From hero to_zero: The FatalNoise neverending story
PPTX
Pastore - Commodore 65 - La storia
PPTX
Pennisi - Essere Richard Altwasser
PPTX
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
PPTX
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
PPTX
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
PPTX
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
PDF
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
PDF
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
PDF
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
PDF
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
PDF
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
PDF
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
PPTX
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
PPTX
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
PDF
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
PDF
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
PDF
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
PDF
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...
Pompili - From hero to_zero: The FatalNoise neverending story
Pastore - Commodore 65 - La storia
Pennisi - Essere Richard Altwasser
Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...
Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019
Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019
Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -
Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...
Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...
Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...
Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...
Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019
Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019
Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019
James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...
Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...
Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019
Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019
Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Ad

Recently uploaded (20)

PDF
Five Habits of High-Impact Board Members
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Benefits of Physical activity for teenagers.pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PPTX
observCloud-Native Containerability and monitoring.pptx
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
STKI Israel Market Study 2025 version august
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PPTX
Modernising the Digital Integration Hub
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
Tartificialntelligence_presentation.pptx
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Five Habits of High-Impact Board Members
Getting started with AI Agents and Multi-Agent Systems
Benefits of Physical activity for teenagers.pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Taming the Chaos: How to Turn Unstructured Data into Decisions
observCloud-Native Containerability and monitoring.pptx
Final SEM Unit 1 for mit wpu at pune .pptx
DP Operators-handbook-extract for the Mautical Institute
STKI Israel Market Study 2025 version august
A contest of sentiment analysis: k-nearest neighbor versus neural network
A novel scalable deep ensemble learning framework for big data classification...
O2C Customer Invoices to Receipt V15A.pptx
Modernising the Digital Integration Hub
Chapter 5: Probability Theory and Statistics
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Developing a website for English-speaking practice to English as a foreign la...
Tartificialntelligence_presentation.pptx
NewMind AI Weekly Chronicles – August ’25 Week III
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf

Horacio Gonzalez - Monitoring OVH - Codemotion Amsterdam 2019