SlideShare a Scribd company logo
   
   
Credit: user niteroi @ panoramio.com
   
vimeo.com/43800150
   
   
   
   
   
   
   
   
1  Metrics 2.0 concepts
2  Implementation
3  Advanced stuff
   
“Dieter” ?
   
Peter   Deter→
   
Terminology sync
   
(1234567890, 82)
(1234567900, 123)
(1234567910, 109)
(1234567920, 77)
db15.mysql.queries_running
host=db15 mysql.queries_running
   
   
How many pagerequests/s is vimeo.com 
doing?
   
● stats.hits.vimeo_com
● stats_counts.hits.vimeo_com
   
   
stats.<host>.requesthostport.vimeo_
com_443
   
stats.timers.dfs5.proxy­
server.object.GET.200.timing
.upper_90
   
O(X*Y*Z)
X = # apps                
Y = # people             
Z = # aggregators     
   
How long does it take to retrieve an object from swift?
   
stats.timers.<host>.proxy­
server.<swift_type>.<http_method>.
<http_code>.timing.<stat>
stats.timers.<host>.object­
server.<http_method>.
timing.<stat>
target=stats.timers.dfs*.object*GET*timing.mean ?
target=groupByNode(stats.timers.dfs*.proxy
­server.object.GET.*.timing.mean,2,"avg")
target=stats.timers.dfs*.object­
server.GET.timing.mean
   
swift_type=object stat=mean timing GET avg by http_code
   
   
   
O((DxV)^2)
D = # dimensions             
V = # values per dim             
   
collectd.db.disk.sda1.dis
k_time.write
   
   
   
What should I name my metric?
   
10
100
1000
10000
100000
1000000
   
   
Metrics 2.0
   
Old:
● information lacking
● fields unclear & inconsistent
● cumbersome strings / trees
● forbidden characters
New:
● Self­describing
● Standardized
● all dimensions in orthogonal tag­space
● Allow some useful characters
   
stats.timers.dfs5.proxy­server.object.GET.200.timing.upper_90
{
    “server”: “dfvimeodfsproxy5”,
    “http_method”: “GET”,
    “http_code”: “200”,
    “unit”: “ms”,
    “target_type”: “gauge”,
    “stat”: “upper_90”,
    “swift_type”: “object”
    “plugin”: “swift_proxy_server”
}
   
Main advantages:
● Immediate understanding of metric meaning (ideally)
● Minimize time to graphs, dashboards, alerting rules 
   
github.com/vimeo/graph­explorer/wiki
   
SI + IEC
B   Err   Warn   Conn   Job   File   Req    ...
MB/s   Err/d   Req/h   ...
   
{
    “site”: “vimeo.com”,
    “port”: 80,
    “unit”: “Req/s”,
    “direction”: “in”,
    “service”: “webapp_php”,
    “server”:  “webxx”
}
   
   
Carbon­tagger:
... 
service=foo.instance=host.target_type=gauge.type=calculatio
n.unit=B 123 1234567890
…
Statsdaemon:
..unit=B..unit=B...        unit=B/s→
..unit=ms..unit=ms..    unit=ms stat=mean→
                                   → unit=ms stat=upper_90
                                   → ...
   
   
   
Graph­Explorer queries 101
site:api.vimeo.com unit=Req/s
requesthostport api_vimeo_com
   
   
Smoothing
avg over 10M
avg over ...
   
   
Aggregation, compare port 80 vs 443
avg by <dimension>
sum by <dimension>
sum by server
   
   
Compare 80 traffic amongt servers
site:api.vimeo.com unit=Req/s port=80 group by none avg 
over 10M
   
   
Graph­Explorer queries 201
proxy­server swift server:regex upper_90 unit=ms from 
<datetime> to <datetime> avg over <timespec> 
   
   
   
   
   
Compare object put/get
Stack .. http_method:(PUT|GET) swift_type=object avg by 
http_code,server
   
   
Comparing servers
http_method:(PUT|GET) avg by 
http_code,swift_type,http_method group by none
   
   
Compare http codes for GET, per swift type
http_method=GET avg by server group by swift_type
   
   
transcode unit=Job/s avg over <time> from <datetime> to 
<datetime>
    Note: data is obfuscated
   
Bucketing
!queue sum by zone:ap­southeast|eu­west|us­east|us­
west|sa­east|vimeo­df|vimeo­lv group by state
    Note: data is obfuscated
   
Compare job states per region (zones bucket)
group by zone
    Note: data is obfuscated
   
Unit conversion
unit=Mb/s network dfvimeorpc sum by server
   
   
   
unit=MB
   
   
   
{
    server=dfvimeodfs1
    plugin=diskspace
    mountpoint=_srv_node_dfs5
    unit=B
    type=used
    target_type=gauge
}
   
server:dfvimeodfs unit=GB type=free srv node
   
   
unit=GB/d group by mountpoint
   
   
   
   
   
   
   
Dashboard definition
 queries = [
   'cpu usage sum by core',
   'mem unit=B !total group by type:swap',
   'stack network unit=b/s',
   'unit=B (free|used) group by =mountpoint'
 ]
   
   
stats.dfvimeocliapp2.twitter.error
{
    “n1”: “dfvimeocliapp2”,
    “n2”: “twitter”,
    “n3”: “error”,
    “plugin”: “catchall_statsd”,
    “source”: “statsd”,
    “target_type”: “rate”,
    “unit”: “unknown/s”
}
   
Two hard things in computer science
   
stats.gauges.files.
id_boundary_7day
stats.gauges.files.
id_boundary_ceil
   
unit=File id_boundary_7d 
{
   “unit”: “File”,
   “n1”: “id_boundary_7d”,
}
   
{
    “intrinsic”: {
        “site”: “vimeo.com”,
        “unit”: “Req/s”
    },
    “extrinsic”: {
        “agent”: “diamond”,
        “processed_by”: “statsd1”,
        “src”: “index.php:135”,
        “replaces”: “vimeo_com_reqps”
    }
}
   
site=vimeo.com unit=Req/s 
  processed_by=statsd1  
src=index.php:135 added_by=dieter 
123 1234567890
   
   
Equivalence
servers.host.cpu.total.iowait   “core” : “_sum_”→
servers.host.cpu.<core­number>.iowait
servers.host.loadavg.15
   
Rollups & aggregation
   
/etc/carbon/storage­aggregation.conf
[min]
pattern = .min$
aggregationMethod = min
[max]
pattern = .max$
aggregationMethod = max
[sum]
pattern = .count$
aggregationMethod = sum
[default_average]
pattern = .*
aggregationMethod = average
   
   
2 kinds of graphite users
   
Self­describing metrics
stat=upper/lower/mean/...
target_type=counter..
   
●    stats.timers.render_time.histogram.bin_0.01
●    stats.timers.render_time.histogram.bin_0.1
●    stats.timers.render_time.histogram.bin_1           unit=Freq_abs bin_upper=1→
●    stats.timers.render_time.histogram.bin_10
●    stats.timers.render_time.histogram.bin_50
●    stats.timers.render_time.histogram.bin_inf
●    stats.timers.render_time.lower                            unit=ms stat=lower→
●    stats.timers.render_time.mean                            unit=ms stat=mean→
●    stats.timers.render_time.mean_90                      ...→
●    stats.timers.render_time.median
●    stats.timers.render_time.std
●    stats.timers.render_time.upper
●    stats.timers.render_time.upper_90
   
Also..
● graphite API functions such as "cumulative", "summarize" 
and "smartSummarize"
● Graph renderers
   
   
From: dygraphs.com
   
   
   
   
   
   
Facet based suggestions
   
   
Metric types
● gauge
● count & rate
● counter
● timer
   
   
   
   
   
gauge
● Multiple values in same interval
● “sticky”
   
   
Count & Rate
   
Counter
   
Timer..
   
   
http://janabeck.com/blog/2012/10/12/lessons­learned­from­100/
   
Timer..
   
● What should a metric be?
● Stickyness?
● Behavior on no packets received
● Behavior on multiple packets received
   
My personal takeaways
   
Conclusion
● Building graphs, setting up alerting cumbersome
● Esp. changing information needs (troubleshooting, exploring, ..)
● Esp. Complicated information needs 
  → PAIN
● Structuring metrics
● Self­describing metrics
● Standardized metrics
● Native metrics 2.0
●  → BREEZE 
   
Conclusion
● Metrics can be so much more usable and useful. Let's talk about 
tagging, standardisation, retaining information throughout the 
pipeline.
● Converting information needs into graph defs, alerting rules
● Graph­Explorer, carbon­tagger, statsdaemon, …
● Graphite­ng (native metrics 2.0)
● Metrics 2.0 in your apps, agents, aggregators?
● Build out structured metrics library
   
github.com/vimeo
github.com/Dieterbe
twitter.com/Dieter_be
dieter.plaetinck.be
   

More Related Content

PDF
Metrics 2.0 & Graph-Explorer
PDF
Metrics 2.0 @ Monitorama PDX 2014
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
PPTX
Graphite
PPTX
Weather of the Century: Design and Performance
PPTX
The Weather of the Century Part 2: High Performance
PPTX
More Data, More Problems: Evolving big data machine learning pipelines with S...
PPTX
Mythbusting: Understanding How We Measure the Performance of MongoDB
Metrics 2.0 & Graph-Explorer
Metrics 2.0 @ Monitorama PDX 2014
A Beginner's Guide to Building Data Pipelines with Luigi
Graphite
Weather of the Century: Design and Performance
The Weather of the Century Part 2: High Performance
More Data, More Problems: Evolving big data machine learning pipelines with S...
Mythbusting: Understanding How We Measure the Performance of MongoDB

What's hot (20)

PDF
Why Grails?
PPT
SharePoint Administration with PowerShell
PDF
Deep dive into deeplearn.js
PDF
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
KEY
Introduction to PiCloud
PPTX
Ember
PPTX
Machine Learning Model Bakeoff
PDF
k-means algorithm implementation on Hadoop
PPTX
Mythbusting: Understanding How We Measure the Performance of MongoDB
DOC
Caching a page
PPTX
Time Series Analysis for Network Secruity
PPTX
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
PPTX
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
PDF
Influx db talk-20150415
PPTX
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
PPTX
Graph Based Malware Analysis @ Graphday SF 2018
PDF
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
PDF
INFLUXQL & TICKSCRIPT
PDF
Boredom comes to_those_who_wait
PDF
Liquid Stream Processing Across Web Browsers and Web Servers
Why Grails?
SharePoint Administration with PowerShell
Deep dive into deeplearn.js
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Introduction to PiCloud
Ember
Machine Learning Model Bakeoff
k-means algorithm implementation on Hadoop
Mythbusting: Understanding How We Measure the Performance of MongoDB
Caching a page
Time Series Analysis for Network Secruity
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Barbara Nelson [InfluxData] | How Can I Put That Dashboard in My App? | Influ...
Influx db talk-20150415
Anais Dotis-Georgiou & Faith Chikwekwe [InfluxData] | Top 10 Hurdles for Flux...
Graph Based Malware Analysis @ Graphday SF 2018
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
INFLUXQL & TICKSCRIPT
Boredom comes to_those_who_wait
Liquid Stream Processing Across Web Browsers and Web Servers
Ad

Similar to Metrics stack 2.0 (20)

PDF
Rethinking metrics: metrics 2.0
PDF
Rethinking metrics: metrics 2.0 @ Lisa 2014
DOCX
Experienced Selenium Interview questions
PPTX
Google Cloud Platform monitoring with Zabbix
PPTX
Jenkins Online Meetup - Automated SLI based Build Validation with Keptn
PDF
Monitoring und Metriken im Wunderland
 
PPTX
DDD, CQRS, ES lessons learned
PDF
Debug production server by counter
PPTX
Measuring User Experience
PPTX
Measuring User Experience in the Browser
PDF
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
PDF
Introduction to Django
KEY
Authentication
KEY
DjangoCon 2010 Scaling Disqus
PDF
Living with garbage
PPTX
Docker, Zabbix and auto-scaling
PPTX
Cloud patterns - NDC Oslo 2016 - Tamir Dresher
PDF
Advanced Cassandra
PDF
An Introduction to Celery
PDF
Reactive Stream Processing Using DDS and Rx
Rethinking metrics: metrics 2.0
Rethinking metrics: metrics 2.0 @ Lisa 2014
Experienced Selenium Interview questions
Google Cloud Platform monitoring with Zabbix
Jenkins Online Meetup - Automated SLI based Build Validation with Keptn
Monitoring und Metriken im Wunderland
 
DDD, CQRS, ES lessons learned
Debug production server by counter
Measuring User Experience
Measuring User Experience in the Browser
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Introduction to Django
Authentication
DjangoCon 2010 Scaling Disqus
Living with garbage
Docker, Zabbix and auto-scaling
Cloud patterns - NDC Oslo 2016 - Tamir Dresher
Advanced Cassandra
An Introduction to Celery
Reactive Stream Processing Using DDS and Rx
Ad

Recently uploaded (20)

PDF
PPT on Performance Review to get promotions
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PPT
Occupational Health and Safety Management System
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
Soil Improvement Techniques Note - Rabbi
PPTX
communication and presentation skills 01
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
UNIT - 3 Total quality Management .pptx
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPTX
UNIT 4 Total Quality Management .pptx
PPT on Performance Review to get promotions
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Occupational Health and Safety Management System
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
III.4.1.2_The_Space_Environment.p pdffdf
Soil Improvement Techniques Note - Rabbi
communication and presentation skills 01
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
R24 SURVEYING LAB MANUAL for civil enggi
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
UNIT - 3 Total quality Management .pptx
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Exploratory_Data_Analysis_Fundamentals.pdf
Analyzing Impact of Pakistan Economic Corridor on Import and Export in Pakist...
Safety Seminar civil to be ensured for safe working.
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
UNIT 4 Total Quality Management .pptx

Metrics stack 2.0