SlideShare a Scribd company logo
Debug production server by
counter
羅仲成 Roy Lou
17 Media
2016 July
About me
- 17 Media Architect
- Past
- HTC: cloud backend
- Google: Google Fiber, embedded system
- NVIDIA: vlsi hardware
- roylou@gmail.com
About HTC CSI Project
- Cloud service infrastructure for
mobile apps (similar to Parse.com)
- Backed 5+ apps and 3M+ users
- 50 < # of VMs < 200 (Autoscaled)
- ~15 microservices
- Team of 15 engineers
One Gallery Umadeit
(Fun Fit)
Debug production server by counter
External
outage
Internet
Connectivity ZooKeeper
Down
Application
Errors
Intranet
Connectivity
Redis Down
DB Down
Problems to Solve
Need utility to monitor, alert, debug production cluster issues:
- Infrastructure outage
- Application outage
What choices do I have
Infrastructure monitoring
Application monitoring (for weak typing languages)
Counter
func (s *Store) Get(key string) ([]byte, error) {
defer ctr.Time("get.proc_time", time.Now())
if val, err := s.Cache.Get(key); err == nil {
ctr.Event("get.cache_hit", 1)
return val, nil
}
val, err := s.DB.Get(key)
if err != nil {
ctr.Event("get.db.err", 1)
return nil, err
}
return msgs, nil
}
Counter Example - Read Cache
Client
Cache DB
func (t *RoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
ctr.Event("qps", 1)
ctr.Event("send.bytes", uint64(req.ContentLength))
defer ctr.Time("latency", time.Now())
res, err := t.rt.RoundTrip(req)
if err == nil {
ctr.Event(fmt.Sprintf("status.%d", res.StatusCode), 1)
} else {
ctr.Err("internal.err", 1)
}
return res, err
} Counter Roundtripper
Client
Server
Roundtrip
Counter Example - Http Roundtrip
App
Container
Fluentd
Agent
VM
Counter Pipeline
App
Container
Fluentd
Agent
VM
Counter Pipeline
prometheus
ES alternative:
App
Container
Fluentd
Agent
VM
How frequent should I send counter?
How Frequent Should I Send Counter?
Option 1: Forward every counter to Elasticsearch
Option 2: Aggregate locally before forwarding
1000 counters / container * 100 counts / second = 100k qps
For us, aggregate and send every 30 sec
App
Container
Fluentd
Agent
VM
How long can I store counters?
How Long Can I Store Counters?
- 50,000 counters
- 1 record every 30 seconds
To save counter for 1 year:
50,000 * 4 (bytes) * 2 (counters/minute) * 525,600 (mins/year)
= 210,240,000,000 Bytes
= 210.24 TB
Need to aggregate for long term storage
App
Container
Fluentd
Agent
VM
Counter
Aggregator
Counter Granularity:
- Past 10 days: 30 sec
- Past 1 month: 5 min
- Past 3 month: 30 min
- Past year: 1 hr
Time series counter
Topology View
Deploy with Counters
Deploy with Counters
Docker Registry
docker push
code Review
CI
git
push
deploy
- Mon night: Code freeze
- Tue morning: Deploy to staging
- If okay, deploy to production
30% => 50% => 100%
Rolling to X%
- Health check
- Manually inspect
counters
- Minimal e2e test
- Compare counter
with last deploy
Monitor / Alert with Counters
App
Container
Fluentd
Agent
VM
Counter
Aggregator
Monitor/Alert with Counters
App
Container
Fluentd
Agent
VM
Counter
Aggregator
Cron Server
eQstr = 'host:"prod-cg-docvcs-group" AND pkg:docvcs_worker AND name:overall.err'
rQstr = 'host:"prod-cg-docvcs-group" AND pkg:docvcs_worker AND name:overall.request'
errors = esq_scalar('sum', 'total', eQstr, 'now-5m', 'now')
requests = esq_scalar('sum', 'total', rQstr, 'now-5m', 'now')
error_rate = errors * 100 / requests
-- Fail rate should be less than 10/s
alert_p2('docvcs fail_rate', error_rate, '>', 10, '15m')
alert_p0('docvcs fail_rate', error_rate, '>', 10, '45m')
Alarm when high error rate
Debug production server by counter
Debug production server by counter
Debug with Counters
Debug with Counters
- GDB
- Bisect with log
- Bisect with counters
counter
Autoscale with Counters
App
Container
Fluentd
Agent
VM
Counter
Aggregator
Cron Server
Autoscale with Counters
App
Container
Fluentd
Agent
VM
Counter
Aggregator
Cron Server
gcloud cli
qstr = 'name: docvcs.jobs.min.outstanding'
outstanding = esq_scalar(qstr, 'now-10m', 'now')
workload = outstanding / 200
autoscale(workload, 'docvcs', 6, 30, 6, 'diff', 0.65, 0.2, 2/3)
autoscale(workload, 'docvcs', 6, 30, 6, 'diff', 0.65, 0.2, 2/3)
minimum # of instances
maximum # of instances
maximum # of VMs to be scaled
target workload
safeguard
workload
▵Instance
0.65 0.85
0.45
6
safeguard
Debug production server by counter
Business Logic with Counters
Business Logic with Counters
What else can counter do?
What can’t counter do?
Counter solves problem on 90% users.
Counter can’t solve problem on 1 user.
If so, need logs
Summary of Counter
A line of code. Can be used for:
- Rolling update
- Monitor / alert
- Debug cluster
- Autoscale cluster
- Simple business logics
- And many others (use your imagination)
Thank You
roylou@gmail.com

More Related Content

PDF
Automated acceptance test
PDF
Universal JavaScript - Frontend United Athens 2017
PDF
[1D1]신개념 N스크린 웹 앱 프레임워크 PARS
PPTX
PDF
Using React with Grails 3
PDF
Behind the scenes of Scaleway Functions : when Kubernetes meets our products
PDF
Yunong Xiao - The Paved PaaS to Microservices - Codemotion Milan 2017
PDF
Full Stack Reactive with React and Spring WebFlux - SpringOne 2018
Automated acceptance test
Universal JavaScript - Frontend United Athens 2017
[1D1]신개념 N스크린 웹 앱 프레임워크 PARS
Using React with Grails 3
Behind the scenes of Scaleway Functions : when Kubernetes meets our products
Yunong Xiao - The Paved PaaS to Microservices - Codemotion Milan 2017
Full Stack Reactive with React and Spring WebFlux - SpringOne 2018

What's hot (20)

PPTX
Top 10 RxJs Operators in Angular
PPTX
Cf summit-2016-monitoring-cf-sensu-graphite
PPTX
Serverless
PDF
A Series of Fortunate Events: Building an Operator in Java
PPTX
Spring webflux
PDF
Kube Your Enthusiasm - Paul Czarkowski
PDF
Orchestrate Event-Driven Infrastructure with SaltStack
PDF
Serverless Angular, Material, Firebase and Google Cloud applications
PDF
Improving the Accumulo User Experience
PDF
Build reactive systems on lambda
PPTX
State in stateless serverless functions
PDF
Mobile Library Development - stuck between a pod and a jar file - Zan Markan ...
PDF
API Design in the Modern Era - Architecture Next 2020
PDF
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
PDF
OSMC 2008 | Lessons in Nagios Learnt From Developing Opsview by Ton Voon
PPTX
Self-healing Applications with Ansible
PDF
Dropwizard and Friends
PDF
SpringBoot and Spring Cloud Service for MSA
PPTX
Advanced Durable Functions - Serverless Meetup Tokyo - Feb 2018
PDF
Choose Your Own Adventure with JHipster & Kubernetes - Denver JUG 2020
Top 10 RxJs Operators in Angular
Cf summit-2016-monitoring-cf-sensu-graphite
Serverless
A Series of Fortunate Events: Building an Operator in Java
Spring webflux
Kube Your Enthusiasm - Paul Czarkowski
Orchestrate Event-Driven Infrastructure with SaltStack
Serverless Angular, Material, Firebase and Google Cloud applications
Improving the Accumulo User Experience
Build reactive systems on lambda
State in stateless serverless functions
Mobile Library Development - stuck between a pod and a jar file - Zan Markan ...
API Design in the Modern Era - Architecture Next 2020
"Technical Challenges behind Visual IDE for React Components" Tetiana Mandziuk
OSMC 2008 | Lessons in Nagios Learnt From Developing Opsview by Ton Voon
Self-healing Applications with Ansible
Dropwizard and Friends
SpringBoot and Spring Cloud Service for MSA
Advanced Durable Functions - Serverless Meetup Tokyo - Feb 2018
Choose Your Own Adventure with JHipster & Kubernetes - Denver JUG 2020
Ad

Similar to Debug production server by counter (20)

PDF
Prometheus Everything, Observing Kubernetes in the Cloud
PDF
Monitoring a Kubernetes-backed microservice architecture with Prometheus
PDF
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
PDF
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
PDF
Composable and streamable Play apps
PDF
Monitoring as Software Validation
PDF
Intelligent Monitoring
PPTX
Google Cloud Platform monitoring with Zabbix
PPTX
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
PPTX
What is going on - Application diagnostics on Azure - TechDays Finland
PDF
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
PDF
Timeseries - data visualization in Grafana
PDF
How to measure everything - a million metrics per second with minimal develop...
PPTX
Docker practical solutions
PPTX
StrongLoop Overview
PPTX
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
PPTX
Best Practices in Handling Performance Issues
PDF
Continous UI testing with Espresso and Jenkins
PDF
GDG Cloud Southlake #36: Kyle Forster: AI and Modern Workflow Automation: Aut...
PPT
When Web Services Go Bad
Prometheus Everything, Observing Kubernetes in the Cloud
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Composable and streamable Play apps
Monitoring as Software Validation
Intelligent Monitoring
Google Cloud Platform monitoring with Zabbix
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
What is going on - Application diagnostics on Azure - TechDays Finland
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Timeseries - data visualization in Grafana
How to measure everything - a million metrics per second with minimal develop...
Docker practical solutions
StrongLoop Overview
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Best Practices in Handling Performance Issues
Continous UI testing with Espresso and Jenkins
GDG Cloud Southlake #36: Kyle Forster: AI and Modern Workflow Automation: Aut...
When Web Services Go Bad
Ad

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
sap open course for s4hana steps from ECC to s4
Digital-Transformation-Roadmap-for-Companies.pptx
MYSQL Presentation for SQL database connectivity
MIND Revenue Release Quarter 2 2025 Press Release
NewMind AI Weekly Chronicles - August'25 Week I
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Diabetes mellitus diagnosis method based random forest with bat algorithm
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Review of recent advances in non-invasive hemoglobin estimation

Debug production server by counter