Debug production server by counter

Debug production server by
counter
羅仲成 Roy Lou
17 Media
2016 July

About me
- 17 Media Architect
- Past
- HTC: cloud backend
- Google: Google Fiber, embedded system
- NVIDIA: vlsi hardware
- roylou@gmail.com

About HTC CSI Project
- Cloud service infrastructure for
mobile apps (similar to Parse.com)
- Backed 5+ apps and 3M+ users
- 50 < # of VMs < 200 (Autoscaled)
- ~15 microservices
- Team of 15 engineers
One Gallery Umadeit
(Fun Fit)

External
outage
Internet
Connectivity ZooKeeper
Down
Application
Errors
Intranet
Connectivity
Redis Down
DB Down

Problems to Solve
Need utility to monitor, alert, debug production cluster issues:
- Infrastructure outage
- Application outage

What choices do I have
Infrastructure monitoring
Application monitoring (for weak typing languages)

func (s *Store) Get(key string) ([]byte, error) {
defer ctr.Time("get.proc_time", time.Now())
if val, err := s.Cache.Get(key); err == nil {
ctr.Event("get.cache_hit", 1)
return val, nil
}
val, err := s.DB.Get(key)
if err != nil {
ctr.Event("get.db.err", 1)
return nil, err
}
return msgs, nil
}
Counter Example - Read Cache
Client
Cache DB

func (t *RoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
ctr.Event("qps", 1)
ctr.Event("send.bytes", uint64(req.ContentLength))
defer ctr.Time("latency", time.Now())
res, err := t.rt.RoundTrip(req)
if err == nil {
ctr.Event(fmt.Sprintf("status.%d", res.StatusCode), 1)
} else {
ctr.Err("internal.err", 1)
}
return res, err
} Counter Roundtripper
Client
Server
Roundtrip
Counter Example - Http Roundtrip

App
Container
Fluentd
Agent
VM
Counter Pipeline

App
Container
Fluentd
Agent
VM
Counter Pipeline
prometheus
ES alternative:

App
Container
Fluentd
Agent
VM
How frequent should I send counter?

How Frequent Should I Send Counter?
Option 1: Forward every counter to Elasticsearch
Option 2: Aggregate locally before forwarding
1000 counters / container * 100 counts / second = 100k qps
For us, aggregate and send every 30 sec

App
Container
Fluentd
Agent
VM
How long can I store counters?

How Long Can I Store Counters?
- 50,000 counters
- 1 record every 30 seconds
To save counter for 1 year:
50,000 * 4 (bytes) * 2 (counters/minute) * 525,600 (mins/year)
= 210,240,000,000 Bytes
= 210.24 TB
Need to aggregate for long term storage

App
Container
Fluentd
Agent
VM
Counter
Aggregator
Counter Granularity:
- Past 10 days: 30 sec
- Past 1 month: 5 min
- Past 3 month: 30 min
- Past year: 1 hr

Deploy with Counters
Docker Registry
docker push
code Review
CI
git
push
deploy
- Mon night: Code freeze
- Tue morning: Deploy to staging
- If okay, deploy to production
30% => 50% => 100%

Rolling to X%
- Health check
- Manually inspect
counters
- Minimal e2e test
- Compare counter
with last deploy

App
Container
Fluentd
Agent
VM
Counter
Aggregator
Monitor/Alert with Counters

App
Container
Fluentd
Agent
VM
Counter
Aggregator
Cron Server

eQstr = 'host:"prod-cg-docvcs-group" AND pkg:docvcs_worker AND name:overall.err'
rQstr = 'host:"prod-cg-docvcs-group" AND pkg:docvcs_worker AND name:overall.request'
errors = esq_scalar('sum', 'total', eQstr, 'now-5m', 'now')
requests = esq_scalar('sum', 'total', rQstr, 'now-5m', 'now')
error_rate = errors * 100 / requests
-- Fail rate should be less than 10/s
alert_p2('docvcs fail_rate', error_rate, '>', 10, '15m')
alert_p0('docvcs fail_rate', error_rate, '>', 10, '45m')
Alarm when high error rate

Debug with Counters
- GDB
- Bisect with log
- Bisect with counters

App
Container
Fluentd
Agent
VM
Counter
Aggregator
Cron Server
Autoscale with Counters

App
Container
Fluentd
Agent
VM
Counter
Aggregator
Cron Server
gcloud cli

qstr = 'name: docvcs.jobs.min.outstanding'
outstanding = esq_scalar(qstr, 'now-10m', 'now')
workload = outstanding / 200
autoscale(workload, 'docvcs', 6, 30, 6, 'diff', 0.65, 0.2, 2/3)

autoscale(workload, 'docvcs', 6, 30, 6, 'diff', 0.65, 0.2, 2/3)
minimum # of instances
maximum # of instances
maximum # of VMs to be scaled
target workload
safeguard
workload
▵Instance
0.65 0.85
0.45
6
safeguard

What can’t counter do?
Counter solves problem on 90% users.
Counter can’t solve problem on 1 user.

Summary of Counter
A line of code. Can be used for:
- Rolling update
- Monitor / alert
- Debug cluster
- Autoscale cluster
- Simple business logics
- And many others (use your imagination)

Debug production server by counter

More Related Content

What's hot (20)

Similar to Debug production server by counter (20)

Recently uploaded (20)

Debug production server by counter