SlideShare a Scribd company logo
Metrics driven development,
a observability perspective
Huy Do
LINE corp
Introduction
• Huy Do
• Software Engineer at Observability Team
• Founded kipalog.com & Ruby Vietnam group
Agenda
• Metrics driven culture at LINE
• Introduce our observability stack
LINE
• A lot of end users (~170M active)
• A lot of traffics
• A lot of services (delivery, taxi, games,
manga…)
What we care
• User Experience
• One important prospect of User Experience
is Reliability
RELIABILITY
• No Downtime
• Low MTTR (Mean Time To Repair)
• Fast Response
• Fair response time
• Fair percentile latency : p99, p95, p50
HOW
CULTURE
• EVERY Engineers MUST care about their application
statuses
• EVERY Engineers MUST do on-call rotate
• NO "application engineer" who write code only
• We have a dedicate team to provide them stable tools
to care about their application status at best
CULTURE
APPLICATION STATUS?
OBSERVABILITY
– Wikipedia
“observability is a measure of how
well internal states of a system can be
inferred from knowledge of its external
outputs”
METRICS
LOGGING
TRACING
https://speakerdeck.com/adriancole/observability-3-ways-logging-metrics-and-tracing
METRICS
• Metrics
• Most simplest form is a triple
• (name, value, timestamp)
• Could be represent as graph
METRICS
• System Metrics
• CPU/Disk IO/Network/DiskUsage...
• MUST: have alert for critical metrics by default (users
don't know what to monitor, and don't know the good
threshold)
• Application Metrics
• Internal queue size, endpoint latency tail (p50, p95,
p99), request size, request count
METRICS
• In LINE we care A LOT about Application Metrics
• We try to instrument every single new added logic
• Some of our heavy servers exported over 10000
metrics per server
METRICS
LOGGING
Warn / Error / Fatal log
for alerting
• In LINE All error / warning logs MUST be
• Permanent stored (for trouble shooting later)
• Used for alerting
• Easy to query (you should not go to each host,
and do grep access log)
LOGGING
LOGGING
Real time error/warn log analysis with help of 

Elasticsearch / Kibana
LOGGING
Daily report for error trend
TRACING
• Not a common concept in normal service
• Very helpful in microservice or fully async
system , when a response could come from
multiple services or multiple async threads.
TRACING
TRACING
OpenZipkin
LINE OBSERVABILITY
STACK
• We call it IMON
• IMON could
• Aggregate metrics from dozen of thousands of hosts, and
do alert
• Aggregate warn/error logs from application and do alert
• (on going) Tracing requests across services
HOW BIG?
• ~ 20 millions metrics per minute
• And keep growing every day
• ~ 500k log received per minute (peak time could
up to few millions)
ARCHITECTURE
Metrics driven development with dedicated Observability Team
DETAILS
•Shard-ing MySQL cluster (~50 servers)
•Partition by “customers”
•Batching write for better throughput
METRICS DATABASE
• MySQL is not fit for time series database
• "Good TSDB"?
• Compression
• Optimize for write, but read MUST fast enough
• Flexible query (topK, rate, delta)
• Fast aggregate
• We're moving to OpenTSDB
METRICS DATABASE
• ElasticSearch to store warn/error log
• ElasticSearch is very good at writing (with support
of batching write from application layer)
• However, some bad read query will kill the server
LOGGING DATABASE
• Wrote our own in golang
• Similar architect with telegraf (but with buffer)
• Fully managed
• Monitor all agents CPU / memory usage..
• Monitor all agents error
• Automatically roll-out
TELEMETRY AGENT
• Flexbile routing rules
• Dedicated collector for big customer
• Drop request by dynamic configuration
• Written by armeria and centraldogma
ROUTING GATEWAY
https://github.com/line/armeria
https://github.com/line/centraldogma
• Faster, more stable TSDB
• Wire everything together
• For every alert, see the big image with metrics/
log/tracing in same place
• Autonomous alerting
• With help of Machine Learning
FUTURE
FINALLY
• How you monitor reflect your engineering
culture
• Data driven culture
• Stability driven culture
• Monitoring IS NOT for devops engineer or
sysadmin only, but for EVERY
ENGINEERS
Thank you for listening

More Related Content

PPTX
Architecture Sustaining LINE Sticker services
PDF
Clovaを支える技術 機械学習配信基盤のご紹介
PDF
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
PDF
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
PPTX
Lieven Vandegaer from MEDIAGENIX - Orchestrating a video-on-demand pipeline w...
PDF
ZaloPay Merchant Platform on K8S on-premise
PPTX
104 meets cloud
PDF
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Architecture Sustaining LINE Sticker services
Clovaを支える技術 機械学習配信基盤のご紹介
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Lieven Vandegaer from MEDIAGENIX - Orchestrating a video-on-demand pipeline w...
ZaloPay Merchant Platform on K8S on-premise
104 meets cloud
Grokking TechTalk #33: High Concurrency Architecture at TIKI

What's hot (20)

PPTX
Modern APM Approaches
PDF
Schemas Beyond The Edge
PDF
APidays Paris 2019 - Reason for Asynchronous APIs by John Carter, Software AG
PDF
LINEデリマでのElasticsearchの運用と監視の話
PDF
OPEN'17_2_Customer Experience_Essent
PPTX
Real User Monitoring (RUM)
PPTX
[Old] Site24x7 Real Browser Monitoring
PPTX
ONAP Overview Webinar - Aarna Networks & Cloudify
PDF
Velocity - NxtGen Oxford
PPT
Project FiFo - Architecture
PDF
Putting the Spark into Functional Fashion Tech Analystics
PPTX
Beyond POLB (Plain Old Load Balancing)
PDF
A Gentle Introduction to Functions-as-a-Service
PPTX
Apache Kafka : Monitoring vs Alerting
PPTX
Rootconf 2017 - State of the Open Source monitoring landscape
PPTX
[Webinar] End User Experience Monitoring with Site24x7
PPT
Scale out magento 2 at aws
PDF
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
PPTX
ONAP on Vagrant
PDF
Shedding Light on LINE Token Economy You Won't Find in Our White Paper
Modern APM Approaches
Schemas Beyond The Edge
APidays Paris 2019 - Reason for Asynchronous APIs by John Carter, Software AG
LINEデリマでのElasticsearchの運用と監視の話
OPEN'17_2_Customer Experience_Essent
Real User Monitoring (RUM)
[Old] Site24x7 Real Browser Monitoring
ONAP Overview Webinar - Aarna Networks & Cloudify
Velocity - NxtGen Oxford
Project FiFo - Architecture
Putting the Spark into Functional Fashion Tech Analystics
Beyond POLB (Plain Old Load Balancing)
A Gentle Introduction to Functions-as-a-Service
Apache Kafka : Monitoring vs Alerting
Rootconf 2017 - State of the Open Source monitoring landscape
[Webinar] End User Experience Monitoring with Site24x7
Scale out magento 2 at aws
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
ONAP on Vagrant
Shedding Light on LINE Token Economy You Won't Find in Our White Paper
Ad

Similar to Metrics driven development with dedicated Observability Team (20)

PPTX
Micro Services Architecture
PDF
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
PPTX
Mule Runtime: Performance Tuning
PPTX
Lessons learned from embedding Cassandra in xPatterns
PPTX
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
PDF
Scaling tappsi
PPTX
PayPal Risk Platform High Performance Practice
PPTX
Kinesis @ lyft
PDF
Production Ready Microservices at Scale
PPTX
Kubernetes Infra 2.0
PPTX
Cassandra in xPatterns
PDF
Agile infrastructure
PDF
Redundant devops
PPTX
Tech talk microservices debugging
PPTX
Debugging Microservices - key challenges and techniques - Microservices Odesa...
PPTX
Moving to microservices – a technology and organisation transformational journey
PDF
Architecture for Scale [AppFirst]
PDF
Dubbo and Weidian's practice on micro-service architecture
PPTX
Manging Container Deployments at Scale
PPTX
Istio Mesh – Managing Container Deployments at Scale
Micro Services Architecture
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Mule Runtime: Performance Tuning
Lessons learned from embedding Cassandra in xPatterns
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Tel...
Scaling tappsi
PayPal Risk Platform High Performance Practice
Kinesis @ lyft
Production Ready Microservices at Scale
Kubernetes Infra 2.0
Cassandra in xPatterns
Agile infrastructure
Redundant devops
Tech talk microservices debugging
Debugging Microservices - key challenges and techniques - Microservices Odesa...
Moving to microservices – a technology and organisation transformational journey
Architecture for Scale [AppFirst]
Dubbo and Weidian's practice on micro-service architecture
Manging Container Deployments at Scale
Istio Mesh – Managing Container Deployments at Scale
Ad

More from LINE Corporation (20)

PDF
JJUG CCC 2018 Fall 懇親会LT
PDF
Reduce dependency on Rx with Kotlin Coroutines
PDF
Kotlin/NativeでAndroidのNativeメソッドを実装してみた
PDF
Use Kotlin scripts and Clova SDK to build your Clova extension
PDF
The Magic of LINE 購物 Testing
PPTX
GA Test Automation
PDF
UI Automation Test with JUnit5
PDF
Feature Detection for UI Testing
PDF
LINE 新星計劃介紹與新創團隊分享
PDF
​LINE 技術合作夥伴與應用分享
PDF
LINE 開發者社群經營與技術推廣
PDF
日本開發者大會短講分享
PDF
LINE Chatbot - 活動報名報到設計分享
PDF
在 LINE 私有雲中使用 Managed Kubernetes
PDF
LINE TODAY高效率的敏捷測試開發技巧
PDF
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
PDF
LINE Things - LINE IoT平台新技術分享
PDF
LINE Pay - 一卡通支付新體驗
PDF
LINE Platform API Update - 打造一個更好的Chatbot服務
PDF
Keynote - ​LINE 的技術策略佈局與跨國產品開發
JJUG CCC 2018 Fall 懇親会LT
Reduce dependency on Rx with Kotlin Coroutines
Kotlin/NativeでAndroidのNativeメソッドを実装してみた
Use Kotlin scripts and Clova SDK to build your Clova extension
The Magic of LINE 購物 Testing
GA Test Automation
UI Automation Test with JUnit5
Feature Detection for UI Testing
LINE 新星計劃介紹與新創團隊分享
​LINE 技術合作夥伴與應用分享
LINE 開發者社群經營與技術推廣
日本開發者大會短講分享
LINE Chatbot - 活動報名報到設計分享
在 LINE 私有雲中使用 Managed Kubernetes
LINE TODAY高效率的敏捷測試開發技巧
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE Things - LINE IoT平台新技術分享
LINE Pay - 一卡通支付新體驗
LINE Platform API Update - 打造一個更好的Chatbot服務
Keynote - ​LINE 的技術策略佈局與跨國產品開發

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Spectroscopy.pptx food analysis technology
PDF
Encapsulation theory and applications.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
sap open course for s4hana steps from ECC to s4
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
MIND Revenue Release Quarter 2 2025 Press Release
Spectral efficient network and resource selection model in 5G networks
Spectroscopy.pptx food analysis technology
Encapsulation theory and applications.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
20250228 LYD VKU AI Blended-Learning.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Metrics driven development with dedicated Observability Team

  • 1. Metrics driven development, a observability perspective Huy Do LINE corp
  • 2. Introduction • Huy Do • Software Engineer at Observability Team • Founded kipalog.com & Ruby Vietnam group
  • 3. Agenda • Metrics driven culture at LINE • Introduce our observability stack
  • 4. LINE • A lot of end users (~170M active) • A lot of traffics • A lot of services (delivery, taxi, games, manga…)
  • 5. What we care • User Experience • One important prospect of User Experience is Reliability
  • 6. RELIABILITY • No Downtime • Low MTTR (Mean Time To Repair) • Fast Response • Fair response time • Fair percentile latency : p99, p95, p50
  • 7. HOW
  • 9. • EVERY Engineers MUST care about their application statuses • EVERY Engineers MUST do on-call rotate • NO "application engineer" who write code only • We have a dedicate team to provide them stable tools to care about their application status at best CULTURE
  • 12. – Wikipedia “observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs”
  • 15. • Metrics • Most simplest form is a triple • (name, value, timestamp) • Could be represent as graph METRICS
  • 16. • System Metrics • CPU/Disk IO/Network/DiskUsage... • MUST: have alert for critical metrics by default (users don't know what to monitor, and don't know the good threshold) • Application Metrics • Internal queue size, endpoint latency tail (p50, p95, p99), request size, request count METRICS
  • 17. • In LINE we care A LOT about Application Metrics • We try to instrument every single new added logic • Some of our heavy servers exported over 10000 metrics per server METRICS
  • 19. Warn / Error / Fatal log for alerting
  • 20. • In LINE All error / warning logs MUST be • Permanent stored (for trouble shooting later) • Used for alerting • Easy to query (you should not go to each host, and do grep access log) LOGGING
  • 21. LOGGING Real time error/warn log analysis with help of 
 Elasticsearch / Kibana
  • 24. • Not a common concept in normal service • Very helpful in microservice or fully async system , when a response could come from multiple services or multiple async threads. TRACING
  • 27. • We call it IMON • IMON could • Aggregate metrics from dozen of thousands of hosts, and do alert • Aggregate warn/error logs from application and do alert • (on going) Tracing requests across services
  • 29. • ~ 20 millions metrics per minute • And keep growing every day • ~ 500k log received per minute (peak time could up to few millions)
  • 33. •Shard-ing MySQL cluster (~50 servers) •Partition by “customers” •Batching write for better throughput METRICS DATABASE
  • 34. • MySQL is not fit for time series database • "Good TSDB"? • Compression • Optimize for write, but read MUST fast enough • Flexible query (topK, rate, delta) • Fast aggregate • We're moving to OpenTSDB METRICS DATABASE
  • 35. • ElasticSearch to store warn/error log • ElasticSearch is very good at writing (with support of batching write from application layer) • However, some bad read query will kill the server LOGGING DATABASE
  • 36. • Wrote our own in golang • Similar architect with telegraf (but with buffer) • Fully managed • Monitor all agents CPU / memory usage.. • Monitor all agents error • Automatically roll-out TELEMETRY AGENT
  • 37. • Flexbile routing rules • Dedicated collector for big customer • Drop request by dynamic configuration • Written by armeria and centraldogma ROUTING GATEWAY https://github.com/line/armeria https://github.com/line/centraldogma
  • 38. • Faster, more stable TSDB • Wire everything together • For every alert, see the big image with metrics/ log/tracing in same place • Autonomous alerting • With help of Machine Learning FUTURE
  • 39. FINALLY • How you monitor reflect your engineering culture • Data driven culture • Stability driven culture • Monitoring IS NOT for devops engineer or sysadmin only, but for EVERY ENGINEERS
  • 40. Thank you for listening