Argus Production
Monitoring At
Salesforce
Service Health & Observability at Scale
Tom Valine
Director, Infrastructure Engineering
tvaline@salesforce.com
in/tvaline
Bhinav Sura
Software Engineer, Infrastructure Engineering
bhinav.sura@salesforce.com
in/bhinavsura
What is Argus?
● Time Series Data & Events
● Inbuilt Service Protection
● Alerting
● Flexible Dashboarding
● Full REST API
● High Throughput
● Low Latency
● Horizontally Scalable
● In Use By
○ Capacity Planning
○ Search
○ Feature Teams
○ Site Reliability
○ Customer Success
But Why Another Monitoring System?
● Technology changes
frequently!
● Insulate our customers
● Performance
● Trust
● Programmatic access for
everything
● Multi-tenancy
● Correlation with non-
timeseries data
● Highly dimensional
I’ve seen this somewhere before...
Metrics
● Transforms
● Namespace
● Scope
● Name
● Tags
● Aggregator
● Downsampler
Events
● Namespace
● Scope
● Name
● Tags
● Type
● User
SCALE(-2d:-1d:dva:argus:freemem{host=*}:min:1d-min, $1e-6)
TRANSFORM
START
END
NAMESPACE
SCOPE
METRIC
TAGS
AGG
DS
PARAMS
-2d:-1d:dva:argus:release{host=*}:major:admin
START
END
NAMESPACE
SCOPE
NAME
TAGS
TYPE
USER
● First Class Data
● Decoupled from Time
Series
● Multiple Events Per
Timestamp
● Event Categories
● Identifiable per User
● Overlay on Any Time
Series
Events
Alerting
● CRON Format
● Alert on Missing Data
● Single Ended & Range
Comparisons
● Inertia
● Cooldown
● Multiple Triggers
● Multiple Notifications
○ Audit
○ Email
○ GOC++
○ Salesforce Chatter
○ PagerDuty
● Event Backannotation
Warden
● Policy Driven Suspension
Mechanism
● Per User
● Application & Subsystem
● Progressively Punitive
● Indefinite Suspension
Supported
● Customizeable
Dashboarding
● Maintaining dashboards is
a horrible business to be
in
● Empower the users, get
out of their way
● Markup based
● Custom tags for
visualization elements
● HTML for everything else
REST
● API First
● All functionality exposed
via services
● Decoupled UI
● Authenticated
○ Login
○ Do stuff
○ Logout
● Get out of User's Way!
○ Orchestra Client
○ ArgusPoke
○ Dashboard Creation
Tool
How does it work?
METRICS ANNOTATION USER ENTITY
ALERTS MAIL SCHEDULING MONITORING
WEB SERVICES
AUTH ORM MQ TSDB
WEB UI CUSTOM APPS OTHER CLIENTS
DASHBOARD MANAGEMENT WARDEN NAMESPACE
SCHEMA WILDCARDING CACHING INTERLOCK
Okay, but how does it REALLY work?
MESSAGE BUS
HBASE/TSDB/RDBMS/CACHING
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
C
L
CO
RE
W
S
Cool, how will it evolve going forward?
HBASE/TSDB/RDBMS/CACHE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
UI
W
S
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
CO
RE
W
S
HBASE/TSDB/RDBMS/CACHE HBASE/TSDB/RDBMS/CACHE HBASE/TSDB/RDBMS/CACHE HBASE/TSDB/RDBMS/CACHE
ROUTE/FORK/JOIN+M/R
ROUTE/FORK/JOIN+M/R
MESSAGE BUS MESSAGE BUS MESSAGE BUS MESSAGE BUS MESSAGE BUS
ROUTE/FORK/JOIN+M/R
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
C
L
Alert Evaluation Data Flows
Message Queue:
1. Scheduling Service updates
alert schedule every 10 minutes.
2. Scheduler submits scheduled
jobs to queue
3. Minimum interval of 1 minute
Alert Client:
1. Dequeues from alert queue.
2. Query ranges adjusted for
scheduling latency
3. Triggers evaluated
4. Notifications sent
5. Cooldowns updated.
ALERT DATA STORE
SCHEUDLING
SERVICE
ALERT CACHE
ARGUS WS
ALERT 8713
...
ALERT 4141
ALERT 9810
Metric & Event Data Flows
Message Queue:
1. Writes are asynchronous with high
degree of parallelism.
2. Queue used as a shock absorber.
Tolerant to lower level
failures/downtime.
3. Kafka for scalability. One topic each
for metrics and annotations.
Number of partitions in the order of
100s.
ArgusMetricsQueue:
1. Consumed by 2 types of clients:
MetricCommit and SchemaCommit
2. MetricCommit client commits the
actual time series data to persistent
storage (using OTSDB or Phoenix).
3. SchemaCommit client only uses the
metric metadata to create metric
schema records and commits them
to HBase (using AsyncHBase).
TIMESERIES STORE
ARGUS WS
METRIC
...
METRIC
METRIC
METRIC SERVICE
SCHEMA STORE
TSDB Service Implementation - OpenTSDB
● Uses HBase underneath
● RowKey: <metric_uid><timestamp><tagk1><tagv1>[...<tagkn><tagvn>].
● Stores actual time series values on hourly boundaries (All values within an hour stored in the
same cell)
● Pros:
○ Extremely fast when you query using complete metric name.
○ 5M datapoints/min write throughput per write daemon.
● Cons:
○ Tag Cardinality - Total number of tags per metric is limited to 8
○ Tag Cardinality - As product of tag values across all tag keys increases, performance decreases
drastically
○ UID Exhaustion - 16M UIDs each for metric, tagk and tagv names by default. Once these are
exhausted, no new metrics, tagk or tagv can be created.
TSDB Service Implementation - Phoenix
● Uses HBase underneath
● RowKey: <metric_uid><timestamp><tagv1>[...<tagvn>].
● Metric modeled as Phoenix VIEW
○ Schema is introspectable and managed outside of data
○ Supports secondary indexes on value and/or tag(s)
● Parallelizes query and pushes computation to server
○ Server-side aggregation conserves network bandwidth
○ Allows SKIP_SCAN filter optimization for minimizing data scanned
○ Leverages ROW_TIMESTAMP optimization for filtering HFiles
● Performance on par or better than OpenTSDB
● Ad hoc SQL query capability
○ Join against other Phoenix tables
● Longer term leverage Drillix (Phoenix + Drill)
○ Cross cluster queries
○ Joins to other non HBase data sources
Schema Service Motivation
● Discover Metrics
○ What all metrics exist within a scope?
○ For a given <scope, metric> combination, what all tags exist?
○ Given a metric, what all scopes contain this metric?
○ What are all the tag values that exist for a given tag key?
● Support Wildcard Queries
○ Non-wildcard query
■ -1h:system.myDatacenter.myPod:Cpu.perc:avg:1m-avg
○ Wildcard query
■ -1h:system.myDatacenter.*:Cpu.perc:avg:1m-avg
■ -1h:system.myDatacenter.myPod:Cpu*:avg:1m-avg
■ -1h:system.myDatacenter.myPod:Cpu.perc{device=*app*}:avg:1m-avg
Schema Service Implementation
● AsyncHBase Schema Service:
○ Uses HBase underneath
○ SchemaRecord: namespace, scope, metricname, tagk, tagv. No data points.
○ Each record indexed in 2 ways in 2 different tables.
○ MetricIndexed schema table:
■ RowKey: <metricname><scope><namespace><tagk><tagv>
○ ScopeIndexed schema table:
■ RowKey: <scope><metricname><namespace><tagk><tagv>
○ Decide what table to use based on the type of query.
○ Pros:
■ Efficient retrieval for schema records for most types of queries
○ Cons:
■ Storage duplication
● DiscoveryService:
○ Uses SchemaService internally
○ Ability to filter records by type
■ For e.g. Filter all unique scopes that match *myScope*
○ Expand Wildcard query and return a collection of non-wildcard queries
Caching
● CachedTSDB Service:
○ Uses RedisCache service and the configured TSDBService implementation (OpenTSDB or
PhoenixTSDB)
○ Query Level Caching (caches synthetic data)
○ Caches data spanning a window of more than last 24 hours.
○ Data is cached by fracturing it on day boundary.
■ For e.g.: Query spanning 5 days is stored using 5 keys on the cache.
○ Support for partial hits
○ Cache expiry time of an hour (can be increased by running a separate Cache update process)
● CachedDiscovery Service:
○ Uses RedisCache service and the configured DiscoveryService implementation
○ Cache queries already expanded
○ Cache expiry time of a day
Developed By
● Anand Subramanian
● Bhinav Sura
● Tom Valine
● Jigna Bhatt
● Ruofan Zhang
● Dilip Devaraj
● Raj Sarkapally
● Kiran Gowdru
More Information
​https://github.com/SalesforceEng/Argus
thank y u

More Related Content

PPTX
Time-Series Apache HBase
PPTX
Rolling Out Apache HBase for Mobile Offerings at Visa
PDF
Argus Production Monitoring at Salesforce
PDF
Apache HBase in the Enterprise Data Hub at Cerner
PDF
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
PPTX
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
PPTX
HBaseCon 2015: HBase Operations in a Flurry
PPTX
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight
Time-Series Apache HBase
Rolling Out Apache HBase for Mobile Offerings at Visa
Argus Production Monitoring at Salesforce
Apache HBase in the Enterprise Data Hub at Cerner
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2015: HBase Operations in a Flurry
HBaseCon 2015: Optimizing HBase for the Cloud in Microsoft Azure HDInsight

What's hot (20)

PDF
HBaseConAsia2018 Keynote1: Apache HBase Project Status
PDF
Tales from Taming the Long Tail
PDF
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
PDF
Imply at Apache Druid Meetup in London 1-15-20
PPTX
HBaseCon 2013: ETL for Apache HBase
PPTX
Scaling HDFS at Xiaomi
PDF
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
PPTX
Cassandra Tuning - above and beyond
PDF
Voldemort on Solid State Drives
PDF
HBaseCon2017 Highly-Available HBase
PDF
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
PPTX
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
PPTX
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
PPTX
Google mesa
PPTX
HBaseCon 2013: Near Real Time Indexing for eBay Search
PDF
Hadoop Networking at Datasift
PDF
Amazon RedShift - Ianni Vamvadelis
PPTX
Foundations of streaming SQL: stream & table theory
PDF
HBaseCon 2013: Apache HBase Operations at Pinterest
PPTX
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
HBaseConAsia2018 Keynote1: Apache HBase Project Status
Tales from Taming the Long Tail
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
Imply at Apache Druid Meetup in London 1-15-20
HBaseCon 2013: ETL for Apache HBase
Scaling HDFS at Xiaomi
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
Cassandra Tuning - above and beyond
Voldemort on Solid State Drives
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Splice Machine as a Service: Multi-tenant HBase using DCOS (Meso...
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Google mesa
HBaseCon 2013: Near Real Time Indexing for eBay Search
Hadoop Networking at Datasift
Amazon RedShift - Ianni Vamvadelis
Foundations of streaming SQL: stream & table theory
HBaseCon 2013: Apache HBase Operations at Pinterest
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Ad

Viewers also liked (16)

PDF
Apache HBase - Just the Basics
PPTX
Date-tiered Compaction Policy for Time-series Data
PPTX
Update on OpenTSDB and AsyncHBase
PDF
OpenTSDB 2.0
PDF
HBaseCon 2015: HBase @ Flipboard
PDF
Breaking the Sound Barrier with Persistent Memory
PDF
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
PDF
Apache HBase Improvements and Practices at Xiaomi
PPTX
Keynote: The Future of Apache HBase
PPTX
Apache HBase, Accelerated: In-Memory Flush and Compaction
PPTX
Keynote: Apache HBase at Yahoo! Scale
PPTX
Apache HBase at Airbnb
PDF
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
PDF
Improvements to Apache HBase and Its Applications in Alibaba Search
PPTX
Apache Phoenix: Use Cases and New Features
PPTX
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Apache HBase - Just the Basics
Date-tiered Compaction Policy for Time-series Data
Update on OpenTSDB and AsyncHBase
OpenTSDB 2.0
HBaseCon 2015: HBase @ Flipboard
Breaking the Sound Barrier with Persistent Memory
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
Apache HBase Improvements and Practices at Xiaomi
Keynote: The Future of Apache HBase
Apache HBase, Accelerated: In-Memory Flush and Compaction
Keynote: Apache HBase at Yahoo! Scale
Apache HBase at Airbnb
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
Improvements to Apache HBase and Its Applications in Alibaba Search
Apache Phoenix: Use Cases and New Features
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Ad

Similar to Argus Production Monitoring at Salesforce (20)

PPTX
HBaseCon2016-final
PDF
2011-12-13 NoSQL aus der Praxis
PPTX
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
PDF
Overview of data analytics service: Treasure Data Service
PDF
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
PPTX
Webinar: An Enterprise Architect’s View of MongoDB
PDF
HBase ArcheTypes
PPT
Ops Jumpstart: MongoDB Administration 101
PDF
OpenTSDB for monitoring @ Criteo
PDF
Tweaking performance on high-load projects
PDF
Scaling Pinterest's Monitoring
PPTX
MongoDB for Time Series Data
PPTX
Update on OpenTSDB and AsyncHBase
PDF
[Hi c2011]building mission critical messaging system(guoqiang jerry)
PDF
A Morning with MongoDB Barcelona: Use Cases and Roadmap
PDF
MongoDB to Cassandra
PPTX
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
PPTX
Hbasepreso 111116185419-phpapp02
PPTX
Apache HBase - Introduction & Use Cases
PPTX
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...
HBaseCon2016-final
2011-12-13 NoSQL aus der Praxis
eHarmony @ Hbase Conference 2016 by vijay vangapandu.
Overview of data analytics service: Treasure Data Service
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
Webinar: An Enterprise Architect’s View of MongoDB
HBase ArcheTypes
Ops Jumpstart: MongoDB Administration 101
OpenTSDB for monitoring @ Criteo
Tweaking performance on high-load projects
Scaling Pinterest's Monitoring
MongoDB for Time Series Data
Update on OpenTSDB and AsyncHBase
[Hi c2011]building mission critical messaging system(guoqiang jerry)
A Morning with MongoDB Barcelona: Use Cases and Roadmap
MongoDB to Cassandra
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
Hbasepreso 111116185419-phpapp02
Apache HBase - Introduction & Use Cases
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...

More from HBaseCon (20)

PDF
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
PDF
hbaseconasia2017: HBase on Beam
PDF
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
PDF
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
PDF
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
PDF
hbaseconasia2017: Apache HBase at Netease
PDF
hbaseconasia2017: HBase在Hulu的使用和实践
PDF
hbaseconasia2017: 基于HBase的企业级大数据平台
PDF
hbaseconasia2017: HBase at JD.com
PDF
hbaseconasia2017: Large scale data near-line loading method and architecture
PDF
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
PDF
hbaseconasia2017: HBase Practice At XiaoMi
PDF
hbaseconasia2017: hbase-2.0.0
PDF
HBaseCon2017 Democratizing HBase
PDF
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
PDF
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
PDF
HBaseCon2017 Transactions in HBase
PDF
HBaseCon2017 Apache HBase at Didi
PDF
HBaseCon2017 gohbase: Pure Go HBase Client
PDF
HBaseCon2017 Improving HBase availability in a multi tenant environment
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: hbase-2.0.0
HBaseCon2017 Democratizing HBase
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Transactions in HBase
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 Improving HBase availability in a multi tenant environment

Recently uploaded (20)

PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PPTX
Full-Stack Developer Courses That Actually Land You Jobs
PDF
Autodesk AutoCAD Crack Free Download 2025
PDF
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
BoxLang Dynamic AWS Lambda - Japan Edition
PPTX
Tech Workshop Escape Room Tech Workshop
DOCX
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PPTX
Cybersecurity-and-Fraud-Protecting-Your-Digital-Life.pptx
PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PDF
CCleaner 6.39.11548 Crack 2025 License Key
PPTX
Matchmaking for JVMs: How to Pick the Perfect GC Partner
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PDF
E-Commerce Website Development Companyin india
How Tridens DevSecOps Ensures Compliance, Security, and Agility
Weekly report ppt - harsh dattuprasad patel.pptx
Full-Stack Developer Courses That Actually Land You Jobs
Autodesk AutoCAD Crack Free Download 2025
Introduction to Ragic - #1 No Code Tool For Digitalizing Your Business Proces...
GSA Content Generator Crack (2025 Latest)
BoxLang Dynamic AWS Lambda - Japan Edition
Tech Workshop Escape Room Tech Workshop
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
Salesforce Agentforce AI Implementation.pdf
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Wondershare Recoverit Full Crack New Version (Latest 2025)
Cybersecurity-and-Fraud-Protecting-Your-Digital-Life.pptx
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
CCleaner 6.39.11548 Crack 2025 License Key
Matchmaking for JVMs: How to Pick the Perfect GC Partner
"Secure File Sharing Solutions on AWS".pptx
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
E-Commerce Website Development Companyin india

Argus Production Monitoring at Salesforce

  • 1. Argus Production Monitoring At Salesforce Service Health & Observability at Scale Tom Valine Director, Infrastructure Engineering tvaline@salesforce.com in/tvaline Bhinav Sura Software Engineer, Infrastructure Engineering bhinav.sura@salesforce.com in/bhinavsura
  • 2. What is Argus? ● Time Series Data & Events ● Inbuilt Service Protection ● Alerting ● Flexible Dashboarding ● Full REST API ● High Throughput ● Low Latency ● Horizontally Scalable ● In Use By ○ Capacity Planning ○ Search ○ Feature Teams ○ Site Reliability ○ Customer Success
  • 3. But Why Another Monitoring System? ● Technology changes frequently! ● Insulate our customers ● Performance ● Trust ● Programmatic access for everything ● Multi-tenancy ● Correlation with non- timeseries data ● Highly dimensional
  • 4. I’ve seen this somewhere before... Metrics ● Transforms ● Namespace ● Scope ● Name ● Tags ● Aggregator ● Downsampler Events ● Namespace ● Scope ● Name ● Tags ● Type ● User SCALE(-2d:-1d:dva:argus:freemem{host=*}:min:1d-min, $1e-6) TRANSFORM START END NAMESPACE SCOPE METRIC TAGS AGG DS PARAMS -2d:-1d:dva:argus:release{host=*}:major:admin START END NAMESPACE SCOPE NAME TAGS TYPE USER
  • 5. ● First Class Data ● Decoupled from Time Series ● Multiple Events Per Timestamp ● Event Categories ● Identifiable per User ● Overlay on Any Time Series Events
  • 6. Alerting ● CRON Format ● Alert on Missing Data ● Single Ended & Range Comparisons ● Inertia ● Cooldown ● Multiple Triggers ● Multiple Notifications ○ Audit ○ Email ○ GOC++ ○ Salesforce Chatter ○ PagerDuty ● Event Backannotation
  • 7. Warden ● Policy Driven Suspension Mechanism ● Per User ● Application & Subsystem ● Progressively Punitive ● Indefinite Suspension Supported ● Customizeable
  • 8. Dashboarding ● Maintaining dashboards is a horrible business to be in ● Empower the users, get out of their way ● Markup based ● Custom tags for visualization elements ● HTML for everything else
  • 9. REST ● API First ● All functionality exposed via services ● Decoupled UI ● Authenticated ○ Login ○ Do stuff ○ Logout ● Get out of User's Way! ○ Orchestra Client ○ ArgusPoke ○ Dashboard Creation Tool
  • 10. How does it work? METRICS ANNOTATION USER ENTITY ALERTS MAIL SCHEDULING MONITORING WEB SERVICES AUTH ORM MQ TSDB WEB UI CUSTOM APPS OTHER CLIENTS DASHBOARD MANAGEMENT WARDEN NAMESPACE SCHEMA WILDCARDING CACHING INTERLOCK
  • 11. Okay, but how does it REALLY work? MESSAGE BUS HBASE/TSDB/RDBMS/CACHING UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE C L CO RE W S
  • 12. Cool, how will it evolve going forward? HBASE/TSDB/RDBMS/CACHE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE UI W S CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE CO RE W S HBASE/TSDB/RDBMS/CACHE HBASE/TSDB/RDBMS/CACHE HBASE/TSDB/RDBMS/CACHE HBASE/TSDB/RDBMS/CACHE ROUTE/FORK/JOIN+M/R ROUTE/FORK/JOIN+M/R MESSAGE BUS MESSAGE BUS MESSAGE BUS MESSAGE BUS MESSAGE BUS ROUTE/FORK/JOIN+M/R C L C L C L C L C L C L C L C L C L C L C L C L C L C L
  • 13. Alert Evaluation Data Flows Message Queue: 1. Scheduling Service updates alert schedule every 10 minutes. 2. Scheduler submits scheduled jobs to queue 3. Minimum interval of 1 minute Alert Client: 1. Dequeues from alert queue. 2. Query ranges adjusted for scheduling latency 3. Triggers evaluated 4. Notifications sent 5. Cooldowns updated. ALERT DATA STORE SCHEUDLING SERVICE ALERT CACHE ARGUS WS ALERT 8713 ... ALERT 4141 ALERT 9810
  • 14. Metric & Event Data Flows Message Queue: 1. Writes are asynchronous with high degree of parallelism. 2. Queue used as a shock absorber. Tolerant to lower level failures/downtime. 3. Kafka for scalability. One topic each for metrics and annotations. Number of partitions in the order of 100s. ArgusMetricsQueue: 1. Consumed by 2 types of clients: MetricCommit and SchemaCommit 2. MetricCommit client commits the actual time series data to persistent storage (using OTSDB or Phoenix). 3. SchemaCommit client only uses the metric metadata to create metric schema records and commits them to HBase (using AsyncHBase). TIMESERIES STORE ARGUS WS METRIC ... METRIC METRIC METRIC SERVICE SCHEMA STORE
  • 15. TSDB Service Implementation - OpenTSDB ● Uses HBase underneath ● RowKey: <metric_uid><timestamp><tagk1><tagv1>[...<tagkn><tagvn>]. ● Stores actual time series values on hourly boundaries (All values within an hour stored in the same cell) ● Pros: ○ Extremely fast when you query using complete metric name. ○ 5M datapoints/min write throughput per write daemon. ● Cons: ○ Tag Cardinality - Total number of tags per metric is limited to 8 ○ Tag Cardinality - As product of tag values across all tag keys increases, performance decreases drastically ○ UID Exhaustion - 16M UIDs each for metric, tagk and tagv names by default. Once these are exhausted, no new metrics, tagk or tagv can be created.
  • 16. TSDB Service Implementation - Phoenix ● Uses HBase underneath ● RowKey: <metric_uid><timestamp><tagv1>[...<tagvn>]. ● Metric modeled as Phoenix VIEW ○ Schema is introspectable and managed outside of data ○ Supports secondary indexes on value and/or tag(s) ● Parallelizes query and pushes computation to server ○ Server-side aggregation conserves network bandwidth ○ Allows SKIP_SCAN filter optimization for minimizing data scanned ○ Leverages ROW_TIMESTAMP optimization for filtering HFiles ● Performance on par or better than OpenTSDB ● Ad hoc SQL query capability ○ Join against other Phoenix tables ● Longer term leverage Drillix (Phoenix + Drill) ○ Cross cluster queries ○ Joins to other non HBase data sources
  • 17. Schema Service Motivation ● Discover Metrics ○ What all metrics exist within a scope? ○ For a given <scope, metric> combination, what all tags exist? ○ Given a metric, what all scopes contain this metric? ○ What are all the tag values that exist for a given tag key? ● Support Wildcard Queries ○ Non-wildcard query ■ -1h:system.myDatacenter.myPod:Cpu.perc:avg:1m-avg ○ Wildcard query ■ -1h:system.myDatacenter.*:Cpu.perc:avg:1m-avg ■ -1h:system.myDatacenter.myPod:Cpu*:avg:1m-avg ■ -1h:system.myDatacenter.myPod:Cpu.perc{device=*app*}:avg:1m-avg
  • 18. Schema Service Implementation ● AsyncHBase Schema Service: ○ Uses HBase underneath ○ SchemaRecord: namespace, scope, metricname, tagk, tagv. No data points. ○ Each record indexed in 2 ways in 2 different tables. ○ MetricIndexed schema table: ■ RowKey: <metricname><scope><namespace><tagk><tagv> ○ ScopeIndexed schema table: ■ RowKey: <scope><metricname><namespace><tagk><tagv> ○ Decide what table to use based on the type of query. ○ Pros: ■ Efficient retrieval for schema records for most types of queries ○ Cons: ■ Storage duplication ● DiscoveryService: ○ Uses SchemaService internally ○ Ability to filter records by type ■ For e.g. Filter all unique scopes that match *myScope* ○ Expand Wildcard query and return a collection of non-wildcard queries
  • 19. Caching ● CachedTSDB Service: ○ Uses RedisCache service and the configured TSDBService implementation (OpenTSDB or PhoenixTSDB) ○ Query Level Caching (caches synthetic data) ○ Caches data spanning a window of more than last 24 hours. ○ Data is cached by fracturing it on day boundary. ■ For e.g.: Query spanning 5 days is stored using 5 keys on the cache. ○ Support for partial hits ○ Cache expiry time of an hour (can be increased by running a separate Cache update process) ● CachedDiscovery Service: ○ Uses RedisCache service and the configured DiscoveryService implementation ○ Cache queries already expanded ○ Cache expiry time of a day
  • 20. Developed By ● Anand Subramanian ● Bhinav Sura ● Tom Valine ● Jigna Bhatt ● Ruofan Zhang ● Dilip Devaraj ● Raj Sarkapally ● Kiran Gowdru More Information ​https://github.com/SalesforceEng/Argus