SlideShare a Scribd company logo
Seattle 2018
@danielhochman / Engineer / Lyft
Instrumenting and Scaling
Cloud-Native Databases with Envoy
Seattle 2018
Database outage
1. Disk I/O wait spikes briefly
2. Client opens more connections
3. Slowdown due to auth overhead of new
connections
4. Client opens more connections
5. Hit max connection limit
Seattle 2018
Databases in the cloud
Instantly provision resilient, high-throughput infrastructure
No access to underlying VM and/or shared hardware
Limited access to telemetry
Limited access to configuration
Closed source or no ability to run custom binary
Seattle 2018
Cloud Native
Cloud native technologies empower organizations to build and run
scalable applications in modern, dynamic environments such as
public, private, and hybrid clouds.
Seattle 2018
Service Mesh topology
Service mesh
Edge
DiscoveryEnvoy Proxy is deployed at every hop
Seattle 2018
Instance topology
Application communicates over locally to Envoy
which will proxy all traffic
localhost:6001
localhost:6101
localhost:7000
…
(internal services)
(third-party services)
(cloud services)
and more!
Seattle 2018
Layer 3 / 4: Proxying TCP
- DNS aware
- Load balancing: round robin, least request, ring hash, random, etc
- Impose an idle timeout
- Healthchecking
- Access logging
localhost:7000
Stats
cx_active
cx_connect_fail
cx_idle_timeout
cx_total
cx_tx_bytes_total
cx_rx_bytes_total
Other benefits
iot.us-east-1.amazonaws.com
174.217.14.202
174.217.14.234
Seattle 2018
Layer 5 / 6: Offloading SSL
Stats
handshakes
tls_session_reused
fail_verify_no_cert
fail_verify_ca_error
fail_verify_san
cipher.<cipher>
days_until_cert_expires
Other benefits
- Efficient
- Up-to-date and secure (TLS 1.3)
- SNI, cert pinning, session resumption, etc.
- Easier to upgrade
localhost:7000 172.217.14.202:443
Seattle 2018
Layer 7: Managing HTTP
Stats
cx_http1_total
cx_http2_total
cx_protocol_error
rq_2xx
rq_4xx
rq_5xx
rq_retry
rq_time_ms (hist)
rq_timeout
Other benefits
- Transparent upgrade from HTTP/1 to HTTP/2 (multiplexed)
- Manage request retries and timeouts
- Access logging
- Offload GZIP decompression
HTTP/1
HTTP/2
Seattle 2018
Statistics
TCP (L3/L4) SSL (L5/L6) HTTP (L7)
cx_active
cx_connect_fail
cx_idle_timeout
cx_total
cx_tx_bytes_total
cx_rx_bytes_total
cx_length_ms (hist)
handshakes
tls_session_reused
fail_verify_no_cert
fail_verify_ca_error
fail_verify_san
cipher.<cipher>
days_until_cert_expires
cx_http1_total
cx_http2_total
cx_protocol_error
rq_2xx
rq_4xx
rq_5xx
rq_retry
rq_time_ms (hist)
rq_timeout
and more!
Seattle 2018
Dashboards
Live templating
or {% macro envoy_stats(origin, destination) %}
Seattle 2018
Observability
Homogenous telemetry data makes it easier
to observe and correlate behavior in large
systems.
Seattle 2018
Observability
Libraries are heterogenous!
SSL ciphers? Status code metrics? Retry?
import pynamodb
use AwsDynamoDbDynamoDbClient;
import "github.com/aws/dynamodb"
&aws.Config{
Endpoint:aws.String("http://localhost:8000")
}
e.g.
Envoy provides standard access logs, stats,
alarms, retry, etc
Seattle 2018
Layer 7: Beyond HTTP
Envoy supports three other database-specific L7 protocols today
Seattle 2018
DynamoDB
- Protocol: JSON over HTTP
- Cloudwatch telemetry
- min, avg, max latency
- per-table capacity unit throughput
- per-minute
- Benefits of Envoy:
- Histogram of latency (percentiles)
- Custom windowing of metrics
- Per-host, per-zone, and per-cluster statistics
Seattle 2018
DynamoDB with codec
Seattle 2018
POST / HTTP/1.1
X-Amz-Target: DynamoDB_20120810.GetItem
{
"TableName": "pets",
"Key": {
"Name": {"S": "Patty"}
}
}
DynamoDB with codec
dynamodb.table.pets.GetItem.upstream_rq_time
Seattle 2018
DynamoDB
What was the per-30s p99 for write requests from the
users-streamlistener canary to the pets table?
ts(
envoy.dynamodb.pets.PutItem.upstream_rq_time.p99,
window=30,
group=users-streamlistener,
canary=true,
)
Seattle 2018
MongoDB
- Protocol: Binary JSON (BSON)
- Benefits of Envoy in TCP mode:
- Per-host, per-cluster, per-zone network I/O
- Benefits of Envoy with Mongo codec:
- Per-operation latency
- Count size and number of documents
- Count scattered gets in sharded cluster
How did the number of documents returned by queries
change in us-east-1a after the 3pm deploy of my service?
Seattle 2018
MongoDB at scale
Help! My Mongo database is experiencing outages:
- Disk I/O wait spikes briefly
- Client opens more connections
- Slowdown due to auth overhead of new connections
- Open more connections
- Hit max connection limit
Envoy will rate limit new connections to apply backpressure so that query
times can recover.
Seattle 2018
MongoDB at scale
Help! I deleted an index. I read the code but it was in a 3,000 line class.
The index was still in use and everything fell over until we could
recreate it.
Envoy will efficiently log all Mongo queries in JSON format so that a week
of logs can be audited for usage of the index's fields.
Have you tried the built-in query profiler?
Yes, it caused a serious outage because it's expensive and results in 3x
CPU usage.
Seattle 2018
MongoDB at scale
Envoy will:
- globally rate limit new connections
- efficiently log all Mongo queries
- track the number of queries with no timeout set
- parse the $comment field of a query so we can time and count queries of
individual application methods, log how many records they returned, etc.
… for applications in 3 different languages across 8 clusters.
… 6 months and several outages later ...
Seattle 2018
/var/log/envoy/mongo/0.log
{
"time": "2018-10-13T21:17:08.483Z",
"upstream_host": "172.18.3.19:27817"
"message": {
"opcode": "OP_QUERY",
"query": {
"findAndModify": "user",
"query": {"_id": 903730},
"update": {"$set": {"stats.rating": 4.9}},
"$comment": "{
"hostname": "users-3ae3r",
"httpUniqueId": "91aaaaaf-4c3d-9400-bcbf-c4aaaaaaadb7",
"callingFunction": "users.UpdateRating" }"
},
},
}
envoy.mongo.callsite.users.UpdateRating.reply_time_ms
Seattle 2018
Redis partitioning proxy
Consistent hashingRedis protocol
+=
Seattle 2018
Redis at scale
localhost:6379
…
SET msg hello
INCR comm
MGET lyft hello
SET msg hello
GET hello
INCR comm
GET lyft
OK
1
nil
To the application, the proxy looks like a single instance of Redis.
Seattle 2018
Approaches
TCP
HTTP
…
Bump-in-the-wire Fully routing
vs
Seattle 2018
Future codecs
Seattle 2018
Roadmap
- More codecs
- Full L7 capability vs bump-in-the-wire
- Better integration of tracing
- More fault injection coverage
- Role-based access control
Seattle 2018
Thanks!
@danielhochman

More Related Content

PPTX
The Basic Introduction of Open vSwitch
PDF
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
PDF
11st Legacy Application의 Spring Cloud 기반 MicroServices로 전환 개발 사례
KEY
Redis overview for Software Architecture Forum
PPTX
RocksDB compaction
PPTX
Introduction to redis
PPTX
Using Rook to Manage Kubernetes Storage with Ceph
PPTX
The Basic Introduction of Open vSwitch
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
11st Legacy Application의 Spring Cloud 기반 MicroServices로 전환 개발 사례
Redis overview for Software Architecture Forum
RocksDB compaction
Introduction to redis
Using Rook to Manage Kubernetes Storage with Ceph

What's hot (20)

PPTX
OpenStack High Availability
PPT
Introduction to redis
PPTX
Grafana.pptx
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
PDF
Gitops: the kubernetes way
PDF
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
PDF
Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security
PPTX
PDF
A crash course in CRUSH
PDF
Cilium + Istio with Gloo Mesh
PDF
Modern Data Center Network Architecture - The house that Clos built
PDF
Red hat ceph storage customer presentation
PDF
Performance Monitoring: Understanding Your Scylla Cluster
PDF
RedHat OpenStack Platform Overview
PDF
Redis - Usability and Use Cases
PPTX
NATS for Modern Messaging and Microservices
PPTX
OpenStack Architecture and Use Cases
PDF
Presentation Ceph
PPTX
PDF
OVS VXLAN Network Accelaration on OpenStack (VXLAN offload and DPDK) - OpenSt...
OpenStack High Availability
Introduction to redis
Grafana.pptx
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Gitops: the kubernetes way
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Cilium - Bringing the BPF Revolution to Kubernetes Networking and Security
A crash course in CRUSH
Cilium + Istio with Gloo Mesh
Modern Data Center Network Architecture - The house that Clos built
Red hat ceph storage customer presentation
Performance Monitoring: Understanding Your Scylla Cluster
RedHat OpenStack Platform Overview
Redis - Usability and Use Cases
NATS for Modern Messaging and Microservices
OpenStack Architecture and Use Cases
Presentation Ceph
OVS VXLAN Network Accelaration on OpenStack (VXLAN offload and DPDK) - OpenSt...
Ad

Similar to Instrumenting and Scaling Databases with Envoy (20)

PPTX
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PDF
Migrating the elastic stack to the cloud, or application logging @ travix
PDF
Model-driven Telemetry: The Foundation of Big Data Analytics
PPTX
How bol.com makes sense of its logs, using the Elastic technology stack.
PDF
Big Events, Mob Scale - Darach Ennis (Push Technology)
PDF
Big Data, Mob Scale.
PPTX
MongoDB World 2018: MongoDB for High Volume Time Series Data Streams
PDF
Enterprise Data Lakes
PDF
Cisco Connect Toronto 2017 - Model-driven Telemetry
PDF
Cisco project ideas
PPTX
Writing New Relic Plugins: NSQ
PDF
Bigdata meetup dwarak_realtime_score_app
PDF
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
PDF
Log everything! @DC13
PDF
Addressing Network Operator Challenges in YANG push Data Mesh Integration
ODP
Log aggregation and analysis
PPTX
Achieve big data analytic platform with lambda architecture on cloud
PDF
An Optics Life
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Migrating the elastic stack to the cloud, or application logging @ travix
Model-driven Telemetry: The Foundation of Big Data Analytics
How bol.com makes sense of its logs, using the Elastic technology stack.
Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Data, Mob Scale.
MongoDB World 2018: MongoDB for High Volume Time Series Data Streams
Enterprise Data Lakes
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco project ideas
Writing New Relic Plugins: NSQ
Bigdata meetup dwarak_realtime_score_app
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
Log everything! @DC13
Addressing Network Operator Challenges in YANG push Data Mesh Integration
Log aggregation and analysis
Achieve big data analytic platform with lambda architecture on cloud
An Optics Life
Ad

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
MYSQL Presentation for SQL database connectivity
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Electronic commerce courselecture one. Pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
CIFDAQ's Market Insight: SEC Turns Pro Crypto
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Electronic commerce courselecture one. Pdf
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks

Instrumenting and Scaling Databases with Envoy

  • 1. Seattle 2018 @danielhochman / Engineer / Lyft Instrumenting and Scaling Cloud-Native Databases with Envoy
  • 2. Seattle 2018 Database outage 1. Disk I/O wait spikes briefly 2. Client opens more connections 3. Slowdown due to auth overhead of new connections 4. Client opens more connections 5. Hit max connection limit
  • 3. Seattle 2018 Databases in the cloud Instantly provision resilient, high-throughput infrastructure No access to underlying VM and/or shared hardware Limited access to telemetry Limited access to configuration Closed source or no ability to run custom binary
  • 4. Seattle 2018 Cloud Native Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds.
  • 5. Seattle 2018 Service Mesh topology Service mesh Edge DiscoveryEnvoy Proxy is deployed at every hop
  • 6. Seattle 2018 Instance topology Application communicates over locally to Envoy which will proxy all traffic localhost:6001 localhost:6101 localhost:7000 … (internal services) (third-party services) (cloud services) and more!
  • 7. Seattle 2018 Layer 3 / 4: Proxying TCP - DNS aware - Load balancing: round robin, least request, ring hash, random, etc - Impose an idle timeout - Healthchecking - Access logging localhost:7000 Stats cx_active cx_connect_fail cx_idle_timeout cx_total cx_tx_bytes_total cx_rx_bytes_total Other benefits iot.us-east-1.amazonaws.com 174.217.14.202 174.217.14.234
  • 8. Seattle 2018 Layer 5 / 6: Offloading SSL Stats handshakes tls_session_reused fail_verify_no_cert fail_verify_ca_error fail_verify_san cipher.<cipher> days_until_cert_expires Other benefits - Efficient - Up-to-date and secure (TLS 1.3) - SNI, cert pinning, session resumption, etc. - Easier to upgrade localhost:7000 172.217.14.202:443
  • 9. Seattle 2018 Layer 7: Managing HTTP Stats cx_http1_total cx_http2_total cx_protocol_error rq_2xx rq_4xx rq_5xx rq_retry rq_time_ms (hist) rq_timeout Other benefits - Transparent upgrade from HTTP/1 to HTTP/2 (multiplexed) - Manage request retries and timeouts - Access logging - Offload GZIP decompression HTTP/1 HTTP/2
  • 10. Seattle 2018 Statistics TCP (L3/L4) SSL (L5/L6) HTTP (L7) cx_active cx_connect_fail cx_idle_timeout cx_total cx_tx_bytes_total cx_rx_bytes_total cx_length_ms (hist) handshakes tls_session_reused fail_verify_no_cert fail_verify_ca_error fail_verify_san cipher.<cipher> days_until_cert_expires cx_http1_total cx_http2_total cx_protocol_error rq_2xx rq_4xx rq_5xx rq_retry rq_time_ms (hist) rq_timeout and more!
  • 11. Seattle 2018 Dashboards Live templating or {% macro envoy_stats(origin, destination) %}
  • 12. Seattle 2018 Observability Homogenous telemetry data makes it easier to observe and correlate behavior in large systems.
  • 13. Seattle 2018 Observability Libraries are heterogenous! SSL ciphers? Status code metrics? Retry? import pynamodb use AwsDynamoDbDynamoDbClient; import "github.com/aws/dynamodb" &aws.Config{ Endpoint:aws.String("http://localhost:8000") } e.g. Envoy provides standard access logs, stats, alarms, retry, etc
  • 14. Seattle 2018 Layer 7: Beyond HTTP Envoy supports three other database-specific L7 protocols today
  • 15. Seattle 2018 DynamoDB - Protocol: JSON over HTTP - Cloudwatch telemetry - min, avg, max latency - per-table capacity unit throughput - per-minute - Benefits of Envoy: - Histogram of latency (percentiles) - Custom windowing of metrics - Per-host, per-zone, and per-cluster statistics
  • 17. Seattle 2018 POST / HTTP/1.1 X-Amz-Target: DynamoDB_20120810.GetItem { "TableName": "pets", "Key": { "Name": {"S": "Patty"} } } DynamoDB with codec dynamodb.table.pets.GetItem.upstream_rq_time
  • 18. Seattle 2018 DynamoDB What was the per-30s p99 for write requests from the users-streamlistener canary to the pets table? ts( envoy.dynamodb.pets.PutItem.upstream_rq_time.p99, window=30, group=users-streamlistener, canary=true, )
  • 19. Seattle 2018 MongoDB - Protocol: Binary JSON (BSON) - Benefits of Envoy in TCP mode: - Per-host, per-cluster, per-zone network I/O - Benefits of Envoy with Mongo codec: - Per-operation latency - Count size and number of documents - Count scattered gets in sharded cluster How did the number of documents returned by queries change in us-east-1a after the 3pm deploy of my service?
  • 20. Seattle 2018 MongoDB at scale Help! My Mongo database is experiencing outages: - Disk I/O wait spikes briefly - Client opens more connections - Slowdown due to auth overhead of new connections - Open more connections - Hit max connection limit Envoy will rate limit new connections to apply backpressure so that query times can recover.
  • 21. Seattle 2018 MongoDB at scale Help! I deleted an index. I read the code but it was in a 3,000 line class. The index was still in use and everything fell over until we could recreate it. Envoy will efficiently log all Mongo queries in JSON format so that a week of logs can be audited for usage of the index's fields. Have you tried the built-in query profiler? Yes, it caused a serious outage because it's expensive and results in 3x CPU usage.
  • 22. Seattle 2018 MongoDB at scale Envoy will: - globally rate limit new connections - efficiently log all Mongo queries - track the number of queries with no timeout set - parse the $comment field of a query so we can time and count queries of individual application methods, log how many records they returned, etc. … for applications in 3 different languages across 8 clusters. … 6 months and several outages later ...
  • 23. Seattle 2018 /var/log/envoy/mongo/0.log { "time": "2018-10-13T21:17:08.483Z", "upstream_host": "172.18.3.19:27817" "message": { "opcode": "OP_QUERY", "query": { "findAndModify": "user", "query": {"_id": 903730}, "update": {"$set": {"stats.rating": 4.9}}, "$comment": "{ "hostname": "users-3ae3r", "httpUniqueId": "91aaaaaf-4c3d-9400-bcbf-c4aaaaaaadb7", "callingFunction": "users.UpdateRating" }" }, }, } envoy.mongo.callsite.users.UpdateRating.reply_time_ms
  • 24. Seattle 2018 Redis partitioning proxy Consistent hashingRedis protocol +=
  • 25. Seattle 2018 Redis at scale localhost:6379 … SET msg hello INCR comm MGET lyft hello SET msg hello GET hello INCR comm GET lyft OK 1 nil To the application, the proxy looks like a single instance of Redis.
  • 28. Seattle 2018 Roadmap - More codecs - Full L7 capability vs bump-in-the-wire - Better integration of tracing - More fault injection coverage - Role-based access control