SlideShare a Scribd company logo
Introducing Envoy-based
service mesh at
Booking.com
Ivan Kruglov
KubeCon Europe 2018
02.05.2018
based in Amsterdam
28M listings
1,5M room nights per day
1700 IT staff
Agenda
• what is service mesh?
• why did we start the project?
• our setup
• learnings
• conclusion
What is
service mesh*?
* the way I understand it
application provider
service consumer service provider
application provider
service consumer service provider
communication infra
application provider
service consumer service provider
proxy proxy
control plane
Why did we start
service mesh
project?
monolith
service
service
service service service
service
service
service
service
service
service
Introducing envoy-based service mesh at Booking.com
Introducing envoy-based service mesh at Booking.com
Consistency & Visibility
in communications
Our setup
application
provider
proxy
service consumer service provider
data plane
HTTP(S)
HTTP
• graceful restart
• TCP proxy
application
provider
envoy
service consumer service provider
data plane
HTTP(S)
HTTP
control plane
control plane
• routing rules
• timeout/retry/etc policies
• service discovery
control plane
• in-house (v1/v2 API)
• started in August 2017; Istio is around 0.2
• one complex system at a time
• start with bare-metal support
• minimal abstraction
• yup, just write Envoy config (partially)
• fine with Envoy’s set of features
• self-service for service owners
application
provider
envoy
service consumer service provider
data plane
HTTP(S)
HTTP
control plane
control plane
• routing rules
• timeout/retry/etc policies
• service discovery
ZooKeeper
(service discovery)
Kubernetes
(service discovery)
ZooKeeper
(configuration)
configuration
• Envoy configuration is quite rich
• power vs. usability:
• cluster specification
• virtual host specification
kind: VirtualHostSpec
spec:
name: api.vhost
domains: ["api.service"]
routes:
- match:
prefix: /
route:
cluster: api.prod.cluster
timeout: 10s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
kind: ClusterSpec
metadata:
annotations:
watcher_name: zookeeper.watcher
paths: ["/pools/api/dc"]
spec:
name: api.prod.cluster
lb_policy: LEAST_REQUEST
connect_timeout: 1s
lb_subset_config:
subset_selectors:
- keys: ["DC"]
- keys: ["IsLocal"]
fallback_policy: DEFAULT_SUBSET
default_subset:
IsLocal: true
kind: VirtualHostSpec
spec:
name: api.vhost
domains: ["api.service"]
routes:
- match:
prefix: /
route:
cluster: api.prod.cluster
timeout: 10s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
kind: ClusterSpec
metadata:
annotations:
watcher_name: zookeeper.watcher
paths: ["/pools/api/dc"]
spec:
name: api.prod.cluster
lb_policy: LEAST_REQUEST
connect_timeout: 1s
lb_subset_config:
subset_selectors:
- keys: ["DC"]
- keys: ["IsLocal"]
fallback_policy: DEFAULT_SUBSET
default_subset:
IsLocal: true
kind: VirtualHostSpec
spec:
name: api.vhost
domains: ["api.service"]
routes:
- match:
prefix: /
route:
cluster: api.prod.cluster
timeout: 10s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
kind: ClusterSpec
metadata:
annotations:
watcher_name: zookeeper.watcher
paths: ["/pools/api/dc"]
spec:
name: api.prod.cluster
lb_policy: LEAST_REQUEST
connect_timeout: 1s
lb_subset_config:
subset_selectors:
- keys: ["DC"]
- keys: ["IsLocal"]
fallback_policy: DEFAULT_SUBSET
default_subset:
IsLocal: true
kind: VirtualHostSpec
spec:
name: api.vhost
domains: ["api.service"]
routes:
- match:
prefix: /
route:
cluster: api.prod.cluster
timeout: 10s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
kind: ClusterSpec
metadata:
annotations:
watcher_name: zookeeper.watcher
paths: ["/pools/api/dc"]
spec:
name: api.prod.cluster
lb_policy: LEAST_REQUEST
connect_timeout: 1s
lb_subset_config:
subset_selectors:
- keys: ["DC"]
- keys: ["IsLocal"]
fallback_policy: DEFAULT_SUBSET
default_subset:
IsLocal: true
kind: VirtualHostSpec
spec:
name: api.vhost
domains: ["api.service"]
routes:
- match:
prefix: /
route:
cluster: api.prod.cluster
timeout: 10s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
kind: ClusterSpec
metadata:
annotations:
watcher_name: zookeeper.watcher
paths: ["/pools/api/dc"]
spec:
name: api.prod.cluster
lb_policy: LEAST_REQUEST
connect_timeout: 1s
lb_subset_config:
subset_selectors:
- keys: ["DC"]
- keys: ["IsLocal"]
fallback_policy: DEFAULT_SUBSET
default_subset:
IsLocal: true
envoy cluster spec envoy virtual host spec
kind: VirtualHostSpec
spec:
name: api.vhost
domains: ["api.service"]
routes:
- match:
prefix: /
route:
cluster: api.prod.cluster
timeout: 10s
retry_policy:
retry_on: 5xx
num_retries: 3
per_try_timeout: 3s
kind: ClusterSpec
metadata:
annotations:
watcher_name: zookeeper.watcher
paths: ["/pools/api/dc"]
spec:
name: api.prod.cluster
lb_policy: LEAST_REQUEST
connect_timeout: 1s
lb_subset_config:
subset_selectors:
- keys: ["DC"]
- keys: ["IsLocal"]
fallback_policy: DEFAULT_SUBSET
default_subset:
IsLocal: true
envoy cluster spec envoy virtual host spec
bootstrap wizard
control plane
ZooKeeper
VC VC VC
service A
expose to service D
service C
expose to service Dservice B
envoy
I’m service D
here is service A
and service C specs
application
provider
envoy
service consumer service provider
data plane
HTTP(S)
HTTP
control plane
control plane
• routing rules
• timeout/retry/etc policies
• service discovery
ZooKeeper
(service discovery)
Kubernetes
(service discovery)
ZooKeeper
(configuration)
configuration
changes
Deploying changes
• integration with existent infrastructure
• git-deploy
• vanguard (canary) deployment
• syntax and semantic validation
• fast rollback
application
provider
envoy
service consumer service provider
data plane
HTTP(S)
HTTP
control plane
control plane
• routing rules
• timeout/retry/etc policies
• service discovery
ZooKeeper
(service discovery)
ZooKeeper
(configuration)
configuration
changes
metrics
Kubernetes
(service discovery)
Monitoring
• approach #1 – graphite
• excellent in-house facilities
• aggregations is too heavy at scale
• approach #2 – prometheus
• statsd support
• statsd_exporter
• still iterating
• standard dashboard
envoy
statsd_exporter
envoy…
Introducing envoy-based service mesh at Booking.com
application
provider
envoy
service consumer service provider
data plane
HTTP(S)
HTTP
control plane
control plane
• routing rules
• timeout/retry/etc policies
• service discovery
ZooKeeper
(service discovery)
ZooKeeper
(configuration)
configuration
changes
metrics
Kubernetes
(service discovery)
Production
Some numbers
• ~ 6 months in production
• ~ 30 projects
• ~ 10K servers
• hundreds of thousands RPS
• overheads…
p50
p75 p90
p95
p99
HTTP
perceived latency
0ms
5ms
10ms
15ms
+ 1ms
HTTPS
perceived latency
p50
p75 p90
p95
p99
10ms
+ 1ms
15ms
20ms
25ms
30ms
40ms
Some findings
• graceful restart works, but not always!
• (major) Envoy upgrades may not be graceful
• ”x-envoy-retry-on = connect-failure” – too few
“x-envoy-retry-on = 5xx” – too much
• gateway-error – HTTP 502, 503, 504
• absence of TCP keepalive => stale connections
• coming in 1.7
• idle_timeout in TCP proxy in 1.6
• cluster_name (1.5) -> cluster_names (1.6)
Conclusion
service mesh* is
a building block on the path to SOA
technically & organizationally
* the way I understand it
Andrei Vereha
Envoy production stories
Friday, May 4th 14:00
Envoy Deep Dive
Thank you!
Ivan Kruglov
ivan.kruglov@booking.com

More Related Content

PPTX
Introduction to Public Key Infrastructure
PPTX
Kafka at Peak Performance
PDF
stackconf 2021 | Setup Min.io and Open Policy Agent for a multi purpose scien...
KEY
Introduction to Cassandra: Replication and Consistency
PDF
MultiChain – Private multicurrency blockchain platform
PPT
Secure shell ppt
PDF
Blockchain Technology | Blockchain Explained | Blockchain Tutorial | Blockcha...
Introduction to Public Key Infrastructure
Kafka at Peak Performance
stackconf 2021 | Setup Min.io and Open Policy Agent for a multi purpose scien...
Introduction to Cassandra: Replication and Consistency
MultiChain – Private multicurrency blockchain platform
Secure shell ppt
Blockchain Technology | Blockchain Explained | Blockchain Tutorial | Blockcha...

What's hot (20)

ODP
Big table
PDF
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
PDF
Serverless Computing
PPT
Secure Socket Layer
PPT
Secure Socket Layer (SSL)
PDF
Public key Infrastructure (PKI)
PDF
How does blockchain work
PDF
June OpenNTF Webinar - Domino V12 Certification Manager
PPTX
Multi Tenancy In The Cloud
PPTX
cryptography ppt free download
PPTX
key management
PPTX
Storage Area Network(SAN)
PPTX
Secure Socket Layer (SSL)
PPTX
Stability Patterns for Microservices
PDF
JavaCard development Quickstart
PPT
Memcache
PPTX
Hyperledger Fabric
PPTX
Cloud computing and data security
PDF
Integration Patterns and Anti-Patterns for Microservices Architectures
PPTX
Virtualization- Cloud Computing
Big table
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Serverless Computing
Secure Socket Layer
Secure Socket Layer (SSL)
Public key Infrastructure (PKI)
How does blockchain work
June OpenNTF Webinar - Domino V12 Certification Manager
Multi Tenancy In The Cloud
cryptography ppt free download
key management
Storage Area Network(SAN)
Secure Socket Layer (SSL)
Stability Patterns for Microservices
JavaCard development Quickstart
Memcache
Hyperledger Fabric
Cloud computing and data security
Integration Patterns and Anti-Patterns for Microservices Architectures
Virtualization- Cloud Computing
Ad

Similar to Introducing envoy-based service mesh at Booking.com (20)

PDF
Introduction-to-Service-Mesh-with-Istio-and-Kiali-OSS-Japan-July-2019.pdf
PDF
Introduction-to-Service-Mesh-with-Istio-and-Kiali-OSS-Japan-July-2019.pdf
PPTX
Multicluster Kubernetes and Service Mesh Patterns
PPTX
How Yelp does Service Discovery
PPTX
Api service mesh and microservice tooling
PDF
Istio presentation jhug
PDF
What is a Service Mesh and what can it do for your Microservices
PDF
Managing Microservices With The Istio Service Mesh on Kubernetes
PDF
Comparing ZooKeeper and Consul
PPTX
Navigating the service mesh landscape with Istio, Consul Connect, and Linkerd
PPTX
CON411-R - Advanced network resource management on Amazon EKS
PPTX
Chapter 05: Eclipse Vert.x - Service Discovery, Resilience and Stability Patt...
PDF
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
PPTX
Service Discovery Like a Pro
PDF
Kubernetes from scratch at veepee sysadmins days 2019
PPTX
Kubernetes Ingress to Service Mesh (and beyond!)
PDF
Building a Service Mesh with Envoy (Kubecon May 2018)
PPTX
Service-mesh options with Linkerd, Consul, Istio and AWS AppMesh
PDF
Kubernetes Networking 101 kubecon EU 2022
PDF
Bringing it all together
Introduction-to-Service-Mesh-with-Istio-and-Kiali-OSS-Japan-July-2019.pdf
Introduction-to-Service-Mesh-with-Istio-and-Kiali-OSS-Japan-July-2019.pdf
Multicluster Kubernetes and Service Mesh Patterns
How Yelp does Service Discovery
Api service mesh and microservice tooling
Istio presentation jhug
What is a Service Mesh and what can it do for your Microservices
Managing Microservices With The Istio Service Mesh on Kubernetes
Comparing ZooKeeper and Consul
Navigating the service mesh landscape with Istio, Consul Connect, and Linkerd
CON411-R - Advanced network resource management on Amazon EKS
Chapter 05: Eclipse Vert.x - Service Discovery, Resilience and Stability Patt...
Amazon EKS 그리고 Service Mesh (김세호 솔루션즈 아키텍트, AWS) :: Gaming on AWS 2018
Service Discovery Like a Pro
Kubernetes from scratch at veepee sysadmins days 2019
Kubernetes Ingress to Service Mesh (and beyond!)
Building a Service Mesh with Envoy (Kubecon May 2018)
Service-mesh options with Linkerd, Consul, Istio and AWS AppMesh
Kubernetes Networking 101 kubecon EU 2022
Bringing it all together
Ad

More from Ivan Kruglov (16)

PPTX
SRE: Site Reliability Engineering
PPTX
Blue-green & canary deployments
PPTX
Обратная сторона сервис-ориентированной архитектуры
PPTX
Kubernetes в Booking.com
PPTX
Тернии контейнеризованных приложений и микросервисов
PPTX
Service mesh для микросервисов
PPTX
SOA: Строим свой service mesh
PDF
Solving some of the scalability problems at booking.com
PDF
Sereal: a view from inside
PPSX
SOA: послать запрос на сервер? Что может быть проще?!
PPSX
Мониторинг, когда не тестируешь
PPTX
Архитектура поиска в Booking.com
PDF
Processing JSON messages in highspeed
PDF
Bringing code to the data: from MySQL to RocksDB for high volume searches
PDF
Optimize sereal
PDF
Sereal and its tooling
SRE: Site Reliability Engineering
Blue-green & canary deployments
Обратная сторона сервис-ориентированной архитектуры
Kubernetes в Booking.com
Тернии контейнеризованных приложений и микросервисов
Service mesh для микросервисов
SOA: Строим свой service mesh
Solving some of the scalability problems at booking.com
Sereal: a view from inside
SOA: послать запрос на сервер? Что может быть проще?!
Мониторинг, когда не тестируешь
Архитектура поиска в Booking.com
Processing JSON messages in highspeed
Bringing code to the data: from MySQL to RocksDB for high volume searches
Optimize sereal
Sereal and its tooling

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
cuic standard and advanced reporting.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Approach and Philosophy of On baking technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
KodekX | Application Modernization Development
NewMind AI Weekly Chronicles - August'25 Week I
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
Advanced methodologies resolving dimensionality complications for autism neur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Electronic commerce courselecture one. Pdf
Spectral efficient network and resource selection model in 5G networks
cuic standard and advanced reporting.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Approach and Philosophy of On baking technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf

Introducing envoy-based service mesh at Booking.com

  • 1. Introducing Envoy-based service mesh at Booking.com Ivan Kruglov KubeCon Europe 2018 02.05.2018
  • 2. based in Amsterdam 28M listings 1,5M room nights per day 1700 IT staff
  • 3. Agenda • what is service mesh? • why did we start the project? • our setup • learnings • conclusion
  • 4. What is service mesh*? * the way I understand it
  • 6. application provider service consumer service provider communication infra
  • 7. application provider service consumer service provider proxy proxy control plane
  • 8. Why did we start service mesh project?
  • 13. Consistency & Visibility in communications
  • 15. application provider proxy service consumer service provider data plane HTTP(S) HTTP
  • 17. application provider envoy service consumer service provider data plane HTTP(S) HTTP control plane control plane • routing rules • timeout/retry/etc policies • service discovery
  • 18. control plane • in-house (v1/v2 API) • started in August 2017; Istio is around 0.2 • one complex system at a time • start with bare-metal support • minimal abstraction • yup, just write Envoy config (partially) • fine with Envoy’s set of features • self-service for service owners
  • 19. application provider envoy service consumer service provider data plane HTTP(S) HTTP control plane control plane • routing rules • timeout/retry/etc policies • service discovery ZooKeeper (service discovery) Kubernetes (service discovery) ZooKeeper (configuration)
  • 20. configuration • Envoy configuration is quite rich • power vs. usability: • cluster specification • virtual host specification
  • 21. kind: VirtualHostSpec spec: name: api.vhost domains: ["api.service"] routes: - match: prefix: / route: cluster: api.prod.cluster timeout: 10s retry_policy: retry_on: 5xx num_retries: 3 per_try_timeout: 3s kind: ClusterSpec metadata: annotations: watcher_name: zookeeper.watcher paths: ["/pools/api/dc"] spec: name: api.prod.cluster lb_policy: LEAST_REQUEST connect_timeout: 1s lb_subset_config: subset_selectors: - keys: ["DC"] - keys: ["IsLocal"] fallback_policy: DEFAULT_SUBSET default_subset: IsLocal: true
  • 22. kind: VirtualHostSpec spec: name: api.vhost domains: ["api.service"] routes: - match: prefix: / route: cluster: api.prod.cluster timeout: 10s retry_policy: retry_on: 5xx num_retries: 3 per_try_timeout: 3s kind: ClusterSpec metadata: annotations: watcher_name: zookeeper.watcher paths: ["/pools/api/dc"] spec: name: api.prod.cluster lb_policy: LEAST_REQUEST connect_timeout: 1s lb_subset_config: subset_selectors: - keys: ["DC"] - keys: ["IsLocal"] fallback_policy: DEFAULT_SUBSET default_subset: IsLocal: true
  • 23. kind: VirtualHostSpec spec: name: api.vhost domains: ["api.service"] routes: - match: prefix: / route: cluster: api.prod.cluster timeout: 10s retry_policy: retry_on: 5xx num_retries: 3 per_try_timeout: 3s kind: ClusterSpec metadata: annotations: watcher_name: zookeeper.watcher paths: ["/pools/api/dc"] spec: name: api.prod.cluster lb_policy: LEAST_REQUEST connect_timeout: 1s lb_subset_config: subset_selectors: - keys: ["DC"] - keys: ["IsLocal"] fallback_policy: DEFAULT_SUBSET default_subset: IsLocal: true
  • 24. kind: VirtualHostSpec spec: name: api.vhost domains: ["api.service"] routes: - match: prefix: / route: cluster: api.prod.cluster timeout: 10s retry_policy: retry_on: 5xx num_retries: 3 per_try_timeout: 3s kind: ClusterSpec metadata: annotations: watcher_name: zookeeper.watcher paths: ["/pools/api/dc"] spec: name: api.prod.cluster lb_policy: LEAST_REQUEST connect_timeout: 1s lb_subset_config: subset_selectors: - keys: ["DC"] - keys: ["IsLocal"] fallback_policy: DEFAULT_SUBSET default_subset: IsLocal: true
  • 25. kind: VirtualHostSpec spec: name: api.vhost domains: ["api.service"] routes: - match: prefix: / route: cluster: api.prod.cluster timeout: 10s retry_policy: retry_on: 5xx num_retries: 3 per_try_timeout: 3s kind: ClusterSpec metadata: annotations: watcher_name: zookeeper.watcher paths: ["/pools/api/dc"] spec: name: api.prod.cluster lb_policy: LEAST_REQUEST connect_timeout: 1s lb_subset_config: subset_selectors: - keys: ["DC"] - keys: ["IsLocal"] fallback_policy: DEFAULT_SUBSET default_subset: IsLocal: true envoy cluster spec envoy virtual host spec
  • 26. kind: VirtualHostSpec spec: name: api.vhost domains: ["api.service"] routes: - match: prefix: / route: cluster: api.prod.cluster timeout: 10s retry_policy: retry_on: 5xx num_retries: 3 per_try_timeout: 3s kind: ClusterSpec metadata: annotations: watcher_name: zookeeper.watcher paths: ["/pools/api/dc"] spec: name: api.prod.cluster lb_policy: LEAST_REQUEST connect_timeout: 1s lb_subset_config: subset_selectors: - keys: ["DC"] - keys: ["IsLocal"] fallback_policy: DEFAULT_SUBSET default_subset: IsLocal: true envoy cluster spec envoy virtual host spec bootstrap wizard
  • 27. control plane ZooKeeper VC VC VC service A expose to service D service C expose to service Dservice B envoy I’m service D here is service A and service C specs
  • 28. application provider envoy service consumer service provider data plane HTTP(S) HTTP control plane control plane • routing rules • timeout/retry/etc policies • service discovery ZooKeeper (service discovery) Kubernetes (service discovery) ZooKeeper (configuration) configuration changes
  • 29. Deploying changes • integration with existent infrastructure • git-deploy • vanguard (canary) deployment • syntax and semantic validation • fast rollback
  • 30. application provider envoy service consumer service provider data plane HTTP(S) HTTP control plane control plane • routing rules • timeout/retry/etc policies • service discovery ZooKeeper (service discovery) ZooKeeper (configuration) configuration changes metrics Kubernetes (service discovery)
  • 31. Monitoring • approach #1 – graphite • excellent in-house facilities • aggregations is too heavy at scale • approach #2 – prometheus • statsd support • statsd_exporter • still iterating • standard dashboard envoy statsd_exporter envoy…
  • 33. application provider envoy service consumer service provider data plane HTTP(S) HTTP control plane control plane • routing rules • timeout/retry/etc policies • service discovery ZooKeeper (service discovery) ZooKeeper (configuration) configuration changes metrics Kubernetes (service discovery)
  • 35. Some numbers • ~ 6 months in production • ~ 30 projects • ~ 10K servers • hundreds of thousands RPS • overheads…
  • 38. Some findings • graceful restart works, but not always! • (major) Envoy upgrades may not be graceful • ”x-envoy-retry-on = connect-failure” – too few “x-envoy-retry-on = 5xx” – too much • gateway-error – HTTP 502, 503, 504 • absence of TCP keepalive => stale connections • coming in 1.7 • idle_timeout in TCP proxy in 1.6 • cluster_name (1.5) -> cluster_names (1.6)
  • 40. service mesh* is a building block on the path to SOA technically & organizationally * the way I understand it
  • 41. Andrei Vereha Envoy production stories Friday, May 4th 14:00 Envoy Deep Dive