SlideShare a Scribd company logo
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
Use Case
Blue-green deploys
with Pulsar & Envoy
in an event-driven
microservice
ecosystem
Kai Levy & Zach Walsh
Toast, Inc.
Kai and Zach both work on Toast’s
Scale team, building shared
infrastructure and solving problems of
messaging, routing and persistence at
scale.
Kai Levy
Senior Software Engineer
Toast
Zach Walsh
Senior Software Engineer
Toast
Agenda
Toast’s microservice ecosystem + Pulsar
Blue/green deployments at Toast
Driving Pulsar adoption
Our Envoy Proxy control plane
“The Pulsar Toggle”
We empower the restaurant
community to delight their guests,
do what they love, and thrive
Toast’s technology platform
Toast’s microservice ecosystem
How it started How it’s going
How it’s going (with Pulsar)
2018 Asynchronous messaging with RabbitMQ
● Order syncing between devices
● Change Data Capture (CDC)
A History of
Pulsar at
Toast
2018 Asynchronous messaging with RabbitMQ
● Order syncing between devices
● Change Data Capture (CDC)
A History of
Pulsar at
Toast
2019 Pulsar pilot
● Initial exploration & testing
● Cluster productionalization
● First features, such as migrating change data
capture
Persistence & Stability
Seamless Pulsar
failover
● RabbitMQ: potential stability issues + in-memory data-storage = lost messages
○ Manual maintenance was a big burden
● Pulsar’s data replication & automatic topic balancing eliminated these concerns
Horizontal Scalability
broker 0
…
● Supports adding more topics without manual provisioning
● Throughput has grown more than 5x without any change in architecture
broker 1 broker 2 broker 3 broker n
2018 Asynchronous messaging with RabbitMQ
● Order syncing between devices
● Change Data Capture (CDC)
A History of
Pulsar at
Toast
2019 Pulsar pilot
● Initial exploration & testing
● Cluster productionalization
● First features, such as migrating change data
capture
2020 Full-fledged adoption
● Teams across Toast rapidly built features on top of
Pulsar to help restaurants survive the pandemic
● Decorated streams built on Pulsar, which enabled
more scalable consumers
CDC
notify-topic
Domain service
(Source of Truth)
service2
service1
service3
Full-fledged adoption
…
serviceN
CDC data decorator service
notify-topic decorated-stream
Domain service
(Source of Truth)
service1
…
serviceN
Full-fledged adoption
Order status notifications
Delivery & curbside arrival notifications for consumers
- helping restaurants pivot to digital
Full-fledged adoption
Tip pool tracking
Tip pooling information is kept up-to-date with orders
information
Loyalty points accrual
Consumer-facing loyalty programs help Toast
restaurants thrive
Restaurant availability
Third party platforms are notified when a restaurant
goes offline
2018 Asynchronous messaging with RabbitMQ
● Order syncing between devices
● Change data capture (CDC)
A History of
Pulsar at
Toast
2019 Pulsar pilot
● Initial exploration & testing
● Cluster productionalization
● First features, such as migrating change data
capture
2020 Full-fledged adoption
● Teams across Toast rapidly built features on top of
Pulsar to help restaurants survive the pandemic
● Decorated streams built on Pulsar, which enabled
more scalable consumers
2022 Next-gen order processing
● Critical replatforming projects in development will
help Toast reach the next level of scale
● Event-driven architecture being widely used for new
features
Agenda
Toast’s microservice ecosystem + Pulsar
Blue/green deployments at Toast
Driving Pulsar adoption
Our Envoy Proxy control plane
“The Pulsar Toggle”
Pulsar adoption has grown steadily
user
adoption
(linear)
Toast client libraries
Providing Toast-specific functionality for free
1. Out-of-box authentication
2. Dead-letter topic guidance (+ topic registries)
3. Metric instrumentation
4. Message parsing
5. Pulsar client configuration
+
Authentication & authorization
● Automatic service authentication provided by the client libraries
○ Easy to use with any of our supported application frameworks
● Contributed a patch into the public Java client library
Dead-Letter Topics
● Standards for undeliverable messages
○ Per-subscription DLQs, or automatic
acknowledgement after redelivery
○ Integrated with service configuration
Topic registries with terraform
● Started with in-house provider
○ Now migrating to StreamNative provider
● Lets us manage namespace authorization
● Provide defaults for retention & persistence
● Central place for discovering events
Developers write infrastructure as code
Metrics
● Automatically report over 2 dozen
metrics
○ Consistent across services
● Critical for operations & monitoring
● Added our own custom metrics
● Adding APM integrations
ackLatency
ackTimeouts
auto-acknowledgements
Message Parsing
We parse Protobuf messages into friendly Kotlin data classes
● Our open-source, Kotlin-first
protocol buffer compiler
● One-line usage for engineers
building on our client
Configuration recommendations
Providing guidance around client settings
● Producer batching
● Acknowledgement timeout
● Receiver queue size
● Redelivery delay
● Unique consumer & producer names
Starting Pulsar consumer status recorder with config: {
"topicNames" : [ "persistent://…" ],
"topicsPattern" : null,
"subscriptionName" : "...",
"subscriptionType" : "Shared",
"subscriptionMode" : "Durable",
"receiverQueueSize" : 1000,
"acknowledgementsGroupTimeMicros" : 100000,
"negativeAckRedeliveryDelayMicros" : 500000,
"maxTotalReceiverQueueSizeAcrossPartitions" : 50000,
"consumerName" : null,
"ackTimeoutMillis" : 30000,
"tickDurationMillis" : 1000,
"priorityLevel" : 0,
"maxPendingChuckedMessage" : 10,
"autoAckOldestChunkedMessageOnQueueFull" : false,
"expireTimeOfIncompleteChunkedMessageMillis" : 60000,
"cryptoFailureAction" : "FAIL",
"properties" : { },
"readCompacted" : false,
"subscriptionInitialPosition" : "Latest",
"patternAutoDiscoveryPeriod" : 60,
"regexSubscriptionMode" : "PersistentOnly",
But something is still missing…
Agenda
Toast’s microservice ecosystem + Pulsar
Blue/green deployments at Toast (the problem)
Driving Pulsar adoption
Our Envoy Proxy control plane
“The Pulsar Toggle”
Deployment and elevation practices
service v1 service v2
HTTP ingress control plane
Deployment and elevation practices
service v2
service v1
HTTP ingress control plane
Deployment and elevation practices
service v1 service v2
HTTP ingress control plane
service v2
Deployment and elevation practices
service v1 service v2
HTTP ingress control plane
service v2
Deployment and elevation practices
service v1 service v2
HTTP ingress control plane
service v2
Deployment and elevation practices
service v1 service v2
HTTP ingress control plane
service v2
Deployment and elevation practices
service v1 service v2
HTTP ingress control plane
service v2
service v1
Deployment and elevation practices
service v2
HTTP ingress control plane
service v2
shared pulsar subscription
Deploying changes to Pulsar consumers is risky
service v1 service v2
service v1 service v2
Mismatch in tooling
Our platform for request-driven service deploys was well ahead of our Pulsar
platform, causing developer frustration
User frustration
Principle of least surprise
“In interface design, always do
the least surprising thing.”
- Basics of the Unix Philosophy
Elevations & deploys should
be safe, easy, uneventful!
Agenda
Toast’s microservice ecosystem + Pulsar
Blue/green deployments at Toast (the solution)
Driving Pulsar adoption
Our Envoy Proxy control plane
“The Pulsar Toggle”
Pulsar operational tooling
Elevations & deploys weren’t easy on Pulsar
REST services Pulsar consumers
Can I validate my deploy
before prod traffic?
✅ ❌
Can I validate with a small
amount of prod traffic?
✅ ❌
Can I easily roll back? ✅ ❌
Can I easily roll forward? ✅ ❌
Contrast: REST services & Pulsar (in 2019)
Pulsar Consumer Elevation Requirements
1. Elevate traffic to new consumers as they are set to “active” in the control plane.
2. Avoid building a single point of failure.
3. Make this reusable for other background processes at Toast.
4. No performance hit or extra infrastructure.
Some options we considered
Message Router Pattern
incoming topic
Deploy
N
Deploy
N + 1
Router
Control
Plane
blue topic
green topic
Some options we considered
Message Router Pattern - Problems
incoming topic
Deploy
N
Deploy
N + 1
Router
Control
Plane
blue topic
green topic
● But, the router is a single
point of failure
● More infrastructure to
monitor
● Two hops per message
Some options we considered
Feature Flags
● Apps use a feature flag to
know whether to connect
● But, not integrated with our
control plane
● Requires more setup for
each consumer
incoming topic
Deploy
N
Deploy
N + 1
FF Off
FF On
Some options we considered
Pausing Inactive Consumers
● The Feature Flag approach
is close
○ No extra infrastructure
○ No extra hops
● But, we’d need to integrate
it into our control plane
● Is this possible with Pulsar?
incoming topic
Deploy
N
Deploy
N + 1
inactive
active
Let’s see what the Pulsar source code has to say about pausing consumers.
What does Pulsar provide?
In Consumer.java:
Will pause() and resume() work?
Pulsar consumers Pulsar consumers with
pause()
Can I validate my deploy
before prod traffic?
❌ ✅
Can I validate with a small
amount of prod traffic?
❌ ❌
Can I easily roll back? ❌ ✅
Can I easily roll forward? ❌ ✅
What do operations look like if inactive consumers call pause()?
How do we get each consumer to call pause() or resume() at the right time?
How Would You Solve This?
● Pausing pulsar consumers is
easy. Knowing when to pause is
hard.
● Central control plane component
owns this data
● Let’s just poll that service
● What would that look like?
control plane
service Z
What’s Wrong With This?
● Used to be the pattern for
service discovery at Toast
● Subject to thundering herd
● Now, we leverage Envoy
control plane
service Z
Agenda
Toast’s microservice ecosystem + Pulsar
Blue/green deployments at Toast
Driving Pulsar adoption
Our Envoy Proxy control plane
“The Pulsar Toggle”
How We Leverage Envoy
Envoy at Toast
Envoy is a reverse proxy
Deployed as a sidecar, forwards requests to their destination
Envoy acts as a proxy, forwarding requests upstream.
my-service menus
GET /menus/v2/menuItems GET /v2/menuItems
envoy
Envoy is eventually consistent
Routing changes are pushed asynchronously
Envoy sidecars across the fleet are pushed updates within ~1-2min of the
change.
Control Plane
…
Envoy knows service status
It gets a push each time any deploy goes active or inactive
We can leverage this to pause() or resume() consumers.
Envoy direct responses
Using an interesting Envoy feature to avoid single points of failure
It can intercept requests and reply with a direct response! This gets
the status info into the process where the Consumer is running.
*magic config*
GET /sidecar/v1/elevation/active
{ "active": true }
my-service envoy
Agenda
Toast’s microservice ecosystem + Pulsar
Blue/green deployments at Toast
Driving Pulsar adoption
Our Envoy Proxy control plane
“The Pulsar Toggle”
The Pulsar Toggle
“Pulsar Toggle” implementation
Leveraging our Envoy Control Plane to toggle Pulsar consumers
A thread polls the locally-running Envoy instance and
toggles the Pulsar consumer as needed
Some “gotchas”
Eventually consistent
Consumers don’t pause immediately - updates
propagate with some latency
Start paused
Wasn’t a way to subscribe in a paused state - we made a
patch to the Java client
More advanced elevation patterns
Currently we can’t support percent elevations of pulsar
traffic onto new deploys
Receiver queue size
Critically important to tune this parameter of consumers
Results
~30
Toggle users in Prod
across pulsar consumers &
background workers
0
Outages
No added load on any
critical systems
2
Contributions
To open source - the Java
client & the Camel
integration
Increased adoption
2x
New topics
Developers are adding
topics at twice the rate
since the Pulsar toggle was
released
user
adoption
(linear)
Users Love it!
65%
Increase
reported ease of use when
deploying pulsar consumer
changes
46%
Decrease
reported risk associated with
deploying pulsar consumer
changes
Positive feedback from satisfaction surveys with our users
Key Takeaways
Integration
Strong integration
with existing systems
is critical for org-wide
adoption.
Ease of Use
As we make our
Pulsar platform easier
to use, we see more
and more adoption.
Stability
Pulsar’s stability
through big growth
has been a killer
feature for us.
Kai Levy & Zach Walsh
Thank you!
klevy@toasttab.com
zachary.walsh@toasttab.com
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
We’re Hiring!
careers.toasttab.com

More Related Content

PPTX
Deep Dive into Apache Kafka
PDF
Producer Performance Tuning for Apache Kafka
PPTX
Infrastructure-as-Code (IaC) Using Terraform (Intermediate Edition)
PDF
Apache Flink internals
PDF
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
PDF
Apache Kafka
PDF
When apache pulsar meets apache flink
PDF
The basics of fluentd
Deep Dive into Apache Kafka
Producer Performance Tuning for Apache Kafka
Infrastructure-as-Code (IaC) Using Terraform (Intermediate Edition)
Apache Flink internals
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Apache Kafka
When apache pulsar meets apache flink
The basics of fluentd

What's hot (20)

PPTX
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
PPTX
Kafka 101
PDF
Kafka Streams: What it is, and how to use it?
PDF
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
PDF
Loki - like prometheus, but for logs
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
PPTX
Apache Kafka at LinkedIn
PDF
Apache Kafka - Martin Podval
PPTX
The top 3 challenges running multi-tenant Flink at scale
PDF
Introduction to Kafka Streams
PPTX
Logging using ELK Stack for Microservices
PPTX
Real-time Stream Processing with Apache Flink
PPTX
Introduction to Apache ZooKeeper
PDF
Mutiny + quarkus
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
Benefits of Stream Processing and Apache Kafka Use Cases
PDF
Cloud Monitoring tool Grafana
PDF
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
PDF
Data integration with Apache Kafka
Building Event Driven Architectures with Kafka and Cloud Events (Dan Rosanova...
Kafka 101
Kafka Streams: What it is, and how to use it?
Apache kafka performance(throughput) - without data loss and guaranteeing dat...
Loki - like prometheus, but for logs
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Apache Kafka at LinkedIn
Apache Kafka - Martin Podval
The top 3 challenges running multi-tenant Flink at scale
Introduction to Kafka Streams
Logging using ELK Stack for Microservices
Real-time Stream Processing with Apache Flink
Introduction to Apache ZooKeeper
Mutiny + quarkus
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Benefits of Stream Processing and Apache Kafka Use Cases
Cloud Monitoring tool Grafana
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Data integration with Apache Kafka
Ad

Similar to Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosystem - Pulsar Summit SF 2022 (20)

PDF
bigdata 2022_ FLiP Into Pulsar Apps
PDF
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
PDF
Open keynote_carolyn&matteo&sijie
PDF
Hands-on Workshop: Apache Pulsar
PDF
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
PDF
Deep Dive into Building Streaming Applications with Apache Pulsar
PDF
(Current22) Let's Monitor The Conditions at the Conference
PDF
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
PDF
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
PDF
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
PDF
[AerospikeRoadshow] Apache Pulsar Unifies Streaming and Messaging for Real-Ti...
PDF
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
PDF
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
PDF
Pulsar - Distributed pub/sub platform
PDF
Timothy Spann: Apache Pulsar for ML
PDF
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)
PDF
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
PDF
Open Source Bristol 30 March 2022
PDF
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
PPTX
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...
bigdata 2022_ FLiP Into Pulsar Apps
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
Open keynote_carolyn&matteo&sijie
Hands-on Workshop: Apache Pulsar
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
Deep Dive into Building Streaming Applications with Apache Pulsar
(Current22) Let's Monitor The Conditions at the Conference
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
[AerospikeRoadshow] Apache Pulsar Unifies Streaming and Messaging for Real-Ti...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Pulsar - Distributed pub/sub platform
Timothy Spann: Apache Pulsar for ML
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
Open Source Bristol 30 March 2022
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Sum...
Ad

More from StreamNative (20)

PDF
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
PDF
Distributed Database Design Decisions to Support High Performance Event Strea...
PDF
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
PDF
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
PDF
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
PDF
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
PDF
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
PDF
Understanding Broker Load Balancing - Pulsar Summit SF 2022
PDF
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
PDF
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
PDF
Event-Driven Applications Done Right - Pulsar Summit SF 2022
PDF
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
PDF
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
PDF
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
PDF
Welcome and Opening Remarks - Pulsar Summit SF 2022
PDF
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
PDF
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
PDF
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
PPTX
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
PPTX
The Evolution History of RoP(RocketMQ-on-Pulsar) - Pulsar Summit Asia 2021
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Distributed Database Design Decisions to Support High Performance Event Strea...
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
The Evolution History of RoP(RocketMQ-on-Pulsar) - Pulsar Summit Asia 2021

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The AUB Centre for AI in Media Proposal.docx
Network Security Unit 5.pdf for BCA BBA.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MIND Revenue Release Quarter 2 2025 Press Release
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
sap open course for s4hana steps from ECC to s4
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars

Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosystem - Pulsar Summit SF 2022

  • 1. Pulsar Summit San Francisco Hotel Nikko August 18 2022 Use Case Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosystem Kai Levy & Zach Walsh Toast, Inc.
  • 2. Kai and Zach both work on Toast’s Scale team, building shared infrastructure and solving problems of messaging, routing and persistence at scale. Kai Levy Senior Software Engineer Toast Zach Walsh Senior Software Engineer Toast
  • 3. Agenda Toast’s microservice ecosystem + Pulsar Blue/green deployments at Toast Driving Pulsar adoption Our Envoy Proxy control plane “The Pulsar Toggle”
  • 4. We empower the restaurant community to delight their guests, do what they love, and thrive
  • 6. Toast’s microservice ecosystem How it started How it’s going
  • 7. How it’s going (with Pulsar)
  • 8. 2018 Asynchronous messaging with RabbitMQ ● Order syncing between devices ● Change Data Capture (CDC) A History of Pulsar at Toast
  • 9. 2018 Asynchronous messaging with RabbitMQ ● Order syncing between devices ● Change Data Capture (CDC) A History of Pulsar at Toast 2019 Pulsar pilot ● Initial exploration & testing ● Cluster productionalization ● First features, such as migrating change data capture
  • 10. Persistence & Stability Seamless Pulsar failover ● RabbitMQ: potential stability issues + in-memory data-storage = lost messages ○ Manual maintenance was a big burden ● Pulsar’s data replication & automatic topic balancing eliminated these concerns
  • 11. Horizontal Scalability broker 0 … ● Supports adding more topics without manual provisioning ● Throughput has grown more than 5x without any change in architecture broker 1 broker 2 broker 3 broker n
  • 12. 2018 Asynchronous messaging with RabbitMQ ● Order syncing between devices ● Change Data Capture (CDC) A History of Pulsar at Toast 2019 Pulsar pilot ● Initial exploration & testing ● Cluster productionalization ● First features, such as migrating change data capture 2020 Full-fledged adoption ● Teams across Toast rapidly built features on top of Pulsar to help restaurants survive the pandemic ● Decorated streams built on Pulsar, which enabled more scalable consumers
  • 13. CDC notify-topic Domain service (Source of Truth) service2 service1 service3 Full-fledged adoption … serviceN
  • 14. CDC data decorator service notify-topic decorated-stream Domain service (Source of Truth) service1 … serviceN Full-fledged adoption
  • 15. Order status notifications Delivery & curbside arrival notifications for consumers - helping restaurants pivot to digital Full-fledged adoption Tip pool tracking Tip pooling information is kept up-to-date with orders information Loyalty points accrual Consumer-facing loyalty programs help Toast restaurants thrive Restaurant availability Third party platforms are notified when a restaurant goes offline
  • 16. 2018 Asynchronous messaging with RabbitMQ ● Order syncing between devices ● Change data capture (CDC) A History of Pulsar at Toast 2019 Pulsar pilot ● Initial exploration & testing ● Cluster productionalization ● First features, such as migrating change data capture 2020 Full-fledged adoption ● Teams across Toast rapidly built features on top of Pulsar to help restaurants survive the pandemic ● Decorated streams built on Pulsar, which enabled more scalable consumers 2022 Next-gen order processing ● Critical replatforming projects in development will help Toast reach the next level of scale ● Event-driven architecture being widely used for new features
  • 17. Agenda Toast’s microservice ecosystem + Pulsar Blue/green deployments at Toast Driving Pulsar adoption Our Envoy Proxy control plane “The Pulsar Toggle”
  • 18. Pulsar adoption has grown steadily user adoption (linear)
  • 19. Toast client libraries Providing Toast-specific functionality for free 1. Out-of-box authentication 2. Dead-letter topic guidance (+ topic registries) 3. Metric instrumentation 4. Message parsing 5. Pulsar client configuration +
  • 20. Authentication & authorization ● Automatic service authentication provided by the client libraries ○ Easy to use with any of our supported application frameworks ● Contributed a patch into the public Java client library
  • 21. Dead-Letter Topics ● Standards for undeliverable messages ○ Per-subscription DLQs, or automatic acknowledgement after redelivery ○ Integrated with service configuration
  • 22. Topic registries with terraform ● Started with in-house provider ○ Now migrating to StreamNative provider ● Lets us manage namespace authorization ● Provide defaults for retention & persistence ● Central place for discovering events Developers write infrastructure as code
  • 23. Metrics ● Automatically report over 2 dozen metrics ○ Consistent across services ● Critical for operations & monitoring ● Added our own custom metrics ● Adding APM integrations ackLatency ackTimeouts auto-acknowledgements
  • 24. Message Parsing We parse Protobuf messages into friendly Kotlin data classes ● Our open-source, Kotlin-first protocol buffer compiler ● One-line usage for engineers building on our client
  • 25. Configuration recommendations Providing guidance around client settings ● Producer batching ● Acknowledgement timeout ● Receiver queue size ● Redelivery delay ● Unique consumer & producer names Starting Pulsar consumer status recorder with config: { "topicNames" : [ "persistent://…" ], "topicsPattern" : null, "subscriptionName" : "...", "subscriptionType" : "Shared", "subscriptionMode" : "Durable", "receiverQueueSize" : 1000, "acknowledgementsGroupTimeMicros" : 100000, "negativeAckRedeliveryDelayMicros" : 500000, "maxTotalReceiverQueueSizeAcrossPartitions" : 50000, "consumerName" : null, "ackTimeoutMillis" : 30000, "tickDurationMillis" : 1000, "priorityLevel" : 0, "maxPendingChuckedMessage" : 10, "autoAckOldestChunkedMessageOnQueueFull" : false, "expireTimeOfIncompleteChunkedMessageMillis" : 60000, "cryptoFailureAction" : "FAIL", "properties" : { }, "readCompacted" : false, "subscriptionInitialPosition" : "Latest", "patternAutoDiscoveryPeriod" : 60, "regexSubscriptionMode" : "PersistentOnly",
  • 26. But something is still missing…
  • 27. Agenda Toast’s microservice ecosystem + Pulsar Blue/green deployments at Toast (the problem) Driving Pulsar adoption Our Envoy Proxy control plane “The Pulsar Toggle”
  • 28. Deployment and elevation practices service v1 service v2 HTTP ingress control plane
  • 29. Deployment and elevation practices service v2 service v1 HTTP ingress control plane
  • 30. Deployment and elevation practices service v1 service v2 HTTP ingress control plane service v2
  • 31. Deployment and elevation practices service v1 service v2 HTTP ingress control plane service v2
  • 32. Deployment and elevation practices service v1 service v2 HTTP ingress control plane service v2
  • 33. Deployment and elevation practices service v1 service v2 HTTP ingress control plane service v2
  • 34. Deployment and elevation practices service v1 service v2 HTTP ingress control plane service v2
  • 35. service v1 Deployment and elevation practices service v2 HTTP ingress control plane service v2
  • 36. shared pulsar subscription Deploying changes to Pulsar consumers is risky service v1 service v2 service v1 service v2
  • 37. Mismatch in tooling Our platform for request-driven service deploys was well ahead of our Pulsar platform, causing developer frustration
  • 39. Principle of least surprise “In interface design, always do the least surprising thing.” - Basics of the Unix Philosophy
  • 40. Elevations & deploys should be safe, easy, uneventful!
  • 41. Agenda Toast’s microservice ecosystem + Pulsar Blue/green deployments at Toast (the solution) Driving Pulsar adoption Our Envoy Proxy control plane “The Pulsar Toggle”
  • 42. Pulsar operational tooling Elevations & deploys weren’t easy on Pulsar REST services Pulsar consumers Can I validate my deploy before prod traffic? ✅ ❌ Can I validate with a small amount of prod traffic? ✅ ❌ Can I easily roll back? ✅ ❌ Can I easily roll forward? ✅ ❌ Contrast: REST services & Pulsar (in 2019)
  • 43. Pulsar Consumer Elevation Requirements 1. Elevate traffic to new consumers as they are set to “active” in the control plane. 2. Avoid building a single point of failure. 3. Make this reusable for other background processes at Toast. 4. No performance hit or extra infrastructure.
  • 44. Some options we considered Message Router Pattern incoming topic Deploy N Deploy N + 1 Router Control Plane blue topic green topic
  • 45. Some options we considered Message Router Pattern - Problems incoming topic Deploy N Deploy N + 1 Router Control Plane blue topic green topic ● But, the router is a single point of failure ● More infrastructure to monitor ● Two hops per message
  • 46. Some options we considered Feature Flags ● Apps use a feature flag to know whether to connect ● But, not integrated with our control plane ● Requires more setup for each consumer incoming topic Deploy N Deploy N + 1 FF Off FF On
  • 47. Some options we considered Pausing Inactive Consumers ● The Feature Flag approach is close ○ No extra infrastructure ○ No extra hops ● But, we’d need to integrate it into our control plane ● Is this possible with Pulsar? incoming topic Deploy N Deploy N + 1 inactive active
  • 48. Let’s see what the Pulsar source code has to say about pausing consumers. What does Pulsar provide? In Consumer.java:
  • 49. Will pause() and resume() work? Pulsar consumers Pulsar consumers with pause() Can I validate my deploy before prod traffic? ❌ ✅ Can I validate with a small amount of prod traffic? ❌ ❌ Can I easily roll back? ❌ ✅ Can I easily roll forward? ❌ ✅ What do operations look like if inactive consumers call pause()?
  • 50. How do we get each consumer to call pause() or resume() at the right time? How Would You Solve This? ● Pausing pulsar consumers is easy. Knowing when to pause is hard. ● Central control plane component owns this data ● Let’s just poll that service ● What would that look like? control plane service Z
  • 51. What’s Wrong With This? ● Used to be the pattern for service discovery at Toast ● Subject to thundering herd ● Now, we leverage Envoy control plane service Z
  • 52. Agenda Toast’s microservice ecosystem + Pulsar Blue/green deployments at Toast Driving Pulsar adoption Our Envoy Proxy control plane “The Pulsar Toggle”
  • 53. How We Leverage Envoy Envoy at Toast
  • 54. Envoy is a reverse proxy Deployed as a sidecar, forwards requests to their destination Envoy acts as a proxy, forwarding requests upstream. my-service menus GET /menus/v2/menuItems GET /v2/menuItems envoy
  • 55. Envoy is eventually consistent Routing changes are pushed asynchronously Envoy sidecars across the fleet are pushed updates within ~1-2min of the change. Control Plane …
  • 56. Envoy knows service status It gets a push each time any deploy goes active or inactive We can leverage this to pause() or resume() consumers.
  • 57. Envoy direct responses Using an interesting Envoy feature to avoid single points of failure It can intercept requests and reply with a direct response! This gets the status info into the process where the Consumer is running. *magic config* GET /sidecar/v1/elevation/active { "active": true } my-service envoy
  • 58. Agenda Toast’s microservice ecosystem + Pulsar Blue/green deployments at Toast Driving Pulsar adoption Our Envoy Proxy control plane “The Pulsar Toggle”
  • 60. “Pulsar Toggle” implementation Leveraging our Envoy Control Plane to toggle Pulsar consumers A thread polls the locally-running Envoy instance and toggles the Pulsar consumer as needed
  • 61. Some “gotchas” Eventually consistent Consumers don’t pause immediately - updates propagate with some latency Start paused Wasn’t a way to subscribe in a paused state - we made a patch to the Java client More advanced elevation patterns Currently we can’t support percent elevations of pulsar traffic onto new deploys Receiver queue size Critically important to tune this parameter of consumers
  • 62. Results ~30 Toggle users in Prod across pulsar consumers & background workers 0 Outages No added load on any critical systems 2 Contributions To open source - the Java client & the Camel integration
  • 63. Increased adoption 2x New topics Developers are adding topics at twice the rate since the Pulsar toggle was released user adoption (linear)
  • 64. Users Love it! 65% Increase reported ease of use when deploying pulsar consumer changes 46% Decrease reported risk associated with deploying pulsar consumer changes Positive feedback from satisfaction surveys with our users
  • 65. Key Takeaways Integration Strong integration with existing systems is critical for org-wide adoption. Ease of Use As we make our Pulsar platform easier to use, we see more and more adoption. Stability Pulsar’s stability through big growth has been a killer feature for us.
  • 66. Kai Levy & Zach Walsh Thank you! klevy@toasttab.com zachary.walsh@toasttab.com Pulsar Summit San Francisco Hotel Nikko August 18 2022 We’re Hiring! careers.toasttab.com