SlideShare a Scribd company logo
Pulsar Virtual Summit North America 2021
Apache Pulsar:
Why Unified Messaging
and Streaming Is the
Future
Matteo Merli, Sijie Guo
@ Pulsar PMC
Who are we?
● Sijie Guo (@sijieg)
● CEO, StreamNative
● PMC Member of Pulsar/BookKeeper
● Ex Co-Founder, Streamlio
● Ex Twitter
● Matteo Merli (@merlimat)
● CTO, StreamNative
● Co-creator and PMC chair of Pulsar
● Ex Co-Founder, Streamlio
● Ex Yahoo!
StreamNative
Founded by the creators of Apache Pulsar, StreamNative provides a
cloud-native, unified messaging and streaming platform powered by
Apache Pulsar to support multi-cloud and hybrid-cloud strategies
Announcing StreamNative Platform 1.0
✓ Pulsar Transactions
✓ Kafka-on-Pulsar
✓ Function Mesh for serverless streaming
✓ Enterprise-ready security
✓ Pulsar Operators
✓ Seamless StreamNative Cloud experience
Pulsar Trends
Kafka -> Pulsar
Scale Cloud-Native
Pulsar + Flink
Pulsar at Scale
More companies in Production
Pulsar at Scale
Hit Trillion Messages Per Day
Cloud-Native
Kubernetes Drive Adoption of Pulsar
✓ 80% of Pulsar users deploy Pulsar in a cloud environment
✓ 62% of Pulsar users deploy Pulsar on Kubernetes
✓ 49% noted Pulsar’s Cloud-Native capabilities as one of the
top reasons they chose to adopt Pulsar
Cloud-Native
Built for Kubernetes
Containers
Cloud Native
Hybrid & MultiCloud
● Single Cloud Provider
● Monolithic
Architectures
● Single Tenant Systems
● No Geo-replication
VM / Early Cloud Era Containers / Modern Cloud Era
Microservices
Pulsar + Flink
Unified Stream and Batch
Kafka to Pulsar
More and More Kafka Users Adopt Pulsar
✓ 68% of respondents use Kafka in addition to Pulsar
✓ 34% of respondents use or plan to use Kafka-on-Pulsar
✓ Kafka and Pulsar serve different use cases
✓ Once adopted, Pulsar usage expands across organizations
Pulsar Adoption Use Cases
Adopted Pulsar to replace Kafka
in their DSP (Data Streaming
Platform).
● 1.5-2x lower in capex cost
● 5-50x improvement in
latency
● 2-3x lower in opex due
● 10 PB / day
Adopted Pulsar to power their
billing platform, Midas, which
processing hundreds of billions
of financial transactions daily.
Adoption then expanded to
Tencent’s Federated Learning
Platform and Tencent Gaming.
Use cases require a scalable
message queue for serving
mission-critical business
applications to replace
RabbitMQ.
In the process of expanding use
cases to build data streaming
services
Modern Data Needs
Messaging + Streaming
Messaging
● Queueing systems are ideal for work
queues that do not require tasks to
be performed in a particular order—
for example, sending one email
message to many recipients.
● RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
Streaming
● Streaming works best in situations
where the order of messages is
important—for example, data
ingestion.
● Kafka and Amazon Kinesis are
examples of messaging systems that
use streaming semantics for
consuming messages.
Data in motion
Typical Architecture
E-Commerce w/o Pulsar
✓ Separate storage
✓ Tiering outside toolset
✓ Separate application and
data domains
✓ Different tech stacks
Why not a system that is
able to support messaging
and streaming?
E-Commerce with Pulsar
✓ Unified storage for in-
motion data
✓ Native tiered storage
✓ Single system to
exchange data
✓ Teams share toolset
Build Apache Pulsar for
unified messaging and
streaming
Step 1: A scalable storage for streams of data
Step 2: Separate serving from storage
Apache Pulsar
Apache BookKeeper
Broker 0
Producer Consumer
Broker 1 Broker 2
Bookie
0
Bookie
1
Bookie
2
Bookie
3
Bookie
4
Step 3: Unified API
Streaming
Messaging
Producer 1
Producer 2
Pulsar
Topic/Partition
m0
m1
m2
m3
m4
Consumer D-1
Consumer D-2
Consumer D-3
Subscription D
Key-Shared
Consumer C-1
Consumer C-2
Consumer C-3
Subscription C
m1
m2
m3
m4
m0
Shared
Failover
Consumer B-1
Consumer B-0
Subscription B
m1
m2
m3
m4
m0
In case of failure
in Consumer B-0
Consumer A-1
Consumer A-0
Subscription A
m1
m2
m3
m4
m0
Exclusive
X
Reader and
Batch API
Pub/Sub
API
Publisher
Subscriber
Step 3:
Unified API
Stream Processor
Applications
Microservices or
Event-Driven Architecture
Step 4:
Schema
API
Reader and
Batch API
Pub/Sub
API
Publisher
Subscriber
Stream Processor
Applications
Microservices or
Event-Driven Architecture
Schema
API
Schema API
Step 5:
Functions
and IO API
Reader and
Batch API
Pub/Sub
API
Publisher
Subscriber
Stream Processor
Applications
Microservices or
Event-Driven Architecture
Schema
API
Schema API
Functions
API
Pulsar
IO/Connectors
Prebuilt Connectors
Custom Connectors
Step 6:
Tiered
Storage
Reader and
Batch API
Pub/Sub
API
Publisher
Subscriber
Stream Processor
Applications
Microservices or
Event-Driven Architecture
Schema
API
Schema API
Functions
API
Pulsar
IO/Connectors
Prebuilt Connectors
Custom Connectors
Tiered Storage
Step 7: Protocol Handlers
Apache Pulsar
Pulsar Protocol
Handler
Pulsar Clients
(queue + stream)
Kafka Protocol
Handler
AMQP Protocol
Handler
MQTT Protocol
Handler
Kafka Clients AMQP Clients MQTT Clients
Reader and
Batch API
Pub/Sub
API
Publisher
Subscriber
Stream Processor
Applications
Microservices or
Event-Driven Architecture
Schema
API
Schema API
Functions
API
Pulsar
IO/Connectors
Prebuilt Connectors
Custom Connectors
Tiered Storage
Step 8:
Transaction
API
Transaction
API
Pulsar 2.8 towards a
complete vision of unified
messaging and streaming
The future of Pulsar
Towards a self-adjusting
data platform
✓ Tuning data platforms to run at scale is hard
✓ Lots of configurations
✓ Requires in-depth knowledge of internals
✓ Workloads are constantly changing
Topic auto-partitioning
✓ Partitions are an artifact of implementation
✓ It’s not a natural property of the data
✓ Abstract the partitioning away from users
✓ Partitions are automatically split / merged based
✓ Rethink how an API should look like
Self-Adjusting Storage
✓ Ensure most optimal utilization of hardware
✓ No configuration
✓ Automatically adjust strategies based on changing
condition:
✓ Disk access
✓ Cache management
✓ Queue sizes
Pulsar Functions
✓ The foundation is now mature — UX is still poor
✓ Simpler tooling to create & manage functions
✓ CI/CD integration — Versioning — A/B testing
✓ Observability & Debuggability
✓ Improve support for Go and Python functions
✓ DSL — Provide higher level constructs to process data
Stream Storage
✓ Evolve the current state of Tiered Storage
✓ Integrate with data lake technologies
Working with the data
community

More Related Content

PDF
Kafka Streams: What it is, and how to use it?
PDF
Voice Services, From Circuit Switch to VoIP
PDF
Istio Service Mesh
PDF
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
PDF
Introduction to Google Compute Engine
PDF
Google Cloud Dataflow Two Worlds Become a Much Better One
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
Microservices & API Gateways
Kafka Streams: What it is, and how to use it?
Voice Services, From Circuit Switch to VoIP
Istio Service Mesh
MySQL Load Balancers - Maxscale, ProxySQL, HAProxy, MySQL Router & nginx - A ...
Introduction to Google Compute Engine
Google Cloud Dataflow Two Worlds Become a Much Better One
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Microservices & API Gateways

What's hot (20)

PPTX
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
PDF
Consumer offset management in Kafka
PDF
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
PDF
Understanding InfluxDB’s New Storage Engine
PPTX
Apache Pinot Meetup Sept02, 2020
PDF
Packer by HashiCorp
PDF
Messaging Standards and Systems - AMQP & RabbitMQ
PDF
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
ODP
Presto
PPTX
Prometheus 101
PDF
Kubernetes Monitoring & Best Practices
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PPTX
Elastic Stack Introduction
PPTX
Prometheus (Prometheus London, 2016)
PDF
Enabling Vectorized Engine in Apache Spark
PDF
Apache Kafka as Event Streaming Platform for Microservice Architectures
PDF
The State of Spark in the Cloud with Nicolas Poggi
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
PDF
Opa gatekeeper
Room 1 - 7 - Lê Quốc Đạt - Upgrading network of Openstack to SDN with Tungste...
Consumer offset management in Kafka
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Understanding InfluxDB’s New Storage Engine
Apache Pinot Meetup Sept02, 2020
Packer by HashiCorp
Messaging Standards and Systems - AMQP & RabbitMQ
Chill, Distill, No Overkill: Best Practices to Stress Test Kafka with Siva Ku...
Presto
Prometheus 101
Kubernetes Monitoring & Best Practices
Optimizing Delta/Parquet Data Lakes for Apache Spark
Elastic Stack Introduction
Prometheus (Prometheus London, 2016)
Enabling Vectorized Engine in Apache Spark
Apache Kafka as Event Streaming Platform for Microservice Architectures
The State of Spark in the Cloud with Nicolas Poggi
Incremental View Maintenance with Coral, DBT, and Iceberg
Opa gatekeeper
Ad

Similar to Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Summit NA 2021 Keynote (20)

PDF
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
PDF
Cloud lunch and learn real-time streaming in azure
PDF
Automation + dev ops summit hail hydrate! from stream to lake
PDF
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PDF
Music city data Hail Hydrate! from stream to lake
PDF
Using FLiP with influxdb for edgeai iot at scale 2022
PDF
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
PDF
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
PDF
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
PDF
What We Learned From Building a Modern Messaging and Streaming System for Cloud
PDF
Open Source Bristol 30 March 2022
PPTX
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
PPTX
Building an Event Streaming Architecture with Apache Pulsar
PDF
Open keynote_carolyn&matteo&sijie
PDF
ITPC Building Modern Data Streaming Apps
PDF
Big data conference europe real-time streaming in any and all clouds, hybri...
PDF
Hail hydrate! from stream to lake using open source
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Cloud lunch and learn real-time streaming in azure
Automation + dev ops summit hail hydrate! from stream to lake
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
Music city data Hail Hydrate! from stream to lake
Using FLiP with influxdb for edgeai iot at scale 2022
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Apache Kafka - Scalable Message-Processing and more !
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
What We Learned From Building a Modern Messaging and Streaming System for Cloud
Open Source Bristol 30 March 2022
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Building an Event Streaming Architecture with Apache Pulsar
Open keynote_carolyn&matteo&sijie
ITPC Building Modern Data Streaming Apps
Big data conference europe real-time streaming in any and all clouds, hybri...
Hail hydrate! from stream to lake using open source
Ad

More from StreamNative (20)

PDF
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
PDF
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
PDF
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
PDF
Distributed Database Design Decisions to Support High Performance Event Strea...
PDF
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
PDF
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
PDF
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
PDF
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
PDF
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
PDF
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
PDF
Understanding Broker Load Balancing - Pulsar Summit SF 2022
PDF
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
PDF
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
PDF
Event-Driven Applications Done Right - Pulsar Summit SF 2022
PDF
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
PDF
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
PDF
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
PDF
Welcome and Opening Remarks - Pulsar Summit SF 2022
PDF
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
PDF
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Distributed Database Design Decisions to Support High Performance Event Strea...
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Dropbox Q2 2025 Financial Results & Investor Presentation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
Building Integrated photovoltaic BIPV_UPV.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
Review of recent advances in non-invasive hemoglobin estimation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MYSQL Presentation for SQL database connectivity
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation theory and applications.pdf
NewMind AI Monthly Chronicles - July 2025

Apache Pulsar: Why Unified Messaging and Streaming Is the Future - Pulsar Summit NA 2021 Keynote

  • 1. Pulsar Virtual Summit North America 2021 Apache Pulsar: Why Unified Messaging and Streaming Is the Future Matteo Merli, Sijie Guo @ Pulsar PMC
  • 2. Who are we? ● Sijie Guo (@sijieg) ● CEO, StreamNative ● PMC Member of Pulsar/BookKeeper ● Ex Co-Founder, Streamlio ● Ex Twitter ● Matteo Merli (@merlimat) ● CTO, StreamNative ● Co-creator and PMC chair of Pulsar ● Ex Co-Founder, Streamlio ● Ex Yahoo!
  • 3. StreamNative Founded by the creators of Apache Pulsar, StreamNative provides a cloud-native, unified messaging and streaming platform powered by Apache Pulsar to support multi-cloud and hybrid-cloud strategies
  • 4. Announcing StreamNative Platform 1.0 ✓ Pulsar Transactions ✓ Kafka-on-Pulsar ✓ Function Mesh for serverless streaming ✓ Enterprise-ready security ✓ Pulsar Operators ✓ Seamless StreamNative Cloud experience
  • 5. Pulsar Trends Kafka -> Pulsar Scale Cloud-Native Pulsar + Flink
  • 6. Pulsar at Scale More companies in Production
  • 7. Pulsar at Scale Hit Trillion Messages Per Day
  • 8. Cloud-Native Kubernetes Drive Adoption of Pulsar ✓ 80% of Pulsar users deploy Pulsar in a cloud environment ✓ 62% of Pulsar users deploy Pulsar on Kubernetes ✓ 49% noted Pulsar’s Cloud-Native capabilities as one of the top reasons they chose to adopt Pulsar
  • 9. Cloud-Native Built for Kubernetes Containers Cloud Native Hybrid & MultiCloud ● Single Cloud Provider ● Monolithic Architectures ● Single Tenant Systems ● No Geo-replication VM / Early Cloud Era Containers / Modern Cloud Era Microservices
  • 10. Pulsar + Flink Unified Stream and Batch
  • 11. Kafka to Pulsar More and More Kafka Users Adopt Pulsar ✓ 68% of respondents use Kafka in addition to Pulsar ✓ 34% of respondents use or plan to use Kafka-on-Pulsar ✓ Kafka and Pulsar serve different use cases ✓ Once adopted, Pulsar usage expands across organizations
  • 12. Pulsar Adoption Use Cases Adopted Pulsar to replace Kafka in their DSP (Data Streaming Platform). ● 1.5-2x lower in capex cost ● 5-50x improvement in latency ● 2-3x lower in opex due ● 10 PB / day Adopted Pulsar to power their billing platform, Midas, which processing hundreds of billions of financial transactions daily. Adoption then expanded to Tencent’s Federated Learning Platform and Tencent Gaming. Use cases require a scalable message queue for serving mission-critical business applications to replace RabbitMQ. In the process of expanding use cases to build data streaming services
  • 15. Messaging ● Queueing systems are ideal for work queues that do not require tasks to be performed in a particular order— for example, sending one email message to many recipients. ● RabbitMQ and Amazon SQS are examples of popular queue-based message systems. Streaming ● Streaming works best in situations where the order of messages is important—for example, data ingestion. ● Kafka and Amazon Kinesis are examples of messaging systems that use streaming semantics for consuming messages. Data in motion
  • 17. E-Commerce w/o Pulsar ✓ Separate storage ✓ Tiering outside toolset ✓ Separate application and data domains ✓ Different tech stacks
  • 18. Why not a system that is able to support messaging and streaming?
  • 19. E-Commerce with Pulsar ✓ Unified storage for in- motion data ✓ Native tiered storage ✓ Single system to exchange data ✓ Teams share toolset
  • 20. Build Apache Pulsar for unified messaging and streaming
  • 21. Step 1: A scalable storage for streams of data
  • 22. Step 2: Separate serving from storage Apache Pulsar Apache BookKeeper Broker 0 Producer Consumer Broker 1 Broker 2 Bookie 0 Bookie 1 Bookie 2 Bookie 3 Bookie 4
  • 23. Step 3: Unified API Streaming Messaging Producer 1 Producer 2 Pulsar Topic/Partition m0 m1 m2 m3 m4 Consumer D-1 Consumer D-2 Consumer D-3 Subscription D Key-Shared Consumer C-1 Consumer C-2 Consumer C-3 Subscription C m1 m2 m3 m4 m0 Shared Failover Consumer B-1 Consumer B-0 Subscription B m1 m2 m3 m4 m0 In case of failure in Consumer B-0 Consumer A-1 Consumer A-0 Subscription A m1 m2 m3 m4 m0 Exclusive X
  • 24. Reader and Batch API Pub/Sub API Publisher Subscriber Step 3: Unified API Stream Processor Applications Microservices or Event-Driven Architecture
  • 25. Step 4: Schema API Reader and Batch API Pub/Sub API Publisher Subscriber Stream Processor Applications Microservices or Event-Driven Architecture Schema API Schema API
  • 26. Step 5: Functions and IO API Reader and Batch API Pub/Sub API Publisher Subscriber Stream Processor Applications Microservices or Event-Driven Architecture Schema API Schema API Functions API Pulsar IO/Connectors Prebuilt Connectors Custom Connectors
  • 27. Step 6: Tiered Storage Reader and Batch API Pub/Sub API Publisher Subscriber Stream Processor Applications Microservices or Event-Driven Architecture Schema API Schema API Functions API Pulsar IO/Connectors Prebuilt Connectors Custom Connectors Tiered Storage
  • 28. Step 7: Protocol Handlers Apache Pulsar Pulsar Protocol Handler Pulsar Clients (queue + stream) Kafka Protocol Handler AMQP Protocol Handler MQTT Protocol Handler Kafka Clients AMQP Clients MQTT Clients
  • 29. Reader and Batch API Pub/Sub API Publisher Subscriber Stream Processor Applications Microservices or Event-Driven Architecture Schema API Schema API Functions API Pulsar IO/Connectors Prebuilt Connectors Custom Connectors Tiered Storage Step 8: Transaction API Transaction API
  • 30. Pulsar 2.8 towards a complete vision of unified messaging and streaming
  • 31. The future of Pulsar
  • 32. Towards a self-adjusting data platform ✓ Tuning data platforms to run at scale is hard ✓ Lots of configurations ✓ Requires in-depth knowledge of internals ✓ Workloads are constantly changing
  • 33. Topic auto-partitioning ✓ Partitions are an artifact of implementation ✓ It’s not a natural property of the data ✓ Abstract the partitioning away from users ✓ Partitions are automatically split / merged based ✓ Rethink how an API should look like
  • 34. Self-Adjusting Storage ✓ Ensure most optimal utilization of hardware ✓ No configuration ✓ Automatically adjust strategies based on changing condition: ✓ Disk access ✓ Cache management ✓ Queue sizes
  • 35. Pulsar Functions ✓ The foundation is now mature — UX is still poor ✓ Simpler tooling to create & manage functions ✓ CI/CD integration — Versioning — A/B testing ✓ Observability & Debuggability ✓ Improve support for Go and Python functions ✓ DSL — Provide higher level constructs to process data
  • 36. Stream Storage ✓ Evolve the current state of Tiered Storage ✓ Integrate with data lake technologies
  • 37. Working with the data community

Editor's Notes

  • #6: Before diving into the “Unified Messaging and Streaming”, let’s take a look at the trends in Pulsar community.
  • #14: To understand what is happening behind the scene, we need to rewind back to the early days of Pulsar. Back to 2012, when we first set out to build Pulsar, we thought there should be a global geo-replicated infrastructure for all the messaging data. We didn’t start with the idea of making our own software, but started by observing the gaps in the existing technologies available at the time and realized how they were insufficient to serve the needs of an data-driven organization.
  • #15: Talking about these 2 different worlds Messaging - read slide These are like commands that represent changes that need to be made to the system An example : we send message that says “Process this order” or “change user to be deleted” but we don’t actually perform that change just notify Messaging systems are selected when synchronous communications breaks down In contrast - streaming systems deal with events. The state changes themselves, so instead of sending a message saying this user wants to update their email, we instead actually perform the update Events interlinked together that may be persisted, replayed or aggregated
  • #16: Talking about these 2 different worlds Messaging - read slide These are like commands that represent changes that need to be made to the system An example : we send message that says “Process this order” or “change user to be deleted” but we don’t actually perform that change just notify Messaging systems are selected when synchronous communications breaks down In contrast - streaming systems deal with events. The state changes themselves, so instead of sending a message saying this user wants to update their email, we instead actually perform the update Events interlinked together that may be persisted, replayed or aggregated
  • #17: Instructor Notes What we have here is a little bit of an example of what we might see in a modern organization that has run into both these issues We have basically 2 different regimes or 2 different worlds - different teams. Historically, these worlds often seem very different with entirely different tech stacks and entirely different teams. However, as data becomes more critical in informing applications, the need to have applications make more use of what data teams and data services are producing. Likewise getting the data out of applications and into the data realm has forced organizations to get better at being able to do both of these things really well. This can be a real challenge. So on the left we have the application side and these are applications that are interacting via messages and dealing with the aspects of running your systems and providing capabilities focused on business concerns On the right side we have services that deal with the data. Data bulk and large Sometimes the right side includes real time or batch processes such as sending large amounts of data, putting it into data lakes, making computing answers about it, sending data for another services or providing that data to other orgs that need it These 2 worlds generally are using different technologies and different tools and different processes - all leading to more complexity and cost
  • #18: Read slide Separate storage/transport systems for messaging, streaming, and big-data. Focus on ETL separate processes Messaging helps decouple apps, provides for reliable async communication, work queues, in core applications. Streaming allows for “medium-term” storage of streams (~30 days), aggregating streams of data and real-time processing for near real-time analytics. Batch processing and long-term object storage (S3, HDFS, etc) allows for processing historical data to learn from the past. “Tiering” of data from messaging -> streaming -> object storage is outside of core toolset and is maintained explicitly. Application and Data domains are separated, data is replicated into data domain. Results from data domain are loaded (ETL) back into application domain. Multiple teams with very different technology stacks. ==== To show how Pulsar provides that ability to be transformative here is a common example of an e-commerce system stack that contains both a streaming set of services and also data processing On the application side we have order services, inventory service and fulfillment Talk to each service (think Amazon) On the data side we have Spark - some batch processing using spark Flink - Real time inventory analysis using flink Another use case maybe some long term storage needs versus short term (30 days) then data warehouse layer Imagine a person ordering something and then check inventory and it isn’t there. Do you delete the order or put on backorder? Once the inventory gets replenished then how do we notify the customers that their order is now coming So need to join both sides together
  • #20: It is very nature to merge both. Talk about the technologies are evolved to a way to that is able to support both. Read slide and add more context: “Unified” storage/transport of message and streams with access to underlying data: Messaging - Decoupled applications with pub/sub, shared subscriptions for work queues, exclusive subscriptions for fanout and point-to-point messaging with flexible large numbers of non-partitioned topics. Streaming - Ordered, scalable partitioned topics with failover and key shared subscriptions. Pub/sub (broker controlled) or reader API (client controlled) for advanced stream processing, replay, etc. Big-data batch Access - Underlying segments of topics can be read directly, allow for scale-out parallelism. Tiered storage is core to Pulsar, no need for external tools. Application and data domains use single system to exchange data, with converged “messaging” and “streaming”. One or many teams, with shared toolset. Talk to diagram Talk to the slide and on the left side say how Pulsar can process real time streams and on the right can do batch processing, offload to tiered storage and read back in parallel batch fashion and even provide a stream back to other systems for consumption order services, inventory service and fulfillment - they still work from the messaging domain (use cases not too different) But now can support processing at much higher scale, any messages they have are kept in Pulsar as a single source of truth and these messages can be offloaded via Pulsar to long term storage Pulsar also provides the power to enable a unified batch and streaming job that can do both batch processing by reading from underlying storage and combine that with real time streams all with a single technology
  • #21: Let's take a retrospective look at how Pulsar has evolved through the years. When we started designing Pulsar as a new platform, we always had this idea of supporting both the Pub-Sub semantics as well as the data streaming pipelines, which at the time were a new and emerging thing. But it would be a lie to say we had everything pre-planned since the beginning. Instead, we spent a lot of time observing how people used these platforms and we tried to fill all the gaps we were seeing, evolving Pulsar with the changing needs of data applications.
  • #22: At the very core of Pulsar there has always been the concept of the "log". A distributed, replicated and immutable ledger where all the events are appended. BookKeeper has proved, throughout the years, to be the best storage solution for streams of data. It scales to very large number of logs, it offers consistency, durability, low latency and high-throughput and, more importantly, very convenient operational tooling. To summarize: using the log as a building block does a lot of the heavy lifting required to build a truly scalable system.
  • #23: Another architectural choice that came naturally from using BookKeeper has been the separation of the storage from the data serving layer. This comes from BookKeeper because BookKeeper requires to have a single writer for a each log. In our case the Broker acts as that single writer. This multi-layer architecture was exactly what we needed because it allows Pulsar to have: 1. Stateless brokers - Means topics can be easily moved across brokers without copying any data. For example, expanding cluster or adjusting the topics assignments after changing conditions. 2. Data locality - Because of this broker layer, the data for a single topic or partition does not have to be stored in one single storage node. Instead we can fully utilize the resources of the entire cluster.
  • #24: We just said that the log is the building block of Pulsa... but the log on its own is a very low level construct. Applications very often need much more sophisticated ways of interacting with the data than just reading through the log of events. Instead, we wanted to capture the right level of semantics needed to support a wide range of pub-sub and streaming use cases. The core idea was to leave the flexibility to consume data from topics in multiple different ways, depending on what the application needs. We ended up having 4 subscription types with different semantics and different properties, each one with its own merits.
  • #25: After the Pub/Sub API, the next addition was the Reader API. You can think of it as the "unmanaged" way to consume data from a topic. While there are many reasons for using a reader, the main users are typically Stream Processing frameworks because they tend to have their own checkpointing mechanisms or, similarly, batch systems that want to do a scan of the historical data.
  • #26: The common theme in the API exposed by Pulsar is the support for Schema. Having direct support for Schema inside Pulsar means that brokers can validate the schema of the data being published and that the expectation of consumers is matched as well. But it also means that it becomes very easy to "discover" the schema of the data. The discoverability of the schema means that you can write fully type safe generic consumers that don't need to be aware of one specific schema.
  • #27: Next we looked at what people were trying to do with messaging platforms and the realization was that there was always some portion of computation involved. Application very often need to do simple data transformations, enrichment and similar things. Functions were designed to provide the simplicity of the "Serverless" model with a very tight integration in the Pulsar platform. One example of how powerful Pulsar functions are is that we have created a connector framework, Pulsar IO, entirely based on Pulsar Functions. With Pulsar IO, you can choose between a large set of pre-built connectors, both sources or sinks, or build your own custom connectors.
  • #28: After that, the next trend saw is that more and more users wanted to use the "stream" concept not just as a temporary buffer, as a way to isolate the data ingestion and the processing. Instead, they increasingly want to keep the stream as a permanent, or at least long term "storage of record". Tiered storage was the missing link to enable this. By offloading cold data to cloud storage providers, we can have large scale data retention at a very effective cost, all while maintaining the stream view of the data and the same APIs.
  • #29: Another realization was that, because of its nature, messaging is always the integration point for different applications and components. This makes migration from other platforms a bit harder. You often have to coordinate that migration across different teams or organizations. To make it easier, we extended the Pulsar brokers to be able to speak several protocols, in addition the Pulsar native protocols. With Protocol Handlers, there is a pluggable way to add more ways to interact with the Pulsar service and the same topic data. We started with KoP, Kafka On Pulsar, then followed up by AMQP and MQTT. It is very powerful mechanism for a few reasons: 1. Applications can use existing client libraries with no code or dependencies changes 2. You can mix all sort of different protocols to interact with the same topic 3. It's exposed directly in Pulsar brokers, data is stored only once and there is no "proxy overhead"
  • #30: To really complete the full picture, in Pulsar 2.8 we introduced support for transactions. It's now possible to do very complex interactions and take advantage of the transactional properties, for example publishing messages atomically across multiple topics, or consuming and producing atomically.
  • #31: We can say that Pulsar 2.8 is a big milestone in the journey completing this vision of unified messaging and streaming platform. We are very excited and very proud of this release. This is culminating months and months of work by a “larger than ever” group of committers and contributors. And while transactions support is the biggest new feature, it is certainly not the only one. We have feature like Exclusive producer support, about which I will Be talking about tomorrow in an ad-hoc session, a new API for package management, to improve the way we manage the functions and connectors code artifacts, or finally simplified way to configure memory limit in Pulsar clients.
  • #32: After looking at the past, let's now take a look at some of the items that we want to focus on in the very near future.
  • #33: A problem that we're seeing overall in the data ecosystem is that these platforms can be very difficult to tune and operate when running at a large scale. This is not a problem specific to Pulsar, but it is something that we believe it should be addressed. Typically, there are a lot of configuration options and each of them requires in-depth knowledge of the internal of the system. Worse, when integrating multiple systems, like a comput framework, it might be very hard to predict how a change in the configuration will affect the overall stability and performance. Finally, the workloads are increasingly dynamic and constantly changing. It's not possible to have a static configuration that will have "optimal" performance in every condition.
  • #34: The first item I want to discuss is partitioning. People are used to see partitioning and sharding, but these are really artifacts of how systems are implemented. Partitions are usually not a natural property of the data. Because of that, we want to abstract the partition concept away from the user sight. Application developers should not be worried about partitions, operators should not be thinking at how many partitions are needed for a certain use case. Instead, the system should be able to figure it out on its own, internally splitting and merging partitions, while maintaining the fundamentals ordering guarantees.
  • #35: Tuning storage system can also be a very complex task. In particular, it can be very hard to predict the impact of configuration on the overall performance when we're crossing multiple layers: there is the Operating System, the disk device and the disk controller. In a similar way, the idea we have is to make it working with no configuration, in a way that the storage system is able to automatically adjust the strategies based on the changing conditions of the traffic. All aspects regarding the access pattern to the disk, what kind of cache eviction strategy and so on.
  • #36: When we introduced Pulsar Functions, we had the idea of making it a frictionless platform for developers to do data processing. Over few years, the foundation of Pulsar Functions runtime has really matured into a solid platform, although the user experience is still not great. While it is very easy for developers to write functions, we should strive to make it much easier to actually deploy and manage functions. For example, having functions tooling to be well integrated with CI/CD platforms, supporting versioning and out of the box support for A/B testings. Another aspect is observability and debuggability. The tooling and the platform needs to make it super-easy for users to discover issues in their own code or to detect performance issues. Finally, we are thinking on a more higher level DSL, that can support higher level constructs to further simplify writing data processing functions.
  • #37: We talked before about Tiered Storage and how it has enabled completely new use cases to be supported by Pulsar. The next step here is to make sure we can integrate with existing data lake technologies, like Delta Lake and Apache Hudi. The vision is to use the Data Lake as the tiered storage backend, so that the same data can be consumed as a stream or with the data lake tooling.
  • #38: As a final note, given the very nature of Pulsar, that sits between different systems and platforms and links all of them together, we want to reaffirm our commitment to work with the larger data community to ensure that Pulsar is supported everywhere, out of the box, as a first class citizen. We have been partnering with many Open Source communities like Trino, Druid, Pinot, Spark and Flink. We will continue to do so, and more in the future. We believe that this will benefits Pulsar, its users and the overall data ecosystem.