SlideShare a Scribd company logo
APACHE PULSAR
Flexible Pub-Sub system for Internet scale
http://pulsar.apache.org
Content on this presentation is licensed under a
Creative Commons Attribution 4.0 International license
Pulsar graduates asTLP project today!
WHO AM I?
• Matteo Merli
• Apache Pulsar PMC Chair
• Member of Apache BookKeeper PMC
• Co-Founder of Streamlio
• Worked on Pulsar since its beginning atYahoo
WHAT IS APACHE PULSAR?
“Pub-Sub messaging backed by durable log storage”
WHAT IS APACHE PULSAR?
5
Multi-tenancy
A single cluster can
support many tenants
and use cases
Ordering
Guaranteed ordering
Durability
Data replicated and
synced to disk
Delivery Guarantees
At least once, at most
once and effectively
once
Highly scalable
Can support millions
of topics
Unified messaging model
Support both Topic &
Queue semantic in a
single model
Geo-replication
Out of box support for
geographically distributed
applications
High throughput
Can reach 1.8 M
messages/s in a single
partition
Low Latency
Low publish latency of
5ms at 99pct
WHY BUILD A NEW SYSTEM?
• No existing solution to satisfy requirements
• Multi tenant — 1M topics — Low latency — Durability — Geo replication
• Other systems don’t scale well with many topics:
• Storage model based on individual directory per topic partition
• Durability kills the performance
• Ability to manage large backlogs — Read old data without impacting writers
• Many other choking points: getting stats, access to metadata, flow-control
• Operations are not very convenient — replacing servers, expanding clusters, etc…
6
STATE OFTHE PROJECT
• Project started atYahoo around 2012 and went through various iterations
• Open-Sourced in September 2016
• Entered Apache Incubator in June 2017
• Graduated asTLP on September 2018
• 2249 Commits — 22Yahoo releases — 9 Apache releases
• 59 Contributors
7
ARCHITECTURALVIEW
Separate layers between
brokers bookies
• Broker and bookies can
be added independently
• Traffic can be shifted very
quickly across brokers
• New bookies will ramp up
on traffic quickly
APACHE BOOKKEEPER
Replicated log storage
• Low-latency durable writes
• Simple repeatable read consistency
• Highly available
• Store many logs per node
• I/O Isolation
SEGMENT
CENTRIC
STORAGE
• In addition to partitioning,
messages are stored in segments
(based on time and size)
• Segments are independent from
each others and spread across
all storage nodes
SEGMENTSVS PARTITIONS
DATA PATH
1 — Publisher sends message to broker
DATA PATH
2 — Broker writes in parallel to N replicas
DATA PATH
3 — Wait for a quorum of acks from bookies
DATA PATH
4 — Send ack to producer — Dispatch to consumer
BOOKKEEPER INTERNAL
Storage optimized for sequential & immutable data
• IO isolation between write and
read operations
• Slow consumers won’t impact
latency
• Very effective IO patterns:
• Journal — append only and no
reads
• Storage device — bulk write
and sequential reads
• Number of files is independent
from number of topics
MESSAGING MODEL
PULSAR CLIENT LIBRARY
• Java — C++ — Python — Go — WebSocket APIs
• Partitioned topics
• Apache Kafka compatibility wrapper API
• Transparent batching and compression
• TLS encryption and authentication
• End-to-end encryption
18
PYTHON CLIENT
import pulsar
client = pulsar.Client('pulsar://localhost:6650')
producer = client.create_producer('my-topic')
for i in range(10):
producer.send(('Hello-%d' % i).encode('utf-8'))
client.close()
19
• pip install pulsar-client
GO CLIENT
• go get -u github.com/apache/pulsar/pulsar-client-go/pulsar
client, err := pulsar.NewClient(pulsar.ClientOptions{
URL: "pulsar://localhost:6650"
})
producer, err := client.CreateProducer(pulsar.ProducerOptions{
Topic: "my-topic",
})
for i := 0; i < 10; i++ {
err := producer.Send(context.Background(), pulsar.ProducerMessage{
Payload: []byte(fmt.Sprintf("hello-%d", i)),
})
}
• Based on C++ client library — Pure Go client is being worked on
20
MULTI-TENANCY
• Authentication / Authorization / Namespaces / Admin APIs
• I/O Isolations between writes and reads
• Provided by BookKeeper - Ensure readers draining backlog won’t affect publishers
• Soft isolation
• Storage quotas — flow-control — back-pressure — rate limiting
• Hardware isolation
• Constrain some tenants on a subset of brokers or bookies
21
GEO REPLICATION
Topic (T1) Topic (T1)
Topic (T1)
Subscription (S1) Subscription (S1)
Producer
(P1)
Consumer
(C1)
Producer
(P3)
Producer
(P2)
Consumer
(C2)
Data Center A Data Center B
Data Center C
• Scalable asynchronous
replication
• Integrated in the broker
message flow
• Simple configuration to
add/remove regions
SCHEMA REGISTRY
• Store information on the data structure — Stored in BookKeeper
• Enforce data types on topic
• Allow for compatible schema evolutions
23
TYPE-SAFE CLIENT API
Producer<MyClass> producer = client
.newProducer(Schema.JSON(MyClass.class))
.topic("my-topic")
.create();
producer.send(new MyClass(1, 2));
24
Consumer<MyClass> consumer = client
.newConsumer(Schema.JSON(MyClass.class))
.topic("my-topic")
.subscriptionName("my-subscription")
.subscribe();
Message<MyClass> msg = consumer.receive();
• Integrated schema in API
• End-to-end type safety — Enforced in Pulsar broker
PULSAR FUNCTIONS
Managed lightweight compute framework
PULSAR FUNCTIONS / 1
• Simple compute against a consumed message
• Managed or manual deployment
• A function gets messages from 1 or more topics
• An instance of the function is invoked to process the event
• The output of the function is published on 1 or more topics
26
PULSAR FUNCTIONS / 2
• Super simple to use — no SDK
• Python example:
def process(input):
return input + '!'
• Supports Java & Python — Go will come next
27
PULSAR FUNCTIONS / 3
• Good use cases for functions:
• ETL
• Data enrichment
• Data filtering
• Routing
28
PULSAR FUNCTIONS / 4
• Deployment modes:
• Local run — Manually run a function, useful for dev mode
• Managed — Worker service is running instances of functions
29
PULSAR IO
Connector framework based on Pulsar Functions
PULSAR IO
• Source — Ingest data into a Pulsar topic
• Sink — Reads data from topic and dump into external sink
• Pulsar provides a set of built-in connectors
• Users can submit customized connectors
31
TIERED STORAGE
Unlimited topic storage capacity
Achieves the true “stream-storage”: keep
the raw data forever in stream form
TIERED STORAGE
• Leverage cloud storage services to offload cold data — Completely
transparent to clients
• Extremely cost effective — Backends (S3) (Coming GCS, HDFS)
• Example: Retain all data for 1 month — Offload all messages older
than 1 day to S3
33
PULSAR SQL
• Coming very soon in Pulsar 2.2
• Interactive SQL queries over data stored in Pulsar
• Query old and real-time data
34
PULSAR SQL / 2
• Based on Presto by Facebook — https://prestodb.io/
• Presto is a distributed query execution engine
• Fetches the data from multiple sources (HDFS, S3, MySQL, …)
• Full SQL compatibility
35
PULSAR SQL / 3
• Pulsar connector for Presto
• Read data directly from BookKeeper — bypass Pulsar Broker
• Many-to-many data reads
• Data is split even on a single partition — multiple workers can read data in
parallel from single Pulsar partition
• Time based indexing — Use “publishTime” in predicates to reduce data being
read from disk
36
OPENMESSAGING
BENCHMARK
openmessaging.cloud
openmessaging.cloud/docs/benchmarks
BENCHMARK FRAMEWORK
• Designed to measure performance of distributed messaging systems
• Supports various “drivers” (Kafka, Pulsar, RocketMQ, RabbitMQ)
• Automated deployment in EC2
• Configure workloads through aYAML file
38
DISTRIBUTED EXECUTION
Coordinator will take the workload definition and propagate to multiple
workers — Collects and reports stats
MaxThroughput
1Topic
1 Partition
1KB payload
Latency at fixed
throughput
50K msg/s
1Topic
1 Partition
1KB payload
Latency at fixed
throughput
—
(including Kafka-sync)
50K msg/s
1Topic
1 Partition
1KB payload
Latency at fixed
throughput
—
99pct
50K msg/s
1Topic
1 Partition
1KB payload
Q & A

More Related Content

PDF
Effectively-once semantics in Apache Pulsar
PDF
High performance messaging with Apache Pulsar
PDF
Apache pulsar - storage architecture
PDF
Linked In Stream Processing Meetup - Apache Pulsar
PDF
Introduction to Apache BookKeeper Distributed Storage
PDF
Hands-on Workshop: Apache Pulsar
PDF
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
PDF
Apache Kafka - Martin Podval
Effectively-once semantics in Apache Pulsar
High performance messaging with Apache Pulsar
Apache pulsar - storage architecture
Linked In Stream Processing Meetup - Apache Pulsar
Introduction to Apache BookKeeper Distributed Storage
Hands-on Workshop: Apache Pulsar
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
Apache Kafka - Martin Podval

What's hot (20)

PDF
Apache BookKeeper: A High Performance and Low Latency Storage Service
PPTX
Introduction to Apache Kafka
PPTX
Apache Kafka
PDF
1. Core Features of Apache RocketMQ
PDF
Integrating Apache Pulsar with Big Data Ecosystem
PDF
Devoxx Morocco 2016 - Microservices with Kafka
PPTX
Apache Bookkeeper and Apache Zookeeper for Apache Pulsar
PDF
Apache Kafka - Free Friday
PPTX
Apache kafka
PPTX
Apache Kafka
PDF
Apache con2016final
PDF
Kafka Overview
PDF
Kafka as Message Broker
PPTX
Kafka 101
PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
Fundamentals and Architecture of Apache Kafka
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PPTX
Flume vs. kafka
PPTX
Apache kafka
Apache BookKeeper: A High Performance and Low Latency Storage Service
Introduction to Apache Kafka
Apache Kafka
1. Core Features of Apache RocketMQ
Integrating Apache Pulsar with Big Data Ecosystem
Devoxx Morocco 2016 - Microservices with Kafka
Apache Bookkeeper and Apache Zookeeper for Apache Pulsar
Apache Kafka - Free Friday
Apache kafka
Apache Kafka
Apache con2016final
Kafka Overview
Kafka as Message Broker
Kafka 101
Apache Kafka Architecture & Fundamentals Explained
Fundamentals and Architecture of Apache Kafka
Apache Kafka Fundamentals for Architects, Admins and Developers
APACHE KAFKA / Kafka Connect / Kafka Streams
Flume vs. kafka
Apache kafka
Ad

Similar to Pulsar - flexible pub-sub for internet scale (20)

PDF
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
PPTX
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
PPTX
Modern Distributed Messaging and RPC
PDF
Pulsar - Distributed pub/sub platform
PDF
Apache Kafka Introduction
PDF
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
PDF
Webinar: Faster Log Indexing with Fusion
PDF
Big data conference europe real-time streaming in any and all clouds, hybri...
PDF
OSMC 2016 - Monasca - Monitoring-as-a-Service (at-Scale) by Roland Hochmuth
PDF
OSMC 2016 | Monasca: Monitoring-as-a-Service (at-Scale) by Roland Hochmuth
PDF
A Closer Look at Apache Kudu
PDF
(Current22) Let's Monitor The Conditions at the Conference
PDF
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
PDF
Timothy Spann: Apache Pulsar for ML
PDF
Evaluating Streaming Data Solutions
PPTX
Building an Event Bus at Scale
PPTX
HPC and cloud distributed computing, as a journey
PPTX
Solving Office 365 Big Challenges using Cassandra + Spark
PDF
bigdata 2022_ FLiP Into Pulsar Apps
PDF
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Modern Distributed Messaging and RPC
Pulsar - Distributed pub/sub platform
Apache Kafka Introduction
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Webinar: Faster Log Indexing with Fusion
Big data conference europe real-time streaming in any and all clouds, hybri...
OSMC 2016 - Monasca - Monitoring-as-a-Service (at-Scale) by Roland Hochmuth
OSMC 2016 | Monasca: Monitoring-as-a-Service (at-Scale) by Roland Hochmuth
A Closer Look at Apache Kudu
(Current22) Let's Monitor The Conditions at the Conference
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Timothy Spann: Apache Pulsar for ML
Evaluating Streaming Data Solutions
Building an Event Bus at Scale
HPC and cloud distributed computing, as a journey
Solving Office 365 Big Challenges using Cassandra + Spark
bigdata 2022_ FLiP Into Pulsar Apps
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Ad

Recently uploaded (20)

PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
PPT on Performance Review to get promotions
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
additive manufacturing of ss316l using mig welding
PPTX
web development for engineering and engineering
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
DOCX
573137875-Attendance-Management-System-original
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPT on Performance Review to get promotions
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Geodesy 1.pptx...............................................
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
OOP with Java - Java Introduction (Basics)
additive manufacturing of ss316l using mig welding
web development for engineering and engineering
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
573137875-Attendance-Management-System-original
Operating System & Kernel Study Guide-1 - converted.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CYBER-CRIMES AND SECURITY A guide to understanding
Internet of Things (IOT) - A guide to understanding
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...

Pulsar - flexible pub-sub for internet scale

  • 1. APACHE PULSAR Flexible Pub-Sub system for Internet scale http://pulsar.apache.org Content on this presentation is licensed under a Creative Commons Attribution 4.0 International license
  • 2. Pulsar graduates asTLP project today!
  • 3. WHO AM I? • Matteo Merli • Apache Pulsar PMC Chair • Member of Apache BookKeeper PMC • Co-Founder of Streamlio • Worked on Pulsar since its beginning atYahoo
  • 4. WHAT IS APACHE PULSAR? “Pub-Sub messaging backed by durable log storage”
  • 5. WHAT IS APACHE PULSAR? 5 Multi-tenancy A single cluster can support many tenants and use cases Ordering Guaranteed ordering Durability Data replicated and synced to disk Delivery Guarantees At least once, at most once and effectively once Highly scalable Can support millions of topics Unified messaging model Support both Topic & Queue semantic in a single model Geo-replication Out of box support for geographically distributed applications High throughput Can reach 1.8 M messages/s in a single partition Low Latency Low publish latency of 5ms at 99pct
  • 6. WHY BUILD A NEW SYSTEM? • No existing solution to satisfy requirements • Multi tenant — 1M topics — Low latency — Durability — Geo replication • Other systems don’t scale well with many topics: • Storage model based on individual directory per topic partition • Durability kills the performance • Ability to manage large backlogs — Read old data without impacting writers • Many other choking points: getting stats, access to metadata, flow-control • Operations are not very convenient — replacing servers, expanding clusters, etc… 6
  • 7. STATE OFTHE PROJECT • Project started atYahoo around 2012 and went through various iterations • Open-Sourced in September 2016 • Entered Apache Incubator in June 2017 • Graduated asTLP on September 2018 • 2249 Commits — 22Yahoo releases — 9 Apache releases • 59 Contributors 7
  • 8. ARCHITECTURALVIEW Separate layers between brokers bookies • Broker and bookies can be added independently • Traffic can be shifted very quickly across brokers • New bookies will ramp up on traffic quickly
  • 9. APACHE BOOKKEEPER Replicated log storage • Low-latency durable writes • Simple repeatable read consistency • Highly available • Store many logs per node • I/O Isolation
  • 10. SEGMENT CENTRIC STORAGE • In addition to partitioning, messages are stored in segments (based on time and size) • Segments are independent from each others and spread across all storage nodes
  • 12. DATA PATH 1 — Publisher sends message to broker
  • 13. DATA PATH 2 — Broker writes in parallel to N replicas
  • 14. DATA PATH 3 — Wait for a quorum of acks from bookies
  • 15. DATA PATH 4 — Send ack to producer — Dispatch to consumer
  • 16. BOOKKEEPER INTERNAL Storage optimized for sequential & immutable data • IO isolation between write and read operations • Slow consumers won’t impact latency • Very effective IO patterns: • Journal — append only and no reads • Storage device — bulk write and sequential reads • Number of files is independent from number of topics
  • 18. PULSAR CLIENT LIBRARY • Java — C++ — Python — Go — WebSocket APIs • Partitioned topics • Apache Kafka compatibility wrapper API • Transparent batching and compression • TLS encryption and authentication • End-to-end encryption 18
  • 19. PYTHON CLIENT import pulsar client = pulsar.Client('pulsar://localhost:6650') producer = client.create_producer('my-topic') for i in range(10): producer.send(('Hello-%d' % i).encode('utf-8')) client.close() 19 • pip install pulsar-client
  • 20. GO CLIENT • go get -u github.com/apache/pulsar/pulsar-client-go/pulsar client, err := pulsar.NewClient(pulsar.ClientOptions{ URL: "pulsar://localhost:6650" }) producer, err := client.CreateProducer(pulsar.ProducerOptions{ Topic: "my-topic", }) for i := 0; i < 10; i++ { err := producer.Send(context.Background(), pulsar.ProducerMessage{ Payload: []byte(fmt.Sprintf("hello-%d", i)), }) } • Based on C++ client library — Pure Go client is being worked on 20
  • 21. MULTI-TENANCY • Authentication / Authorization / Namespaces / Admin APIs • I/O Isolations between writes and reads • Provided by BookKeeper - Ensure readers draining backlog won’t affect publishers • Soft isolation • Storage quotas — flow-control — back-pressure — rate limiting • Hardware isolation • Constrain some tenants on a subset of brokers or bookies 21
  • 22. GEO REPLICATION Topic (T1) Topic (T1) Topic (T1) Subscription (S1) Subscription (S1) Producer (P1) Consumer (C1) Producer (P3) Producer (P2) Consumer (C2) Data Center A Data Center B Data Center C • Scalable asynchronous replication • Integrated in the broker message flow • Simple configuration to add/remove regions
  • 23. SCHEMA REGISTRY • Store information on the data structure — Stored in BookKeeper • Enforce data types on topic • Allow for compatible schema evolutions 23
  • 24. TYPE-SAFE CLIENT API Producer<MyClass> producer = client .newProducer(Schema.JSON(MyClass.class)) .topic("my-topic") .create(); producer.send(new MyClass(1, 2)); 24 Consumer<MyClass> consumer = client .newConsumer(Schema.JSON(MyClass.class)) .topic("my-topic") .subscriptionName("my-subscription") .subscribe(); Message<MyClass> msg = consumer.receive(); • Integrated schema in API • End-to-end type safety — Enforced in Pulsar broker
  • 26. PULSAR FUNCTIONS / 1 • Simple compute against a consumed message • Managed or manual deployment • A function gets messages from 1 or more topics • An instance of the function is invoked to process the event • The output of the function is published on 1 or more topics 26
  • 27. PULSAR FUNCTIONS / 2 • Super simple to use — no SDK • Python example: def process(input): return input + '!' • Supports Java & Python — Go will come next 27
  • 28. PULSAR FUNCTIONS / 3 • Good use cases for functions: • ETL • Data enrichment • Data filtering • Routing 28
  • 29. PULSAR FUNCTIONS / 4 • Deployment modes: • Local run — Manually run a function, useful for dev mode • Managed — Worker service is running instances of functions 29
  • 30. PULSAR IO Connector framework based on Pulsar Functions
  • 31. PULSAR IO • Source — Ingest data into a Pulsar topic • Sink — Reads data from topic and dump into external sink • Pulsar provides a set of built-in connectors • Users can submit customized connectors 31
  • 32. TIERED STORAGE Unlimited topic storage capacity Achieves the true “stream-storage”: keep the raw data forever in stream form
  • 33. TIERED STORAGE • Leverage cloud storage services to offload cold data — Completely transparent to clients • Extremely cost effective — Backends (S3) (Coming GCS, HDFS) • Example: Retain all data for 1 month — Offload all messages older than 1 day to S3 33
  • 34. PULSAR SQL • Coming very soon in Pulsar 2.2 • Interactive SQL queries over data stored in Pulsar • Query old and real-time data 34
  • 35. PULSAR SQL / 2 • Based on Presto by Facebook — https://prestodb.io/ • Presto is a distributed query execution engine • Fetches the data from multiple sources (HDFS, S3, MySQL, …) • Full SQL compatibility 35
  • 36. PULSAR SQL / 3 • Pulsar connector for Presto • Read data directly from BookKeeper — bypass Pulsar Broker • Many-to-many data reads • Data is split even on a single partition — multiple workers can read data in parallel from single Pulsar partition • Time based indexing — Use “publishTime” in predicates to reduce data being read from disk 36
  • 38. BENCHMARK FRAMEWORK • Designed to measure performance of distributed messaging systems • Supports various “drivers” (Kafka, Pulsar, RocketMQ, RabbitMQ) • Automated deployment in EC2 • Configure workloads through aYAML file 38
  • 39. DISTRIBUTED EXECUTION Coordinator will take the workload definition and propagate to multiple workers — Collects and reports stats
  • 41. Latency at fixed throughput 50K msg/s 1Topic 1 Partition 1KB payload
  • 42. Latency at fixed throughput — (including Kafka-sync) 50K msg/s 1Topic 1 Partition 1KB payload
  • 43. Latency at fixed throughput — 99pct 50K msg/s 1Topic 1 Partition 1KB payload
  • 44. Q & A