SlideShare a Scribd company logo
1
UNIFYING MESSAGING, QUEUEING & LIGHT WEIGHT
COMPUTE USING APACHE PULSAR
KARTHIK RAMASAMY
CEO AND CO-FOUNDER
KARTHIK@STREAML.IO
APACHE PULSAR
2
Cloud Na)ve Messaging + Queuing + Compute System
backed by a durable log storage
WHAT IS CLOUD NATIVE?
3
✦ Ability to separate storage and compute layers
✦ Dynamically scale the resources as and when needed
✦ Isolation among users in a shared environment
✦ Resource usage control - quotas and rate limiting
✦ Seamlessly span across multiple data centers
MESSAGING AND QUEUING?
4
✦ Messaging/Streaming
✦ Facilitates decoupling of components
✦ Associated with ordering & stateful processing (e.g event A followed by event B)
✦ Queuing
✦ Facilitates master worker queues (e.g JMS, Rabbit MQ, ActiveMQ)
✦ Associated with stateless processing (e.g uploaded virus file scanning)
COMPONENTS OF A MESSAGING/STREAMING SYSTEM
5
ComputeMessaging
Storage
Data	Inges-on Data	Processing
Results	StorageData	Storage
Data	
Serving
6
Cloud Na)ve + Messaging + Queuing + Storage
APACHE PULSAR - TENANTS/NAMESPACES/TOPICS
7
Apache Pulsar Cluster
Product
Safety
ETL
Fraud
Detection
Topic-1
Account History
Topic-2
User Clustering
Topic-1
Risk Classification
MarketingCampaigns
ETL
Topic-1
Budgeted Spend
Topic-2
Demographic Classification
Topic-1
Location Resolution
Data
Serving
Microservice
Topic-1
Customer Authentication
Tenants
Namespaces
APACHE PULSAR - TOPICS
8
Topic
Producers
Consumers
Time
APACHE PULSAR - TOPIC PARTITIONS
9
Topic - P0 Producers
Consumers
Time
Topic - P1
Topic - P2
APACHE PULSAR - SEGMENTS
10
Producers
Consumers
Time
Segment 1 Segment 2 Segment 3
Segment 1 Segment 2 Segment 3 Segment 4
Segment 1 Segment 2 Segment 3
P0
P1
P2
APACHE PULSAR
11
Bookie Bookie Bookie
Broker Broker Broker
Producer Consumer
✓Layered Architecture
✓ Independent Scalability

✓ Fault Tolerance

✓Instant Scalability
APACHE PULSAR - SEGMENT CENTRIC STORAGE
12
✓ Logical Partition

✓ Partition divided into Segments

✓ Size-based & Time-based

✓ Uniformly distributed across the cluster 

APACHE PULSAR - BROKER FAILURE RECOVERY
13
✓Topic is reassigned to an available broker
based on load
✓Can reconstruct the previous state
consistently
✓No data needs to be copied
✓Failover handled transparently by client
library
APACHE PULSAR - BOOKIE FAILURE RECOVERY
14
✓After a write failure, BookKeeper will
immediately switch write to a new
bookie, within the same segment.
✓As long as we have any 3 bookies in the
cluster, we can continue to write
APACHE PULSAR - BOOKIE FAILURE RECOVERY
15
✓In background, starts a many-to-many
recovery process to regain the
configured replication factor
APACHE PULSAR - SEAMLESS CLUSTER EXPANSION
16
1234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1
Segment 4
Segment 4
Segment Y
Segment Z
Segment X
APACHE PULSAR - TIERED STORAGE
17
Low Cost Storage
1234…20212223…40414243…60616263…
Segment 3
Segment 2Segment 3
Segment 4
Segment 3
Segment 1
Segment 4 Segment 4
PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE?
18
Legacy Architectures
! Storage co-resident with processing
! Partition-centric
! Cumbersome to scale--data
redistribution, performance impact
Logical
View
Apache Pulsar
! Storage decoupled from processing
! Partitions stored as segments
! Flexible, easy scalability
Partition
Processing
& Storage
Segment 1 Segment 3Segment 2 Segment n
Partition
Broker
Partition
(primary)
Broker
Partition
(copy)
Broker
Partition
(copy)
Broker Broker Broker
Segment 1
Segment 2
Segment n
.
.
.
Segment 2
Segment 3
Segment n
.
.
.
Segment 3
Segment 1
Segment n
.
.
.
Segment 1
Segment 2
Segment n
.
.
.
Processing
(brokers)
Storage
APACHE PULSAR - DURABILITY
19
Bookie
Bookie
BookieBrokerProducer
Journal
Journal
Journal
fsync
fsync
fsync
UNIFIED MODEL - MESSAGING
20
Pulsar topic/
partition
Producer 2
Producer 1
Consumer 1
Consumer 2
Subscription
A
M4
M3
M2
M1
M0
M4
M3
M2
M1
M0
X
Exclusive
UNIFIED MODEL - MESSAGING
21
Pulsar topic/
partition
Producer 2
Producer 1
Consumer 1
Consumer 2
Subscription
B
M4
M3
M2
M1
M0
M4
M3
M2
M1
M0
Failover
In case of failure in
consumer 1
UNIFIED MODEL - QUEUING
22
Pulsar topic/
partition
Producer 2
Producer 1
Consumer 2
Consumer 3
Subscription
C
M4
M3
M2
M1
M0
Shared
Traffic is equally distributed
across consumers
Consumer 1
M4M3
M2M1M0
DISASTER REPLICATION & RECOVERY
23
Topic	(T1) Topic	(T1)
Topic	(T1)
Subscrip-on	
(S1)
Subscrip-on	
(S1)
Producer		
(P1)
Consumer		
(C1)
Producer		
(P3)
Producer		
(P2)
Consumer		
(C2)
Data	Center	A Data	Center	B
Data	Center	C
Integrated in the
broker message flow
Simple configuration
to add/remove regions
Asynchronous (default)
and synchronous
replication
• Two independent clusters,
primary and standby
• Configured tenants and
namespaces replicate to
standby
• Data published to primary is
asynchronously replicated to
standby
24
Producers
(active)
Datacenter 1
Consumers
(active)
Pulsar Cluster
(primary)
Datacenter 2
Producers
(standby)
Consumers
(standby)
Pulsar Cluster
(standby)
Pulsar
replication
ZooKeeper ZooKeeper
Replicated subscriptions allows the producers and consumers
to restart close to where they left off in second datacenter
upon primary failure
ASYNCHRONOUS REPLICATION - REPLICATED SUBSCRIPTIONS
ZooKeeper
• Each topic owned by one
broker at a time, i.e. in one
datacenter
• ZooKeeper cluster spread
across multiple locations
• Broker commits writes to
bookies in both datacenters
• In event of datacenter failure,
broker in surviving datacenter
assumes ownership of topic
25
Producers
Datacenter 1
Consumers
Pulsar Cluster
Datacenter 2
Producers
Consumers
SYNCHRONOUS REPLICATION
• Remote clusters replicate data
to clusters in primary and
standby datacenters
concurrently
• Primary and standby data
centers asynchronously replicate
to each other
• Consumers restarted in second
datacenter upon primary
datacenter failure
26
Edge Cluster
Datacenter 1
Consumers
(active)
Pulsar Cluster
(primary)
Datacenter 2
Consumers
(standby)
Pulsar Cluster
(standby)
Remote Office
Edge Cluster
EDGE TO CORE
APACHE PULSAR - MULTITENANCY
27
Apache Pulsar Cluster
Product
Safety
ETL
Fraud
Detection
Topic-1
Account History
Topic-2
User Clustering
Topic-1
Risk Classification
MarketingCampaigns
ETL
Topic-1
Budgeted Spend
Topic-2
Demographic Classification
Topic-1
Location Resolution
Data
Serving
Microservice
Topic-1
Customer Authentication
10 TB
7 TB
5 TB
✦ Authentication
✦ Authorization
✦ Software isolation
๏ Storage quotas, flow control, back pressure, rate limiting
✦ Hardware isolation
๏ Constrain some tenants on a subset of brokers/bookies
28
Compute
HOW TO PROCESS DATA MODELED AS STREAMS
29
✦ Consume data as it is produced (pub/sub)
✦ Heavy weight compute - continuous data processing (DAG Processing)
✦ Light weight compute - transform and react to data as it arrives
✦ Interactive query of stored streams
LESSONS LEARNT - USE CASES
30
✦ Data transformations
✦ Data classification
✦ Data enrichment
✦ Data routing
✦ Data extraction and loading
✦ Real time aggregation
✦ Microservices
Significant set of processing tasks are exceedingly simple
LIGHT WEIGHT COMPUTE
31
f(x)
Incoming	Messages Output	Messages
ABSTRACT VIEW OF COMPUTE REPRESENTATION
STREAM NATIVE COMPUTE USING FUNCTIONS
32
✦ Simplest possible API -function or a procedure
✦ Support for multi language
✦ Use of native API for each language
✦ Scale developers
✦ Use of message bus native concepts - input and output as topics
✦ Flexible runtime - simple standalone applications vs managed system applications
APPLYING INSIGHT GAINED FROM SERVERLESS
PULSAR FUNCTIONS
33
SDK LESS API
import java.util.function.Function;
public class ExclamationFunction implements Function<String, String> {
@Override
public String apply(String input) {
return input + "!";
}
}
PROCESSING GUARANTEES
34
✦ ATMOST_ONCE
๏ Message acked to Pulsar as soon as we receive it
✦ ATLEAST_ONCE
๏ Message acked to Pulsar after the function completes
๏ Default behavior - don’t want people to loose data
✦ EFFECTIVELY_ONCE
๏ Uses Pulsar’s inbuilt effectively once semantics
✦ Controlled at runtime by user
DEPLOYING FUNCTIONS - BROKER
35
Broker 1
Worker
Function
wordcount-1
Function
transform-2
Broker 1
Worker
Function
transform-1
Function
dataroute-1
Broker 1
Worker
Function
wordcount-2
Function
transform-3
Node 1 Node 2 Node 3
DEPLOYING FUNCTIONS - WORKER NODES
36
Worker
Function
wordcount-1
Function
transform-2
Worker
Function
transform-1
Function
dataroute-1
Worker
Function
wordcount-2
Function
transform-3
Node 1 Node 2 Node 3
Broker 1 Broker 2 Broker 3
Node 4 Node 5 Node 6
DEPLOYING FUNCTIONS - KUBERNETES
37
Function
wordcount-1
Function
transform-1
Function
transform-3
Pod 1 Pod 2 Pod 3
Broker 1 Broker 2 Broker 3
Pod 7 Pod 8 Pod 9
Function
dataroute-1
Function
wordcount-2
Function
transform-2
Pod 4 Pod 5 Pod 6
INTERACTIVE QUERYING OF STREAMS - PULSAR SQL
38
1234…20212223…40414243…60616263…
Segment 1
Segment 3
Segment 2
Segment 2
Segment 1
Segment 3
Segment 4
Segment 3
Segment 2
Segment 1
Segment 4
Segment 4
Segment
Reader
Segment
Reader
Segment
Reader
Segment
Reader
Coordinator
Growing ecosystem of Apache Pulsar
39
Apache Pulsar as SaaS - Preview
40
https://sandbox.cloud.streamlio.com
Apache Pulsar SaaS - Demo - Sentiment Analysis
41
Twitter Firehose
Source
Sentiment
Analysis Pulsar
Function
Tweet Topic
Positive
Tweets Topic
Neutral
Tweets Topic
Negative
Tweets topic
PulsarSQL
Function State
# positive Tweets
# of neutral Tweets
# of negative Tweets
APACHE PULSAR COMMUNITY
42
✓ Twitter: @apache_pulsar
✓ Wechat Subscription: ApachePulsar
✓ Mailing Lists

dev@pulsar.apache.org, users@pulsar.apache.org
✓ Slack

https://apache-pulsar.slack.com
✓ Localization

https://crowdin.com/project/apache-pulsar
✓ Github

https://github.com/apache/pulsar

https://github.com/apache/bookkeeper
43
✓ Understanding How Pulsar Works

https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-
works
✓ How To (Not) Lose Messages on Apache Pulsar Cluster

https://jack-vanlightly.com/blog/2018/10/21/how-to-not-lose-messages-on-an-
apache-pulsar-cluster
MORE READINGS
MORE READINGS
44
✓ Unified queuing and streaming

https://streaml.io/blog/pulsar-streaming-queuing
✓ Segment centric storage

https://streaml.io/blog/pulsar-segment-based-architecture
✓ Messaging, Storage or Both

https://streaml.io/blog/messaging-storage-or-both
✓ Access patterns and tiered storage

https://streaml.io/blog/access-patterns-and-tiered-storage-in-apache-pulsar
✓ Tiered Storage in Apache Pulsar

https://streaml.io/blog/tiered-storage-in-apache-pulsar
QUESTIONS
45
46
@karthikz

More Related Content

PPTX
Apache Pulsar First Overview
PDF
High performance messaging with Apache Pulsar
ODP
Kafka aws
PDF
Apache Pulsar at Yahoo! Japan
PDF
Linked In Stream Processing Meetup - Apache Pulsar
PDF
How Splunk Mission Control leverages various Pulsar subscription types_Pranav...
PDF
Apache Pulsar Seattle - Meetup
PDF
Hello, kafka! (an introduction to apache kafka)
Apache Pulsar First Overview
High performance messaging with Apache Pulsar
Kafka aws
Apache Pulsar at Yahoo! Japan
Linked In Stream Processing Meetup - Apache Pulsar
How Splunk Mission Control leverages various Pulsar subscription types_Pranav...
Apache Pulsar Seattle - Meetup
Hello, kafka! (an introduction to apache kafka)

What's hot (20)

PDF
Getting Pulsar Spinning_Addison Higham
PDF
Pulsar for Kafka People
PDF
Creating Data Fabric for #IOT with Apache Pulsar
PPTX
Fundamentals and Architecture of Apache Kafka
PPTX
Interactive Analytics on Pulsar with Pulsar SQL - Pulsar Virtual Summit Europ...
PDF
Effectively-once semantics in Apache Pulsar
PPTX
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
PDF
Introduction to Apache Kafka and why it matters - Madrid
PDF
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
PDF
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
PPTX
Using the JMS 2.0 API with Apache Pulsar - Pulsar Virtual Summit Europe 2021
PDF
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
PDF
Apache Kafka - Martin Podval
PDF
Kafka and Spark Streaming
PDF
Scaling customer engagement with apache pulsar
PDF
Stream-Native Processing with Pulsar Functions
PDF
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
PDF
Hands-on Workshop: Apache Pulsar
PPTX
Current and Future of Apache Kafka
PDF
Kafka internals
Getting Pulsar Spinning_Addison Higham
Pulsar for Kafka People
Creating Data Fabric for #IOT with Apache Pulsar
Fundamentals and Architecture of Apache Kafka
Interactive Analytics on Pulsar with Pulsar SQL - Pulsar Virtual Summit Europ...
Effectively-once semantics in Apache Pulsar
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Introduction to Apache Kafka and why it matters - Madrid
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Using the JMS 2.0 API with Apache Pulsar - Pulsar Virtual Summit Europe 2021
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Apache Kafka - Martin Podval
Kafka and Spark Streaming
Scaling customer engagement with apache pulsar
Stream-Native Processing with Pulsar Functions
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Hands-on Workshop: Apache Pulsar
Current and Future of Apache Kafka
Kafka internals
Ad

Similar to Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar (20)

PDF
Apache Pulsar Overview
PPTX
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
PPTX
Streaming in Practice - Putting Apache Kafka in Production
PPTX
BigData Developers MeetUp
PDF
Designing Modern Streaming Data Applications
PDF
Chicago Kafka Meetup
PPT
01 oracle architecture
PPTX
Stream data from Apache Kafka for processing with Apache Apex
PDF
BigDataSpain 2016: Introduction to Apache Apex
PPTX
Software architecture for data applications
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Intro to Apache Apex @ Women in Big Data
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
PDF
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
PPTX
Apache Apex: Stream Processing Architecture and Applications
PPTX
Apache Apex: Stream Processing Architecture and Applications
PDF
Pulsar - Distributed pub/sub platform
PDF
Timothy Spann: Apache Pulsar for ML
PPTX
Ingestion and Dimensions Compute and Enrich using Apache Apex
PDF
IBM MQ High Availabillity and Disaster Recovery (2017 version)
Apache Pulsar Overview
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Streaming in Practice - Putting Apache Kafka in Production
BigData Developers MeetUp
Designing Modern Streaming Data Applications
Chicago Kafka Meetup
01 oracle architecture
Stream data from Apache Kafka for processing with Apache Apex
BigDataSpain 2016: Introduction to Apache Apex
Software architecture for data applications
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex @ Women in Big Data
Architectual Comparison of Apache Apex and Spark Streaming
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Pulsar - Distributed pub/sub platform
Timothy Spann: Apache Pulsar for ML
Ingestion and Dimensions Compute and Enrich using Apache Apex
IBM MQ High Availabillity and Disaster Recovery (2017 version)
Ad

More from Karthik Ramasamy (10)

PDF
Scaling Apache Pulsar to 10 PB/day
PDF
Apache Pulsar @Splunk
PDF
Pulsar summit-keynote-final
PDF
Exactly once in Apache Heron
PDF
Tutorial - Modern Real Time Streaming Architectures
PDF
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
PDF
Modern Data Pipelines
PDF
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
PDF
Storm@Twitter, SIGMOD 2014 paper
PDF
Storm@Twitter, SIGMOD 2014
Scaling Apache Pulsar to 10 PB/day
Apache Pulsar @Splunk
Pulsar summit-keynote-final
Exactly once in Apache Heron
Tutorial - Modern Real Time Streaming Architectures
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeper
Modern Data Pipelines
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Storm@Twitter, SIGMOD 2014 paper
Storm@Twitter, SIGMOD 2014

Recently uploaded (20)

PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Lecture1 pattern recognition............
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to Knowledge Engineering Part 1
Supervised vs unsupervised machine learning algorithms
Data_Analytics_and_PowerBI_Presentation.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Database Infoormation System (DBIS).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Fluorescence-microscope_Botany_detailed content
Business Acumen Training GuidePresentation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Lecture1 pattern recognition............
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
oil_refinery_comprehensive_20250804084928 (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Qualitative Qantitative and Mixed Methods.pptx
1_Introduction to advance data techniques.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Knowledge Engineering Part 1

Unifying Messaging, Queueing & Light Weight Compute Using Apache Pulsar

  • 1. 1 UNIFYING MESSAGING, QUEUEING & LIGHT WEIGHT COMPUTE USING APACHE PULSAR KARTHIK RAMASAMY CEO AND CO-FOUNDER KARTHIK@STREAML.IO
  • 2. APACHE PULSAR 2 Cloud Na)ve Messaging + Queuing + Compute System backed by a durable log storage
  • 3. WHAT IS CLOUD NATIVE? 3 ✦ Ability to separate storage and compute layers ✦ Dynamically scale the resources as and when needed ✦ Isolation among users in a shared environment ✦ Resource usage control - quotas and rate limiting ✦ Seamlessly span across multiple data centers
  • 4. MESSAGING AND QUEUING? 4 ✦ Messaging/Streaming ✦ Facilitates decoupling of components ✦ Associated with ordering & stateful processing (e.g event A followed by event B) ✦ Queuing ✦ Facilitates master worker queues (e.g JMS, Rabbit MQ, ActiveMQ) ✦ Associated with stateless processing (e.g uploaded virus file scanning)
  • 5. COMPONENTS OF A MESSAGING/STREAMING SYSTEM 5 ComputeMessaging Storage Data Inges-on Data Processing Results StorageData Storage Data Serving
  • 6. 6 Cloud Na)ve + Messaging + Queuing + Storage
  • 7. APACHE PULSAR - TENANTS/NAMESPACES/TOPICS 7 Apache Pulsar Cluster Product Safety ETL Fraud Detection Topic-1 Account History Topic-2 User Clustering Topic-1 Risk Classification MarketingCampaigns ETL Topic-1 Budgeted Spend Topic-2 Demographic Classification Topic-1 Location Resolution Data Serving Microservice Topic-1 Customer Authentication Tenants Namespaces
  • 8. APACHE PULSAR - TOPICS 8 Topic Producers Consumers Time
  • 9. APACHE PULSAR - TOPIC PARTITIONS 9 Topic - P0 Producers Consumers Time Topic - P1 Topic - P2
  • 10. APACHE PULSAR - SEGMENTS 10 Producers Consumers Time Segment 1 Segment 2 Segment 3 Segment 1 Segment 2 Segment 3 Segment 4 Segment 1 Segment 2 Segment 3 P0 P1 P2
  • 11. APACHE PULSAR 11 Bookie Bookie Bookie Broker Broker Broker Producer Consumer ✓Layered Architecture ✓ Independent Scalability
 ✓ Fault Tolerance
 ✓Instant Scalability
  • 12. APACHE PULSAR - SEGMENT CENTRIC STORAGE 12 ✓ Logical Partition
 ✓ Partition divided into Segments
 ✓ Size-based & Time-based
 ✓ Uniformly distributed across the cluster 

  • 13. APACHE PULSAR - BROKER FAILURE RECOVERY 13 ✓Topic is reassigned to an available broker based on load ✓Can reconstruct the previous state consistently ✓No data needs to be copied ✓Failover handled transparently by client library
  • 14. APACHE PULSAR - BOOKIE FAILURE RECOVERY 14 ✓After a write failure, BookKeeper will immediately switch write to a new bookie, within the same segment. ✓As long as we have any 3 bookies in the cluster, we can continue to write
  • 15. APACHE PULSAR - BOOKIE FAILURE RECOVERY 15 ✓In background, starts a many-to-many recovery process to regain the configured replication factor
  • 16. APACHE PULSAR - SEAMLESS CLUSTER EXPANSION 16 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4 Segment Y Segment Z Segment X
  • 17. APACHE PULSAR - TIERED STORAGE 17 Low Cost Storage 1234…20212223…40414243…60616263… Segment 3 Segment 2Segment 3 Segment 4 Segment 3 Segment 1 Segment 4 Segment 4
  • 18. PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE? 18 Legacy Architectures ! Storage co-resident with processing ! Partition-centric ! Cumbersome to scale--data redistribution, performance impact Logical View Apache Pulsar ! Storage decoupled from processing ! Partitions stored as segments ! Flexible, easy scalability Partition Processing & Storage Segment 1 Segment 3Segment 2 Segment n Partition Broker Partition (primary) Broker Partition (copy) Broker Partition (copy) Broker Broker Broker Segment 1 Segment 2 Segment n .
.
. Segment 2 Segment 3 Segment n .
.
. Segment 3 Segment 1 Segment n .
.
. Segment 1 Segment 2 Segment n .
.
. Processing (brokers) Storage
  • 19. APACHE PULSAR - DURABILITY 19 Bookie Bookie BookieBrokerProducer Journal Journal Journal fsync fsync fsync
  • 20. UNIFIED MODEL - MESSAGING 20 Pulsar topic/ partition Producer 2 Producer 1 Consumer 1 Consumer 2 Subscription A M4 M3 M2 M1 M0 M4 M3 M2 M1 M0 X Exclusive
  • 21. UNIFIED MODEL - MESSAGING 21 Pulsar topic/ partition Producer 2 Producer 1 Consumer 1 Consumer 2 Subscription B M4 M3 M2 M1 M0 M4 M3 M2 M1 M0 Failover In case of failure in consumer 1
  • 22. UNIFIED MODEL - QUEUING 22 Pulsar topic/ partition Producer 2 Producer 1 Consumer 2 Consumer 3 Subscription C M4 M3 M2 M1 M0 Shared Traffic is equally distributed across consumers Consumer 1 M4M3 M2M1M0
  • 23. DISASTER REPLICATION & RECOVERY 23 Topic (T1) Topic (T1) Topic (T1) Subscrip-on (S1) Subscrip-on (S1) Producer (P1) Consumer (C1) Producer (P3) Producer (P2) Consumer (C2) Data Center A Data Center B Data Center C Integrated in the broker message flow Simple configuration to add/remove regions Asynchronous (default) and synchronous replication
  • 24. • Two independent clusters, primary and standby • Configured tenants and namespaces replicate to standby • Data published to primary is asynchronously replicated to standby 24 Producers (active) Datacenter 1 Consumers (active) Pulsar Cluster (primary) Datacenter 2 Producers (standby) Consumers (standby) Pulsar Cluster (standby) Pulsar replication ZooKeeper ZooKeeper Replicated subscriptions allows the producers and consumers to restart close to where they left off in second datacenter upon primary failure ASYNCHRONOUS REPLICATION - REPLICATED SUBSCRIPTIONS
  • 25. ZooKeeper • Each topic owned by one broker at a time, i.e. in one datacenter • ZooKeeper cluster spread across multiple locations • Broker commits writes to bookies in both datacenters • In event of datacenter failure, broker in surviving datacenter assumes ownership of topic 25 Producers Datacenter 1 Consumers Pulsar Cluster Datacenter 2 Producers Consumers SYNCHRONOUS REPLICATION
  • 26. • Remote clusters replicate data to clusters in primary and standby datacenters concurrently • Primary and standby data centers asynchronously replicate to each other • Consumers restarted in second datacenter upon primary datacenter failure 26 Edge Cluster Datacenter 1 Consumers (active) Pulsar Cluster (primary) Datacenter 2 Consumers (standby) Pulsar Cluster (standby) Remote Office Edge Cluster EDGE TO CORE
  • 27. APACHE PULSAR - MULTITENANCY 27 Apache Pulsar Cluster Product Safety ETL Fraud Detection Topic-1 Account History Topic-2 User Clustering Topic-1 Risk Classification MarketingCampaigns ETL Topic-1 Budgeted Spend Topic-2 Demographic Classification Topic-1 Location Resolution Data Serving Microservice Topic-1 Customer Authentication 10 TB 7 TB 5 TB ✦ Authentication ✦ Authorization ✦ Software isolation ๏ Storage quotas, flow control, back pressure, rate limiting ✦ Hardware isolation ๏ Constrain some tenants on a subset of brokers/bookies
  • 29. HOW TO PROCESS DATA MODELED AS STREAMS 29 ✦ Consume data as it is produced (pub/sub) ✦ Heavy weight compute - continuous data processing (DAG Processing) ✦ Light weight compute - transform and react to data as it arrives ✦ Interactive query of stored streams
  • 30. LESSONS LEARNT - USE CASES 30 ✦ Data transformations ✦ Data classification ✦ Data enrichment ✦ Data routing ✦ Data extraction and loading ✦ Real time aggregation ✦ Microservices Significant set of processing tasks are exceedingly simple
  • 31. LIGHT WEIGHT COMPUTE 31 f(x) Incoming Messages Output Messages ABSTRACT VIEW OF COMPUTE REPRESENTATION
  • 32. STREAM NATIVE COMPUTE USING FUNCTIONS 32 ✦ Simplest possible API -function or a procedure ✦ Support for multi language ✦ Use of native API for each language ✦ Scale developers ✦ Use of message bus native concepts - input and output as topics ✦ Flexible runtime - simple standalone applications vs managed system applications APPLYING INSIGHT GAINED FROM SERVERLESS
  • 33. PULSAR FUNCTIONS 33 SDK LESS API import java.util.function.Function; public class ExclamationFunction implements Function<String, String> { @Override public String apply(String input) { return input + "!"; } }
  • 34. PROCESSING GUARANTEES 34 ✦ ATMOST_ONCE ๏ Message acked to Pulsar as soon as we receive it ✦ ATLEAST_ONCE ๏ Message acked to Pulsar after the function completes ๏ Default behavior - don’t want people to loose data ✦ EFFECTIVELY_ONCE ๏ Uses Pulsar’s inbuilt effectively once semantics ✦ Controlled at runtime by user
  • 35. DEPLOYING FUNCTIONS - BROKER 35 Broker 1 Worker Function wordcount-1 Function transform-2 Broker 1 Worker Function transform-1 Function dataroute-1 Broker 1 Worker Function wordcount-2 Function transform-3 Node 1 Node 2 Node 3
  • 36. DEPLOYING FUNCTIONS - WORKER NODES 36 Worker Function wordcount-1 Function transform-2 Worker Function transform-1 Function dataroute-1 Worker Function wordcount-2 Function transform-3 Node 1 Node 2 Node 3 Broker 1 Broker 2 Broker 3 Node 4 Node 5 Node 6
  • 37. DEPLOYING FUNCTIONS - KUBERNETES 37 Function wordcount-1 Function transform-1 Function transform-3 Pod 1 Pod 2 Pod 3 Broker 1 Broker 2 Broker 3 Pod 7 Pod 8 Pod 9 Function dataroute-1 Function wordcount-2 Function transform-2 Pod 4 Pod 5 Pod 6
  • 38. INTERACTIVE QUERYING OF STREAMS - PULSAR SQL 38 1234…20212223…40414243…60616263… Segment 1 Segment 3 Segment 2 Segment 2 Segment 1 Segment 3 Segment 4 Segment 3 Segment 2 Segment 1 Segment 4 Segment 4 Segment Reader Segment Reader Segment Reader Segment Reader Coordinator
  • 39. Growing ecosystem of Apache Pulsar 39
  • 40. Apache Pulsar as SaaS - Preview 40 https://sandbox.cloud.streamlio.com
  • 41. Apache Pulsar SaaS - Demo - Sentiment Analysis 41 Twitter Firehose Source Sentiment Analysis Pulsar Function Tweet Topic Positive Tweets Topic Neutral Tweets Topic Negative Tweets topic PulsarSQL Function State # positive Tweets # of neutral Tweets # of negative Tweets
  • 42. APACHE PULSAR COMMUNITY 42 ✓ Twitter: @apache_pulsar ✓ Wechat Subscription: ApachePulsar ✓ Mailing Lists
 dev@pulsar.apache.org, users@pulsar.apache.org ✓ Slack
 https://apache-pulsar.slack.com ✓ Localization
 https://crowdin.com/project/apache-pulsar ✓ Github
 https://github.com/apache/pulsar
 https://github.com/apache/bookkeeper
  • 43. 43 ✓ Understanding How Pulsar Works
 https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar- works ✓ How To (Not) Lose Messages on Apache Pulsar Cluster
 https://jack-vanlightly.com/blog/2018/10/21/how-to-not-lose-messages-on-an- apache-pulsar-cluster MORE READINGS
  • 44. MORE READINGS 44 ✓ Unified queuing and streaming
 https://streaml.io/blog/pulsar-streaming-queuing ✓ Segment centric storage
 https://streaml.io/blog/pulsar-segment-based-architecture ✓ Messaging, Storage or Both
 https://streaml.io/blog/messaging-storage-or-both ✓ Access patterns and tiered storage
 https://streaml.io/blog/access-patterns-and-tiered-storage-in-apache-pulsar ✓ Tiered Storage in Apache Pulsar
 https://streaml.io/blog/tiered-storage-in-apache-pulsar