SlideShare a Scribd company logo
FLiP Into Pulsar Apps
Tim Spann | Developer Advocate
● Introduction
● What is Apache Pulsar?
● Pulsar Functions
● Apache NiFi
● Apache Flink
● Apache Spark
● Demo
● Q&A
In this session, Timothy
will introduce you to the
world of Apache Pulsar
and how to build real-time
messaging and streaming
applications with a variety
of OSS libraries, schemas,
languages, frameworks,
and tools.
Tim Spann
Developer Advocate
Tim Spann
Developer Advocate at StreamNative
● FLiP(N) Stack = Flink, Pulsar and NiFi Stack
● Streaming Systems & Data Architecture Expert
● Experience:
○ 15+ years of experience with streaming technologies including Pulsar, Flink, Spark, NiFi, Big
Data, Cloud, MXNet, IoT, Python and more.
○ Today, he helps to grow the Pulsar community sharing rich technical knowledge and experience
at both global conferences and through individual conversations.
This week in Apache Flink, Apache
Pulsar, Apache NiFi, Apache Spark
and open source friends.
https://bit.ly/32dAJft
FLiP Stack Weekly
● Apache Flink
● Apache Pulsar
● Apache NiFi
● Apache Spark
● Pulsar Functions
● Python, Java, Golang
FLiP(N) Stack
streamnative.io
Transit, Humidity, Air Quality, Energy, …
Apache Pulsar is built to support legacy applications, handle the
needs of modern apps, and supports NextGen applications
Support legacy workloads.
Compatible with popular
messaging and streaming tools.
Legacy
Built for today's real-time
event driven applications.
Modern
Scalable, adaptive architecture
ready for the future of real-time
streaming.
NextGen
bigdata 2022_ FLiP Into Pulsar Apps
Apache Pulsar has a vibrant community
560+
Contributors
10,000+
Commits
7,000+
Slack Members
1,000+
Organizations
Using Pulsar
It is often assumed that Pulsar and Kafka have equal capabilities. In reality,
Pulsar offers a superset of Kafka.
● Pulsar is streaming and queuing together
● Pulsar is cloud-native with stateless brokers
● Natively includes geo-replication, multi-tenancy, and end-to-end
security out of the box
● Pulsar provides automated rebalancing
● Pulsar offers 100X lower latency w/ 2.5 greater throughput than Kafka
Advantages of Apache Pulsar
Apache Pulsar features
Cloud native with decoupled
storage and compute layers.
Built-in compatibility with your
existing code and messaging
infrastructure.
Geographic redundancy and high
availability included.
Centralized cluster management
and oversight.
Elastic horizontal and vertical
scalability.
Seamless and instant partitioning
rebalancing with no downtime.
Flexible subscription model
supports a wide array of use cases.
Compatible with the tools you use
to store, analyze, and process data.
● “Bookies”
● Stores messages and cursors
● Messages are grouped in
segments/ledgers
● A group of bookies form an
“ensemble” to store a ledger
● “Brokers”
● Handles message routing and
connections
● Stateless, but with caches
● Automatic load-balancing
● Topics are composed of
multiple segments
●
● Stores metadata for both
Pulsar and BookKeeper
● Service discovery
Store
Messages
Metadata &
Service Discovery
Metadata &
Service Discovery
Pulsar Cluster
Metadata
Storage
Pulsar Cluster
Component Description
Value /
Data payload
The data carried by the message. All Pulsar messages contain raw bytes,
although message data can also conform to data schemas.
Key Messages are optionally tagged with keys, used in partitioning and also is useful
for things like topic compaction.
Properties An optional key/value map of user-defined properties.
Producer name The name of the producer who produces the message. If you do not specify a
producer name, the default name is used.
Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The
sequence ID of the message is its order in that sequence.
Messages - the Basic Unit of Apache Pulsar
Different subscription modes
have different semantics:
Exclusive/Failover -
guaranteed order, single active
consumer
Shared - multiple active
consumers, no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2
,V
2
1>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover
Apache Pulsar Subscription Modes
Streaming
Consumer
Consumer
Consumer
Subscription
Shared
Failover
Consumer
Consumer
Subscription
In case of failure in
Consumer B-0
Consumer
Consumer
Subscription
Exclusive
X
Consumer
Consumer
Key-Shared
Subscription
Pulsar
Topic/Partition
Messaging
Unified Messaging Model
Simplify your data infrastructure and
enable new use cases with queuing and
streaming capabilities in one platform.
Multi-tenancy
Enable multiple user groups to share the
same cluster, either via access control, or
in entirely different namespaces.
Scalability
Decoupled data computing and storage
enable horizontal scaling to handle data
scale and management complexity.
Geo-replication
Support for multi-datacenter replication
with both asynchronous and
synchronous replication for built-in
disaster recovery.
Tiered storage
Enable historical data to be offloaded to
cloud-native storage and store event
streams for indefinite periods of time.
Apache Pulsar Benefits
Messaging Use Cases Streaming Use Cases
Service x commands service y to make some
change.
Example: order service removing item from
inventory service
Moving large amounts of data to another service
(real-time ETL).
Example: logs to elasticsearch
Distributing messages that represent work
among n workers.
Example: order processing not in main “thread”
Periodic jobs moving large amounts of data and
aggregating to more traditional stores.
Example: logs to s3
Sending “scheduled” messages.
Example: notification service for marketing emails
or push notifications
Computing a near real-time aggregate of a message
stream, split among n workers, with order being
important.
Example: real-time analytics over page views
Messaging vs Streaming
Messaging Use Case Streaming Use Case
Retention The amount of data retained is
relatively small - typically only a day
or two of data at most.
Large amounts of data are retained,
with higher ingest volumes and
longer retention periods.
Throughput Messaging systems are not designed
to manage big “catch-up” reads.
Streaming systems are designed to
scale and can handle use cases
such as catch-up reads.
Differences in Consumption
byte[] msgIdBytes = // Some byte
array
MessageId id =
MessageId.fromByteArray(msgIdBytes);
Reader<byte[]> reader =
pulsarClient.newReader()
.topic(topic)
.startMessageId(id)
.create();
Create a reader that will read from
some message between earliest and
latest.
Reader
Apache Pulsar Reader Interface
● New Consumer type added in Pulsar 2.10 that provides a
continuously updated key-value map view of compacted topic data.
● An abstraction of a changelog stream from a primary-keyed table,
where each record in the changelog stream is an update on the
primary-keyed table with the record key as the primary key.
● READ ONLY DATA STRUCTURE!
Apache Pulsar TableView
bigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar Apps
Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
Schema Registry
● Utilizing JSON Data with a JSON Schema
● Consistency, Contracts, Clean Data
● This enables easy SQL:
○ Pulsar SQL (Presto SQL)
○ Flink SQL
○ Spark Structured Streaming
Use Schemas
• Functions - Lightweight Stream
Processing (Java, Python, Go)
• Connectors - Sources & Sinks
(Cassandra, Kafka, …)
• Protocol Handlers - AoP (AMQP), KoP
(Kafka), MoP (MQTT)
• Processing Engines - Flink, Spark,
Presto/Trino via Pulsar SQL
• Data Offloaders - Tiered Storage - (S3)
Sources, Sinks and Processing
Kafka on Pulsar (KoP)
MQTT on Pulsar (MoP)
AMQP on Pulsar (AoP)
Use Apache Pulsar For Ingest
Use Apache Pulsar To Stream to Lakehouses
● Lightweight computation similar
to AWS Lambda.
● Specifically designed to use
Apache Pulsar as a message
bus.
● Function runtime can be
located within Pulsar Broker.
● Java Functions
A serverless event
streaming framework
Apache Pulsar Functions
● Consume messages from one or
more Pulsar topics.
● Apply user-supplied processing
logic to each message.
● Publish the results of the
computation to another topic.
● Support multiple programming
languages (Java, Python, Go)
● Can leverage 3rd-party libraries
to support the execution of ML
models on the edge.
Apache Pulsar Functions
● Visual Question and Answer
● Natural Language Processing
● Sentiment Analysis
● Text Classification
● Named Entity Recognition
● Content-based
Recommendations
• Predictive
Maintenance
• Fault Detection
• Fraud Detection
• Time-Series
Predictions
• Naive Bayes
Apache Pulsar Functions for ML Models
● Libraries
● Functions
● Connectors
● AMQP, Kafka, MQTT
● Tiered Storage
Use Apache Pulsar to Route, Transform & Enrich
Building Real-Time Apps Requires a Team
https://www.influxdata.com/integration/mqtt-monitoring/
https://www.influxdata.com/integration/mqtt-monitoring/
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Hundreds of processors
• Visual command and
control
• Over a 300 components
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
• Version Control
Apache NiFi Basics
Apache NiFi - Apache Pulsar Connector
https://github.com/streamnative/pulsar-nifi-bundle
Apache NiFi - Apache Pulsar Connector
Apache NiFi - Apache Pulsar Connector
Apache NiFi - Apache Pulsar Connector
● Unified computing engine
● Batch processing is a special case of stream processing
● Stateful processing
● Massive Scalability
● Flink SQL for queries, inserts against Pulsar Topics
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache Calcite
Apache Flink
Apache Flink Job Dashboard
https://pulsar.apache.org/docs/en/adaptors-spark/
val dfPulsar = spark.readStream.format("
pulsar")
.option("
service.url", "pulsar://pulsar1:6650")
.option("
admin.url", "http://pulsar1:8080
")
.option("
topic", "persistent://public/default/airquality").load()
val pQuery = dfPulsar.selectExpr("*")
.writeStream.format("
console")
.option("truncate", false).start()
Apache Spark + Apache Pulsar
val dfPulsar = spark.readStream.format("pulsar")
.option("service.url", "pulsar://pulsar1:6650")
.option("admin.url", "http://pulsar1:8080")
.option("topic", "persistent://public/default/pi-sensors")
.load()
dfPulsar.printSchema()
val pQuery = dfPulsar.selectExpr("*")
.writeStream.format("console")
.option("truncate", false)
.start()
https://github.com/tspannhw/FLiP-Pi-BreakoutGarden
Building Spark SQL View
● Java, Scala, Python Support
● Strong ETL/ELT
● Diverse ML support
● Scalable Distributed compute
● Apache Zeppelin and Jupyter Notebooks
● Fast connector for Apache Pulsar
Why Apache Spark?
bigdata 2022_ FLiP Into Pulsar Apps
NLP Streaming Architecture
IoT Streaming Architecture
● Buffer
● Batch
● Route
● Filter
● Aggregate
● Enrich
● Replicate
● Dedupe
● Decouple
● Distribute
Pulsar Ecosystem for Apps
Streaming FLiPN Java App
StreamNative Hub
StreamNative Cloud
Unified Batch and Stream COMPUTING
Batch
(Batch + Stream)
Unified Batch and Stream STORAGE
Offload
(Queuing + Streaming)
Tiered Storage
Pulsar
---
KoP
---
MoP
---
Websocket
Pulsar
Sink
Streaming
Edge Gateway
Protocols
Apps
Streaming FLiPN Apps
StreamNative Hub
StreamNative Cloud
Unified Batch and Stream COMPUTING
Batch
(Batch + Stream)
Unified Batch and Stream STORAGE
Offload
(Queuing + Streaming)
Tiered Storage
Pulsar
---
KoP
---
MoP
---
Websocket
---
HTTP
Pulsar
Sink
Streaming
Edge Gateway
Protocols
Streaming Edge Apps
● https://github.com/tspannhw/pulsar-pychat-function
● https://streamnative.io/apache-nifi-connector/
● https://nightlies.apache.org/flink/flink-docs-master/docs/conne
ctors/datastream/pulsar/
● https://streamnative.io/en/blog/release/2021-04-20-flink-sql-o
n-streamnative-cloud
● https://github.com/streamnative/flink-example
● https://pulsar.apache.org/docs/en/adaptors-spark/
● https://www.unifiedstreaming.dev/
Apache Pulsar Links
● https://github.com/tspannhw/FLiP-Pi-BreakoutGarden
● https://github.com/tspannhw/FLiP-Pi-Thermal
● https://github.com/tspannhw/FLiP-Pi-Weather
● https://github.com/tspannhw/FLiP-RP400
● https://github.com/tspannhw/FLiP-Py-Pi-GasThermal
● https://github.com/tspannhw/FLiP-PY-FakeDataPulsar
● https://github.com/tspannhw/FLiP-Py-Pi-EnviroPlus
● https://github.com/tspannhw/PythonPulsarExamples
● https://github.com/tspannhw/pulsar-pychat-function
● https://github.com/tspannhw/FLiP-PulsarDevPython101
Apache Pulsar Examples
Apache Pulsar Training
● Instructor-led courses
○ Pulsar Fundamentals
○ Pulsar Developers
○ Pulsar Operations
● On-demand learning with labs
● 300+ engineers, admins and
architects trained!
Now Available
On-Demand
Pulsar Training
Academy.StreamNative.io
StreamNative Academy
Deploying AI With an
Event-Driven
Platform
https://dzone.com/trendreports/enterprise-ai-1
Apache Pulsar in Action
http://tinyurl.com/bdha5p4r
Please enjoy David’s complete book which is the ultimate guide to Pulsar.
bigdata 2022_ FLiP Into Pulsar Apps
Tim Spann
Developer Advocate
@PaaSDev
https://www.linkedin.com/in/timothyspann
https://github.com/tspannhw
Let’s Keep in Touch

More Related Content

PDF
Timothy Spann: Apache Pulsar for ML
PDF
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
PDF
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
PDF
JConf.dev 2022 - Apache Pulsar Development 101 with Java
PDF
[March sn meetup] apache pulsar + apache nifi for cloud data lake
PDF
(Current22) Let's Monitor The Conditions at the Conference
PDF
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
PDF
Princeton Dec 2022 Meetup_ NiFi + Flink + Pulsar
Timothy Spann: Apache Pulsar for ML
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
JConf.dev 2022 - Apache Pulsar Development 101 with Java
[March sn meetup] apache pulsar + apache nifi for cloud data lake
(Current22) Let's Monitor The Conditions at the Conference
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Princeton Dec 2022 Meetup_ NiFi + Flink + Pulsar

Similar to bigdata 2022_ FLiP Into Pulsar Apps (20)

PDF
Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar)
PDF
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
PDF
Apache Pulsar Development 101 with Python
PDF
Fast Streaming into Clickhouse with Apache Pulsar
PDF
Let's keep it simple and streaming
PDF
Let's keep it simple and streaming.pdf
PDF
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)
PDF
Python web conference 2022 apache pulsar development 101 with python (f li-...
PDF
Music city data Hail Hydrate! from stream to lake
PDF
PDF
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
PDF
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
PDF
NYC Dec 2022 Meetup_ Building Real-Time Requires a Team
PDF
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
PDF
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
PDF
Cloud lunch and learn real-time streaming in azure
PDF
Introduction to Apache Kafka
PDF
[AI Dev World 2022] Build ML Enhanced Event Streaming
PPTX
Modern Distributed Messaging and RPC
PPTX
Apache kafka
Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar)
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Apache Pulsar Development 101 with Python
Fast Streaming into Clickhouse with Apache Pulsar
Let's keep it simple and streaming
Let's keep it simple and streaming.pdf
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)
Python web conference 2022 apache pulsar development 101 with python (f li-...
Music city data Hail Hydrate! from stream to lake
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
NYC Dec 2022 Meetup_ Building Real-Time Requires a Team
MLconf 2022 NYC Event-Driven Machine Learning at Scale.pdf
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
Cloud lunch and learn real-time streaming in azure
Introduction to Apache Kafka
[AI Dev World 2022] Build ML Enhanced Event Streaming
Modern Distributed Messaging and RPC
Apache kafka
Ad

More from Timothy Spann (20)

PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Ad

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Introduction to Artificial Intelligence
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
ai tools demonstartion for schools and inter college
PPTX
Transform Your Business with a Software ERP System
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Introduction to Artificial Intelligence
Upgrade and Innovation Strategies for SAP ERP Customers
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
ai tools demonstartion for schools and inter college
Transform Your Business with a Software ERP System
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Reimagine Home Health with the Power of Agentic AI​
wealthsignaloriginal-com-DS-text-... (1).pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
How Creative Agencies Leverage Project Management Software.pdf
Operating system designcfffgfgggggggvggggggggg
Design an Analysis of Algorithms II-SECS-1021-03
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool

bigdata 2022_ FLiP Into Pulsar Apps

  • 1. FLiP Into Pulsar Apps Tim Spann | Developer Advocate
  • 2. ● Introduction ● What is Apache Pulsar? ● Pulsar Functions ● Apache NiFi ● Apache Flink ● Apache Spark ● Demo ● Q&A In this session, Timothy will introduce you to the world of Apache Pulsar and how to build real-time messaging and streaming applications with a variety of OSS libraries, schemas, languages, frameworks, and tools.
  • 3. Tim Spann Developer Advocate Tim Spann Developer Advocate at StreamNative ● FLiP(N) Stack = Flink, Pulsar and NiFi Stack ● Streaming Systems & Data Architecture Expert ● Experience: ○ 15+ years of experience with streaming technologies including Pulsar, Flink, Spark, NiFi, Big Data, Cloud, MXNet, IoT, Python and more. ○ Today, he helps to grow the Pulsar community sharing rich technical knowledge and experience at both global conferences and through individual conversations.
  • 4. This week in Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark and open source friends. https://bit.ly/32dAJft FLiP Stack Weekly
  • 5. ● Apache Flink ● Apache Pulsar ● Apache NiFi ● Apache Spark ● Pulsar Functions ● Python, Java, Golang FLiP(N) Stack
  • 7. Apache Pulsar is built to support legacy applications, handle the needs of modern apps, and supports NextGen applications Support legacy workloads. Compatible with popular messaging and streaming tools. Legacy Built for today's real-time event driven applications. Modern Scalable, adaptive architecture ready for the future of real-time streaming. NextGen
  • 9. Apache Pulsar has a vibrant community 560+ Contributors 10,000+ Commits 7,000+ Slack Members 1,000+ Organizations Using Pulsar
  • 10. It is often assumed that Pulsar and Kafka have equal capabilities. In reality, Pulsar offers a superset of Kafka. ● Pulsar is streaming and queuing together ● Pulsar is cloud-native with stateless brokers ● Natively includes geo-replication, multi-tenancy, and end-to-end security out of the box ● Pulsar provides automated rebalancing ● Pulsar offers 100X lower latency w/ 2.5 greater throughput than Kafka Advantages of Apache Pulsar
  • 11. Apache Pulsar features Cloud native with decoupled storage and compute layers. Built-in compatibility with your existing code and messaging infrastructure. Geographic redundancy and high availability included. Centralized cluster management and oversight. Elastic horizontal and vertical scalability. Seamless and instant partitioning rebalancing with no downtime. Flexible subscription model supports a wide array of use cases. Compatible with the tools you use to store, analyze, and process data.
  • 12. ● “Bookies” ● Stores messages and cursors ● Messages are grouped in segments/ledgers ● A group of bookies form an “ensemble” to store a ledger ● “Brokers” ● Handles message routing and connections ● Stateless, but with caches ● Automatic load-balancing ● Topics are composed of multiple segments ● ● Stores metadata for both Pulsar and BookKeeper ● Service discovery Store Messages Metadata & Service Discovery Metadata & Service Discovery Pulsar Cluster Metadata Storage Pulsar Cluster
  • 13. Component Description Value / Data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data can also conform to data schemas. Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like topic compaction. Properties An optional key/value map of user-defined properties. Producer name The name of the producer who produces the message. If you do not specify a producer name, the default name is used. Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the message is its order in that sequence. Messages - the Basic Unit of Apache Pulsar
  • 14. Different subscription modes have different semantics: Exclusive/Failover - guaranteed order, single active consumer Shared - multiple active consumers, no order Key_Shared - multiple active consumers, order for given key Producer 1 Producer 2 Pulsar Topic Subscription D Consumer D-1 Consumer D-2 Key-Shared < K 1, V 10 > < K 1, V 11 > < K 1, V 12 > < K 2 ,V 2 0 > < K 2 ,V 2 1> < K 2 ,V 2 2 > Subscription C Consumer C-1 Consumer C-2 Shared < K 1, V 10 > < K 2 ,V 2 1> < K 1, V 12 > < K 2 ,V 2 0 > < K 1, V 11 > < K 2 ,V 2 2 > Subscription A Consumer A Exclusive Subscription B Consumer B-1 Consumer B-2 In case of failure in Consumer B-1 Failover Apache Pulsar Subscription Modes
  • 15. Streaming Consumer Consumer Consumer Subscription Shared Failover Consumer Consumer Subscription In case of failure in Consumer B-0 Consumer Consumer Subscription Exclusive X Consumer Consumer Key-Shared Subscription Pulsar Topic/Partition Messaging
  • 16. Unified Messaging Model Simplify your data infrastructure and enable new use cases with queuing and streaming capabilities in one platform. Multi-tenancy Enable multiple user groups to share the same cluster, either via access control, or in entirely different namespaces. Scalability Decoupled data computing and storage enable horizontal scaling to handle data scale and management complexity. Geo-replication Support for multi-datacenter replication with both asynchronous and synchronous replication for built-in disaster recovery. Tiered storage Enable historical data to be offloaded to cloud-native storage and store event streams for indefinite periods of time. Apache Pulsar Benefits
  • 17. Messaging Use Cases Streaming Use Cases Service x commands service y to make some change. Example: order service removing item from inventory service Moving large amounts of data to another service (real-time ETL). Example: logs to elasticsearch Distributing messages that represent work among n workers. Example: order processing not in main “thread” Periodic jobs moving large amounts of data and aggregating to more traditional stores. Example: logs to s3 Sending “scheduled” messages. Example: notification service for marketing emails or push notifications Computing a near real-time aggregate of a message stream, split among n workers, with order being important. Example: real-time analytics over page views Messaging vs Streaming
  • 18. Messaging Use Case Streaming Use Case Retention The amount of data retained is relatively small - typically only a day or two of data at most. Large amounts of data are retained, with higher ingest volumes and longer retention periods. Throughput Messaging systems are not designed to manage big “catch-up” reads. Streaming systems are designed to scale and can handle use cases such as catch-up reads. Differences in Consumption
  • 19. byte[] msgIdBytes = // Some byte array MessageId id = MessageId.fromByteArray(msgIdBytes); Reader<byte[]> reader = pulsarClient.newReader() .topic(topic) .startMessageId(id) .create(); Create a reader that will read from some message between earliest and latest. Reader Apache Pulsar Reader Interface
  • 20. ● New Consumer type added in Pulsar 2.10 that provides a continuously updated key-value map view of compacted topic data. ● An abstraction of a changelog stream from a primary-keyed table, where each record in the changelog stream is an update on the primary-keyed table with the record key as the primary key. ● READ ONLY DATA STRUCTURE! Apache Pulsar TableView
  • 23. Schema Registry schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3 (value=Avro/Protobuf/JSON) Schema Data ID Local Cache for Schemas + Schema Data ID + Local Cache for Schemas Send schema-1 (value=Avro/Protobuf/JSON) data serialized per schema ID Send (register) schema (if not in local cache) Read schema-1 (value=Avro/Protobuf/JSON) data deserialized per schema ID Get schema by ID (if not in local cache) Producers Consumers Schema Registry
  • 24. ● Utilizing JSON Data with a JSON Schema ● Consistency, Contracts, Clean Data ● This enables easy SQL: ○ Pulsar SQL (Presto SQL) ○ Flink SQL ○ Spark Structured Streaming Use Schemas
  • 25. • Functions - Lightweight Stream Processing (Java, Python, Go) • Connectors - Sources & Sinks (Cassandra, Kafka, …) • Protocol Handlers - AoP (AMQP), KoP (Kafka), MoP (MQTT) • Processing Engines - Flink, Spark, Presto/Trino via Pulsar SQL • Data Offloaders - Tiered Storage - (S3) Sources, Sinks and Processing
  • 27. MQTT on Pulsar (MoP)
  • 28. AMQP on Pulsar (AoP)
  • 29. Use Apache Pulsar For Ingest
  • 30. Use Apache Pulsar To Stream to Lakehouses
  • 31. ● Lightweight computation similar to AWS Lambda. ● Specifically designed to use Apache Pulsar as a message bus. ● Function runtime can be located within Pulsar Broker. ● Java Functions A serverless event streaming framework Apache Pulsar Functions
  • 32. ● Consume messages from one or more Pulsar topics. ● Apply user-supplied processing logic to each message. ● Publish the results of the computation to another topic. ● Support multiple programming languages (Java, Python, Go) ● Can leverage 3rd-party libraries to support the execution of ML models on the edge. Apache Pulsar Functions
  • 33. ● Visual Question and Answer ● Natural Language Processing ● Sentiment Analysis ● Text Classification ● Named Entity Recognition ● Content-based Recommendations • Predictive Maintenance • Fault Detection • Fraud Detection • Time-Series Predictions • Naive Bayes Apache Pulsar Functions for ML Models
  • 34. ● Libraries ● Functions ● Connectors ● AMQP, Kafka, MQTT ● Tiered Storage Use Apache Pulsar to Route, Transform & Enrich
  • 35. Building Real-Time Apps Requires a Team
  • 36. https://www.influxdata.com/integration/mqtt-monitoring/ https://www.influxdata.com/integration/mqtt-monitoring/ • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Supports push and pull models • Hundreds of processors • Visual command and control • Over a 300 components • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering • Version Control Apache NiFi Basics
  • 37. Apache NiFi - Apache Pulsar Connector
  • 39. Apache NiFi - Apache Pulsar Connector
  • 40. Apache NiFi - Apache Pulsar Connector
  • 41. ● Unified computing engine ● Batch processing is a special case of stream processing ● Stateful processing ● Massive Scalability ● Flink SQL for queries, inserts against Pulsar Topics ● Streaming Analytics ● Continuous SQL ● Continuous ETL ● Complex Event Processing ● Standard SQL Powered by Apache Calcite Apache Flink
  • 42. Apache Flink Job Dashboard
  • 43. https://pulsar.apache.org/docs/en/adaptors-spark/ val dfPulsar = spark.readStream.format(" pulsar") .option(" service.url", "pulsar://pulsar1:6650") .option(" admin.url", "http://pulsar1:8080 ") .option(" topic", "persistent://public/default/airquality").load() val pQuery = dfPulsar.selectExpr("*") .writeStream.format(" console") .option("truncate", false).start() Apache Spark + Apache Pulsar
  • 44. val dfPulsar = spark.readStream.format("pulsar") .option("service.url", "pulsar://pulsar1:6650") .option("admin.url", "http://pulsar1:8080") .option("topic", "persistent://public/default/pi-sensors") .load() dfPulsar.printSchema() val pQuery = dfPulsar.selectExpr("*") .writeStream.format("console") .option("truncate", false) .start() https://github.com/tspannhw/FLiP-Pi-BreakoutGarden Building Spark SQL View
  • 45. ● Java, Scala, Python Support ● Strong ETL/ELT ● Diverse ML support ● Scalable Distributed compute ● Apache Zeppelin and Jupyter Notebooks ● Fast connector for Apache Pulsar Why Apache Spark?
  • 49. ● Buffer ● Batch ● Route ● Filter ● Aggregate ● Enrich ● Replicate ● Dedupe ● Decouple ● Distribute
  • 52. StreamNative Hub StreamNative Cloud Unified Batch and Stream COMPUTING Batch (Batch + Stream) Unified Batch and Stream STORAGE Offload (Queuing + Streaming) Tiered Storage Pulsar --- KoP --- MoP --- Websocket Pulsar Sink Streaming Edge Gateway Protocols Apps Streaming FLiPN Apps
  • 53. StreamNative Hub StreamNative Cloud Unified Batch and Stream COMPUTING Batch (Batch + Stream) Unified Batch and Stream STORAGE Offload (Queuing + Streaming) Tiered Storage Pulsar --- KoP --- MoP --- Websocket --- HTTP Pulsar Sink Streaming Edge Gateway Protocols Streaming Edge Apps
  • 54. ● https://github.com/tspannhw/pulsar-pychat-function ● https://streamnative.io/apache-nifi-connector/ ● https://nightlies.apache.org/flink/flink-docs-master/docs/conne ctors/datastream/pulsar/ ● https://streamnative.io/en/blog/release/2021-04-20-flink-sql-o n-streamnative-cloud ● https://github.com/streamnative/flink-example ● https://pulsar.apache.org/docs/en/adaptors-spark/ ● https://www.unifiedstreaming.dev/ Apache Pulsar Links
  • 55. ● https://github.com/tspannhw/FLiP-Pi-BreakoutGarden ● https://github.com/tspannhw/FLiP-Pi-Thermal ● https://github.com/tspannhw/FLiP-Pi-Weather ● https://github.com/tspannhw/FLiP-RP400 ● https://github.com/tspannhw/FLiP-Py-Pi-GasThermal ● https://github.com/tspannhw/FLiP-PY-FakeDataPulsar ● https://github.com/tspannhw/FLiP-Py-Pi-EnviroPlus ● https://github.com/tspannhw/PythonPulsarExamples ● https://github.com/tspannhw/pulsar-pychat-function ● https://github.com/tspannhw/FLiP-PulsarDevPython101 Apache Pulsar Examples
  • 56. Apache Pulsar Training ● Instructor-led courses ○ Pulsar Fundamentals ○ Pulsar Developers ○ Pulsar Operations ● On-demand learning with labs ● 300+ engineers, admins and architects trained! Now Available On-Demand Pulsar Training Academy.StreamNative.io StreamNative Academy
  • 57. Deploying AI With an Event-Driven Platform https://dzone.com/trendreports/enterprise-ai-1
  • 58. Apache Pulsar in Action http://tinyurl.com/bdha5p4r Please enjoy David’s complete book which is the ultimate guide to Pulsar.