SlideShare a Scribd company logo
Welcome to
Apache Pulsar and Apache NiFi
for Cloud Data Lakes
In the meantime:
● (1) Use the chat to let us know
where you’re calling in from
● (2) Take part in our our poll, under
the booth “poll” tab in the right
panel of Hopin
We’ll start in 5min we’re just waiting for
People to sign in
[March sn meetup] apache pulsar + apache nifi for cloud data lake
Agenda
01
02
03
04
05
Intro to Apache Pulsar - Tim Spann
Intro to Apache NiFi - John Kuchmek
Demo
Key Takeaways + Resources
Additional Q&A
Tim Spann
Developer Advocate
● FLiP(N) Stack = Flink, Pulsar and NiFi Combined
● Streaming Systems & Data Architecture Expert
● Pulsar, Flink, Spark, NiFi, Big Data, Cloud, MXNet,
IoT, Java, Python, Sensors and more.
Tim Spann
Developer Advocate
StreamNative
Tim Spann
Developer Advocate
● Integration of OT & IT Data
● Cloudera Streaming SME
● NiFi, Spark, Flink, Kafka, Storm, Druid, Kudu,
Python, Sensors, PLCs, Private Cloud and Public
Cloud
John Kuchmek
Principal Solutions Engineer
Cloudera
Apache Pulsar is a Cloud-Native
Messaging and Event-Streaming Platform.
Why Apache Pulsar?
Unified
Messaging Platform
Guaranteed
Message Delivery Resiliency Infinite
Scalability
Component Description
Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although
message data can also conform to data schemas.
Key Messages are optionally tagged with keys, used in partitioning and also is useful for
things like topic compaction.
Properties An optional key/value map of user-defined properties.
Producer name The name of the producer who produces the message. If you do not specify a producer
name, the default name is used. Message De-Duplication.
Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of
the message is its order in that sequence. Message De-Duplication.
Messages - the basic unit of Pulsar
Connectivity
• Libraries - (Java, Python, Go, NodeJS, WebSockets,
C++, C#, Scala, Rust,...)
• Functions - Lightweight Stream Processing (Java,
Python, Go)
• Connectors - Sources & Sinks (Cassandra, Kafka, …)
• Protocol Handlers - AoP (AMQP), KoP (Kafka), MoP
(MQTT)
• Processing Engines - Flink, Spark, Presto/Trino via
Pulsar SQL, NiFi
• Data Offloaders - Tiered Storage - (S3)
hub.streamnative.io
Apache NiFi Pulsar Connector
https://github.com/streamnative/pulsar-nifi-bundle
Apache NiFi is a GUI based Data Flow
tool that runs anywhere.
Why NiFi
• Enable easy ingestion, routing, management and delivery of any data anywhere (Edge,
cloud, data center) to any downstream system with built in end-to-end security and
provenance
ACQUIRE PROCESS DELIVER
• Over 350 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
• Guaranteed Delivery
• Full data provenance
• Eco-system integration
Advanced tooling to industrialize flow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE
Apache NiFi Capabilities
Data Ingest Data Transformation Data Enrichment
HTTP
Syslog
HL7
UDP
SFTP
MQTT
WS
Hash
Compress
Merge
Duplicate
Split
Encrypt
Syslog
REST
Mapcach
Enrich IP
GeoIP
XML
Flow Development Lifecycle
FLiP Stack Weekly
This week in Apache Flink, Apache Pulsar, Apache
NiFi, Apache Spark and open source friends.
https://bit.ly/32dAJft
Demo
Streaming NFTs
NFTs
NFT
Thermal
Aggregates
Status
APIs
CRYPTO
FEEDS
Weather
NFT++
Streaming NFTs
https://opensea.io/collection/tspannhw-collection
Key Takeaways
● Real-time ingest and data manipulation
● Easy to use/configure processors and
controller services
● Multiple ways to connect
Resources
Learn More about Nifi + Pulsar Integration
https://streamnative.io/apache-nifi-connector/
Github
https://github.com/tspannhw/awesome-nifi-pulsar
Blogpost on Apache Pulsar + Nifi Integration
https://hubs.ly/Q015PNMd0
● StreamNative: Pulsar-as-a-Service
● AWS Certified Associate Solutions
Architect
● Reach me at doug@streamnative.io
Doug Cohen
Head of Sales, StreamNative
Additional
Resources
Let’s Keep
in Touch!
Tim Spann
Developer Advocate
@PaaSDev
linkedin.com/in/timo
thyspann
github.com/tspannhw
John Kuchmek
Principal
Solutions Engineer
@K_Physics
linkedin.com/in/jkuch
mek
github.com/johnkuch
Pulsar Subscription Modes
Different subscription modes
have different semantics:
Exclusive/Failover - guaranteed
order, single active consumer
Shared - multiple active
consumers, no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2,
V
21
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover
Schema Registry
Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
● Buffer
● Batch
● Route
● Filter
● Aggregate
● Enrich
● Replicate
● Dedupe
● Decouple
● Distribute
Messaging
Ideal for work queues that do not
require tasks to be performed in a
particular order—for example, sending
one email message to many recipients.
RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
Pulsar: Unified Messaging + Data Streaming
Messaging
Ideal for work queues that do not
require tasks to be performed in a
particular order—for example, sending
one email message to many recipients.
RabbitMQ and Amazon SQS are
examples of popular queue-based
message systems.
Pulsar: Unified Messaging + Data Streaming
.. and Streaming
Works best in situations where the order
of messages is important—for example,
data ingestion.
Kafka and Amazon Kinesis are examples
of messaging systems that use streaming
semantics for consuming messages.
Pulsar Instance
Pulsar Cluster
Pulsar Instance
Pulsar Cluster
A Unified Messaging Platform
Message Queuing
Data Streaming
Topics
Tenants
(Compliance)
Tenants
(Data Services)
Namespace
(Microservices)
Topic-1
(Cust Auth)
Topic-1
(Location Resolution)
Topic-2
(Demographics)
Topic-1
(Budgeted Spend)
Topic-1
(Acct History)
Topic-1
(Risk Detection)
Namespace
(ETL)
Namespace
(Campaigns)
Namespace
(ETL)
Tenants
(Marketing)
Namespace
(Risk Assessment)
Pulsar Instance
Pulsar Cluster
Pulsar’s Publish-Subscribe model
Broker
Subscription
Consumer 1
Consumer 2
Consumer 3
Topic
Producer 1
Producer 2
● Producers send messages.
● Topics are an ordered, named channel that producers
use to transmit messages to subscribed consumers.
● Messages belong to a topic and contain an arbitrary
payload.
● Brokers handle connections and routes
messages between producers / consumers.
● Subscriptions are named configuration rules
that determine how messages are delivered to
consumers.
● Consumers receive messages.
Producer-Consumer
Producer Consumer
Publisher sends data and
doesn't know about the
subscribers or their status.
All interactions go through
Pulsar and it handles all
communication.
Subscriber receives data
from publisher and never
directly interacts with it
Topic
Topic
Kafka
On Pulsar
(KoP)
streamnative.io
MQTT
On Pulsar
(MoP)
Pulsar Functions
● Lightweight
computation similar to
AWS Lambda.
● Specifically designed to
use Apache Pulsar as a
message bus.
● Function runtime can
be located within
Pulsar Broker.
A serverless event streaming
framework
streamnative.io
● Consume messages from one
or more Pulsar topics.
● Apply user-supplied
processing logic to each
message.
● Publish the results of the
computation to another topic.
● Support multiple
programming languages (Java,
Python, Go)
● Can leverage 3rd-party
libraries to support the
execution of ML models on
the edge.
Pulsar Functions
Pulsar SQL
Presto/Trino workers can read
segments directly from
bookies (or offloaded storage)
in parallel.
Bookie
1
Segment 1
Producer Consumer
Broker 1
Topic1-Part1
Broker 2
Topic1-Part2
Broker 3
Topic1-Part3
Segment 2 Segment 3 Segment 4 Segment X
Segment 1
Segment 1 Segment 1
Segment 3 Segment 3
Segment 3
Segment 2
Segment 2
Segment 2
Segment 4
Segment 4
Segment 4
Segment X
Segment X
Segment X
Bookie
2
Bookie
3
Query
Coordinator
...
...
SQL Worker SQL Worker SQL Worker
SQL Worker
Query
Topic
Metadata
Use Cases
Multi-Tenant Data
Infrastructure
AdTech
Fraud Detection
Connected Car
IoT Analytics
Data Lake Hydration
Apache NiFi
Apache NiFi Pulsar Connector
https://github.com/streamnative/pulsar-nifi-bundle
Apache NiFi Pulsar Connector
https://github.com/david-streamlio/pulsar-nifi-bundle
Apache NiFi Pulsar Connector
Apache NiFi Pulsar Connector
Apache NiFi Pulsar Connector
StreamNative
Cloud
streamnative.io
Passionate and dedicated team.
Founded by the original developers of
Apache Pulsar.
StreamNative helps teams to capture,
manage, and leverage data using Pulsar’s
unified messaging and streaming
platform.
Founded By The
Creators Of Apache Pulsar
Sijie Guo
ASF Member
Pulsar/BookKeeper PMC
Founder and CEO
Jia Zhai
Pulsar/BookKeeper PMC
Co-Founder
Matteo Merli
ASF Member
Pulsar/BookKeeper PMC
CTO
Data veterans with extensive industry experience
[March sn meetup] apache pulsar + apache nifi for cloud data lake
REST Feed Non-Fungible Token
{"date":"Thu, 24 Feb 2022 22:26:41
GMT","short_description":"","featured":"false","image_thumbnail_url":"htt","asset_contract_created_date":"2022-02-17T15:4
8:44.822206","asset_contract_owner":"50299352","image_preview_url":"https://lh3.googleus","asset_contract_symbol":"TD
","twitter_username":"","description":"10,000metaverse-readyAvatars","asset_contract_address":"0xc7df86762ba83f2a619
7e1ff9bb40ae0f696b9e6","external_url":"https://www.sandbox.game/en/snoopdogg/","token_id":"492","asset_contract_na
me":"Theoggies","asset_contract_nft_version":"3.0","asset_contract_description":"metaverse.","asset_contract_external_lin
k":"https://www.sandbox.game/en/snoopdogg/","id":"307922619","featured_image_url":"https","slug":"snoop-dogg-doggie
s","token_metadata":"https://contracts.sandbox.game/unrevealed.json?tokenId=492","asset_contract_schema_name":"ER
C721","animation_url":"https","num_sales":"1","image_url":"https://lh","asset_contract_default_to_fiat":"false","external_link":
"","image_original_url":"https://contracts.sandbox.game/preview.png","asset_contract_payout_address":"0x4489590a1166
18b506f0efe885432f6a8ed998e9","animation_original_url":"https://con","background_color":"","asset_contract_asset_cont
ract_type":"non-fungible","name":"The Doggies","asset_contract_image_url":"https","asset_contract_total_supply":"0"}
https://docs.opensea.io/reference/retrieving-bundles
StreamNative Hub
StreamNative Cloud
Unified Batch and Stream STORAGE
Offload
(Queuing + Streaming)
Apache Pulsar - Apache NiFi <-> Events <-> Cloud Data Stores
Tiered Storage
Pulsar
---
KoP
---
MoP
---
Websocket
---
HTTP
Pulsar
Sink
Pulsar
Sink
Data Gateway
Protocols
Data to Cloud Data Lake
Micro
Service
(Queuing + Streaming)
StreamNative Cloud
Tiered Storage
(Queuing + Streaming)
(Queuing + Streaming)
Tiered Storage
(Queuing + Streaming)
[March sn meetup] apache pulsar + apache nifi for cloud data lake
[March sn meetup] apache pulsar + apache nifi for cloud data lake

More Related Content

PDF
StreamNative FLiP into scylladb - scylla summit 2022
PDF
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
PDF
Apache Deep Learning 201 - Philly Open Source
PDF
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PDF
Pulsar summit asia 2021: Designing Pulsar for Isolation
PDF
Real time cloud native open source streaming of any data to apache solr
PDF
fluentd -- the missing log collector
PDF
Data science online camp using the flipn stack for edge ai (flink, nifi, pu...
StreamNative FLiP into scylladb - scylla summit 2022
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Apache Deep Learning 201 - Philly Open Source
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
Pulsar summit asia 2021: Designing Pulsar for Isolation
Real time cloud native open source streaming of any data to apache solr
fluentd -- the missing log collector
Data science online camp using the flipn stack for edge ai (flink, nifi, pu...

What's hot (20)

PDF
DBCC 2021 - FLiP Stack for Cloud Data Lakes
PDF
Music city data Hail Hydrate! from stream to lake
PDF
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
PDF
Data minutes #2 Apache Pulsar with MQTT for Edge Computing Lightning - 2022
PDF
Distributed Crypto-Currency Trading with Apache Pulsar
PDF
Python web conference 2022 apache pulsar development 101 with python (f li-...
PPTX
Interactive Analytics on Pulsar with Pulsar SQL - Pulsar Virtual Summit Europ...
PDF
Architecting for Scale
PDF
Using FLiP with influxdb for edgeai iot at scale 2022
PDF
Big data conference europe real-time streaming in any and all clouds, hybri...
ODP
Introduction to Apache Kafka- Part 1
PDF
Open Source Bristol 30 March 2022
PDF
Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar) - Pulsar Summit Asia ...
PDF
Automation + dev ops summit hail hydrate! from stream to lake
PDF
Kafka and Spark Streaming
PDF
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
PDF
Cloud lunch and learn real-time streaming in azure
PDF
Apache Pulsar at Yahoo! Japan
PDF
Osacon 2021 hello hydrate! from stream to clickhouse with apache pulsar and...
PPTX
Spark optimization
DBCC 2021 - FLiP Stack for Cloud Data Lakes
Music city data Hail Hydrate! from stream to lake
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Data minutes #2 Apache Pulsar with MQTT for Edge Computing Lightning - 2022
Distributed Crypto-Currency Trading with Apache Pulsar
Python web conference 2022 apache pulsar development 101 with python (f li-...
Interactive Analytics on Pulsar with Pulsar SQL - Pulsar Virtual Summit Europ...
Architecting for Scale
Using FLiP with influxdb for edgeai iot at scale 2022
Big data conference europe real-time streaming in any and all clouds, hybri...
Introduction to Apache Kafka- Part 1
Open Source Bristol 30 March 2022
Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar) - Pulsar Summit Asia ...
Automation + dev ops summit hail hydrate! from stream to lake
Kafka and Spark Streaming
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
Cloud lunch and learn real-time streaming in azure
Apache Pulsar at Yahoo! Japan
Osacon 2021 hello hydrate! from stream to clickhouse with apache pulsar and...
Spark optimization
Ad

Similar to [March sn meetup] apache pulsar + apache nifi for cloud data lake (20)

PDF
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
PDF
Apache Pulsar Development 101 with Python
PDF
bigdata 2022_ FLiP Into Pulsar Apps
PDF
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
PDF
Princeton Dec 2022 Meetup_ NiFi + Flink + Pulsar
PDF
Timothy Spann: Apache Pulsar for ML
PDF
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
PDF
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
PDF
Deep Dive into Building Streaming Applications with Apache Pulsar
PDF
Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar)
PDF
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
PDF
(Current22) Let's Monitor The Conditions at the Conference
PDF
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
PDF
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)
PDF
JConf.dev 2022 - Apache Pulsar Development 101 with Java
PDF
CODEONTHEBEACH_Streaming Applications with Apache Pulsar
PDF
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
PDF
Unified Messaging and Data Streaming 101
PDF
NYC Dec 2022 Meetup_ Building Real-Time Requires a Team
PDF
Let's keep it simple and streaming.pdf
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Apache Pulsar Development 101 with Python
bigdata 2022_ FLiP Into Pulsar Apps
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Princeton Dec 2022 Meetup_ NiFi + Flink + Pulsar
Timothy Spann: Apache Pulsar for ML
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
Deep Dive into Building Streaming Applications with Apache Pulsar
Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar)
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
(Current22) Let's Monitor The Conditions at the Conference
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)
JConf.dev 2022 - Apache Pulsar Development 101 with Java
CODEONTHEBEACH_Streaming Applications with Apache Pulsar
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Unified Messaging and Data Streaming 101
NYC Dec 2022 Meetup_ Building Real-Time Requires a Team
Let's keep it simple and streaming.pdf
Ad

More from Timothy Spann (20)

PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
01-Oct-2024_PES-VectorDatabasesAndAI.pdf

Recently uploaded (20)

PDF
Nekopoi APK 2025 free lastest update
PPTX
Introduction to Artificial Intelligence
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
ai tools demonstartion for schools and inter college
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
AI in Product Development-omnex systems
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
System and Network Administration Chapter 2
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Nekopoi APK 2025 free lastest update
Introduction to Artificial Intelligence
2025 Textile ERP Trends: SAP, Odoo & Oracle
ai tools demonstartion for schools and inter college
Which alternative to Crystal Reports is best for small or large businesses.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
AI in Product Development-omnex systems
How to Choose the Right IT Partner for Your Business in Malaysia
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Softaken Excel to vCard Converter Software.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Upgrade and Innovation Strategies for SAP ERP Customers
System and Network Administration Chapter 2
Operating system designcfffgfgggggggvggggggggg
Wondershare Filmora 15 Crack With Activation Key [2025
Navsoft: AI-Powered Business Solutions & Custom Software Development
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Adobe Illustrator 28.6 Crack My Vision of Vector Design

[March sn meetup] apache pulsar + apache nifi for cloud data lake

  • 1. Welcome to Apache Pulsar and Apache NiFi for Cloud Data Lakes In the meantime: ● (1) Use the chat to let us know where you’re calling in from ● (2) Take part in our our poll, under the booth “poll” tab in the right panel of Hopin We’ll start in 5min we’re just waiting for People to sign in
  • 3. Agenda 01 02 03 04 05 Intro to Apache Pulsar - Tim Spann Intro to Apache NiFi - John Kuchmek Demo Key Takeaways + Resources Additional Q&A
  • 4. Tim Spann Developer Advocate ● FLiP(N) Stack = Flink, Pulsar and NiFi Combined ● Streaming Systems & Data Architecture Expert ● Pulsar, Flink, Spark, NiFi, Big Data, Cloud, MXNet, IoT, Java, Python, Sensors and more. Tim Spann Developer Advocate StreamNative
  • 5. Tim Spann Developer Advocate ● Integration of OT & IT Data ● Cloudera Streaming SME ● NiFi, Spark, Flink, Kafka, Storm, Druid, Kudu, Python, Sensors, PLCs, Private Cloud and Public Cloud John Kuchmek Principal Solutions Engineer Cloudera
  • 6. Apache Pulsar is a Cloud-Native Messaging and Event-Streaming Platform.
  • 7. Why Apache Pulsar? Unified Messaging Platform Guaranteed Message Delivery Resiliency Infinite Scalability
  • 8. Component Description Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data can also conform to data schemas. Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like topic compaction. Properties An optional key/value map of user-defined properties. Producer name The name of the producer who produces the message. If you do not specify a producer name, the default name is used. Message De-Duplication. Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the message is its order in that sequence. Message De-Duplication. Messages - the basic unit of Pulsar
  • 9. Connectivity • Libraries - (Java, Python, Go, NodeJS, WebSockets, C++, C#, Scala, Rust,...) • Functions - Lightweight Stream Processing (Java, Python, Go) • Connectors - Sources & Sinks (Cassandra, Kafka, …) • Protocol Handlers - AoP (AMQP), KoP (Kafka), MoP (MQTT) • Processing Engines - Flink, Spark, Presto/Trino via Pulsar SQL, NiFi • Data Offloaders - Tiered Storage - (S3) hub.streamnative.io
  • 10. Apache NiFi Pulsar Connector https://github.com/streamnative/pulsar-nifi-bundle
  • 11. Apache NiFi is a GUI based Data Flow tool that runs anywhere.
  • 12. Why NiFi • Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance ACQUIRE PROCESS DELIVER • Over 350 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure • Guaranteed Delivery • Full data provenance • Eco-system integration Advanced tooling to industrialize flow development (Flow Development Life Cycle) FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG HASH MERGE EXTRACT DUPLICATE SPLIT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT CONTROL RATE DISTRIBUTE LOAD GEOENRICH SCAN REPLACE TRANSLATE CONVERT ENCRYPT TALL EVALUATE EXECUTE
  • 13. Apache NiFi Capabilities Data Ingest Data Transformation Data Enrichment HTTP Syslog HL7 UDP SFTP MQTT WS Hash Compress Merge Duplicate Split Encrypt Syslog REST Mapcach Enrich IP GeoIP XML
  • 15. FLiP Stack Weekly This week in Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark and open source friends. https://bit.ly/32dAJft
  • 16. Demo
  • 19. Key Takeaways ● Real-time ingest and data manipulation ● Easy to use/configure processors and controller services ● Multiple ways to connect
  • 20. Resources Learn More about Nifi + Pulsar Integration https://streamnative.io/apache-nifi-connector/ Github https://github.com/tspannhw/awesome-nifi-pulsar Blogpost on Apache Pulsar + Nifi Integration https://hubs.ly/Q015PNMd0
  • 21. ● StreamNative: Pulsar-as-a-Service ● AWS Certified Associate Solutions Architect ● Reach me at doug@streamnative.io Doug Cohen Head of Sales, StreamNative Additional Resources
  • 22. Let’s Keep in Touch! Tim Spann Developer Advocate @PaaSDev linkedin.com/in/timo thyspann github.com/tspannhw John Kuchmek Principal Solutions Engineer @K_Physics linkedin.com/in/jkuch mek github.com/johnkuch
  • 23. Pulsar Subscription Modes Different subscription modes have different semantics: Exclusive/Failover - guaranteed order, single active consumer Shared - multiple active consumers, no order Key_Shared - multiple active consumers, order for given key Producer 1 Producer 2 Pulsar Topic Subscription D Consumer D-1 Consumer D-2 Key-Shared < K 1, V 10 > < K 1, V 11 > < K 1, V 12 > < K 2 ,V 2 0 > < K 2 ,V 2 1> < K 2 ,V 2 2 > Subscription C Consumer C-1 Consumer C-2 Shared < K 1, V 10 > < K 2, V 21 > < K 1, V 12 > < K 2 ,V 2 0 > < K 1, V 11 > < K 2 ,V 2 2 > Subscription A Consumer A Exclusive Subscription B Consumer B-1 Consumer B-2 In case of failure in Consumer B-1 Failover
  • 24. Schema Registry Schema Registry schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3 (value=Avro/Protobuf/JSON) Schema Data ID Local Cache for Schemas + Schema Data ID + Local Cache for Schemas Send schema-1 (value=Avro/Protobuf/JSON) data serialized per schema ID Send (register) schema (if not in local cache) Read schema-1 (value=Avro/Protobuf/JSON) data deserialized per schema ID Get schema by ID (if not in local cache) Producers Consumers
  • 25. ● Buffer ● Batch ● Route ● Filter ● Aggregate ● Enrich ● Replicate ● Dedupe ● Decouple ● Distribute
  • 26. Messaging Ideal for work queues that do not require tasks to be performed in a particular order—for example, sending one email message to many recipients. RabbitMQ and Amazon SQS are examples of popular queue-based message systems. Pulsar: Unified Messaging + Data Streaming
  • 27. Messaging Ideal for work queues that do not require tasks to be performed in a particular order—for example, sending one email message to many recipients. RabbitMQ and Amazon SQS are examples of popular queue-based message systems. Pulsar: Unified Messaging + Data Streaming .. and Streaming Works best in situations where the order of messages is important—for example, data ingestion. Kafka and Amazon Kinesis are examples of messaging systems that use streaming semantics for consuming messages.
  • 28. Pulsar Instance Pulsar Cluster Pulsar Instance Pulsar Cluster
  • 29. A Unified Messaging Platform Message Queuing Data Streaming
  • 30. Topics Tenants (Compliance) Tenants (Data Services) Namespace (Microservices) Topic-1 (Cust Auth) Topic-1 (Location Resolution) Topic-2 (Demographics) Topic-1 (Budgeted Spend) Topic-1 (Acct History) Topic-1 (Risk Detection) Namespace (ETL) Namespace (Campaigns) Namespace (ETL) Tenants (Marketing) Namespace (Risk Assessment) Pulsar Instance Pulsar Cluster
  • 31. Pulsar’s Publish-Subscribe model Broker Subscription Consumer 1 Consumer 2 Consumer 3 Topic Producer 1 Producer 2 ● Producers send messages. ● Topics are an ordered, named channel that producers use to transmit messages to subscribed consumers. ● Messages belong to a topic and contain an arbitrary payload. ● Brokers handle connections and routes messages between producers / consumers. ● Subscriptions are named configuration rules that determine how messages are delivered to consumers. ● Consumers receive messages.
  • 32. Producer-Consumer Producer Consumer Publisher sends data and doesn't know about the subscribers or their status. All interactions go through Pulsar and it handles all communication. Subscriber receives data from publisher and never directly interacts with it Topic Topic
  • 35. Pulsar Functions ● Lightweight computation similar to AWS Lambda. ● Specifically designed to use Apache Pulsar as a message bus. ● Function runtime can be located within Pulsar Broker. A serverless event streaming framework
  • 36. streamnative.io ● Consume messages from one or more Pulsar topics. ● Apply user-supplied processing logic to each message. ● Publish the results of the computation to another topic. ● Support multiple programming languages (Java, Python, Go) ● Can leverage 3rd-party libraries to support the execution of ML models on the edge. Pulsar Functions
  • 37. Pulsar SQL Presto/Trino workers can read segments directly from bookies (or offloaded storage) in parallel. Bookie 1 Segment 1 Producer Consumer Broker 1 Topic1-Part1 Broker 2 Topic1-Part2 Broker 3 Topic1-Part3 Segment 2 Segment 3 Segment 4 Segment X Segment 1 Segment 1 Segment 1 Segment 3 Segment 3 Segment 3 Segment 2 Segment 2 Segment 2 Segment 4 Segment 4 Segment 4 Segment X Segment X Segment X Bookie 2 Bookie 3 Query Coordinator ... ... SQL Worker SQL Worker SQL Worker SQL Worker Query Topic Metadata
  • 38. Use Cases Multi-Tenant Data Infrastructure AdTech Fraud Detection Connected Car IoT Analytics Data Lake Hydration
  • 40. Apache NiFi Pulsar Connector https://github.com/streamnative/pulsar-nifi-bundle
  • 41. Apache NiFi Pulsar Connector https://github.com/david-streamlio/pulsar-nifi-bundle
  • 42. Apache NiFi Pulsar Connector
  • 43. Apache NiFi Pulsar Connector
  • 44. Apache NiFi Pulsar Connector
  • 46. streamnative.io Passionate and dedicated team. Founded by the original developers of Apache Pulsar. StreamNative helps teams to capture, manage, and leverage data using Pulsar’s unified messaging and streaming platform.
  • 47. Founded By The Creators Of Apache Pulsar Sijie Guo ASF Member Pulsar/BookKeeper PMC Founder and CEO Jia Zhai Pulsar/BookKeeper PMC Co-Founder Matteo Merli ASF Member Pulsar/BookKeeper PMC CTO Data veterans with extensive industry experience
  • 49. REST Feed Non-Fungible Token {"date":"Thu, 24 Feb 2022 22:26:41 GMT","short_description":"","featured":"false","image_thumbnail_url":"htt","asset_contract_created_date":"2022-02-17T15:4 8:44.822206","asset_contract_owner":"50299352","image_preview_url":"https://lh3.googleus","asset_contract_symbol":"TD ","twitter_username":"","description":"10,000metaverse-readyAvatars","asset_contract_address":"0xc7df86762ba83f2a619 7e1ff9bb40ae0f696b9e6","external_url":"https://www.sandbox.game/en/snoopdogg/","token_id":"492","asset_contract_na me":"Theoggies","asset_contract_nft_version":"3.0","asset_contract_description":"metaverse.","asset_contract_external_lin k":"https://www.sandbox.game/en/snoopdogg/","id":"307922619","featured_image_url":"https","slug":"snoop-dogg-doggie s","token_metadata":"https://contracts.sandbox.game/unrevealed.json?tokenId=492","asset_contract_schema_name":"ER C721","animation_url":"https","num_sales":"1","image_url":"https://lh","asset_contract_default_to_fiat":"false","external_link": "","image_original_url":"https://contracts.sandbox.game/preview.png","asset_contract_payout_address":"0x4489590a1166 18b506f0efe885432f6a8ed998e9","animation_original_url":"https://con","background_color":"","asset_contract_asset_cont ract_type":"non-fungible","name":"The Doggies","asset_contract_image_url":"https","asset_contract_total_supply":"0"} https://docs.opensea.io/reference/retrieving-bundles
  • 50. StreamNative Hub StreamNative Cloud Unified Batch and Stream STORAGE Offload (Queuing + Streaming) Apache Pulsar - Apache NiFi <-> Events <-> Cloud Data Stores Tiered Storage Pulsar --- KoP --- MoP --- Websocket --- HTTP Pulsar Sink Pulsar Sink Data Gateway Protocols Data to Cloud Data Lake Micro Service (Queuing + Streaming)
  • 53. (Queuing + Streaming) Tiered Storage (Queuing + Streaming)