SlideShare a Scribd company logo
© 2023 Cloudera, Inc. All rights reserved.
Apache Pulsar Development 101
(Cloud Native Streaming With Java)
Tim Spann
Principal Developer Advocate
5-April-2023
© 2023 Cloudera, Inc. All rights reserved.
TOPICS
© 2023 Cloudera, Inc. All rights reserved. 3
Topics
● Introduction to Streaming
● Introduction to Apache Pulsar
● Developing Pulsar Apps in Spring
● Introduction to Apache Kafka
● Developing Spring Apps Against Kafka
● FLiPN/FLaNK Stack
● Demos
© 2023 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved. 5
FLiPN-FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://github.com/tspannhw/EverythingApacheNiFi
https://medium.com/@tspann
Apache NiFi x Apache Kafka x Apache Flink x Java
© 2023 Cloudera, Inc. All rights reserved. 6
FLiP Stack Weekly
This week in Apache NiFi, Apache Kafka, Apache
Flink, Apache Pulsar, Apache Iceberg, Python,
Java and Open Source friends.
https://bit.ly/32dAJft
© 2023 Cloudera, Inc. All rights reserved.
STREAMING
© 2023 Cloudera, Inc. All rights reserved. 8
STREAMING FROM … TO .. WHILE ..
Data distribution as a first class citizen
IOT
Devices
LOG DATA
SOURCES
ON-PREM
DATA SOURCES
BIG DATA CLOUD
SERVICES
CLOUD BUSINESS
PROCESS SERVICES *
CLOUD DATA*
ANALYTICS /SERVICE
(Cloudera DW)
App
Logs
Laptops
/Servers Mobile
Apps
Security
Agents
CLOUD
WAREHOUSE
UNIVERSAL
DATA DISTRIBUTION
(Ingest, Transform, Deliver)
Ingest
Processors
Ingest
Gateway
Router, Filter &
Transform
Processors
Destination
Processors
© 2023 Cloudera, Inc. All rights reserved. 9
Streaming for Java Developers
Multiple users, frameworks, languages, devices, data sources & clusters
• Expert in ETL (Eating, Ties
and Laziness)
• Deep SME in Buzzwords
• No Coding Skills
• R&D into Lasers
CAT AI
• Will Drive your Car?
• Will Fix Your Code?
• Will Beat You At Q-Bert
• Will Write my Next Talk
STREAMING ENGINEER
• Coding skills in Python,
Java
• Experience with Apache
Kafka or Pulsar
• Knowledge of database
query languages such as
SQL
• Knowledge of tools such as
Apache Flink, Apache
Spark and Apache NiFi
JAVA DEVELOPER
• Frameworks like Spring,
Quarkus and micronaut
• Relational Databases, SQL
• Cloud
• Dev and Build Tools
© 2023 Cloudera, Inc. All rights reserved.
APACHE PULSAR
DevNexus:  Apache Pulsar Development 101 with Java
101
Unified
Messaging
Platform
Guaranteed
Message
Delivery
Resiliency Infinite
Scalability
Streaming
Consumer
Consumer
Consumer
Subscription
Shared
Failover
Consumer
Consumer
Subscription
In case of failure in
Consumer B-0
Consumer
Consumer
Subscription
Exclusive
X
Consumer
Consumer
Key-Shared
Subscription
Pulsar
Topic/Partition
Messaging
Tenants / Namespaces / Topics
Tenants
(Compliance)
Tenants
(Data Services)
Namespace
(Microservices)
Topic-1
(Cust Auth)
Topic-1
(Location Resolution)
Topic-2
(Demographics)
Topic-1
(Budgeted Spend)
Topic-1
(Acct History)
Topic-1
(Risk Detection)
Namespace
(ETL)
Namespace
(Campaigns)
Namespace
(ETL)
Tenants
(Marketing)
Namespace
(Risk Assessment)
Pulsar Cluster
Messages - The Basic Unit of Pulsar
Component Description
Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data
can also conform to data schemas.
Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like
topic compaction.
Properties An optional key/value map of user-defined properties.
Producer name The name of the producer who produces the message. If you do not specify a producer name, the
default name is used.
Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the
message is its order in that sequence.
Pulsar Subscription Modes
Different subscription modes
have different semantics:
Exclusive/Failover -
guaranteed order, single active
consumer
Shared - multiple active
consumers, no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2,
V
21
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover
Flexible Pub/Sub API for Pulsar - Shared
Consumer consumer =
client.newConsumer()
.topic("my-topic")
.subscriptionName("work-q-1")
.subscriptionType(SubType.Shared)
.subscribe();
Flexible Pub/Sub API for Pulsar - Failover
Consumer consumer = client.newConsumer()
.topic("my-topic")
.subscriptionName("stream-1")
.subscriptionType(SubType.Failover)
.subscribe();
Data Offloaders
(Tiered Storage)
Client Libraries
Apache Pulsar Ecosystem
hub.streamnative.io
Connectors
(Sources & Sinks)
Protocol Handlers
Pulsar Functions
(Lightweight Stream
Processing)
Processing Engines
… and more!
… and more!
Kafka
On Pulsar
(KoP)
MQTT
On Pulsar
(MoP)
AMQP
On Pulsar
(AoP)
Schema Registry
Schema Registry
schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3
(value=Avro/Protobuf/JSON)
Schema
Data
ID
Local Cache
for Schemas
+
Schema
Data
ID +
Local Cache
for Schemas
Send schema-1
(value=Avro/Protobuf/JSON) data
serialized per schema ID
Send (register)
schema (if not in
local cache)
Read schema-1
(value=Avro/Protobuf/JSON) data
deserialized per schema ID
Get schema by ID (if
not in local cache)
Producers Consumers
Building Real-Time Requires a Team
Pulsar - Spring
https://docs.spring.io/spring-pulsar/docs/current-SNAPSHOT/reference/html/
Pulsar - Spring - Code
@Autowired
private PulsarTemplate<Observation> pulsarTemplate;
this.pulsarTemplate.setSchema(Schema.
JSON(Observation.class));
MessageId msgid = pulsarTemplate.newMessage(observation)
.withMessageCustomizer((mb) -> mb.key(uuidKey.toString()))
.send();
@PulsarListener(subscriptionName = "aq-spring-reader", subscriptionType = Shared,
schemaType = SchemaType.
JSON, topics = "persistent://public/default/aq-pm25")
void echoObservation(Observation message) {
this.log.info("PM2.5 Message received: {}", message);
}
Pulsar - Spring - Configuration
spring:
pulsar:
client:
service-url: pulsar+ssl://sn-academy.sndevadvocate.snio.cloud:6651
auth-plugin-class-name: org.apache.pulsar.client.impl.auth.oauth2.AuthenticationOAuth2
authentication:
issuer-url: https://auth.streamnative.cloud/
private-key: file:///scr/sndevadvocate-tspann.json
audience: urn:sn:pulsar:sndevadvocate:my-instance
producer:
batching-enabled: false
send-timeout-ms: 90000
producer-name: airqualityjava
topic-name: persistent://public/default/airquality
Kafka - Spring
https://docs.spring.io/spring-kafka/reference/html/
Spring - Kafka
https://www.baeldung.com/spring-kafka
@Bean
public KafkaTemplate<String, Observation> kafkaTemplate() {
KafkaTemplate<String, Observation> kafkaTemplate =
new KafkaTemplate<String, Observation>(producerFactory());
return kafkaTemplate;
}
ProducerRecord<String, Observation> producerRecord = new ProducerRecord<>(topicName,
uuidKey.toString(),
message);
kafkaTemplate.send(producerRecord);
Spring - MQTT
https://roytuts.com/publish-subscribe-message-onto-mqtt-using-spring/
@Bean
public IMqttClient mqttClient(
@Value("${mqtt.clientId}") String clientId,
@Value("${mqtt.hostname}") String hostname,
@Value("${mqtt.port}") int port)
throws MqttException {
IMqttClient mqttClient = new MqttClient(
"tcp://" + hostname + ":" + port, clientId);
mqttClient.connect(mqttConnectOptions());
return mqttClient;
}
MqttMessage mqttMessage = new MqttMessage();
mqttMessage.setPayload(DataUtility.serialize(payload));
mqttMessage.setQos(0);
mqttMessage.setRetained(true);
mqttClient.publish(topicName, mqttMessage);
Spring - AMQP
https://www.baeldung.com/spring-amqp
rabbitTemplate.convertAndSend(topicName,
DataUtility.serializeToJSON(observation));
@Bean
public CachingConnectionFactory
connectionFactory() {
CachingConnectionFactory ccf =
new CachingConnectionFactory();
ccf.setAddresses(serverName);
return ccf;
}
Reactive Spring - Pulsar
Reactive Spring - Pulsar
Demo
© 2023 Cloudera, Inc. All rights reserved. 36
REST + Spring Boot + Pulsar + Friends
© 2023 Cloudera, Inc. All rights reserved.
APACHE PULSAR JAVA FUNCTION
https://github.com/tspannhw/pulsar-airquality-function
public class AirQualityFunction implements Function<byte[], Void> {
@Override
public Void process(byte[] input, Context context) {
if ( input == null || context == null ) {
return null;
}
//context.getInputTopics().toString()
if ( context.getLogger() != null && context.getLogger().isDebugEnabled() ) {
context.getLogger().debug("LOG:" + input.toString());
}
context.newOutputMessage(“NewTopicName”, JSONSchema.of(Observation.class))
.key(UUID.randomUUID().toString())
.property(“Language”, “java”)
.value(observation)
.send();
}
}
© 2023 Cloudera, Inc. All rights reserved.
APACHE KAFKA
© 2023 Cloudera, Inc. All rights reserved. 40
Apache Kafka
• Highly reliable distributed
messaging system
• Decouple applications, enables
many-to-many patterns
• Publish-Subscribe semantics
• Horizontal scalability
• Efficient implementation to
operate at speed with big data
volumes
• Organized by topic to support
several use cases
Source
System
Source
System
Source
System
Kafka
Fraud
Detection
Security
Systems
Real-Time
Monitoring
Source
System
Source
System
Source
System
Fraud
Detection
Security
Systems
Real-Time
Monitoring
Many-To-Many
Publish-Subscribe
Point-To-Point
Request-Response
© 2023 Cloudera, Inc. All rights reserved.
STREAM TEAM
© 2023 Cloudera, Inc. All rights reserved. 42
CSP Community
Edition
• Kafka, KConnect, SMM, SR,
Flink, and SSB in Docker
• Runs in Docker
• Try new features quickly
• Develop applications locally
● Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry
○ $>docker compose up
● Licensed under the Cloudera Community License
● Unsupported
● Community Group Hub for CSP
● Find it on docs.cloudera.com under Applications
© 2023 Cloudera, Inc. All rights reserved. 43
ENABLING ANALYTICS AND INSIGHTS ANYWHERE
Driving enterprise business value
REAL-TIME
STREAMING
ENGINE
ANALYTICS &
DATA WAREHOUSE
DATA SCIENCE/
MACHINE LEARNING
CENTRALIZED DATA
PLATFORM
STORAGE & PROCESSING
ANALYTICS & INSIGHTS
Stream
Ingest
Ingest – Data
at Rest
Deploy
Models
BI
Solutions
SQL Predictive
Analytics
• Model Building
• Model Training
• Model Scoring
Actions &
Alerts
[SQL]
Real-Time
Apps
STREAMING DATA
SOURCES
Clickstream Market data
Machine logs Social
ENTERPRISE DATA
SOURCES
CRM
Customer
history
Research
Compliance
Data
Risk Data
Lending
© 2023 Cloudera, Inc. All rights reserved. 44
EVENT-DRIVEN ORGANIZATION
Modernize your data and applications
CDF Event Streaming Platform
Integration - Processing - Management - Cloud
Stream
ETL
Cloud Storage
Application
Data Lake Data Stores
Make Payment
µServices
Streams
Edge - IoT Dashboard
© 2023 Cloudera, Inc. All rights reserved.
DATAFLOW
APACHE NIFI
46
© 2023 Cloudera, Inc. All rights reserved.
WHAT IS APACHE NIFI?
Apache NiFi is a scalable, real-time streaming data
platform that collects, curates, and analyzes data so
customers gain key insights for immediate
actionable intelligence.
© 2023 Cloudera, Inc. All rights reserved. 47
CLOUDERA FLOW AND EDGE MANAGEMENT
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
Advanced tooling to industrialize
flow development (Flow Development
Life Cycle)
ACQUIRE
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
PROCESS
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ENCRYPT
TALL
EVALUATE
EXECUTE
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
ROUTE RATE
DISTRIBUTE LOAD
DELIVER
• Guaranteed Delivery
• Full data provenance from
acquisition to delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
© 2023 Cloudera, Inc. All rights reserved. 48
APACHE NIFI
Enable easy ingestion, routing, management and delivery of
any data anywhere (Edge, cloud, data center) to any
downstream system with built in end-to-end security and
provenance
ACQUIRE PROCESS DELIVER
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
• Guaranteed Delivery
• Full data provenance from acquisition to
delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
Advanced tooling to industrialize flow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE
© 2023 Cloudera, Inc. All rights reserved. 49
Cloudera DataFlow: Universal Data Distribution Service
Process
Route
Filter
Enrich
Transform
Distribute
Connectors
Any
destination
Deliver
Ingest
Active
Passive
Connectors
Gateway
Endpoint
Connect & Pull
Send
Data born in
the cloud
Data born
outside the
cloud
UNIVERSAL DATA DISTRIBUTION WITH CLOUDERA DATAFLOW (CDF)
Connect to Any Data Source Anywhere then Process and Deliver to Any Destination
© 2023 Cloudera, Inc. All rights reserved.
JAVA DEV FOR NIFI
© 2023 Cloudera, Inc. All rights reserved. 51
Custom Processors in Java
https://www.datainmotion.dev/2019/03/getting-started-with-custom-processor.html
https://www.datainmotion.dev/2019/03/posting-images-to-imgur-via-apache-nifi.html
© 2023 Cloudera, Inc. All rights reserved. 52
https://github.com/tspannhw/nifi-tensorflow-processor
public class TensorFlowProcessor extends AbstractProcessor {
public static final PropertyDescriptor MODEL_DIR = new
PropertyDescriptor.Builder().name(MODEL_DIR_NAME)
.description("Model Directory").required(true).expressionLanguageSupported(true)
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR).build();
@Override
public void onTrigger(final ProcessContext context, final ProcessSession session) throws
ProcessException {
FlowFile flowFile = session.get();
if (flowFile == null) {
flowFile = session.create();
}
try {
flowFile.getAttributes();
}
© 2023 Cloudera, Inc. All rights reserved.
APACHE SPARK
© 2023 Cloudera, Inc. All rights reserved. 54
APACHE SPARK
Data Engineering at Scale
• Multi-language support with Java, Scala, Python and R
• Batch and Microbatch
• Strong SQL support
• Machine Learning
• Jupyter and Apache Zeppelin notebook support
© 2023 Cloudera, Inc. All rights reserved. 55
APACHE ICEBERG
A Flexible, Performant & Scalable Table Format
• Donated by Netflix to the Apache Foundation in 2018
• Flexibility
– Hidden partitioning
– Full schema evolution
• Data Warehouse Operations
– Atomic Consistent Isolated Durable (ACID)
Transactions
– Time travel and rollback
• Supports best in class SQL performance
– High performance at Petabyte scale
© 2023 Cloudera, Inc. All rights reserved.
APACHE FLINK
© 2023 Cloudera, Inc. All rights reserved. 57
Flink SQL
https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache
Calcite
© 2023 Cloudera, Inc. All rights reserved. 58
Flink SQL
-- specify Kafka partition key on output
SELECT foo AS _eventKey FROM sensors
-- use event time timestamp from kafka
-- exactly once compatible
SELECT eventTimestamp FROM sensors
-- nested structures access
SELECT foo.’bar’ FROM table; -- must quote nested
column
-- timestamps
SELECT * FROM payments
WHERE eventTimestamp > CURRENT_TIMESTAMP-interval
'10' second;
-- unnest
SELECT b.*, u.*
FROM bgp_avro b,
UNNEST(b.path) AS u(pathitem)
-- aggregations and windows
SELECT card,
MAX(amount) as theamount,
TUMBLE_END(eventTimestamp, interval '5' minute) as
ts
FROM payments
WHERE lat IS NOT NULL
AND lon IS NOT NULL
GROUP BY card,
TUMBLE(eventTimestamp, interval '5' minute)
HAVING COUNT(*) > 4 -- >4==fraud
-- try to do this ksql!
SELECT us_west.user_score+ap_south.user_score
FROM kafka_in_zone_us_west us_west
FULL OUTER JOIN kafka_in_zone_ap_south ap_south
ON us_west.user_id = ap_south.user_id;
Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
59
© 2022 Cloudera, Inc. All rights reserved.
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
© 2023 Cloudera, Inc. All rights reserved.
DEMO
© 2023 Cloudera, Inc. All rights reserved. 61
© 2023 Cloudera, Inc. All rights reserved.
RESOURCES AND WRAP-UP
© 2023 Cloudera, Inc. All rights reserved. 63
● https://spring.io/guides/gs/spring-boot/
● https://spring.io/projects/spring-amqp/
● https://spring.io/projects/spring-kafka/
● https://github.com/spring-projects/spring-integration-kafka
● https://github.com/spring-projects/spring-integration
● https://github.com/spring-projects/spring-data-relational
● https://github.com/spring-projects/spring-kafka
● https://github.com/spring-projects/spring-amqp
Spring Things
© 2023 Cloudera, Inc. All rights reserved. 64
Streaming Resources
• https://dzone.com/articles/real-time-stream-processing-with-hazelcast-an
d-streamnative
• https://flipstackweekly.com/
• https://www.datainmotion.dev/
• https://www.flankstack.dev/
• https://github.com/tspannhw
• https://medium.com/@tspann
• https://medium.com/@tspann/predictions-for-streaming-in-2023-ad4d739
5d714
• https://www.apachecon.com/acna2022/slides/04_Spann_Tim_Citizen_Str
eaming_Engineer.pdf
© 2023 Cloudera, Inc. All rights reserved. 65
Later Today
https://devnexus.com/presentations/introducing-spring-for-apache-pulsar
Introducing Spring for Apache Pulsar
© 2023 Cloudera, Inc. All rights reserved. 66
Time to Learn Apache NiFi
https://www.catscloudsanddata.com/
https://attend.cloudera.com/nificommitters0503
© 2023 Cloudera, Inc. All rights reserved. 68
Upcoming Events
April 26 May 10
© 2023 Cloudera, Inc. All rights reserved. 69
Resources
TH N Y U

More Related Content

PDF
Let's keep it simple and streaming.pdf
PDF
Let's keep it simple and streaming
PDF
Living the Stream Dream with Pulsar and Spring Boot
PDF
Living the Stream Dream with Pulsar and Spring Boot
PDF
PhillyJug Getting Started With Real-time Cloud Native Streaming With Java
PDF
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
PDF
Timothy Spann: Apache Pulsar for ML
PDF
[March sn meetup] apache pulsar + apache nifi for cloud data lake
Let's keep it simple and streaming.pdf
Let's keep it simple and streaming
Living the Stream Dream with Pulsar and Spring Boot
Living the Stream Dream with Pulsar and Spring Boot
PhillyJug Getting Started With Real-time Cloud Native Streaming With Java
Princeton Dec 2022 Meetup_ StreamNative and Cloudera Streaming
Timothy Spann: Apache Pulsar for ML
[March sn meetup] apache pulsar + apache nifi for cloud data lake

Similar to DevNexus: Apache Pulsar Development 101 with Java (20)

PDF
bigdata 2022_ FLiP Into Pulsar Apps
PDF
Deep Dive into Building Streaming Applications with Apache Pulsar
PDF
The Dream Stream Team for Pulsar and Spring
PDF
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)
PDF
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
PDF
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
PDF
Python web conference 2022 apache pulsar development 101 with python (f li-...
PDF
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
PDF
JConf.dev 2022 - Apache Pulsar Development 101 with Java
PDF
Princeton Dec 2022 Meetup_ NiFi + Flink + Pulsar
PDF
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
PDF
Apache Pulsar Development 101 with Python
PDF
(Current22) Let's Monitor The Conditions at the Conference
PDF
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
PDF
[AerospikeRoadshow] Apache Pulsar Unifies Streaming and Messaging for Real-Ti...
PDF
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
PDF
Preview of Apache Pulsar 2.5.0
PDF
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
PDF
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
PDF
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
bigdata 2022_ FLiP Into Pulsar Apps
Deep Dive into Building Streaming Applications with Apache Pulsar
The Dream Stream Team for Pulsar and Spring
Why Spring Belongs In Your Data Stream (From Edge to Multi-Cloud)
Python Web Conference 2022 - Apache Pulsar Development 101 with Python (FLiP-Py)
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
Python web conference 2022 apache pulsar development 101 with python (f li-...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
JConf.dev 2022 - Apache Pulsar Development 101 with Java
Princeton Dec 2022 Meetup_ NiFi + Flink + Pulsar
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Apache Pulsar Development 101 with Python
(Current22) Let's Monitor The Conditions at the Conference
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...
[AerospikeRoadshow] Apache Pulsar Unifies Streaming and Messaging for Real-Ti...
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
Preview of Apache Pulsar 2.5.0
Apache Pulsar in Action MEAP V04 David Kjerrumgaard
Big mountain data and dev conference apache pulsar with mqtt for edge compu...
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
Ad

More from Timothy Spann (20)

PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Ad

Recently uploaded (20)

PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
medical staffing services at VALiNTRY
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
top salesforce developer skills in 2025.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
AI in Product Development-omnex systems
PPTX
Introduction to Artificial Intelligence
PPTX
Transform Your Business with a Software ERP System
PPTX
Essential Infomation Tech presentation.pptx
PDF
System and Network Administraation Chapter 3
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
Which alternative to Crystal Reports is best for small or large businesses.pdf
L1 - Introduction to python Backend.pptx
medical staffing services at VALiNTRY
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
VVF-Customer-Presentation2025-Ver1.9.pptx
Design an Analysis of Algorithms I-SECS-1021-03
2025 Textile ERP Trends: SAP, Odoo & Oracle
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
CHAPTER 2 - PM Management and IT Context
top salesforce developer skills in 2025.pdf
ai tools demonstartion for schools and inter college
AI in Product Development-omnex systems
Introduction to Artificial Intelligence
Transform Your Business with a Software ERP System
Essential Infomation Tech presentation.pptx
System and Network Administraation Chapter 3
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Upgrade and Innovation Strategies for SAP ERP Customers

DevNexus: Apache Pulsar Development 101 with Java

  • 1. © 2023 Cloudera, Inc. All rights reserved. Apache Pulsar Development 101 (Cloud Native Streaming With Java) Tim Spann Principal Developer Advocate 5-April-2023
  • 2. © 2023 Cloudera, Inc. All rights reserved. TOPICS
  • 3. © 2023 Cloudera, Inc. All rights reserved. 3 Topics ● Introduction to Streaming ● Introduction to Apache Pulsar ● Developing Pulsar Apps in Spring ● Introduction to Apache Kafka ● Developing Spring Apps Against Kafka ● FLiPN/FLaNK Stack ● Demos
  • 4. © 2023 Cloudera, Inc. All rights reserved.
  • 5. © 2023 Cloudera, Inc. All rights reserved. 5 FLiPN-FLaNK Stack Tim Spann @PaasDev // Blog: www.datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://github.com/tspannhw/EverythingApacheNiFi https://medium.com/@tspann Apache NiFi x Apache Kafka x Apache Flink x Java
  • 6. © 2023 Cloudera, Inc. All rights reserved. 6 FLiP Stack Weekly This week in Apache NiFi, Apache Kafka, Apache Flink, Apache Pulsar, Apache Iceberg, Python, Java and Open Source friends. https://bit.ly/32dAJft
  • 7. © 2023 Cloudera, Inc. All rights reserved. STREAMING
  • 8. © 2023 Cloudera, Inc. All rights reserved. 8 STREAMING FROM … TO .. WHILE .. Data distribution as a first class citizen IOT Devices LOG DATA SOURCES ON-PREM DATA SOURCES BIG DATA CLOUD SERVICES CLOUD BUSINESS PROCESS SERVICES * CLOUD DATA* ANALYTICS /SERVICE (Cloudera DW) App Logs Laptops /Servers Mobile Apps Security Agents CLOUD WAREHOUSE UNIVERSAL DATA DISTRIBUTION (Ingest, Transform, Deliver) Ingest Processors Ingest Gateway Router, Filter & Transform Processors Destination Processors
  • 9. © 2023 Cloudera, Inc. All rights reserved. 9
  • 10. Streaming for Java Developers Multiple users, frameworks, languages, devices, data sources & clusters • Expert in ETL (Eating, Ties and Laziness) • Deep SME in Buzzwords • No Coding Skills • R&D into Lasers CAT AI • Will Drive your Car? • Will Fix Your Code? • Will Beat You At Q-Bert • Will Write my Next Talk STREAMING ENGINEER • Coding skills in Python, Java • Experience with Apache Kafka or Pulsar • Knowledge of database query languages such as SQL • Knowledge of tools such as Apache Flink, Apache Spark and Apache NiFi JAVA DEVELOPER • Frameworks like Spring, Quarkus and micronaut • Relational Databases, SQL • Cloud • Dev and Build Tools
  • 11. © 2023 Cloudera, Inc. All rights reserved. APACHE PULSAR
  • 14. Streaming Consumer Consumer Consumer Subscription Shared Failover Consumer Consumer Subscription In case of failure in Consumer B-0 Consumer Consumer Subscription Exclusive X Consumer Consumer Key-Shared Subscription Pulsar Topic/Partition Messaging
  • 15. Tenants / Namespaces / Topics Tenants (Compliance) Tenants (Data Services) Namespace (Microservices) Topic-1 (Cust Auth) Topic-1 (Location Resolution) Topic-2 (Demographics) Topic-1 (Budgeted Spend) Topic-1 (Acct History) Topic-1 (Risk Detection) Namespace (ETL) Namespace (Campaigns) Namespace (ETL) Tenants (Marketing) Namespace (Risk Assessment) Pulsar Cluster
  • 16. Messages - The Basic Unit of Pulsar Component Description Value / data payload The data carried by the message. All Pulsar messages contain raw bytes, although message data can also conform to data schemas. Key Messages are optionally tagged with keys, used in partitioning and also is useful for things like topic compaction. Properties An optional key/value map of user-defined properties. Producer name The name of the producer who produces the message. If you do not specify a producer name, the default name is used. Sequence ID Each Pulsar message belongs to an ordered sequence on its topic. The sequence ID of the message is its order in that sequence.
  • 17. Pulsar Subscription Modes Different subscription modes have different semantics: Exclusive/Failover - guaranteed order, single active consumer Shared - multiple active consumers, no order Key_Shared - multiple active consumers, order for given key Producer 1 Producer 2 Pulsar Topic Subscription D Consumer D-1 Consumer D-2 Key-Shared < K 1, V 10 > < K 1, V 11 > < K 1, V 12 > < K 2 ,V 2 0 > < K 2 ,V 2 1> < K 2 ,V 2 2 > Subscription C Consumer C-1 Consumer C-2 Shared < K 1, V 10 > < K 2, V 21 > < K 1, V 12 > < K 2 ,V 2 0 > < K 1, V 11 > < K 2 ,V 2 2 > Subscription A Consumer A Exclusive Subscription B Consumer B-1 Consumer B-2 In case of failure in Consumer B-1 Failover
  • 18. Flexible Pub/Sub API for Pulsar - Shared Consumer consumer = client.newConsumer() .topic("my-topic") .subscriptionName("work-q-1") .subscriptionType(SubType.Shared) .subscribe();
  • 19. Flexible Pub/Sub API for Pulsar - Failover Consumer consumer = client.newConsumer() .topic("my-topic") .subscriptionName("stream-1") .subscriptionType(SubType.Failover) .subscribe();
  • 20. Data Offloaders (Tiered Storage) Client Libraries Apache Pulsar Ecosystem hub.streamnative.io Connectors (Sources & Sinks) Protocol Handlers Pulsar Functions (Lightweight Stream Processing) Processing Engines … and more! … and more!
  • 24. Schema Registry Schema Registry schema-1 (value=Avro/Protobuf/JSON) schema-2 (value=Avro/Protobuf/JSON) schema-3 (value=Avro/Protobuf/JSON) Schema Data ID Local Cache for Schemas + Schema Data ID + Local Cache for Schemas Send schema-1 (value=Avro/Protobuf/JSON) data serialized per schema ID Send (register) schema (if not in local cache) Read schema-1 (value=Avro/Protobuf/JSON) data deserialized per schema ID Get schema by ID (if not in local cache) Producers Consumers
  • 27. Pulsar - Spring - Code @Autowired private PulsarTemplate<Observation> pulsarTemplate; this.pulsarTemplate.setSchema(Schema. JSON(Observation.class)); MessageId msgid = pulsarTemplate.newMessage(observation) .withMessageCustomizer((mb) -> mb.key(uuidKey.toString())) .send(); @PulsarListener(subscriptionName = "aq-spring-reader", subscriptionType = Shared, schemaType = SchemaType. JSON, topics = "persistent://public/default/aq-pm25") void echoObservation(Observation message) { this.log.info("PM2.5 Message received: {}", message); }
  • 28. Pulsar - Spring - Configuration spring: pulsar: client: service-url: pulsar+ssl://sn-academy.sndevadvocate.snio.cloud:6651 auth-plugin-class-name: org.apache.pulsar.client.impl.auth.oauth2.AuthenticationOAuth2 authentication: issuer-url: https://auth.streamnative.cloud/ private-key: file:///scr/sndevadvocate-tspann.json audience: urn:sn:pulsar:sndevadvocate:my-instance producer: batching-enabled: false send-timeout-ms: 90000 producer-name: airqualityjava topic-name: persistent://public/default/airquality
  • 30. Spring - Kafka https://www.baeldung.com/spring-kafka @Bean public KafkaTemplate<String, Observation> kafkaTemplate() { KafkaTemplate<String, Observation> kafkaTemplate = new KafkaTemplate<String, Observation>(producerFactory()); return kafkaTemplate; } ProducerRecord<String, Observation> producerRecord = new ProducerRecord<>(topicName, uuidKey.toString(), message); kafkaTemplate.send(producerRecord);
  • 31. Spring - MQTT https://roytuts.com/publish-subscribe-message-onto-mqtt-using-spring/ @Bean public IMqttClient mqttClient( @Value("${mqtt.clientId}") String clientId, @Value("${mqtt.hostname}") String hostname, @Value("${mqtt.port}") int port) throws MqttException { IMqttClient mqttClient = new MqttClient( "tcp://" + hostname + ":" + port, clientId); mqttClient.connect(mqttConnectOptions()); return mqttClient; } MqttMessage mqttMessage = new MqttMessage(); mqttMessage.setPayload(DataUtility.serialize(payload)); mqttMessage.setQos(0); mqttMessage.setRetained(true); mqttClient.publish(topicName, mqttMessage);
  • 32. Spring - AMQP https://www.baeldung.com/spring-amqp rabbitTemplate.convertAndSend(topicName, DataUtility.serializeToJSON(observation)); @Bean public CachingConnectionFactory connectionFactory() { CachingConnectionFactory ccf = new CachingConnectionFactory(); ccf.setAddresses(serverName); return ccf; }
  • 35. Demo
  • 36. © 2023 Cloudera, Inc. All rights reserved. 36 REST + Spring Boot + Pulsar + Friends
  • 37. © 2023 Cloudera, Inc. All rights reserved. APACHE PULSAR JAVA FUNCTION
  • 38. https://github.com/tspannhw/pulsar-airquality-function public class AirQualityFunction implements Function<byte[], Void> { @Override public Void process(byte[] input, Context context) { if ( input == null || context == null ) { return null; } //context.getInputTopics().toString() if ( context.getLogger() != null && context.getLogger().isDebugEnabled() ) { context.getLogger().debug("LOG:" + input.toString()); } context.newOutputMessage(“NewTopicName”, JSONSchema.of(Observation.class)) .key(UUID.randomUUID().toString()) .property(“Language”, “java”) .value(observation) .send(); } }
  • 39. © 2023 Cloudera, Inc. All rights reserved. APACHE KAFKA
  • 40. © 2023 Cloudera, Inc. All rights reserved. 40 Apache Kafka • Highly reliable distributed messaging system • Decouple applications, enables many-to-many patterns • Publish-Subscribe semantics • Horizontal scalability • Efficient implementation to operate at speed with big data volumes • Organized by topic to support several use cases Source System Source System Source System Kafka Fraud Detection Security Systems Real-Time Monitoring Source System Source System Source System Fraud Detection Security Systems Real-Time Monitoring Many-To-Many Publish-Subscribe Point-To-Point Request-Response
  • 41. © 2023 Cloudera, Inc. All rights reserved. STREAM TEAM
  • 42. © 2023 Cloudera, Inc. All rights reserved. 42 CSP Community Edition • Kafka, KConnect, SMM, SR, Flink, and SSB in Docker • Runs in Docker • Try new features quickly • Develop applications locally ● Docker compose file of CSP to run from command line w/o any dependencies, including Flink, SQL Stream Builder, Kafka, Kafka Connect, Streams Messaging Manager and Schema Registry ○ $>docker compose up ● Licensed under the Cloudera Community License ● Unsupported ● Community Group Hub for CSP ● Find it on docs.cloudera.com under Applications
  • 43. © 2023 Cloudera, Inc. All rights reserved. 43 ENABLING ANALYTICS AND INSIGHTS ANYWHERE Driving enterprise business value REAL-TIME STREAMING ENGINE ANALYTICS & DATA WAREHOUSE DATA SCIENCE/ MACHINE LEARNING CENTRALIZED DATA PLATFORM STORAGE & PROCESSING ANALYTICS & INSIGHTS Stream Ingest Ingest – Data at Rest Deploy Models BI Solutions SQL Predictive Analytics • Model Building • Model Training • Model Scoring Actions & Alerts [SQL] Real-Time Apps STREAMING DATA SOURCES Clickstream Market data Machine logs Social ENTERPRISE DATA SOURCES CRM Customer history Research Compliance Data Risk Data Lending
  • 44. © 2023 Cloudera, Inc. All rights reserved. 44 EVENT-DRIVEN ORGANIZATION Modernize your data and applications CDF Event Streaming Platform Integration - Processing - Management - Cloud Stream ETL Cloud Storage Application Data Lake Data Stores Make Payment µServices Streams Edge - IoT Dashboard
  • 45. © 2023 Cloudera, Inc. All rights reserved. DATAFLOW APACHE NIFI
  • 46. 46 © 2023 Cloudera, Inc. All rights reserved. WHAT IS APACHE NIFI? Apache NiFi is a scalable, real-time streaming data platform that collects, curates, and analyzes data so customers gain key insights for immediate actionable intelligence.
  • 47. © 2023 Cloudera, Inc. All rights reserved. 47 CLOUDERA FLOW AND EDGE MANAGEMENT Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance Advanced tooling to industrialize flow development (Flow Development Life Cycle) ACQUIRE • Over 300 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG PROCESS HASH MERGE EXTRACT DUPLICATE SPLIT ENCRYPT TALL EVALUATE EXECUTE GEOENRICH SCAN REPLACE TRANSLATE CONVERT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT ROUTE RATE DISTRIBUTE LOAD DELIVER • Guaranteed Delivery • Full data provenance from acquisition to delivery • Diverse, Non-Traditional Sources • Eco-system integration FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG
  • 48. © 2023 Cloudera, Inc. All rights reserved. 48 APACHE NIFI Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance ACQUIRE PROCESS DELIVER • Over 300 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure • Guaranteed Delivery • Full data provenance from acquisition to delivery • Diverse, Non-Traditional Sources • Eco-system integration Advanced tooling to industrialize flow development (Flow Development Life Cycle) FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G HASH MERGE EXTRACT DUPLICATE SPLIT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT CONTROL RATE DISTRIBUTE LOAD GEOENRICH SCAN REPLACE TRANSLATE CONVERT ENCRYPT TALL EVALUATE EXECUTE
  • 49. © 2023 Cloudera, Inc. All rights reserved. 49 Cloudera DataFlow: Universal Data Distribution Service Process Route Filter Enrich Transform Distribute Connectors Any destination Deliver Ingest Active Passive Connectors Gateway Endpoint Connect & Pull Send Data born in the cloud Data born outside the cloud UNIVERSAL DATA DISTRIBUTION WITH CLOUDERA DATAFLOW (CDF) Connect to Any Data Source Anywhere then Process and Deliver to Any Destination
  • 50. © 2023 Cloudera, Inc. All rights reserved. JAVA DEV FOR NIFI
  • 51. © 2023 Cloudera, Inc. All rights reserved. 51 Custom Processors in Java https://www.datainmotion.dev/2019/03/getting-started-with-custom-processor.html https://www.datainmotion.dev/2019/03/posting-images-to-imgur-via-apache-nifi.html
  • 52. © 2023 Cloudera, Inc. All rights reserved. 52 https://github.com/tspannhw/nifi-tensorflow-processor public class TensorFlowProcessor extends AbstractProcessor { public static final PropertyDescriptor MODEL_DIR = new PropertyDescriptor.Builder().name(MODEL_DIR_NAME) .description("Model Directory").required(true).expressionLanguageSupported(true) .addValidator(StandardValidators.NON_EMPTY_VALIDATOR).build(); @Override public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException { FlowFile flowFile = session.get(); if (flowFile == null) { flowFile = session.create(); } try { flowFile.getAttributes(); }
  • 53. © 2023 Cloudera, Inc. All rights reserved. APACHE SPARK
  • 54. © 2023 Cloudera, Inc. All rights reserved. 54 APACHE SPARK Data Engineering at Scale • Multi-language support with Java, Scala, Python and R • Batch and Microbatch • Strong SQL support • Machine Learning • Jupyter and Apache Zeppelin notebook support
  • 55. © 2023 Cloudera, Inc. All rights reserved. 55 APACHE ICEBERG A Flexible, Performant & Scalable Table Format • Donated by Netflix to the Apache Foundation in 2018 • Flexibility – Hidden partitioning – Full schema evolution • Data Warehouse Operations – Atomic Consistent Isolated Durable (ACID) Transactions – Time travel and rollback • Supports best in class SQL performance – High performance at Petabyte scale
  • 56. © 2023 Cloudera, Inc. All rights reserved. APACHE FLINK
  • 57. © 2023 Cloudera, Inc. All rights reserved. 57 Flink SQL https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html ● Streaming Analytics ● Continuous SQL ● Continuous ETL ● Complex Event Processing ● Standard SQL Powered by Apache Calcite
  • 58. © 2023 Cloudera, Inc. All rights reserved. 58 Flink SQL -- specify Kafka partition key on output SELECT foo AS _eventKey FROM sensors -- use event time timestamp from kafka -- exactly once compatible SELECT eventTimestamp FROM sensors -- nested structures access SELECT foo.’bar’ FROM table; -- must quote nested column -- timestamps SELECT * FROM payments WHERE eventTimestamp > CURRENT_TIMESTAMP-interval '10' second; -- unnest SELECT b.*, u.* FROM bgp_avro b, UNNEST(b.path) AS u(pathitem) -- aggregations and windows SELECT card, MAX(amount) as theamount, TUMBLE_END(eventTimestamp, interval '5' minute) as ts FROM payments WHERE lat IS NOT NULL AND lon IS NOT NULL GROUP BY card, TUMBLE(eventTimestamp, interval '5' minute) HAVING COUNT(*) > 4 -- >4==fraud -- try to do this ksql! SELECT us_west.user_score+ap_south.user_score FROM kafka_in_zone_us_west us_west FULL OUTER JOIN kafka_in_zone_ap_south ap_south ON us_west.user_id = ap_south.user_id; Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
  • 59. 59 © 2022 Cloudera, Inc. All rights reserved. SQL STREAM BUILDER (SSB) SQL STREAM BUILDER allows developers, analysts, and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  • 60. © 2023 Cloudera, Inc. All rights reserved. DEMO
  • 61. © 2023 Cloudera, Inc. All rights reserved. 61
  • 62. © 2023 Cloudera, Inc. All rights reserved. RESOURCES AND WRAP-UP
  • 63. © 2023 Cloudera, Inc. All rights reserved. 63 ● https://spring.io/guides/gs/spring-boot/ ● https://spring.io/projects/spring-amqp/ ● https://spring.io/projects/spring-kafka/ ● https://github.com/spring-projects/spring-integration-kafka ● https://github.com/spring-projects/spring-integration ● https://github.com/spring-projects/spring-data-relational ● https://github.com/spring-projects/spring-kafka ● https://github.com/spring-projects/spring-amqp Spring Things
  • 64. © 2023 Cloudera, Inc. All rights reserved. 64 Streaming Resources • https://dzone.com/articles/real-time-stream-processing-with-hazelcast-an d-streamnative • https://flipstackweekly.com/ • https://www.datainmotion.dev/ • https://www.flankstack.dev/ • https://github.com/tspannhw • https://medium.com/@tspann • https://medium.com/@tspann/predictions-for-streaming-in-2023-ad4d739 5d714 • https://www.apachecon.com/acna2022/slides/04_Spann_Tim_Citizen_Str eaming_Engineer.pdf
  • 65. © 2023 Cloudera, Inc. All rights reserved. 65 Later Today https://devnexus.com/presentations/introducing-spring-for-apache-pulsar Introducing Spring for Apache Pulsar
  • 66. © 2023 Cloudera, Inc. All rights reserved. 66 Time to Learn Apache NiFi https://www.catscloudsanddata.com/
  • 68. © 2023 Cloudera, Inc. All rights reserved. 68 Upcoming Events April 26 May 10
  • 69. © 2023 Cloudera, Inc. All rights reserved. 69 Resources
  • 70. TH N Y U