SlideShare a Scribd company logo
Flink SQL on Pulsar Made Easy
● Staff Software Engineer @ StreamNative
● Apache Pulsar Committor & Apache Incubator Heron Committer
● Leading compute team @ StreamNative
● Founder & CEO @ StreamNative
● Apache Pulsar Committor & PMC
● Leading Apache Pulsar & StreamNative
PHOT
O
HERE
Flink SQL on Pulsar made easy
Agenda
● Apache Pulsar & Apache Flink
● Flink-Pulsar SQL Connector
● Flink-Pulsar Catalog
Apache Pulsar - The cloud-native messaging and streaming platform
Unified Messaging Model
Simplify your data infrastructure and enable new use cases with queuing and streaming
capabilities in one platform.
Multi-tenancy
Enable multiple user groups to share the same platform, either via access control, or in
entirely different tenants or namespaces.
Scalability
Decoupled data computing and storage enable horizontal scaling to handle data scale and
management complexity.
Geo-replication
Support for multi-datacenter replication with both asynchronous and synchronous
replication for built-in disaster recovery, multi-cloud and hybrid cloud
Tiered storage
Enable historical data to be offloaded to cloud-native storage and store event streams for
indefinite periods of time. Unlock new cases for unified batch and stream processing.
Reader and
Batch API
Pulsar
IO/Connectors
Stream Processor
Applications
Prebuilt Connectors Custom Connectors
Microservices or
Event-Driven Architecture
Pub/Sub
API
Publisher
Subscriber
Admin
API
Operators &
Administrators
Teams
Tenant
Pulsar API
Subscription Modes
Different subscription modes
have different semantics:
Exclusive/Failover - guaranteed
order, single active consumer
Shared - multiple active
consumers, no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2,
V
21
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover
Apache Pulsar – Adoptions
Apache Pulsar + Apache Flink
Apache Pulsar + Apache Flink
Streaming-first, unified data processing
Hosted by
Save Your Spot Now
Use code FLINK+PULSAR
to get 50% off your ticket.
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
5 Keynotes
12 Breakout Sessions
1 Amazing Happy Hour
Pulsar Summit
San Francisco
Sponsorship
Prospectus
Sponsorships Available
Help engage and connect the Apache Pulsar
community by becoming an official sponsor for
Pulsar Summit San Francisco 2022! Learn more
about the requirements and benefits of
becoming a sponsor.
Hosted by
Flink-Pulsar SQL Connector – What
● Flink has 2 APIs:
○ DataStream API
○ SQL (& Table API)
● Flink SQL job can talk to Pulsar clusters
● Built from the Pulsar DataStream Connector
● Built-in PulsarCatalog
○ Provide metadata for Flink SQL tables
Flink-Pulsar SQL Connector – Why
• Flink SQL is becoming more and more
popular
• Easy to use for SQL-only users
• Ad-hoc queries against Pulsar topics
• CDC use case with CDC Format
• Goal: provide seamless integration
between Flink and Pulsar
Flink-Pulsar SQL Connector – Example
Flink-Pulsar SQL Connector – Message Structure
Flink-Pulsar SQL Connector – Metadata
● Each Pulsar message is associated with
metadata
● Allow users to declare columns mapped from
Pulsar message metadata
● VIRTUAL means only available at the source
Flink-Pulsar SQL Connector – Metadata
Flink-Pulsar SQL Connector – Pulsar Schema
● ser/de raw message bytes into typed
objects
● SchemaInfo as the data structure to
define Pulsar schema
● Schema Type
○ Primitive: BOOLEAN, INT, STRING …
○ Complex: KeyValue, Struct (json, avro,
protobuf_native)
● byte[] if no Schema defined
● AUTO schema to produce/consume
generic records to/from brokers
Flink-Pulsar SQL Connector – Flink Format
● “defines how to map binary data onto table columns”
● Existing formats: avro/json/csv/raw
● In Pulsar SQL Connector, we manage ser/de using Flink formats
Flink-Pulsar SQL Connector – Schema & Format
● Case 1: Interact with topic where messages are serialized by Pulsar Client
● Problem: Then Which format to use ?
○ They are two different serialization frameworks used by two different
systems
○ How to make sure Flink formats can understand the binary data
produced by Pulsar Client using Pulsar Schema?
● Well, they follow the same binary format protocol : json / avro
● So they should be compatible.
Flink-Pulsar SQL Connector – Schema & Format
● But there might be tiny differences, due to
implementation details.
● So the ideal solution would be: implement a Flink
format for each Pulsar schema.
○ But this is not available yet
● So, as a workaround, we will use the Flink formats
and test thoroughly to make sure the Flink formats
work with with Pulsar schema.
What if we don’t use Pulsar Schema?
● Case 2: interact with topic where messages are serialized by Flink formats
● And Flink SQL jobs are the only clients of the topic
● Then any Flink format works fine, as the message are serialized/deserialized
only by Flink SQL formats
Flink-Pulsar SQL Connector – Schema Summary
● Interact with topic where messages are serialized by Pulsar client:
○ users must choose a correct and compatible format to use
● Interact with topic where messages are serialized by Flink formats:
○ Flink SQL take over the serialization/deserialization so any Flink format will be good
● More Flink formats to support more Pulsar Schema ? We are working on it !
PulsarCatalog – What
● A Flink Catalog implementation using Pulsar as a metadata store
○ By default uses GenericInMemoryCatalog
○ Thus the Flink table definition is not persisted by default
● No other components needed
● Views, UDFs are not supported yet
PulsarCatalog – Pulsar Multi Tenancy
persistent://tenant/namespace/topic
● Tenant
○ authorization and authentication schema
○ configurations including storage quota, message TTL etc
○ set of clusters which the tenant’s configuration apply
● Namespace
○ the administrative unit within a tenant
○ policies including retention, dispatch throttling, etc
PulsarCatalog – Tables
● Pulsar-Native Table: Existing Pulsar Topics –> Flink Table
○ Easy to use for simple queries
○ Do not need link topic with Flink table via `CREATE` statement
○ Can’t specify watermark, metadata or primary key.
● Explicit Table: Flink Table Declared via `CREATE` statement
○ Support all Flink SQL features: watermarks, primary key, metadata columns, etc.
○ Better control over Pulsar configs: regex-pattern topics, client options tuning, etc.
○ Requires additional setup and configuration
You can create multiple tables against a topic, so native table and explicit table
can referring the same topic.
PulsarCatalog – Native Table
● Map a Pulsar `tenant/namespace`
combination to a Flink database
○ e.g: persistent://public/default/topicA is
under database `public/default` and with
table name `topicA`.
● PulsarCatalog derive the columns of the
table schema from the Pulsar Schema
● PulsarCatalog automatically decides
which format to use
PulsarCatalog – Native Table
● For structured Pulsar schemas, the Flink table schema is derived from the Pulsar
schema
● For primitive types, mapped to a single column table schema with field name “value”
● Limitations
○ Requires the Pulsar topic to use a valid schema
○ Some Pulsar schema auto mapping is not supported
PulsarCatalog – Explicit Table
● Create “placeholder” topics under a
system tenant
● No data in the placeholder topic
● The Flink table schema is stored in
the placeholder topic’s schema
definition
PulsarCatalog – Explicit Table
PulsarCatalog – Tables Summary
● Pulsar-Native Table
○ Topic Metadata -> Flink Table Definition
■ tenant/namespace -> database
■ topic -> table
■ Pulsar schema -> Flink schema
● Explicit Table
○ Flink Table Definition persisted in Pulsar cluster
■ flink_catalog_tenant/database_name/table_flink_table_name
■ __flink_catalog/flink_forward/table_Orders
○ Table Schema Serialized and persisted in Pulsar Schema store
Future work
● Improvement and enhancement
● protobuf_native format
● Upsert mode: CDC scenarios and adapt CDC formats
● Cookbooks and migration guide
Resources
● Apache Pulsar
● Apache Pulsar Newsletter
● Github repository: streamnative/flink
● SQL connector image (1.15 later)
● Examples: streamnative/flink-examples
● StreamNative Hub Documentation: SN Hub
Acknowledgements
Yufei Zhang is a StreamNative engineer
working on the integration of Pulsar and Flink.
He is an Apache RocketMQ Committer &
Apache Flink Contributor.
Yufan Sheng is a software engineer at StreamNative
where he devotes in flink or other streaming platform
integration with Apache Pulsar. Before that he was a
senior software engineer at Tencent Cloud.
➔ Pulsar expert instructor-led courses
➔ On-demand learning with labs
➔ 300+ engineers, admins and architects trained!
StreamNative Academy
Academy.StreamNative.io
LEARN MORE ABOUT APACHE PULSAR WITH:
Thank You!

More Related Content

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink

More from Flink Forward (20)

PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
PPTX
Welcome to the Flink Community!
PPTX
Practical learnings from running thousands of Flink jobs
PPTX
Extending Flink SQL for stream processing use cases
PPTX
The top 3 challenges running multi-tenant Flink at scale
PPTX
Using Queryable State for Fun and Profit
PDF
Changelog Stream Processing with Apache Flink
PPTX
Large Scale Real Time Fraudulent Web Behavior Detection
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg
Welcome to the Flink Community!
Practical learnings from running thousands of Flink jobs
Extending Flink SQL for stream processing use cases
The top 3 challenges running multi-tenant Flink at scale
Using Queryable State for Fun and Profit
Changelog Stream Processing with Apache Flink
Large Scale Real Time Fraudulent Web Behavior Detection
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Building Reliable Lakehouses with Apache Flink and Delta Lake
Ad

Flink SQL on Pulsar made easy

  • 1. Flink SQL on Pulsar Made Easy
  • 2. ● Staff Software Engineer @ StreamNative ● Apache Pulsar Committor & Apache Incubator Heron Committer ● Leading compute team @ StreamNative ● Founder & CEO @ StreamNative ● Apache Pulsar Committor & PMC ● Leading Apache Pulsar & StreamNative PHOT O HERE
  • 4. Agenda ● Apache Pulsar & Apache Flink ● Flink-Pulsar SQL Connector ● Flink-Pulsar Catalog
  • 5. Apache Pulsar - The cloud-native messaging and streaming platform Unified Messaging Model Simplify your data infrastructure and enable new use cases with queuing and streaming capabilities in one platform. Multi-tenancy Enable multiple user groups to share the same platform, either via access control, or in entirely different tenants or namespaces. Scalability Decoupled data computing and storage enable horizontal scaling to handle data scale and management complexity. Geo-replication Support for multi-datacenter replication with both asynchronous and synchronous replication for built-in disaster recovery, multi-cloud and hybrid cloud Tiered storage Enable historical data to be offloaded to cloud-native storage and store event streams for indefinite periods of time. Unlock new cases for unified batch and stream processing.
  • 6. Reader and Batch API Pulsar IO/Connectors Stream Processor Applications Prebuilt Connectors Custom Connectors Microservices or Event-Driven Architecture Pub/Sub API Publisher Subscriber Admin API Operators & Administrators Teams Tenant Pulsar API
  • 7. Subscription Modes Different subscription modes have different semantics: Exclusive/Failover - guaranteed order, single active consumer Shared - multiple active consumers, no order Key_Shared - multiple active consumers, order for given key Producer 1 Producer 2 Pulsar Topic Subscription D Consumer D-1 Consumer D-2 Key-Shared < K 1, V 10 > < K 1, V 11 > < K 1, V 12 > < K 2 ,V 2 0 > < K 2 ,V 2 1> < K 2 ,V 2 2 > Subscription C Consumer C-1 Consumer C-2 Shared < K 1, V 10 > < K 2, V 21 > < K 1, V 12 > < K 2 ,V 2 0 > < K 1, V 11 > < K 2 ,V 2 2 > Subscription A Consumer A Exclusive Subscription B Consumer B-1 Consumer B-2 In case of failure in Consumer B-1 Failover
  • 8. Apache Pulsar – Adoptions
  • 9. Apache Pulsar + Apache Flink
  • 10. Apache Pulsar + Apache Flink Streaming-first, unified data processing
  • 11. Hosted by Save Your Spot Now Use code FLINK+PULSAR to get 50% off your ticket. Pulsar Summit San Francisco Hotel Nikko August 18 2022 5 Keynotes 12 Breakout Sessions 1 Amazing Happy Hour
  • 12. Pulsar Summit San Francisco Sponsorship Prospectus Sponsorships Available Help engage and connect the Apache Pulsar community by becoming an official sponsor for Pulsar Summit San Francisco 2022! Learn more about the requirements and benefits of becoming a sponsor. Hosted by
  • 13. Flink-Pulsar SQL Connector – What ● Flink has 2 APIs: ○ DataStream API ○ SQL (& Table API) ● Flink SQL job can talk to Pulsar clusters ● Built from the Pulsar DataStream Connector ● Built-in PulsarCatalog ○ Provide metadata for Flink SQL tables
  • 14. Flink-Pulsar SQL Connector – Why • Flink SQL is becoming more and more popular • Easy to use for SQL-only users • Ad-hoc queries against Pulsar topics • CDC use case with CDC Format • Goal: provide seamless integration between Flink and Pulsar
  • 16. Flink-Pulsar SQL Connector – Message Structure
  • 17. Flink-Pulsar SQL Connector – Metadata ● Each Pulsar message is associated with metadata ● Allow users to declare columns mapped from Pulsar message metadata ● VIRTUAL means only available at the source
  • 19. Flink-Pulsar SQL Connector – Pulsar Schema ● ser/de raw message bytes into typed objects ● SchemaInfo as the data structure to define Pulsar schema ● Schema Type ○ Primitive: BOOLEAN, INT, STRING … ○ Complex: KeyValue, Struct (json, avro, protobuf_native) ● byte[] if no Schema defined ● AUTO schema to produce/consume generic records to/from brokers
  • 20. Flink-Pulsar SQL Connector – Flink Format ● “defines how to map binary data onto table columns” ● Existing formats: avro/json/csv/raw ● In Pulsar SQL Connector, we manage ser/de using Flink formats
  • 21. Flink-Pulsar SQL Connector – Schema & Format ● Case 1: Interact with topic where messages are serialized by Pulsar Client ● Problem: Then Which format to use ? ○ They are two different serialization frameworks used by two different systems ○ How to make sure Flink formats can understand the binary data produced by Pulsar Client using Pulsar Schema? ● Well, they follow the same binary format protocol : json / avro ● So they should be compatible.
  • 22. Flink-Pulsar SQL Connector – Schema & Format ● But there might be tiny differences, due to implementation details. ● So the ideal solution would be: implement a Flink format for each Pulsar schema. ○ But this is not available yet ● So, as a workaround, we will use the Flink formats and test thoroughly to make sure the Flink formats work with with Pulsar schema.
  • 23. What if we don’t use Pulsar Schema? ● Case 2: interact with topic where messages are serialized by Flink formats ● And Flink SQL jobs are the only clients of the topic ● Then any Flink format works fine, as the message are serialized/deserialized only by Flink SQL formats
  • 24. Flink-Pulsar SQL Connector – Schema Summary ● Interact with topic where messages are serialized by Pulsar client: ○ users must choose a correct and compatible format to use ● Interact with topic where messages are serialized by Flink formats: ○ Flink SQL take over the serialization/deserialization so any Flink format will be good ● More Flink formats to support more Pulsar Schema ? We are working on it !
  • 25. PulsarCatalog – What ● A Flink Catalog implementation using Pulsar as a metadata store ○ By default uses GenericInMemoryCatalog ○ Thus the Flink table definition is not persisted by default ● No other components needed ● Views, UDFs are not supported yet
  • 26. PulsarCatalog – Pulsar Multi Tenancy persistent://tenant/namespace/topic ● Tenant ○ authorization and authentication schema ○ configurations including storage quota, message TTL etc ○ set of clusters which the tenant’s configuration apply ● Namespace ○ the administrative unit within a tenant ○ policies including retention, dispatch throttling, etc
  • 27. PulsarCatalog – Tables ● Pulsar-Native Table: Existing Pulsar Topics –> Flink Table ○ Easy to use for simple queries ○ Do not need link topic with Flink table via `CREATE` statement ○ Can’t specify watermark, metadata or primary key. ● Explicit Table: Flink Table Declared via `CREATE` statement ○ Support all Flink SQL features: watermarks, primary key, metadata columns, etc. ○ Better control over Pulsar configs: regex-pattern topics, client options tuning, etc. ○ Requires additional setup and configuration You can create multiple tables against a topic, so native table and explicit table can referring the same topic.
  • 28. PulsarCatalog – Native Table ● Map a Pulsar `tenant/namespace` combination to a Flink database ○ e.g: persistent://public/default/topicA is under database `public/default` and with table name `topicA`. ● PulsarCatalog derive the columns of the table schema from the Pulsar Schema ● PulsarCatalog automatically decides which format to use
  • 29. PulsarCatalog – Native Table ● For structured Pulsar schemas, the Flink table schema is derived from the Pulsar schema ● For primitive types, mapped to a single column table schema with field name “value” ● Limitations ○ Requires the Pulsar topic to use a valid schema ○ Some Pulsar schema auto mapping is not supported
  • 30. PulsarCatalog – Explicit Table ● Create “placeholder” topics under a system tenant ● No data in the placeholder topic ● The Flink table schema is stored in the placeholder topic’s schema definition
  • 32. PulsarCatalog – Tables Summary ● Pulsar-Native Table ○ Topic Metadata -> Flink Table Definition ■ tenant/namespace -> database ■ topic -> table ■ Pulsar schema -> Flink schema ● Explicit Table ○ Flink Table Definition persisted in Pulsar cluster ■ flink_catalog_tenant/database_name/table_flink_table_name ■ __flink_catalog/flink_forward/table_Orders ○ Table Schema Serialized and persisted in Pulsar Schema store
  • 33. Future work ● Improvement and enhancement ● protobuf_native format ● Upsert mode: CDC scenarios and adapt CDC formats ● Cookbooks and migration guide
  • 34. Resources ● Apache Pulsar ● Apache Pulsar Newsletter ● Github repository: streamnative/flink ● SQL connector image (1.15 later) ● Examples: streamnative/flink-examples ● StreamNative Hub Documentation: SN Hub
  • 35. Acknowledgements Yufei Zhang is a StreamNative engineer working on the integration of Pulsar and Flink. He is an Apache RocketMQ Committer & Apache Flink Contributor. Yufan Sheng is a software engineer at StreamNative where he devotes in flink or other streaming platform integration with Apache Pulsar. Before that he was a senior software engineer at Tencent Cloud.
  • 36. ➔ Pulsar expert instructor-led courses ➔ On-demand learning with labs ➔ 300+ engineers, admins and architects trained! StreamNative Academy Academy.StreamNative.io LEARN MORE ABOUT APACHE PULSAR WITH: