Flink SQL on Pulsar made easy

● Staff Software Engineer @ StreamNative
● Apache Pulsar Committor & Apache Incubator Heron Committer
● Leading compute team @ StreamNative
● Founder & CEO @ StreamNative
● Apache Pulsar Committor & PMC
● Leading Apache Pulsar & StreamNative
PHOT
O
HERE

Agenda
● Apache Pulsar & Apache Flink
● Flink-Pulsar SQL Connector
● Flink-Pulsar Catalog

Apache Pulsar - The cloud-native messaging and streaming platform
Unified Messaging Model
Simplify your data infrastructure and enable new use cases with queuing and streaming
capabilities in one platform.
Multi-tenancy
Enable multiple user groups to share the same platform, either via access control, or in
entirely different tenants or namespaces.
Scalability
Decoupled data computing and storage enable horizontal scaling to handle data scale and
management complexity.
Geo-replication
Support for multi-datacenter replication with both asynchronous and synchronous
replication for built-in disaster recovery, multi-cloud and hybrid cloud
Tiered storage
Enable historical data to be offloaded to cloud-native storage and store event streams for
indefinite periods of time. Unlock new cases for unified batch and stream processing.

Reader and
Batch API
Pulsar
IO/Connectors
Stream Processor
Applications
Prebuilt Connectors Custom Connectors
Microservices or
Event-Driven Architecture
Pub/Sub
API
Publisher
Subscriber
Admin
API
Operators &
Administrators
Teams
Tenant
Pulsar API

Subscription Modes
Different subscription modes
have different semantics:
Exclusive/Failover - guaranteed
order, single active consumer
Shared - multiple active
consumers, no order
Key_Shared - multiple active
consumers, order for given key
Producer 1
Producer 2
Pulsar Topic
Subscription D
Consumer D-1
Consumer D-2
Key-Shared
<
K
1,
V
10
>
<
K
1,
V
11
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
2
,V
2
1>
<
K
2
,V
2
2
>
Subscription C
Consumer C-1
Consumer C-2
Shared
<
K
1,
V
10
>
<
K
2,
V
21
>
<
K
1,
V
12
>
<
K
2
,V
2
0
>
<
K
1,
V
11
>
<
K
2
,V
2
2
>
Subscription A Consumer A
Exclusive
Subscription B
Consumer B-1
Consumer B-2
In case of failure in
Consumer B-1
Failover

Apache Pulsar + Apache Flink
Streaming-ﬁrst, uniﬁed data processing

Hosted by
Save Your Spot Now
Use code FLINK+PULSAR
to get 50% off your ticket.
Pulsar Summit
San Francisco
Hotel Nikko
August 18 2022
5 Keynotes
12 Breakout Sessions
1 Amazing Happy Hour

Pulsar Summit
San Francisco
Sponsorship
Prospectus
Sponsorships Available
Help engage and connect the Apache Pulsar
community by becoming an official sponsor for
Pulsar Summit San Francisco 2022! Learn more
about the requirements and benefits of
becoming a sponsor.
Hosted by

Flink-Pulsar SQL Connector – What
● Flink has 2 APIs:
○ DataStream API
○ SQL (& Table API)
● Flink SQL job can talk to Pulsar clusters
● Built from the Pulsar DataStream Connector
● Built-in PulsarCatalog
○ Provide metadata for Flink SQL tables

Flink-Pulsar SQL Connector – Why
• Flink SQL is becoming more and more
popular
• Easy to use for SQL-only users
• Ad-hoc queries against Pulsar topics
• CDC use case with CDC Format
• Goal: provide seamless integration
between Flink and Pulsar

Flink-Pulsar SQL Connector – Example

Flink-Pulsar SQL Connector – Message Structure

Flink-Pulsar SQL Connector – Metadata
● Each Pulsar message is associated with
metadata
● Allow users to declare columns mapped from
Pulsar message metadata
● VIRTUAL means only available at the source

Flink-Pulsar SQL Connector – Metadata

Flink-Pulsar SQL Connector – Pulsar Schema
● ser/de raw message bytes into typed
objects
● SchemaInfo as the data structure to
deﬁne Pulsar schema
● Schema Type
○ Primitive: BOOLEAN, INT, STRING …
○ Complex: KeyValue, Struct (json, avro,
protobuf_native)
● byte[] if no Schema deﬁned
● AUTO schema to produce/consume
generic records to/from brokers

Flink-Pulsar SQL Connector – Flink Format
● “deﬁnes how to map binary data onto table columns”
● Existing formats: avro/json/csv/raw
● In Pulsar SQL Connector, we manage ser/de using Flink formats

Flink-Pulsar SQL Connector – Schema & Format
● Case 1: Interact with topic where messages are serialized by Pulsar Client
● Problem: Then Which format to use ?
○ They are two different serialization frameworks used by two different
systems
○ How to make sure Flink formats can understand the binary data
produced by Pulsar Client using Pulsar Schema?
● Well, they follow the same binary format protocol : json / avro
● So they should be compatible.

Flink-Pulsar SQL Connector – Schema & Format
● But there might be tiny differences, due to
implementation details.
● So the ideal solution would be: implement a Flink
format for each Pulsar schema.
○ But this is not available yet
● So, as a workaround, we will use the Flink formats
and test thoroughly to make sure the Flink formats
work with with Pulsar schema.

What if we don’t use Pulsar Schema?
● Case 2: interact with topic where messages are serialized by Flink formats
● And Flink SQL jobs are the only clients of the topic
● Then any Flink format works ﬁne, as the message are serialized/deserialized
only by Flink SQL formats

Flink-Pulsar SQL Connector – Schema Summary
● Interact with topic where messages are serialized by Pulsar client:
○ users must choose a correct and compatible format to use
● Interact with topic where messages are serialized by Flink formats:
○ Flink SQL take over the serialization/deserialization so any Flink format will be good
● More Flink formats to support more Pulsar Schema ? We are working on it !

PulsarCatalog – What
● A Flink Catalog implementation using Pulsar as a metadata store
○ By default uses GenericInMemoryCatalog
○ Thus the Flink table deﬁnition is not persisted by default
● No other components needed
● Views, UDFs are not supported yet

PulsarCatalog – Pulsar Multi Tenancy
persistent://tenant/namespace/topic
● Tenant
○ authorization and authentication schema
○ conﬁgurations including storage quota, message TTL etc
○ set of clusters which the tenant’s conﬁguration apply
● Namespace
○ the administrative unit within a tenant
○ policies including retention, dispatch throttling, etc

PulsarCatalog – Tables
● Pulsar-Native Table: Existing Pulsar Topics –> Flink Table
○ Easy to use for simple queries
○ Do not need link topic with Flink table via `CREATE` statement
○ Can’t specify watermark, metadata or primary key.
● Explicit Table: Flink Table Declared via `CREATE` statement
○ Support all Flink SQL features: watermarks, primary key, metadata columns, etc.
○ Better control over Pulsar conﬁgs: regex-pattern topics, client options tuning, etc.
○ Requires additional setup and conﬁguration
You can create multiple tables against a topic, so native table and explicit table
can referring the same topic.

PulsarCatalog – Native Table
● Map a Pulsar `tenant/namespace`
combination to a Flink database
○ e.g: persistent://public/default/topicA is
under database `public/default` and with
table name `topicA`.
● PulsarCatalog derive the columns of the
table schema from the Pulsar Schema
● PulsarCatalog automatically decides
which format to use

PulsarCatalog – Native Table
● For structured Pulsar schemas, the Flink table schema is derived from the Pulsar
schema
● For primitive types, mapped to a single column table schema with ﬁeld name “value”
● Limitations
○ Requires the Pulsar topic to use a valid schema
○ Some Pulsar schema auto mapping is not supported

PulsarCatalog – Explicit Table
● Create “placeholder” topics under a
system tenant
● No data in the placeholder topic
● The Flink table schema is stored in
the placeholder topic’s schema
deﬁnition

PulsarCatalog – Explicit Table

PulsarCatalog – Tables Summary
● Pulsar-Native Table
○ Topic Metadata -> Flink Table Definition
■ tenant/namespace -> database
■ topic -> table
■ Pulsar schema -> Flink schema
● Explicit Table
○ Flink Table Definition persisted in Pulsar cluster
■ flink_catalog_tenant/database_name/table_flink_table_name
■ __flink_catalog/flink_forward/table_Orders
○ Table Schema Serialized and persisted in Pulsar Schema store

Future work
● Improvement and enhancement
● protobuf_native format
● Upsert mode: CDC scenarios and adapt CDC formats
● Cookbooks and migration guide

Resources
● Apache Pulsar
● Apache Pulsar Newsletter
● Github repository: streamnative/ﬂink
● SQL connector image (1.15 later)
● Examples: streamnative/ﬂink-examples
● StreamNative Hub Documentation: SN Hub

Acknowledgements
Yufei Zhang is a StreamNative engineer
working on the integration of Pulsar and Flink.
He is an Apache RocketMQ Committer &
Apache Flink Contributor.
Yufan Sheng is a software engineer at StreamNative
where he devotes in flink or other streaming platform
integration with Apache Pulsar. Before that he was a
senior software engineer at Tencent Cloud.

➔ Pulsar expert instructor-led courses
➔ On-demand learning with labs
➔ 300+ engineers, admins and architects trained!
StreamNative Academy
Academy.StreamNative.io
LEARN MORE ABOUT APACHE PULSAR WITH:

Flink SQL on Pulsar made easy

More Related Content

More from Flink Forward (20)

Flink SQL on Pulsar made easy