Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset

Streaming Data Analytics with ksqlDB and Superset
w/ Robert Stolz
Email: robert@preset.io
GitHub: garden-of-delete
Find me on the Superset Slack!

Who am I?
2
● Data Engineer and Developer Advocate @ Preset
● Background in scientific research, computational
biology, mathematics, open-source software
● Data architecture and best practices nerd
● New(ish) to Kafka

Agenda
3
• The history and anatomy of Apache Superset
• What superset offers a streaming data architecture
• Streaming analytics w/ Kafka: paths and challenges
Feel free to ask questions as they come up
Keep an eye out for this series on the Preset Blog!

Apache Superset
2019 2021
2015
Version 1.0,
ASF incubator
graduation
5

Dynamic Dashboards
Dashboard filters and Jinja templating enable end-users
to drill deeper into data
No Code Exploration
Create beautiful, complex charts from your data without
having to write any code
SQL Lab
State of the art SQL IDE with a rich metadata browser for
deeper analysis
Rich Visualizations
Beautiful array of interactive visualizations including
geospatial
Granular Permissions
Row level security, configurable data policies
Semantic Layer
Support for virtual columns, virtual tables, view creation,
and more
Caching
Reduce load on the database - faster queries, faster
results
Modern Datastack Support
Connect to any SQL speaking database, including popular
cloud data warehouses and SQL engines
Alerts & Reports
Get notified via Slack or email when dips or spikes happen
in your data
Custom Viz Plugins
Build your own custom visualization plug-in or connect to
popular 3rd party plug-ins
6
Apache Superset

Superset speaks SQL via SQLAlchemy
7

Who uses Apache Superset?
and hundreds more...
8

Value proposition of open-source BI
● Extensibility: custom analytics, embedding, piecemeal
● Control: avoid vendor lock-in
● Cost: free to use and modify, but can be expensive to maintain an
enterprise deployment
● Quality: open-source is a better process for making software
9

Superset’s lightweight semantic layer
SQL
speaking
datasources
React
front-end
Python
back-end
+
semantic
layer
10

Dashboard: Drag and Drop Editing

Why connect streaming data to the BI layer?
● BI is one of the primary sensory organs of modern organizations
● Faster well-informed decision-making is a generally desirable thing
● Many more specific business use-cases require fast response to external events
○ Anomaly detection
○ Location and time-sensitive services
○ Extreme event monitoring
○ Visualizing and analyzing a real-world process that is constantly evolving

The Question
Want to understand: what paths exist for getting streaming data from
Kafka into Superset? (and more generally into the BI/analytics layer)
Distinct from wanting to analyze metadata from a kafka deployment

Best practice: Intermediate datastore
?
Want to understand: what paths exist for getting streaming data from
Kafka into Superset? (and more generally into the BI/analytics layer)
Distinct from wanting to analyze metadata from a kafka deployment

Direct connection
- Connect Kafka directly to Superset
- The most naive approach

Direct connection
- Superset would need to consume data from Kafka topics directly
- Undesirable to have data live in the BI/Analytics layer

Streaming Analytics w/ Superset + ksqlDB
- ksqlDB provides a SQL speaking interface for data in Kafka topics
- Powered by Kafka’s stream processing framework

Streaming Analytics w/ Superset + ksqlDB
- No SQLAlchemy dialect for ksqlDB (as of today)
- Probably undesirable to have historical data, complex aggregates,
etc accessible only through Kafka’s stream-processing framework

Best-practice: Intermediate datastore
- Desirable properties: high write-volume, robust support for event
data, low read-after-write latency, integrated kafka consumer
?

Best-practice: Intermediate datastore
- Desirable properties: high write-volume, robust support for time-
series data, low read-after-write latency, integrated kafka consumer
- Druid, Clickhouse, Rockset, Pinot, Cassandra, etc ...

How to choose the right datastore?

Path 1: Integrated consumer
- Integrated consumers ingest event data directly from Kafka topics
- Transformation can be handled by the datastore or by kafka streams
- Best performance, limited flexibility in choice of datastore

Path 2: ksqlDB connection
- Some transformation tasks are handled by ksqlDB (Kafka Streams)
- Expands the list of possible intermediate datastores

Path 3: Ad-hoc consumers
- Maximum flexibility around choice of datastore
- Comes at the expense of performance
- Can be harder to maintain

Superset fits into batch and streaming data architectures
Src: Designing Cloud Data Platforms by Danil Zburivsky and Lynda Partner

Manual Setup
• Complex set-up
• Maximum control over
configuration
• Good for enterprise
deployments
• Advanced features require
additional set-up (Async
Queries, Query Caching,
Prophet integration,
Dashboard thumbnails,
Alerts and Reports)
Docker-compose
• Easiest set-up
• Great for trying out
Superset and local
development
• Some features are part
of the stack by default
(caching) and some
aren’t (alerts and
reports, prophet
integration)
Preset Cloud
• No set-up
• Good for individual
evaluation all the
way up to enterprise
needs
• All advanced
Superset features
available
• Still FREE for small
teams!
Three ways to run Superset

Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset

More Related Content

What's hot (20)

Similar to Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset