© 2021 Google LLC. All rights reserved.
Hybrid Streaming Analytics
for Apache Kafka Users
Firat Tekiner (ftekiner@google.com)
EMEA Data Analytics Practice Lead
© 2021 Google LLC. All rights reserved.
On-premises or Other Cloud
Hybrid Kafka Reference Architecture
Dataflow
BigQuery
Cloud
Storage
Data
Studio
Cloud
Functions
AI
Platform
Bigtable
Confluent Replicator
KSQL
App App DataStore
MySQL HDFS Teradata,
Netezza
Mainframe
App App
© 2021 Google LLC. All rights reserved.
Business is transforming
Businesses have to anticipate and
act on risks and opportunities faster
than ever before
The data and events needed for
analysis are increasing in velocity,
volume, and type
Companies that are able to quickly identify and capitalize on insights within this
changing landscape have a strategic advantage.
© 2021 Google LLC. All rights reserved.
Why Enterprises
choose Google Cloud
for Streaming
Analytics
Serverless Architecture
Robust ingestion services
Unified batch and stream processing
Comprehensive set of analysis tools
Flexibility for users
© 2021 Google LLC. All rights reserved.
Serverless data analytics
From infrastructure to platform for insights
Performance tuning
Monitoring
Reliability
Deployment &
configuration
Utilization
improvements
The traditional data analytics platform
Analysis and insights
Resource provisioning
Handling growing scale
Analysis and
insights
The serverless data
analytics model
© 2021 Google LLC. All rights reserved.
Right-time Action
Dashboard
Visualize and share anomalous events in
your data.
Alerts
Manage by exception through condition-
based notifications.
Actions
Automatically trigger workflows in other
systems using conditions.
1
2
3
Looker
Blocks
© 2021 Google LLC. All rights reserved.
Comprehensive set of analysis tools
BigQuery
Cloud Data
Warehouse
Easy setup
Directly integrated with
streaming Dataflow and
Confluent Cloud
Real time
Fast insights and action
powered by BigQuery’s
Streaming API
Intelligent
Built-in ML for out-of-the-
box predictive insights
Cloud AI
Platform
AI & ML Tools
Plug-and-play
Easily experiment and
collaborate with Google’s
AI Hub
Building blocks
Tools for sight, language,
conversation, and
structured data
Fast deployment
Code-based AI platform
quickly moves ML ideas
to deployment
Tensorflow
Extended (TFX)
© 2021 Google LLC. All rights reserved.
Improve the customer experience with Real-time AI
TFX uses Dataflow and Apache Beam as the distributed data processing engine to enable
several aspects of the ML life cycle, all supported with CI/CD for ML through Kubeflow pipelines.
Predictive
Analytics
Fraud
Detection
Real-time
Personalization
More!
Proprietary + Confidential
© 2021 Google LLC. All rights reserved.
Data Analytics
& Management
Google
Cloud
Smart
Analytics
& AI
Prebuilt
ML APIs
Foundation
AI Platform
AutoML
AI Solutions
Language Conversation
Horizontal solutions
Structured Data
Language
Frameworks Compute
Contact
Center AI
Ingestion and Processing Storage and Analytics
Orchestration
Notebooks
Industry solutions
Data
Labeling
Training Prediction Continuous
evaluation
Explainability Pipelines
Compute
Engine
Cloud TPU
Cloud GPU Cloud
scheduler
Cloud
Composer
Instrumentation
Cloud Build Container
Registry
Cloud
Pub/Sub
Cloud
Dataflow
Cloud
Dataproc
Data
Fusion
Cloud
Storage BigQuery
Cloud
Bigtable
Cloud SQL
Data
Catalog
Data
Studio
Data Science and Machine Learning
Sight
Sight
Vision Video Translate Natural
Language
Tables
Video
Intelligence
Vision Natural
Language
Translate Speech-to-Text Text-to-Speech
Document AI
Dialogflow Talent Solution
Recommendation AI
© 2021 Google LLC. All rights reserved.
Flexibility for users
Apache Beam
Open-source,
unified model and
set of SDKs for
defining and
executing data
processing
Open source programming
model
Serves as the SDK for
creating Cloud Dataflow jobs;
community development
increases flexibility
Choose your language
Java, Python, Scala, and GO are available;
join DA Spotlight
for news on languages
Portability
Program in Beam, and gain the ability to
move between
Spark, Flink, Dataflow, and more
Dataflow
Simplified stream and
batch data processing
Batch and Stream
Reduce complexity and reuse code
by driving batch and stream
workloads from the same tool
Reliable and consistent processing
Exactly once processing with built-in
support for fault-tolerant execution
Simplified operations & management
Performance, scaling, availability,
security, and compliance
handled automatically
Integrated
Integration with Kafka/Confluent Cloud,
the Google Data Analytics suite,
and GCP broadly
Unified stream and batch
processing
© 2021 Google LLC. All rights reserved.
Ingest Transform Analyze
Ingest and distribute
data reliably
Fast, correct computations
quickly and simply
Machine learning &
data warehouse
Cloud Dataflow
Cloud ML
Pub/Sub BigQuery
Dataflow
Flexible stream analytics with OSS
KSQL
© 2021 Google LLC. All rights reserved.
Google Cloud has an
end-to-end, fully-
managed Stream
Analytics offering
Pub/Sub
(Messaging)
Confluent Kafka
(Messaging)*
BigQuery Streaming
API
IoT Core
Collect
Data Catalog (Metadata Management) & Composer (Workflow Orchestration)
Dataflow
(Beam Streaming)
Dataproc
(Spark Streaming and Flink)
Dataform
Kubernetes
Process
BigQuery
Bigtable
AI Platform + TFX
Integration
Databases (e.g.
Cloud SQL, Spanner)
Store and Analyze
Looker
Apigee
Firebase
Activate
Cloud Functions
* Partner Solution
© 2021 Google LLC. All rights reserved.
A platform for all users and intents throughout the data lifecycle
Fine-grained
access control
Cloud IAM
Metadata
management
Data Catalog
Always
encrypted
Data at rest and
in transit
Redact sensitive
data
Cloud DLP
Security Admin
Protecting data
Messaging
PubSub
Data Processing
Dataflow
Data Apps
Looker
(LookML)
OSS Engines
Dataproc
(Spark, Flink)
Developer
Intelligent apps
DW & DB
BigQuery ,
BigTable
Data processing
(OSS) pipelines
Dataproc
(Spark, Presto, Flink)
Data Processing
(Native) pipelines
Dataflow
Orchestration
Composer
Data engineer
Get clean, useful data
Messaging
PubSub or
Confluent Kafka
CDW
BigQuery
CDW &
Orchestration
BigQuery
Visual data
Integration
Data Fusion
ML in SQL
BigQuery ML
Data models,
catalog
Looker, Data
Catalog
Data analyst
Query and analyze
Ingestion
BigQuery
Streaming &
DTS
Governed BI
Looker
CDW in a
Spreadsheet
Connected
Sheets
Natural Language
Query
Data QnA
Business User
Insights Everywhere
Data models,
catalog
Looker, Data
Catalog
CDW
BigQuery
Portable
notebooks
AI Platform
Notebooks
Simplified ML
BigQuery ML &
Auto ML
Collaboration
Feature Store,
AI Platform
Pipelines
Spark
Dataproc
Data scientist
Models that work
CDW
BigQuery
Secure data
sharing
BigQuery
© 2021 Google LLC. All rights reserved.
Real-time Analytics GCP Approach
Event Collect Process Store and Analyze Activate
BigQuery Looker
Event stream / Integration
Pub/sub Dataflow
IoT Core
Analytics
Low Latency,
Time Series Bigtable
Apigee
Firebase
Apigee
Firebase
Monetization
Cloud Logging
...
Templates
AI Platform
Continuous
Intelligence
Edge Manager
for ML
ML at the Edge
App Activation
© 2021 Google LLC. All rights reserved.
Real-time Analytics GCP Simplified Approach
Event Collect Process Store and Analyze Activate
BigQuery
Looker
Streaming API
ELT
(Dataform)
Materialized
Views
BQML
BI Engine
Data Studio
Apigee
Connected
Sheets
Event stream / Integration
© 2021 Google LLC. All rights reserved.
Real-time Analytics Open and Partner Approach
Event Collect Process Store and Analyze Activate
Dataproc
Streaming
BigQuery
3rd Party BI and
activation tools
...
...
© 2021 Google LLC. All rights reserved.
Options
Hybrid
● Accessing Kafka on-prem directly from GCP
● Kafka replication (on-prem to GCE or Confluent Cloud’s GCP marketplace offering)
Lift and Shift
● Confluent Cloud’s fully managed Kafka (Marketplace offering)
– Connectors available to BigQuery, Cloud Storage, Pub/Sub, MongoDB Atlas, etc
– Clustering, SLAs, etc
● Self-managing Kafka on GCE
GCP Integration
● Pre-Built Dataflow Flex
● Kafka to BigQuery template
● Using Kafka Connect
● To push to Google BigQuery. Supported by Confluent and WePay
● To push to Google Cloud Pub/Sub. Supported by Google
● Fivetran, Confluent ...
How do we deploy Kafka or integrate it with the rest of the GCP
stack?
© 2021 Google LLC. All rights reserved.
On-prem
Hybrid: Access Kafka on-prem from GCP
Gateway
Google Cloud
Interconnect
& VPN
Gateway
Kafka
Cluster
Analysis
Cloud Dataflow
Analysis
Compute Engine
Analysis
Cloud Dataproc
© 2021 Google LLC. All rights reserved.
On-prem
Hybrid: Replicate Kafka on-prem to GCP
Gateway
Google Cloud
Interconnect
& VPN
Gateway
Kafka
Cluster
Kafka
Self Managed
Cluster
Compute
Engine
Analysis
Cloud
Dataflow
Analysis
Compute
Engine
Kafka Connect
Kafka Connect
Replicator
Analysis
Cloud
Dataproc
© 2021 Google LLC. All rights reserved.
On-prem
Lift and Shift: Confluent Cloud’s Kafka on GCP
Analysis
Cloud
Dataflow
Analysis
Compute
Engine
Analysis
Cloud
Dataproc
Confluent Cloud
Managed by Confluent
Kafka Cluster
Customer Project
Internet
Private network
© 2021 Google LLC. All rights reserved.
On-prem
Lift and Shift: Self-managing Kafka on GCP
Gateway
Google Cloud
Interconnect
& VPN
Gateway
Kafka
Self Managed
Cluster
Compute
Engine
Analysis
Cloud
Dataflow
Analysis
Compute
Engine
Analysis
Cloud
Dataproc
© 2021 Google LLC. All rights reserved.
GCP Integration: Using Dataflow Template
Kafka to BQ
Dataflow Template
Table
BigQuery
Kafka
Compute Engine
© 2021 Google LLC. All rights reserved.
On-prem
GCP Integration: Using Kafka Connect
Gateway
Google Cloud
Interconnect
& VPN
Gateway
Analysis
Cloud
Dataflow
Kafka Connect
Cloud Pub/Sub
Connector
Kafka Topic
Cloud Pub/Sub
Kafka Topic Dest.
BigQuery
Kafka Connect
BigQuery Connector
Internet
Private Network
Supported by Google
Supported by Confluent and WePay
Analysis
Cloud
BigQuery
© 2021 Google LLC. All rights reserved.
Comparing it to Google Cloud Pub/Sub
Self-managed Kafka
● Open source
● Set up your own auth to protect your Kafka
● You must provision and plan for load isolation
● You must support it
● You must infer costs based on variety of capacity
and availability patterns, buy components (rather
than pay for usage): CPU, disk, network
● You must design and maintain your own replication
and backup setup
● Can be used as a system of record, messages re-
read from beginning — new subscribers can read
from start (depending on retention policy)
● Order guarantees within a partition
● Large platform of streaming tools — KSQL, Schema
Registry, Connectors to/from data sources
Cloud Pub/Sub
● GCP only; however, the API can be emulated on a
Kafka server on-prem
● GCP IAM integration
● 24-hour on-call support, SLAs from Google, and
integrated monitoring with Stackdriver
● Transparent replication and backups for high
availability and durability
● Predictable bandwidth-based billing
● Global presence: Pub/Sub is already deployed in all
GCP data centers for consistent latency and high
availability. Today, only global is possible.
● Single service: You only worry about managing
topics and subscribers, rather than clusters
● At least once delivery
Thank you

More Related Content

PPTX
Data governance and discoverability at AO.com | Jon Vines, AO.com and Christo...
PPTX
Confluent Private Cloud | Rohit Bakhshi, Staff Product Manager
PPTX
Confluent Cloud Networking | Rajan Sundaram, Confluent
PPTX
Cloud native Kafka | Sascha Holtbruegge and Margaretha Erber, HiveMQ
PDF
5 lessons learned for successful migration to Confluent cloud | Natan Silinit...
PPTX
Writing Kafka applications without Kafka server access | Zoltan Balogh, IBM U...
PPTX
From legacy systems to microservices and back | Andera Gioia, Quantyca
PDF
Why you should have a Schema Registry | David Hettler, Celonis SE
Data governance and discoverability at AO.com | Jon Vines, AO.com and Christo...
Confluent Private Cloud | Rohit Bakhshi, Staff Product Manager
Confluent Cloud Networking | Rajan Sundaram, Confluent
Cloud native Kafka | Sascha Holtbruegge and Margaretha Erber, HiveMQ
5 lessons learned for successful migration to Confluent cloud | Natan Silinit...
Writing Kafka applications without Kafka server access | Zoltan Balogh, IBM U...
From legacy systems to microservices and back | Andera Gioia, Quantyca
Why you should have a Schema Registry | David Hettler, Celonis SE

What's hot (20)

PDF
Don't Cross the Streams! (or do, we got you)
PDF
Apicurio Registry: Event-driven APIs & Schema governance for Apache Kafka | F...
PPTX
Building a Codeless Log Pipeline w/ Confluent Sink Connector | Pollyanna Vale...
PDF
Building a Data Subscription Service with Kafka Connect (Danica Fine & Ajay V...
PPTX
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
PPTX
Should we manage events like APIs? | Kim Clark, IBM
PDF
Kafka summit apac session
PDF
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
PDF
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
PDF
Reacting to an Event-Driven World (Kate Stanley & Grace Jansen, IBM) Kafka Su...
PDF
A Solution for Leveraging Kafka to Provide End-to-End ACID Transactions
PDF
Kafka for connected vehicle research | Pavle Bujanovic, Federal Highway Admin...
PPTX
EDA Governance Model: a multicloud approach based on GitOps | Alejandro Alija...
PDF
How we eased out security journey with OAuth (Goodbye Kerberos!) | Paul Makka...
PDF
Kubernetes Apache Kafka
PDF
Digital Transformation: Highly Resilient Streaming Architecture and Strategies
PDF
Stream Processing with Kafka and KSQL in Jupiter | Namit Mahuvakar, Jupiter
PDF
Toward Hybrid Cloud Serverless Transparency with Lithops Framework
PPTX
Anypoint Data Graphs
PDF
Real-Time Market Data Analytics Using Kafka Streams
Don't Cross the Streams! (or do, we got you)
Apicurio Registry: Event-driven APIs & Schema governance for Apache Kafka | F...
Building a Codeless Log Pipeline w/ Confluent Sink Connector | Pollyanna Vale...
Building a Data Subscription Service with Kafka Connect (Danica Fine & Ajay V...
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Should we manage events like APIs? | Kim Clark, IBM
Kafka summit apac session
Top 5 Event Streaming Use Cases for 2021 with Apache Kafka
Real-time Adaptation of Financial Market Events with Kafka | Cliff Cheng and ...
Reacting to an Event-Driven World (Kate Stanley & Grace Jansen, IBM) Kafka Su...
A Solution for Leveraging Kafka to Provide End-to-End ACID Transactions
Kafka for connected vehicle research | Pavle Bujanovic, Federal Highway Admin...
EDA Governance Model: a multicloud approach based on GitOps | Alejandro Alija...
How we eased out security journey with OAuth (Goodbye Kerberos!) | Paul Makka...
Kubernetes Apache Kafka
Digital Transformation: Highly Resilient Streaming Architecture and Strategies
Stream Processing with Kafka and KSQL in Jupiter | Namit Mahuvakar, Jupiter
Toward Hybrid Cloud Serverless Transparency with Lithops Framework
Anypoint Data Graphs
Real-Time Market Data Analytics Using Kafka Streams
Ad

Similar to Hybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, Google (20)

PDF
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
PDF
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
PDF
Navigating Your Data Landscape With Siddharth Desai and Elena Cuevas | Curren...
PPTX
Eric Andersen Keynote
PDF
Modern Thinking área digital MSKM 21/09/2017
PDF
Google Cloud Dataflow
PDF
Build your own event analytics pipeline using BigQuery, Dataflow, and k8s. Je...
PPTX
Introduction to GCP DataFlow Presentation
PPTX
Introduction to GCP Data Flow Presentation
PDF
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
PDF
Critical Breakthroughs and Challenges in Big Data and Analytics
PDF
Google Cloud Dataflow Two Worlds Become a Much Better One
PDF
DSDT Meetup Nov 2017
PDF
Dsdt meetup 2017 11-21
PDF
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
PDF
Google на конференции Big Data Russia
PDF
Big data in action
PPTX
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
PDF
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
PDF
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Navigating Your Data Landscape With Siddharth Desai and Elena Cuevas | Curren...
Eric Andersen Keynote
Modern Thinking área digital MSKM 21/09/2017
Google Cloud Dataflow
Build your own event analytics pipeline using BigQuery, Dataflow, and k8s. Je...
Introduction to GCP DataFlow Presentation
Introduction to GCP Data Flow Presentation
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
Critical Breakthroughs and Challenges in Big Data and Analytics
Google Cloud Dataflow Two Worlds Become a Much Better One
DSDT Meetup Nov 2017
Dsdt meetup 2017 11-21
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
Google на конференции Big Data Russia
Big data in action
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
PDF
Renaming a Kafka Topic | Kafka Summit London
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
PDF
Exactly-once Stream Processing with Arroyo and Kafka
PDF
Fish Plays Pokemon | Kafka Summit London
PDF
Tiered Storage 101 | Kafla Summit London
PDF
Building a Self-Service Stream Processing Portal: How And Why
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
PDF
TL;DR Kafka Metrics | Kafka Summit London
PDF
A Window Into Your Kafka Streams Tasks | KSL
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
PDF
Data Contracts Management: Schema Registry and Beyond
PDF
Code-First Approach: Crafting Efficient Flink Apps
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Renaming a Kafka Topic | Kafka Summit London
Evolution of NRT Data Ingestion Pipeline at Trendyol
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Exactly-once Stream Processing with Arroyo and Kafka
Fish Plays Pokemon | Kafka Summit London
Tiered Storage 101 | Kafla Summit London
Building a Self-Service Stream Processing Portal: How And Why
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Navigating Private Network Connectivity Options for Kafka Clusters
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Explaining How Real-Time GenAI Works in a Noisy Pub
TL;DR Kafka Metrics | Kafka Summit London
A Window Into Your Kafka Streams Tasks | KSL
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Data Contracts Management: Schema Registry and Beyond
Code-First Approach: Crafting Efficient Flink Apps
Debezium vs. the World: An Overview of the CDC Ecosystem
Beyond Tiered Storage: Serverless Kafka with No Local Disks

Recently uploaded (20)

PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Abstractive summarization using multilingual text-to-text transfer transforme...
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPT
What is a Computer? Input Devices /output devices
DOCX
search engine optimization ppt fir known well about this
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Five Habits of High-Impact Board Members
PDF
Consumable AI The What, Why & How for Small Teams.pdf
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
CloudStack 4.21: First Look Webinar slides
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
Modernising the Digital Integration Hub
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
The influence of sentiment analysis in enhancing early warning system model f...
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Convolutional neural network based encoder-decoder for efficient real-time ob...
Enhancing emotion recognition model for a student engagement use case through...
Abstractive summarization using multilingual text-to-text transfer transforme...
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
A review of recent deep learning applications in wood surface defect identifi...
What is a Computer? Input Devices /output devices
search engine optimization ppt fir known well about this
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Hindi spoken digit analysis for native and non-native speakers
Five Habits of High-Impact Board Members
Consumable AI The What, Why & How for Small Teams.pdf
A contest of sentiment analysis: k-nearest neighbor versus neural network
CloudStack 4.21: First Look Webinar slides
Module 1.ppt Iot fundamentals and Architecture
Modernising the Digital Integration Hub

Hybrid Streaming Analytics for Apache Kafka Users | Firat Tekiner, Google

  • 1. © 2021 Google LLC. All rights reserved. Hybrid Streaming Analytics for Apache Kafka Users Firat Tekiner (ftekiner@google.com) EMEA Data Analytics Practice Lead
  • 2. © 2021 Google LLC. All rights reserved. On-premises or Other Cloud Hybrid Kafka Reference Architecture Dataflow BigQuery Cloud Storage Data Studio Cloud Functions AI Platform Bigtable Confluent Replicator KSQL App App DataStore MySQL HDFS Teradata, Netezza Mainframe App App
  • 3. © 2021 Google LLC. All rights reserved. Business is transforming Businesses have to anticipate and act on risks and opportunities faster than ever before The data and events needed for analysis are increasing in velocity, volume, and type Companies that are able to quickly identify and capitalize on insights within this changing landscape have a strategic advantage.
  • 4. © 2021 Google LLC. All rights reserved. Why Enterprises choose Google Cloud for Streaming Analytics Serverless Architecture Robust ingestion services Unified batch and stream processing Comprehensive set of analysis tools Flexibility for users
  • 5. © 2021 Google LLC. All rights reserved. Serverless data analytics From infrastructure to platform for insights Performance tuning Monitoring Reliability Deployment & configuration Utilization improvements The traditional data analytics platform Analysis and insights Resource provisioning Handling growing scale Analysis and insights The serverless data analytics model
  • 6. © 2021 Google LLC. All rights reserved. Right-time Action Dashboard Visualize and share anomalous events in your data. Alerts Manage by exception through condition- based notifications. Actions Automatically trigger workflows in other systems using conditions. 1 2 3 Looker Blocks
  • 7. © 2021 Google LLC. All rights reserved. Comprehensive set of analysis tools BigQuery Cloud Data Warehouse Easy setup Directly integrated with streaming Dataflow and Confluent Cloud Real time Fast insights and action powered by BigQuery’s Streaming API Intelligent Built-in ML for out-of-the- box predictive insights Cloud AI Platform AI & ML Tools Plug-and-play Easily experiment and collaborate with Google’s AI Hub Building blocks Tools for sight, language, conversation, and structured data Fast deployment Code-based AI platform quickly moves ML ideas to deployment Tensorflow Extended (TFX)
  • 8. © 2021 Google LLC. All rights reserved. Improve the customer experience with Real-time AI TFX uses Dataflow and Apache Beam as the distributed data processing engine to enable several aspects of the ML life cycle, all supported with CI/CD for ML through Kubeflow pipelines. Predictive Analytics Fraud Detection Real-time Personalization More!
  • 9. Proprietary + Confidential © 2021 Google LLC. All rights reserved. Data Analytics & Management Google Cloud Smart Analytics & AI Prebuilt ML APIs Foundation AI Platform AutoML AI Solutions Language Conversation Horizontal solutions Structured Data Language Frameworks Compute Contact Center AI Ingestion and Processing Storage and Analytics Orchestration Notebooks Industry solutions Data Labeling Training Prediction Continuous evaluation Explainability Pipelines Compute Engine Cloud TPU Cloud GPU Cloud scheduler Cloud Composer Instrumentation Cloud Build Container Registry Cloud Pub/Sub Cloud Dataflow Cloud Dataproc Data Fusion Cloud Storage BigQuery Cloud Bigtable Cloud SQL Data Catalog Data Studio Data Science and Machine Learning Sight Sight Vision Video Translate Natural Language Tables Video Intelligence Vision Natural Language Translate Speech-to-Text Text-to-Speech Document AI Dialogflow Talent Solution Recommendation AI
  • 10. © 2021 Google LLC. All rights reserved. Flexibility for users Apache Beam Open-source, unified model and set of SDKs for defining and executing data processing Open source programming model Serves as the SDK for creating Cloud Dataflow jobs; community development increases flexibility Choose your language Java, Python, Scala, and GO are available; join DA Spotlight for news on languages Portability Program in Beam, and gain the ability to move between Spark, Flink, Dataflow, and more Dataflow Simplified stream and batch data processing Batch and Stream Reduce complexity and reuse code by driving batch and stream workloads from the same tool Reliable and consistent processing Exactly once processing with built-in support for fault-tolerant execution Simplified operations & management Performance, scaling, availability, security, and compliance handled automatically Integrated Integration with Kafka/Confluent Cloud, the Google Data Analytics suite, and GCP broadly Unified stream and batch processing
  • 11. © 2021 Google LLC. All rights reserved. Ingest Transform Analyze Ingest and distribute data reliably Fast, correct computations quickly and simply Machine learning & data warehouse Cloud Dataflow Cloud ML Pub/Sub BigQuery Dataflow Flexible stream analytics with OSS KSQL
  • 12. © 2021 Google LLC. All rights reserved. Google Cloud has an end-to-end, fully- managed Stream Analytics offering Pub/Sub (Messaging) Confluent Kafka (Messaging)* BigQuery Streaming API IoT Core Collect Data Catalog (Metadata Management) & Composer (Workflow Orchestration) Dataflow (Beam Streaming) Dataproc (Spark Streaming and Flink) Dataform Kubernetes Process BigQuery Bigtable AI Platform + TFX Integration Databases (e.g. Cloud SQL, Spanner) Store and Analyze Looker Apigee Firebase Activate Cloud Functions * Partner Solution
  • 13. © 2021 Google LLC. All rights reserved. A platform for all users and intents throughout the data lifecycle Fine-grained access control Cloud IAM Metadata management Data Catalog Always encrypted Data at rest and in transit Redact sensitive data Cloud DLP Security Admin Protecting data Messaging PubSub Data Processing Dataflow Data Apps Looker (LookML) OSS Engines Dataproc (Spark, Flink) Developer Intelligent apps DW & DB BigQuery , BigTable Data processing (OSS) pipelines Dataproc (Spark, Presto, Flink) Data Processing (Native) pipelines Dataflow Orchestration Composer Data engineer Get clean, useful data Messaging PubSub or Confluent Kafka CDW BigQuery CDW & Orchestration BigQuery Visual data Integration Data Fusion ML in SQL BigQuery ML Data models, catalog Looker, Data Catalog Data analyst Query and analyze Ingestion BigQuery Streaming & DTS Governed BI Looker CDW in a Spreadsheet Connected Sheets Natural Language Query Data QnA Business User Insights Everywhere Data models, catalog Looker, Data Catalog CDW BigQuery Portable notebooks AI Platform Notebooks Simplified ML BigQuery ML & Auto ML Collaboration Feature Store, AI Platform Pipelines Spark Dataproc Data scientist Models that work CDW BigQuery Secure data sharing BigQuery
  • 14. © 2021 Google LLC. All rights reserved. Real-time Analytics GCP Approach Event Collect Process Store and Analyze Activate BigQuery Looker Event stream / Integration Pub/sub Dataflow IoT Core Analytics Low Latency, Time Series Bigtable Apigee Firebase Apigee Firebase Monetization Cloud Logging ... Templates AI Platform Continuous Intelligence Edge Manager for ML ML at the Edge App Activation
  • 15. © 2021 Google LLC. All rights reserved. Real-time Analytics GCP Simplified Approach Event Collect Process Store and Analyze Activate BigQuery Looker Streaming API ELT (Dataform) Materialized Views BQML BI Engine Data Studio Apigee Connected Sheets Event stream / Integration
  • 16. © 2021 Google LLC. All rights reserved. Real-time Analytics Open and Partner Approach Event Collect Process Store and Analyze Activate Dataproc Streaming BigQuery 3rd Party BI and activation tools ... ...
  • 17. © 2021 Google LLC. All rights reserved. Options Hybrid ● Accessing Kafka on-prem directly from GCP ● Kafka replication (on-prem to GCE or Confluent Cloud’s GCP marketplace offering) Lift and Shift ● Confluent Cloud’s fully managed Kafka (Marketplace offering) – Connectors available to BigQuery, Cloud Storage, Pub/Sub, MongoDB Atlas, etc – Clustering, SLAs, etc ● Self-managing Kafka on GCE GCP Integration ● Pre-Built Dataflow Flex ● Kafka to BigQuery template ● Using Kafka Connect ● To push to Google BigQuery. Supported by Confluent and WePay ● To push to Google Cloud Pub/Sub. Supported by Google ● Fivetran, Confluent ... How do we deploy Kafka or integrate it with the rest of the GCP stack?
  • 18. © 2021 Google LLC. All rights reserved. On-prem Hybrid: Access Kafka on-prem from GCP Gateway Google Cloud Interconnect & VPN Gateway Kafka Cluster Analysis Cloud Dataflow Analysis Compute Engine Analysis Cloud Dataproc
  • 19. © 2021 Google LLC. All rights reserved. On-prem Hybrid: Replicate Kafka on-prem to GCP Gateway Google Cloud Interconnect & VPN Gateway Kafka Cluster Kafka Self Managed Cluster Compute Engine Analysis Cloud Dataflow Analysis Compute Engine Kafka Connect Kafka Connect Replicator Analysis Cloud Dataproc
  • 20. © 2021 Google LLC. All rights reserved. On-prem Lift and Shift: Confluent Cloud’s Kafka on GCP Analysis Cloud Dataflow Analysis Compute Engine Analysis Cloud Dataproc Confluent Cloud Managed by Confluent Kafka Cluster Customer Project Internet Private network
  • 21. © 2021 Google LLC. All rights reserved. On-prem Lift and Shift: Self-managing Kafka on GCP Gateway Google Cloud Interconnect & VPN Gateway Kafka Self Managed Cluster Compute Engine Analysis Cloud Dataflow Analysis Compute Engine Analysis Cloud Dataproc
  • 22. © 2021 Google LLC. All rights reserved. GCP Integration: Using Dataflow Template Kafka to BQ Dataflow Template Table BigQuery Kafka Compute Engine
  • 23. © 2021 Google LLC. All rights reserved. On-prem GCP Integration: Using Kafka Connect Gateway Google Cloud Interconnect & VPN Gateway Analysis Cloud Dataflow Kafka Connect Cloud Pub/Sub Connector Kafka Topic Cloud Pub/Sub Kafka Topic Dest. BigQuery Kafka Connect BigQuery Connector Internet Private Network Supported by Google Supported by Confluent and WePay Analysis Cloud BigQuery
  • 24. © 2021 Google LLC. All rights reserved. Comparing it to Google Cloud Pub/Sub Self-managed Kafka ● Open source ● Set up your own auth to protect your Kafka ● You must provision and plan for load isolation ● You must support it ● You must infer costs based on variety of capacity and availability patterns, buy components (rather than pay for usage): CPU, disk, network ● You must design and maintain your own replication and backup setup ● Can be used as a system of record, messages re- read from beginning — new subscribers can read from start (depending on retention policy) ● Order guarantees within a partition ● Large platform of streaming tools — KSQL, Schema Registry, Connectors to/from data sources Cloud Pub/Sub ● GCP only; however, the API can be emulated on a Kafka server on-prem ● GCP IAM integration ● 24-hour on-call support, SLAs from Google, and integrated monitoring with Stackdriver ● Transparent replication and backups for high availability and durability ● Predictable bandwidth-based billing ● Global presence: Pub/Sub is already deployed in all GCP data centers for consistent latency and high availability. Today, only global is possible. ● Single service: You only worry about managing topics and subscribers, rather than clusters ● At least once delivery