SlideShare a Scribd company logo
Bringing Streaming Data
To The Masses
Lowering The "Cost Of Admission”
For Your Streaming Data Platform
San Francisco, CA
October 17th
, 2018
What is Kafka?
About Me
• Bob Lehmann
• Started life as an Electrical Engineer,
switched to IT 20 years ago
• Have worked with data in many
capacities – sensors and controls,
manufacturing process data, ERP
systems, enterprise data, etc.
• Architect and manage the Enterprise
DataHub at Bayer
• Live in St. Louis, MO
Who are we?
Who are we?
Inbred
(Parent 1)
Inbred
(Parent 2)
Hybrid
The Corn “Galaxy”
The Journey Starts Here
Circa 2014…
• Siloed IT org with different tech stacks
(Bayer IT org > 4000)
• MANY legacy systems and platforms
• Bayer adopted “cloud first” philosophy
• Embraced open source (finally J)
• Cross functional team of architects was established
to define strategies and architectures
DIRECTIVE: Develop a strategy for cloud-based enterprise wide analytics
Houston, We Have A Data Problem
• Data sprawl
• Data inconsistency
• Difficult to find data
• Can’t propagate
changes fast enough
Legacy
• Increased data sprawl
• Can’t forklift
applications to cloud
• Cloud apps need on-
prem data and vice-
versa
Cloud
Volume
Variety
Velocity
Veracity
Let’s Clean Up This Mess!
Relational
Databases
App App App
Cache
Poll For Changes
Caches &
Derived Stores
Relational
Data
Warehouse
ODS
Data Guard
Hadoop
CSV Dump
Transforms
Transforms
Apps and Services
Splunk
ActiveMQ
Apps
ActiveMQ
Apps Apps
Log Aggregation
HTTP
NFS
NFS
rsync
Transform & Load
Load
Monitoring
Apps and ServicesApps and Services
HTTP
Key-value
Store
Apps
OLTP Queries
Kafka
Log
Search
Monitoring
Real-time
Analytics
Social
Graph
Search Newsfeed OLAP
Samza
Apps
Key-Value
Storage
Oracle
Apps AppsAppsApps
Security &
Fraud
Hadoop Teradata
Apps
Courtesy: Jay Kreps
The Enterprise DataHub – Original Concept
- Kafka clusters on prem and in AWS
- Datacenter agnostic
- Establish cross-datacenter connection
- Replicate across datacenters
- Apps only interact with local Kafka cluster
- Use AVRO schemas
VPN
Use Schema
Registry
Mirrormaker?
Maybe GCP in the future?
Enterprise DataHub POC
Circa 2015
EC2 Instance
MirrorMaker
MirrorMaker
MirrorMaker
MirrorMaker
MirrorMaker
MirrorMaker
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
ZookeeperZookeeper
Oracle
SQL
Server
Cloud Foundry
Producer
Producer
Network
Monitor
Application
Ticketing
Application
Other
Monitoring
Apps
Postgres
RDS
Cloud Foundry
Consumer
Consumer
Consumer
Confluent 1.0 / Kafka 0.8
VPN
Tunnel
What a Long, Strange Trip It’s Been!
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
First Phase - The Launch
September, 2016
• Confluent 2.0 / Kafka 0.9
• Security via SSL certs – developed
patch to dynamically load broker
certs
• Replicant - Process to replace
Mirrormaker
• Basic platform monitoring
• Most user interaction via command
line tools
EC2
Container
Service
Replicant
Replicant
Replicant
Replicant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Core Platform
Monitoring
Documentation
Portal
Kafka
Manager
Phase 1 Results
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
• Java/Scala Developers
• AWS skills
• Linux command line
• Data movement from
on-prem to cloud
• Customer 360 –
bidirectional
movement
Phase 2 - Reaching Orbit
• Self Service User Portal
• Improved replication
process - Replikant
• Security Improvements
• Infrastructure automation
• Monitoring for topics and
consumers
• Slack integration for alerts
• Initial evaluation of Kafka
connect
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Core Platform
Monitoring
Documentation
Portal
Kafka
Manager
User
Self-Service
Portal
Consumer and
Replikant
Monitoring
Slack
Integration
DataHub
Portal
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Your Streaming Data Platform
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Your Streaming Data Platform
Topic and
Consumer
Monitoring
Phase 2 Results
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
• Java/Scala Developers
• AWS skills
• Linux command line
• Data movement from
on-prem to cloud
• Customer 360 –
bidirectional
movement
• Python, Node, etc.
• Some Analytics
• Some BI
• Event sourcing
• Company360
• Exadata CDC
• Incremental migration
to the cloud
Stage 3 - Leaving Orbit
• Kubernetes / Kafka
Connect
• Expansion to Google
Cloud
• CDC from SAP using
Informatica Data
Replication
• Integration with Data
Historian
• Detailed training class
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Core Platform
Monitoring
Documentation
Portal
Kafka
Manager
User
Self-Service
Portal
Consumer and
Replikant
Monitoring
Slack
Integration
Kubernetes
JDBC
Connector
S3
Connector
JMS
Connector
Elasticsearch
Connector
Kubernetes
JDBC
Connector
JDBC
Connector
JDBC
Connector
JMS
Connector
Data Historian
Ingestion
EMR
Hive SparkPresto
S3/
Parquet
• Code-free, simple
• Connector universe is expanding rapidly
• Secure - SSL connection
• AVRO support
Kafka Connect and
Kubernetes
• JDBC (Oracle, Postges, MySQL, SQL Server,
Teradata, Redshift)
• JMS
• S3
• File
• Elasticsearch
Connecters In Use
• Highly scalable
• Cluster in each environment
• Keeps processing local to the
environment
• Efficient use of resources
• Increased security
KubernetesKafka Connect
Expansion To
Other
Datacenters
North America
Datacenter
Greenhouse
Datacenter
Future
Datacenter
Different
Region
DataHub Tech Stack
Phase 3 Results
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
• Java/Scala Developers
• AWS skills
• Linux command line
• Data movement from
on-prem to cloud
• Customer 360 –
bidirectional
movement
• Data movement
between all datacenters
on central platform
• LIMS Integration
• Serverless apps in AWS
• SAP/Oracle CDC
• Product360 in GCP - Go
• BI / Reporting
• Analytics platform
• Python, Node, etc.
• Some Analytics
• Some BI
• Event sourcing
• Company360
• Exadata CDC
• Incremental migration
to the cloud
Current Phase – To Infinity And Beyond
• Bring stream processing to
the masses!
• Data validation across the
pipeline
• SQL interface for Kafka
(using Presto)
• Improve topic
discoverability and reuse
• Expose consumer metrics
to end users
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Kubernetes
I/O
File
Connector
JDBC
Connector
JDBC
Connector
JMS
Connector
Stream
Processing
KSQL
Kafka
Streams
Custom
Stream Proc
Kubernetes
I/O
File
Connector
JDBC
Connector
S3
Connector
Elasticsearch
Connector
Stream
Processing
KSQL
Kafka
Streams
Custom
Stream Proc
Data Historian
Ingestion
EMR
Hive SparkPresto
S3/
Parquet
Presto
SQL Engine
Haystack
Metadata
Platform
Many Clients In Many Places
Managed By DataHub Team
GoldenGate
CDC
Oracle
SQL
Server
TeraData
ExaData
Neo4J
Cloud Foundry
ProducerProducerProducers
ConsumerConsumerConsumers
ConsumerConsumerApplications
Legacy
Apps
(WebLogic)
Legacy
Apps
(WebLogic)
Legacy
Apps
EMR
S3/
Parquet
Postgres
Cloud Foundry
ProducerProducerProducers
ConsumerConsumerConsumers
ConsumerConsumerApplications
MySQL
Cassandra
RedShift
Integration
With
salesforce.com
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Replikant
Cluster
Replikant
Replikant
Replikant
Replikant
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Kafka
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
REST
Proxy
Schema
Registry
Schema
Registry
REST
Proxy
VPN
Tunnel
Core Platform
Monitoring
Documentation
Portal
Kafka
Manager
User
Self-Service
Portal
Consumer and
Replikant
Monitoring
Slack
Integration
“First Mile”
Processing
Automatic ingestion
Use Case – CDC From SAP to Data Historian
Schema Splitter converts an input
stream with a “generic” schema
(many different tables flowing
through one topic) to individual table
streams with table specific schemas
Data Historian
Ingestion
EMR
Hive SparkPresto
S3/
Parquet
Kubernetes
Cluster
Schema
Splitter
KSQL
Filter
KSQL
Agg
Kafka
Streams Proc
Replikant
Cluster
Replikant
Kafka
Schema
Registry
Schema
Registry
ZookeeperZookeeper
Oracle
ETL
SAP
Informatica
Data
Replication
Kafka
DB
Schema 2
DB
Schema 4
DB
Schema 1
DB
Schema 3
Teradata
Staging
Teradata
Final
Table 1
Table 3
DB
Schema 1
Table 2
Replikant
Replikant
Replikant
VPN
Tunnel
Topic Reuse
• Not as good as we would like. Why?
• Discoverability
• Developers are not altruistic when creating
schemas
MetaData Platform
• Haystack is our enterprise metadata platform
• Kafka topic metadata is automatically synced to
Haystack
• Haystack links back to the DataHub portal
• Will be able to search for topics in Haystack and
immediately find the topic in the DataHub portal
Apache Presto
• Presto is being implemented as an enterprise data
virtualization solution…not just for the DataHub
• Will also be used to provide data validation across the
pipeline via SQL
Example: Join a topic in Kafka to a table in Postgres to confirm that all
data has transferred correctly
• Will also be used to provide a SQL interface in the
DataHub portal to allow querying for specific messages.
• Developed a patch to the Presto Kafka connector to
connect with SSL and deserialize AVRO
Future Tech Stack
Phase 4 – Current Phase
First Phase
The Launch
Second Phase
Reaching
Orbit
Third Phase
Escaping
Gravity
Current Phase
To Infinity
And Beyond…
Kafka
Kafka
Streams
Portal
Portal
KSQL
Kafka
Connect
Portal
Documentation
Site
• Product360 in GCP - Go
• BI / Reporting
• Analytics platform
• Java/Scala Developers
• AWS skills
• Linux command line
• Data movement from
on-prem to cloud
• Customer 360 –
bidirectional
movement
• Data Stewards
• Everyone else!!
• Global streaming
• IOT data
• SAP Hana
• Data movement
between all datacenters
on central platform
• LIMS Integration
• Serverless apps in AWS
• SAP/Oracle CDC
• Python, Node, etc.
• Some Analytics
• Some BI
• Event sourcing
• Company360
• Exadata CDC
• Incremental migration
to the cloud
Future
• Move as much ETL as possible to the
streaming layer
• Monitoring and auditing of data flow across
the pipeline
• Consumer monitoring and configurable
alerting
• Improved data governance
• Integrate with enterprise security platform
The Enterprise DataHub is a living,
scalable, robust central nervous system
for data that facilitates the seamless
acquisition, transport and processing of
information in real time across multiple
datacenter and cloud environments.
THANK YOU!
Bob Lehmann
robert.lehmann@bayer.com

More Related Content

PDF
Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...
PDF
Real-Time Dynamic Data Export Using the Kafka Ecosystem
PDF
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
PDF
HOP! Airlines Jets to Real Time
PDF
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
PDF
Kafka and Kafka Streams in the Global Schibsted Data Platform
PDF
Achieving end-to-end visibility into complex event-sourcing transactions usin...
PDF
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...
Digital Transformation in Healthcare with Kafka—Building a Low Latency Data P...
Real-Time Dynamic Data Export Using the Kafka Ecosystem
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
HOP! Airlines Jets to Real Time
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
Kafka and Kafka Streams in the Global Schibsted Data Platform
Achieving end-to-end visibility into complex event-sourcing transactions usin...
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...

What's hot (20)

PDF
Kafka for Real-Time Event Processing in Serverless Environments
PDF
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
PDF
How to mutate your immutable log | Andrey Falko, Stripe
PDF
Matching the Scale at Tinder with Kafka
PDF
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
PDF
Building Retry Architectures in Kafka with Compacted Topics | Matthew Zhou, V...
PDF
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
PDF
Creating an Elastic Platform Using Kafka and Microservices in OpenShift
PDF
Kafka in the Enterprise—A Two-Year Journey to Build a Data Streaming Platform...
PDF
What's new in confluent platform 5.4 online talk
PDF
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
PDF
Processing IoT Data from End to End with MQTT and Apache Kafka
PDF
Stream Processing with Kafka and KSQL in Jupiter | Namit Mahuvakar, Jupiter
PDF
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
PDF
Moving 150 TB of data resiliently on Kafka With Quorum Controller on Kubernet...
PPTX
Cloud native Kafka | Sascha Holtbruegge and Margaretha Erber, HiveMQ
PDF
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
PPTX
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
PPTX
Building Event Streaming Microservices with Spring Boot and Apache Kafka | Ja...
PDF
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
Kafka for Real-Time Event Processing in Serverless Environments
Building Streaming Data Pipelines with Google Cloud Dataflow and Confluent Cl...
How to mutate your immutable log | Andrey Falko, Stripe
Matching the Scale at Tinder with Kafka
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Building Retry Architectures in Kafka with Compacted Topics | Matthew Zhou, V...
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...
Creating an Elastic Platform Using Kafka and Microservices in OpenShift
Kafka in the Enterprise—A Two-Year Journey to Build a Data Streaming Platform...
What's new in confluent platform 5.4 online talk
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Processing IoT Data from End to End with MQTT and Apache Kafka
Stream Processing with Kafka and KSQL in Jupiter | Namit Mahuvakar, Jupiter
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Moving 150 TB of data resiliently on Kafka With Quorum Controller on Kubernet...
Cloud native Kafka | Sascha Holtbruegge and Margaretha Erber, HiveMQ
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
Building Event Streaming Microservices with Spring Boot and Apache Kafka | Ja...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
Ad

Similar to Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Your Streaming Data Platform (20)

PDF
Kafka Vienna Meetup 020719
PPTX
Streaming Data and Stream Processing with Apache Kafka
PDF
Beyond the brokers - A tour of the Kafka ecosystem
PDF
Beyond the Brokers: A Tour of the Kafka Ecosystem
PDF
Beyond the brokers - Un tour de l'écosystème Kafka
PPTX
Streaming Data Ingest and Processing with Apache Kafka
PDF
Confluent kafka meetupseattle jan2017
PPTX
Unlock value with Confluent and AWS.pptx
PDF
EDA Meets Data Engineering – What's the Big Deal?
PDF
Devoxx university - Kafka de haut en bas
PDF
Apache Kafka in the Airline, Aviation and Travel Industry
PDF
Data pipeline with kafka
PPTX
Kafka Tutorial: Streaming Data Architecture
PPTX
Data In Motion Paris 2023
PDF
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
PDF
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
PPTX
Big Data Analytics_basic introduction of Kafka.pptx
PDF
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
PDF
Introduction to apache kafka, confluent and why they matter
PDF
Apache Kafka as Event Streaming Platform for Microservice Architectures
Kafka Vienna Meetup 020719
Streaming Data and Stream Processing with Apache Kafka
Beyond the brokers - A tour of the Kafka ecosystem
Beyond the Brokers: A Tour of the Kafka Ecosystem
Beyond the brokers - Un tour de l'écosystème Kafka
Streaming Data Ingest and Processing with Apache Kafka
Confluent kafka meetupseattle jan2017
Unlock value with Confluent and AWS.pptx
EDA Meets Data Engineering – What's the Big Deal?
Devoxx university - Kafka de haut en bas
Apache Kafka in the Airline, Aviation and Travel Industry
Data pipeline with kafka
Kafka Tutorial: Streaming Data Architecture
Data In Motion Paris 2023
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Big Data Analytics_basic introduction of Kafka.pptx
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
Introduction to apache kafka, confluent and why they matter
Apache Kafka as Event Streaming Platform for Microservice Architectures
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Approach and Philosophy of On baking technology
PDF
Electronic commerce courselecture one. Pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Cloud computing and distributed systems.
PPT
Teaching material agriculture food technology
PDF
Encapsulation theory and applications.pdf
Understanding_Digital_Forensics_Presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Advanced methodologies resolving dimensionality complications for autism neur...
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Approach and Philosophy of On baking technology
Electronic commerce courselecture one. Pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Cloud computing and distributed systems.
Teaching material agriculture food technology
Encapsulation theory and applications.pdf

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Your Streaming Data Platform

  • 1. Bringing Streaming Data To The Masses Lowering The "Cost Of Admission” For Your Streaming Data Platform San Francisco, CA October 17th , 2018
  • 3. About Me • Bob Lehmann • Started life as an Electrical Engineer, switched to IT 20 years ago • Have worked with data in many capacities – sensors and controls, manufacturing process data, ERP systems, enterprise data, etc. • Architect and manage the Enterprise DataHub at Bayer • Live in St. Louis, MO
  • 8. The Journey Starts Here Circa 2014… • Siloed IT org with different tech stacks (Bayer IT org > 4000) • MANY legacy systems and platforms • Bayer adopted “cloud first” philosophy • Embraced open source (finally J) • Cross functional team of architects was established to define strategies and architectures DIRECTIVE: Develop a strategy for cloud-based enterprise wide analytics
  • 9. Houston, We Have A Data Problem • Data sprawl • Data inconsistency • Difficult to find data • Can’t propagate changes fast enough Legacy • Increased data sprawl • Can’t forklift applications to cloud • Cloud apps need on- prem data and vice- versa Cloud Volume Variety Velocity Veracity
  • 10. Let’s Clean Up This Mess! Relational Databases App App App Cache Poll For Changes Caches & Derived Stores Relational Data Warehouse ODS Data Guard Hadoop CSV Dump Transforms Transforms Apps and Services Splunk ActiveMQ Apps ActiveMQ Apps Apps Log Aggregation HTTP NFS NFS rsync Transform & Load Load Monitoring Apps and ServicesApps and Services HTTP Key-value Store Apps OLTP Queries Kafka Log Search Monitoring Real-time Analytics Social Graph Search Newsfeed OLAP Samza Apps Key-Value Storage Oracle Apps AppsAppsApps Security & Fraud Hadoop Teradata Apps Courtesy: Jay Kreps
  • 11. The Enterprise DataHub – Original Concept - Kafka clusters on prem and in AWS - Datacenter agnostic - Establish cross-datacenter connection - Replicate across datacenters - Apps only interact with local Kafka cluster - Use AVRO schemas VPN Use Schema Registry Mirrormaker? Maybe GCP in the future?
  • 12. Enterprise DataHub POC Circa 2015 EC2 Instance MirrorMaker MirrorMaker MirrorMaker MirrorMaker MirrorMaker MirrorMaker Kafka Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Kafka Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 REST Proxy Schema Registry Schema Registry REST Proxy ZookeeperZookeeper Oracle SQL Server Cloud Foundry Producer Producer Network Monitor Application Ticketing Application Other Monitoring Apps Postgres RDS Cloud Foundry Consumer Consumer Consumer Confluent 1.0 / Kafka 0.8 VPN Tunnel
  • 13. What a Long, Strange Trip It’s Been! First Phase The Launch Second Phase Reaching Orbit Third Phase Escaping Gravity Current Phase To Infinity And Beyond… Kafka Kafka Streams Portal Portal KSQL Kafka Connect Portal Documentation Site
  • 14. First Phase - The Launch September, 2016 • Confluent 2.0 / Kafka 0.9 • Security via SSL certs – developed patch to dynamically load broker certs • Replicant - Process to replace Mirrormaker • Basic platform monitoring • Most user interaction via command line tools EC2 Container Service Replicant Replicant Replicant Replicant Kafka Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Kafka Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 REST Proxy Schema Registry Schema Registry REST Proxy VPN Tunnel Core Platform Monitoring Documentation Portal Kafka Manager
  • 15. Phase 1 Results First Phase The Launch Second Phase Reaching Orbit Third Phase Escaping Gravity Current Phase To Infinity And Beyond… Kafka Kafka Streams Portal Portal KSQL Kafka Connect Portal Documentation Site • Java/Scala Developers • AWS skills • Linux command line • Data movement from on-prem to cloud • Customer 360 – bidirectional movement
  • 16. Phase 2 - Reaching Orbit • Self Service User Portal • Improved replication process - Replikant • Security Improvements • Infrastructure automation • Monitoring for topics and consumers • Slack integration for alerts • Initial evaluation of Kafka connect Replikant Cluster Replikant Replikant Replikant Replikant Replikant Cluster Replikant Replikant Replikant Replikant Kafka Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Kafka Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 REST Proxy Schema Registry Schema Registry REST Proxy VPN Tunnel Core Platform Monitoring Documentation Portal Kafka Manager User Self-Service Portal Consumer and Replikant Monitoring Slack Integration
  • 21. Phase 2 Results First Phase The Launch Second Phase Reaching Orbit Third Phase Escaping Gravity Current Phase To Infinity And Beyond… Kafka Kafka Streams Portal Portal KSQL Kafka Connect Portal Documentation Site • Java/Scala Developers • AWS skills • Linux command line • Data movement from on-prem to cloud • Customer 360 – bidirectional movement • Python, Node, etc. • Some Analytics • Some BI • Event sourcing • Company360 • Exadata CDC • Incremental migration to the cloud
  • 22. Stage 3 - Leaving Orbit • Kubernetes / Kafka Connect • Expansion to Google Cloud • CDC from SAP using Informatica Data Replication • Integration with Data Historian • Detailed training class Replikant Cluster Replikant Replikant Replikant Replikant Replikant Cluster Replikant Replikant Replikant Replikant Kafka Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Kafka Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 REST Proxy Schema Registry Schema Registry REST Proxy VPN Tunnel Core Platform Monitoring Documentation Portal Kafka Manager User Self-Service Portal Consumer and Replikant Monitoring Slack Integration Kubernetes JDBC Connector S3 Connector JMS Connector Elasticsearch Connector Kubernetes JDBC Connector JDBC Connector JDBC Connector JMS Connector Data Historian Ingestion EMR Hive SparkPresto S3/ Parquet
  • 23. • Code-free, simple • Connector universe is expanding rapidly • Secure - SSL connection • AVRO support Kafka Connect and Kubernetes • JDBC (Oracle, Postges, MySQL, SQL Server, Teradata, Redshift) • JMS • S3 • File • Elasticsearch Connecters In Use • Highly scalable • Cluster in each environment • Keeps processing local to the environment • Efficient use of resources • Increased security KubernetesKafka Connect
  • 26. Phase 3 Results First Phase The Launch Second Phase Reaching Orbit Third Phase Escaping Gravity Current Phase To Infinity And Beyond… Kafka Kafka Streams Portal Portal KSQL Kafka Connect Portal Documentation Site • Java/Scala Developers • AWS skills • Linux command line • Data movement from on-prem to cloud • Customer 360 – bidirectional movement • Data movement between all datacenters on central platform • LIMS Integration • Serverless apps in AWS • SAP/Oracle CDC • Product360 in GCP - Go • BI / Reporting • Analytics platform • Python, Node, etc. • Some Analytics • Some BI • Event sourcing • Company360 • Exadata CDC • Incremental migration to the cloud
  • 27. Current Phase – To Infinity And Beyond • Bring stream processing to the masses! • Data validation across the pipeline • SQL interface for Kafka (using Presto) • Improve topic discoverability and reuse • Expose consumer metrics to end users Replikant Cluster Replikant Replikant Replikant Replikant Replikant Cluster Replikant Replikant Replikant Replikant Kafka Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Kafka Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 REST Proxy Schema Registry Schema Registry REST Proxy VPN Tunnel Kubernetes I/O File Connector JDBC Connector JDBC Connector JMS Connector Stream Processing KSQL Kafka Streams Custom Stream Proc Kubernetes I/O File Connector JDBC Connector S3 Connector Elasticsearch Connector Stream Processing KSQL Kafka Streams Custom Stream Proc Data Historian Ingestion EMR Hive SparkPresto S3/ Parquet Presto SQL Engine Haystack Metadata Platform
  • 28. Many Clients In Many Places Managed By DataHub Team GoldenGate CDC Oracle SQL Server TeraData ExaData Neo4J Cloud Foundry ProducerProducerProducers ConsumerConsumerConsumers ConsumerConsumerApplications Legacy Apps (WebLogic) Legacy Apps (WebLogic) Legacy Apps EMR S3/ Parquet Postgres Cloud Foundry ProducerProducerProducers ConsumerConsumerConsumers ConsumerConsumerApplications MySQL Cassandra RedShift Integration With salesforce.com Replikant Cluster Replikant Replikant Replikant Replikant Replikant Cluster Replikant Replikant Replikant Replikant Kafka Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Kafka Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 REST Proxy Schema Registry Schema Registry REST Proxy VPN Tunnel Core Platform Monitoring Documentation Portal Kafka Manager User Self-Service Portal Consumer and Replikant Monitoring Slack Integration
  • 30. Use Case – CDC From SAP to Data Historian Schema Splitter converts an input stream with a “generic” schema (many different tables flowing through one topic) to individual table streams with table specific schemas Data Historian Ingestion EMR Hive SparkPresto S3/ Parquet Kubernetes Cluster Schema Splitter KSQL Filter KSQL Agg Kafka Streams Proc Replikant Cluster Replikant Kafka Schema Registry Schema Registry ZookeeperZookeeper Oracle ETL SAP Informatica Data Replication Kafka DB Schema 2 DB Schema 4 DB Schema 1 DB Schema 3 Teradata Staging Teradata Final Table 1 Table 3 DB Schema 1 Table 2 Replikant Replikant Replikant VPN Tunnel
  • 31. Topic Reuse • Not as good as we would like. Why? • Discoverability • Developers are not altruistic when creating schemas
  • 32. MetaData Platform • Haystack is our enterprise metadata platform • Kafka topic metadata is automatically synced to Haystack • Haystack links back to the DataHub portal • Will be able to search for topics in Haystack and immediately find the topic in the DataHub portal
  • 33. Apache Presto • Presto is being implemented as an enterprise data virtualization solution…not just for the DataHub • Will also be used to provide data validation across the pipeline via SQL Example: Join a topic in Kafka to a table in Postgres to confirm that all data has transferred correctly • Will also be used to provide a SQL interface in the DataHub portal to allow querying for specific messages. • Developed a patch to the Presto Kafka connector to connect with SSL and deserialize AVRO
  • 35. Phase 4 – Current Phase First Phase The Launch Second Phase Reaching Orbit Third Phase Escaping Gravity Current Phase To Infinity And Beyond… Kafka Kafka Streams Portal Portal KSQL Kafka Connect Portal Documentation Site • Product360 in GCP - Go • BI / Reporting • Analytics platform • Java/Scala Developers • AWS skills • Linux command line • Data movement from on-prem to cloud • Customer 360 – bidirectional movement • Data Stewards • Everyone else!! • Global streaming • IOT data • SAP Hana • Data movement between all datacenters on central platform • LIMS Integration • Serverless apps in AWS • SAP/Oracle CDC • Python, Node, etc. • Some Analytics • Some BI • Event sourcing • Company360 • Exadata CDC • Incremental migration to the cloud
  • 36. Future • Move as much ETL as possible to the streaming layer • Monitoring and auditing of data flow across the pipeline • Consumer monitoring and configurable alerting • Improved data governance • Integrate with enterprise security platform
  • 37. The Enterprise DataHub is a living, scalable, robust central nervous system for data that facilitates the seamless acquisition, transport and processing of information in real time across multiple datacenter and cloud environments.