SlideShare a Scribd company logo
Arcadia Data. Proprietary and Confidential
New Approaches for Fraud Detection
on Apache Kafka and KSQL
September 20, 2018
Arcadia Data. Proprietary and Confidential2
Featured Speakers
Dale Kim
Sr. Director, Products/Solutions
Arcadia Data
Chong Yan
Solutions Architect
Confluent
Arcadia Data. Proprietary and Confidential3
 If you have any questions along the way, please type them into the chat window.
 If you have audio problems, please chat us for help.
 A recording of this presentation will be sent to you in a few days.
 Please live tweet! @arcadiadata @confluentinc
Before We Begin Our Presentation
Arcadia Data. Proprietary and Confidential4
 Primary goals include
 Reduce losses due to fraud
 Reduce rate of fraudulent activity versus legitimate activity
 Cost of fraud often goes beyond the cost of the transaction
 Retain high rate of approved transactions
 Reduce false positives (in this case, legitimate activities flagged as potentially fraudulent)
 Customer experience could be impacted by false positives
 What can be done?
 Enable a larger user base for monitoring for fraud
 Identify risky transactions sooner (i.e., in real-time)
 Evolve “better” algorithms (beyond scope of this talk)
First, Let’s Review the Goals of Fraud Detection
Arcadia Data. Proprietary and Confidential5
 Fraud is largely about anomaly detection
 Outlier or unexpected events that signal potential fraud
 Anomalies across a population, not only for individuals
 Examples
 Unusually large transactions
 Unusual timing of transactions
 Consistent groupings of transactions
Example Fraud Signals
Arcadia Data. Proprietary and Confidential6
 Does a rise in transactions in the past 30 minutes look suspicious?
 Can this be captured in a batch environment?
 Fraud detection goes beyond
just a fraud team
 Analyzing marketing data might
lead to insights in fraud
What about Trends/Patterns?
Arcadia Data. Proprietary and Confidential7
 Your approach should not be limited to static data and only known patterns and
signatures
 Fraud detection must be holistic across all data, and requires exploration
 Your system should provide BI-style dashboards and reports
 Goals must be tied in with customer acquisition and revenue strategies
Key Requirements for Fraud Detection
Arcadia Data. Proprietary and Confidential8
Credit Card Transaction Analysis
Arcadia Data. Proprietary and Confidential9
Get an Overview of User Behavior
Arcadia Data. Proprietary and Confidential10
Visualize Data Points in Any Manner
Arcadia Data. Proprietary and Confidential11
Monitor Click Activity
Arcadia Data. Proprietary and Confidential12
Drill Down to Known Malicious Users
Arcadia Data. Proprietary and Confidential13
Traditional Streaming Architectures for Transactions
Kafka Cluster
Source Topics
Stream Processing
System
Job Version N
Job Version N+1
Serving DB
Output Table N
Output Table N+1
Analytics
App
Queries/
Responses
Future Queries/
Responses
Data Sources
Kafka Cluster
Source Topics
Analytics App
Stream Processing
Framework
Custom
End User
Interface
Responses
Data Sources
Arcadia Data. Proprietary and Confidential14
Native Streaming Visualizations Architecture
Kafka Cluster
Source Topics
KSQL Cluster
SQL engine
Visual
Analytics
/ BI App
Queries/
Responses
Data Sources
Arcadia Data. Proprietary and Confidential
Quick Poll
16Confidential
New Approaches for Fraud
Detection on Apache Kafka
and KSQL
Chong Yan, Solutions Architect, Confluent
What is Apache Kafka®?
Way More Than Messaging
True Storage
Real-Time
Processing
Scalability
Messaging done right.
18
1919
+ Distributed clustered storage
Kafka is a blend of messaging, stream processing, ETL and
modern database designs built around a distributed log.
+ Streaming platform
Pub/Sub
Messaging
ETL
Connectors
Spark
Flink
Beam
IBM MQ
TIBCO
RabbitMQ
Mulesoft
Talend
Informatica
Kafka is much more than messaging
+ Exactly once
+ Designed for the cloud+ Inter-DC
replication
+ Schema evolution
Stream
Processing
2020
Stream Data is
The Faster the Better
We are challenging old assumptions...
Big Data was
The More the Better
ValueofData
Volume of Data
ValueofData
Age of Data
Confidential 21
KSQL from Confluent
KSQL
An Open Source Streaming SQL
Engine for Apache Kafka®
Confidential
2
2
KSQL: The streaming SQL engine for Apache Kafka from Confluent
✓ All you need is SQL
✓ No separate processing cluster required
✓ Powered by Kafka: elastic, scalable,
distributed, battle tested
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u
ON c.userid = u.userid
WHERE u.level = 'Platinum';
KSQL is the simplest way to process streams of data in real
time
✓ Perfect for streaming ETL, anomaly detection,
event monitoring and more
✓ Part of Confluent Open Source
https://github.com/confluentinc/ksql
Confidential 23
KSQL: The streaming SQL engine for Apache Kafka from Confluent
• Enables stream processing with SQL like syntax.
• The simplest way to process streams of data in real time
• Powered by Kafka: scalable, distributed, battle tested
• All you need is Kafka–no complex deployments of bespoke systems for stream
processing
Ksql>
Confidential 24
CREATE STREAM possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
KSQL: The simplest way to do stream processing
Confidential 25
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid =
u.user_id
WHERE u.level = 'Platinum';
CREATE STREAM possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
KSQL: The simplest way to do stream processing
1 2 3Streaming ETL Anomaly detection Monitoring
26
KSQL Concepts
● STREAM and TABLE as first-class citizens
• Interpretations of topic content
● STREAM – data in motion
● TABLE – collected state of a stream
• One record per key (per window)
• Current values (compacted topic) ← Not yet in KSQL
● STREAM – TABLE Joins
27
Window Aggregations
Three types supported (same as KStreams):
● TUMBLING: Fixed-size, non-overlapping, gapless windows
• SELECT ip, count(*) AS hits
FROM clickstream WINDOW TUMBLING (size 1 minute)
GROUP BY ip;
● HOPPING: Fixed-size, overlapping windows
• SELECT ip, SUM(bytes) AS bytes_per_ip_and_bucket
FROM clickstream WINDOW HOPPING ( size 20 second, advance by 5 second)
GROUP BY ip;
● SESSION: Dynamically-sized, non-overlapping, data-driven window
• SELECT ip, SUM(bytes) AS bytes_per_ip
FROM clickstream WINDOW SESSION (20 second)
GROUP BY ip;
More: http://docs.confluent.io/current/streams/developer-guide.html#windowing
Confidential 28
1)How to run KSQL: Standalone aka “local mode”
• Starts a CLI, an engine and a REST server all in the same JVM
• Ideal for laptop development
• Start with default settings:
> bin/ksql-cli local
• Or with customized settings:
> bin/ksql-cli local –-properties-file foo/bar/ksql.properties
Confidential 29
2) How to run KSQL: Client-Server
• Start any number of server nodes
• > bin/ksql-server-start
• Start any number of CLIs and specify “remote” server address
• >bin/ksql-cli remote http://myserver:8090
• All running engines share the processing load
• Technically, instances of the same Kafka Streams
applications
• Scale up/down without restart
Confidential 30
2) How to run KSQL: As an application
• Start any number of engine instances
• Pass a file of KSQL statements to execute
> bin/ksql-node query-file=foo/bar.sql
• Ideal for streaming ETL application deployment
• Version control your queries and transformations as code
• All running engines share the processing load
• Technically, instances of the same Kafka Streams
applications
• Scale up/down without restart
Confidential
Arcadia Data – A Visualization Layer for KSQL
31
Arcadia Data. Proprietary and Confidential
Demo
Arcadia Data. Proprietary and Confidential33
Try Out the Software Yourself
Go to: https://www.arcadiadata.com/product/streaming-visualizations
https://www.arcadiadata.com/product/streaming-visualizations/
San Francisco – October 16-17, 2018
Presented by
Kafka Community Discount Code
KS18COMM25 for 25% off
www.kafka-summit.org
Arcadia Data. Proprietary and Confidential35
Thank You!
Be sure to also visit:
Try Arcadia Instant (free download)
https://www.arcadiadata.com/instant
Get started with Arcadia Data on KSQL
https://www.arcadiadata.com/resources/knowledge-base/
Read Arcadia Data blog posts
https://www.arcadiadata.com/blog
@arcadiadata
Try Confluent KSQL (free download)
https://cnfl.io/ksql
Sign up for confluentcommunity
https://cnfl.io/slack #ksql
Read Confluent blog posts
https://cnfl.io/blog
@confluentinc

More Related Content

PPTX
Introduction to Docker - 2017
PPTX
Introduction to Apache ZooKeeper
PPSX
Big Data Redis Mongodb Dynamodb Sharding
PPTX
Introduction to kubernetes
PDF
Introduction to kubernetes
PDF
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
PDF
Service Mesh with Apache Kafka, Kubernetes, Envoy, Istio and Linkerd
PDF
A la découverte de kubernetes
Introduction to Docker - 2017
Introduction to Apache ZooKeeper
Big Data Redis Mongodb Dynamodb Sharding
Introduction to kubernetes
Introduction to kubernetes
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
Service Mesh with Apache Kafka, Kubernetes, Envoy, Istio and Linkerd
A la découverte de kubernetes

What's hot (20)

PDF
HA Deployment Architecture with HAProxy and Keepalived
PPTX
Kubernetes Networking 101
PPTX
HAProxy
PDF
Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
PDF
Building Event Driven (Micro)services with Apache Kafka
PDF
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
PDF
Common issues with Apache Kafka® Producer
PDF
Scalability, Availability & Stability Patterns
PDF
Google File System
PDF
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
PDF
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
PPTX
Sizing MongoDB Clusters
PDF
Monitoring kubernetes with prometheus
PDF
Podman, Buildah, and Quarkus - The Latest in Linux Containers Technologies
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Getting Started with Kubernetes
PDF
Blazing Performance with Flame Graphs
PDF
Producer Performance Tuning for Apache Kafka
HA Deployment Architecture with HAProxy and Keepalived
Kubernetes Networking 101
HAProxy
Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter...
Introducing the Apache Flink Kubernetes Operator
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Building Event Driven (Micro)services with Apache Kafka
[오픈소스컨설팅] 쿠버네티스와 쿠버네티스 on 오픈스택 비교 및 구축 방법
Common issues with Apache Kafka® Producer
Scalability, Availability & Stability Patterns
Google File System
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Sizing MongoDB Clusters
Monitoring kubernetes with prometheus
Podman, Buildah, and Quarkus - The Latest in Linux Containers Technologies
Processing Semantically-Ordered Streams in Financial Services
Getting Started with Kubernetes
Blazing Performance with Flame Graphs
Producer Performance Tuning for Apache Kafka
Ad

Similar to New Approaches for Fraud Detection on Apache Kafka and KSQL (20)

PDF
NDC London 2017 - The Data Dichotomy- Rethinking Data and Services with Streams
PPTX
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
PDF
Introduction to Apache Kafka and why it matters - Madrid
PDF
EDA Meets Data Engineering – What's the Big Deal?
PDF
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
PPTX
Data Streaming with Apache Kafka & MongoDB - EMEA
PPTX
Webinar: Data Streaming with Apache Kafka & MongoDB
PDF
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
PPTX
Data Streaming with Apache Kafka & MongoDB
PPTX
WebAppseqweqweqweqwewqeqweqweReImagined.pptx
PDF
JHipster conf 2019 - Kafka Ecosystem
PDF
Apache Kafka as Event Streaming Platform for Microservice Architectures
PPTX
Building Serverless EDA w_ AWS Lambda (1).pptx
PDF
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
PDF
ScyllaDB Virtual Workshop
PDF
[Hands-on] CQRS(Command Query Responsibility Segregation) 와 Event Sourcing 패턴 실습
PPTX
CQRS and Event Sourcing
PDF
Streaming ETL to Elastic with Apache Kafka and KSQL
PDF
Beyond the brokers - A tour of the Kafka ecosystem
PDF
Beyond the Brokers: A Tour of the Kafka Ecosystem
NDC London 2017 - The Data Dichotomy- Rethinking Data and Services with Streams
Webinar: Unlock the Power of Streaming Data with Kinetica and Confluent
Introduction to Apache Kafka and why it matters - Madrid
EDA Meets Data Engineering – What's the Big Deal?
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Data Streaming with Apache Kafka & MongoDB - EMEA
Webinar: Data Streaming with Apache Kafka & MongoDB
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
Data Streaming with Apache Kafka & MongoDB
WebAppseqweqweqweqwewqeqweqweReImagined.pptx
JHipster conf 2019 - Kafka Ecosystem
Apache Kafka as Event Streaming Platform for Microservice Architectures
Building Serverless EDA w_ AWS Lambda (1).pptx
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
ScyllaDB Virtual Workshop
[Hands-on] CQRS(Command Query Responsibility Segregation) 와 Event Sourcing 패턴 실습
CQRS and Event Sourcing
Streaming ETL to Elastic with Apache Kafka and KSQL
Beyond the brokers - A tour of the Kafka ecosystem
Beyond the Brokers: A Tour of the Kafka Ecosystem
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Cloud computing and distributed systems.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
KodekX | Application Modernization Development
PDF
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
Spectral efficient network and resource selection model in 5G networks
Review of recent advances in non-invasive hemoglobin estimation
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
Building Integrated photovoltaic BIPV_UPV.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The AUB Centre for AI in Media Proposal.docx
KodekX | Application Modernization Development
Empathic Computing: Creating Shared Understanding

New Approaches for Fraud Detection on Apache Kafka and KSQL

  • 1. Arcadia Data. Proprietary and Confidential New Approaches for Fraud Detection on Apache Kafka and KSQL September 20, 2018
  • 2. Arcadia Data. Proprietary and Confidential2 Featured Speakers Dale Kim Sr. Director, Products/Solutions Arcadia Data Chong Yan Solutions Architect Confluent
  • 3. Arcadia Data. Proprietary and Confidential3  If you have any questions along the way, please type them into the chat window.  If you have audio problems, please chat us for help.  A recording of this presentation will be sent to you in a few days.  Please live tweet! @arcadiadata @confluentinc Before We Begin Our Presentation
  • 4. Arcadia Data. Proprietary and Confidential4  Primary goals include  Reduce losses due to fraud  Reduce rate of fraudulent activity versus legitimate activity  Cost of fraud often goes beyond the cost of the transaction  Retain high rate of approved transactions  Reduce false positives (in this case, legitimate activities flagged as potentially fraudulent)  Customer experience could be impacted by false positives  What can be done?  Enable a larger user base for monitoring for fraud  Identify risky transactions sooner (i.e., in real-time)  Evolve “better” algorithms (beyond scope of this talk) First, Let’s Review the Goals of Fraud Detection
  • 5. Arcadia Data. Proprietary and Confidential5  Fraud is largely about anomaly detection  Outlier or unexpected events that signal potential fraud  Anomalies across a population, not only for individuals  Examples  Unusually large transactions  Unusual timing of transactions  Consistent groupings of transactions Example Fraud Signals
  • 6. Arcadia Data. Proprietary and Confidential6  Does a rise in transactions in the past 30 minutes look suspicious?  Can this be captured in a batch environment?  Fraud detection goes beyond just a fraud team  Analyzing marketing data might lead to insights in fraud What about Trends/Patterns?
  • 7. Arcadia Data. Proprietary and Confidential7  Your approach should not be limited to static data and only known patterns and signatures  Fraud detection must be holistic across all data, and requires exploration  Your system should provide BI-style dashboards and reports  Goals must be tied in with customer acquisition and revenue strategies Key Requirements for Fraud Detection
  • 8. Arcadia Data. Proprietary and Confidential8 Credit Card Transaction Analysis
  • 9. Arcadia Data. Proprietary and Confidential9 Get an Overview of User Behavior
  • 10. Arcadia Data. Proprietary and Confidential10 Visualize Data Points in Any Manner
  • 11. Arcadia Data. Proprietary and Confidential11 Monitor Click Activity
  • 12. Arcadia Data. Proprietary and Confidential12 Drill Down to Known Malicious Users
  • 13. Arcadia Data. Proprietary and Confidential13 Traditional Streaming Architectures for Transactions Kafka Cluster Source Topics Stream Processing System Job Version N Job Version N+1 Serving DB Output Table N Output Table N+1 Analytics App Queries/ Responses Future Queries/ Responses Data Sources Kafka Cluster Source Topics Analytics App Stream Processing Framework Custom End User Interface Responses Data Sources
  • 14. Arcadia Data. Proprietary and Confidential14 Native Streaming Visualizations Architecture Kafka Cluster Source Topics KSQL Cluster SQL engine Visual Analytics / BI App Queries/ Responses Data Sources
  • 15. Arcadia Data. Proprietary and Confidential Quick Poll
  • 16. 16Confidential New Approaches for Fraud Detection on Apache Kafka and KSQL Chong Yan, Solutions Architect, Confluent
  • 17. What is Apache Kafka®?
  • 18. Way More Than Messaging True Storage Real-Time Processing Scalability Messaging done right. 18
  • 19. 1919 + Distributed clustered storage Kafka is a blend of messaging, stream processing, ETL and modern database designs built around a distributed log. + Streaming platform Pub/Sub Messaging ETL Connectors Spark Flink Beam IBM MQ TIBCO RabbitMQ Mulesoft Talend Informatica Kafka is much more than messaging + Exactly once + Designed for the cloud+ Inter-DC replication + Schema evolution Stream Processing
  • 20. 2020 Stream Data is The Faster the Better We are challenging old assumptions... Big Data was The More the Better ValueofData Volume of Data ValueofData Age of Data
  • 21. Confidential 21 KSQL from Confluent KSQL An Open Source Streaming SQL Engine for Apache Kafka®
  • 22. Confidential 2 2 KSQL: The streaming SQL engine for Apache Kafka from Confluent ✓ All you need is SQL ✓ No separate processing cluster required ✓ Powered by Kafka: elastic, scalable, distributed, battle tested CREATE TABLE possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3; CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.userid WHERE u.level = 'Platinum'; KSQL is the simplest way to process streams of data in real time ✓ Perfect for streaming ETL, anomaly detection, event monitoring and more ✓ Part of Confluent Open Source https://github.com/confluentinc/ksql
  • 23. Confidential 23 KSQL: The streaming SQL engine for Apache Kafka from Confluent • Enables stream processing with SQL like syntax. • The simplest way to process streams of data in real time • Powered by Kafka: scalable, distributed, battle tested • All you need is Kafka–no complex deployments of bespoke systems for stream processing Ksql>
  • 24. Confidential 24 CREATE STREAM possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3; KSQL: The simplest way to do stream processing
  • 25. Confidential 25 CREATE TABLE error_counts AS SELECT error_code, count(*) FROM monitoring_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE type = 'ERROR' GROUP BY error_code; CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum'; CREATE STREAM possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3; KSQL: The simplest way to do stream processing 1 2 3Streaming ETL Anomaly detection Monitoring
  • 26. 26 KSQL Concepts ● STREAM and TABLE as first-class citizens • Interpretations of topic content ● STREAM – data in motion ● TABLE – collected state of a stream • One record per key (per window) • Current values (compacted topic) ← Not yet in KSQL ● STREAM – TABLE Joins
  • 27. 27 Window Aggregations Three types supported (same as KStreams): ● TUMBLING: Fixed-size, non-overlapping, gapless windows • SELECT ip, count(*) AS hits FROM clickstream WINDOW TUMBLING (size 1 minute) GROUP BY ip; ● HOPPING: Fixed-size, overlapping windows • SELECT ip, SUM(bytes) AS bytes_per_ip_and_bucket FROM clickstream WINDOW HOPPING ( size 20 second, advance by 5 second) GROUP BY ip; ● SESSION: Dynamically-sized, non-overlapping, data-driven window • SELECT ip, SUM(bytes) AS bytes_per_ip FROM clickstream WINDOW SESSION (20 second) GROUP BY ip; More: http://docs.confluent.io/current/streams/developer-guide.html#windowing
  • 28. Confidential 28 1)How to run KSQL: Standalone aka “local mode” • Starts a CLI, an engine and a REST server all in the same JVM • Ideal for laptop development • Start with default settings: > bin/ksql-cli local • Or with customized settings: > bin/ksql-cli local –-properties-file foo/bar/ksql.properties
  • 29. Confidential 29 2) How to run KSQL: Client-Server • Start any number of server nodes • > bin/ksql-server-start • Start any number of CLIs and specify “remote” server address • >bin/ksql-cli remote http://myserver:8090 • All running engines share the processing load • Technically, instances of the same Kafka Streams applications • Scale up/down without restart
  • 30. Confidential 30 2) How to run KSQL: As an application • Start any number of engine instances • Pass a file of KSQL statements to execute > bin/ksql-node query-file=foo/bar.sql • Ideal for streaming ETL application deployment • Version control your queries and transformations as code • All running engines share the processing load • Technically, instances of the same Kafka Streams applications • Scale up/down without restart
  • 31. Confidential Arcadia Data – A Visualization Layer for KSQL 31
  • 32. Arcadia Data. Proprietary and Confidential Demo
  • 33. Arcadia Data. Proprietary and Confidential33 Try Out the Software Yourself Go to: https://www.arcadiadata.com/product/streaming-visualizations https://www.arcadiadata.com/product/streaming-visualizations/
  • 34. San Francisco – October 16-17, 2018 Presented by Kafka Community Discount Code KS18COMM25 for 25% off www.kafka-summit.org
  • 35. Arcadia Data. Proprietary and Confidential35 Thank You! Be sure to also visit: Try Arcadia Instant (free download) https://www.arcadiadata.com/instant Get started with Arcadia Data on KSQL https://www.arcadiadata.com/resources/knowledge-base/ Read Arcadia Data blog posts https://www.arcadiadata.com/blog @arcadiadata Try Confluent KSQL (free download) https://cnfl.io/ksql Sign up for confluentcommunity https://cnfl.io/slack #ksql Read Confluent blog posts https://cnfl.io/blog @confluentinc