SlideShare a Scribd company logo
Real Time
Fraud Detection
Patterns and reference architectures
Ted Malaska // PSA Gwen Shapira // Software
Engineer
2
• Intro
• Review Problem
• Quick overview of key technology
• High level architecture
• Deep Dive into NRT Processing
• Completing the Puzzle – Micro-batch, Ingest and Batch
Overview
©2014 Cloudera, Inc. All rights reserved.
3©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data
• Formerly consultant
• Now Cloudera Engineer:
– Sqoop Committer
– Kafka
– Flume
• @gwenshap
Gwen Shapira
4
• Ted Malaska (PSA at Cloudera)
• Hadoop for ~5 years
• Contributed to
– HDFS, MapReduce, Yarn, HBase, Spark, Avro,
– Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo
– And working on a Sentry Patch
• Co-Author to O’Reilly Hadoop Application Architectures
• Worked with about 70 companies in 8 countries
• Marvel Fan Boy
• Runner
Hello
©2014 Cloudera, Inc. All rights reserved.
5
The Problem
©2014 Cloudera, Inc. All rights reserved.
6
Credit Card Transaction Fraud
©2014 Cloudera, Inc. All rights reserved.
7
Ikea Meat Balls
©2014 Cloudera, Inc. All rights reserved.
8
Coupon Fraud
©2014 Cloudera, Inc. All rights reserved.
9
Video Game Strategy
©2014 Cloudera, Inc. All rights reserved.
10
Health Insurance Fraud
©2014 Cloudera, Inc. All rights reserved.
11
• Typical Atomic Card Fraud Detection
• Ikea Meat Ball
• Multi Coupons Combinations
• OP or Negative Video Games Strategies
• Ad Serving
• Health Insurance Fraud
• Kid Coming Home From School
Review of the Problem
©2014 Cloudera, Inc. All rights reserved.
12
How do we React
• Human Brain at Tennis
– Muscle Memory
– Reaction Thought
– Reflective Meditation
©2014 Cloudera, Inc. All rights reserved.
13
Overview of
Key Technologies
©2014 Cloudera, Inc. All rights reserved.
14
Kafka
©2014 Cloudera, Inc. All Rights Reserved.
15©2014 Cloudera, Inc. All rights reserved.
• Messages are organized into topics
• Producers push messages
• Consumers pull messages
• Kafka runs in a cluster. Nodes are called
brokers
The Basics
16©2014 Cloudera, Inc. All rights reserved.
Topics, Partitions and Logs
17©2014 Cloudera, Inc. All rights reserved.
Each partition is a log
18©2014 Cloudera, Inc. All rights reserved.
Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 1
Partition 0
Partition 2 Partion 2
19©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
20©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
21©2014 Cloudera, Inc. All rights reserved.
Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained with in
partition
Order retained with in
partition but not over
partitionsOffSetX
OffSetX
OffSetX
OffSetYOffSetYOffSetY
Off sets are kept per
consumer group
22
Flume
23
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to Flume
Twitter, logs, JMS,
webserver, Kafka
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
HDFS, HBase,
Solr
24
Flume and/or Kafka
©2014 Cloudera, Inc. All rights reserved.
Flume
UpStream
Flume Source
Interceptor
Flume Channel
Flume Sink
Down Stream
Selector
Can Be KafkaCan Be KafkaCan Be Kafka
25
Interceptors
• Mask fields
• Validate information
against external source
• Extract fields
• Modify data format
• Filter or split events
©2014 Cloudera, Inc. All rights reserved.
26
SparkStreaming
27
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()
28
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
29
DStream
DStream
DStream
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch
30
DStream
DStream
DStreamSpark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful RDD 1
Print
Stateful RDD 2
Stateful RDD 1
31
Spark Streaming and HBase
©2014 Cloudera, Inc. All rights reserved.
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
32
High Level
Architecture
©2014 Cloudera, Inc. All rights reserved.
33
Real-Time Event Processing Approach
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
34
NRT Processing
©2014 Cloudera, Inc. All rights reserved.
35
Focus on NRT First
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
NRT Event Processing with Context
36
Streaming Architecture – NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume Source
Flume Source
Kafka
Initial Events Topic
Flume Source
Flume Interceptor
Event Processing Logic
Local
Memory
HBase
Client
Kafka
Answer Topic
HBase
KafkaConsumer
KafkaProducer
Able to respond with
in 10s of
milliseconds
37
Partitioned NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume Source
Flume Source
Kafka
Initial Events Topic
Flume Source
Flume Interceptor
Event Processing Logic
Local
Memory
HBase
Client
Kafka
Answer Topic
HBase
KafkaConsumer
KafkaProducer
Topic
Partition A
Partition B
Partition C
Producer
Partitione
r
Producer
Partitione
r
Producer
Partitione
r
Custom Partitioner
Better use of local
memory
38
Completing the
Puzzle
©2014 Cloudera, Inc. All rights reserved.
39
Micro Batching
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
Micro Batching
Micro Batching
Micro Batching
40
Complex Topologies
©2014 Cloudera, Inc. All rights reserved.
Kafka
Initial Events Topic
Spark Streaming
KafkaDirect
Connection
Dag Topologies
Kafka
Initial Events Topic
Spark Streaming
Kafka Receivers Dag Topologies
Kafka Receivers
Kafka Receivers
• Manages Offset
• Stores Offset is RDD
• No longer needs HDFS for initial RDD check
pointing
• Lets Kafka Manage Offsets
• Uses HDFS for initial RDD recovery
1.3
1.2
41
MicroBatch Bad-Input Handling
©2014 Cloudera, Inc. All rights reserved.
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – incoming events topic
Dag Topologies
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – bad events topic
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – resolved events topic
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – results topic
42
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
Ingestion
Ingestion
43
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Flume HDFS Sink
Kafka Cluster
Topic
Partition A
Partition B
Partition C
Sink
Sink
Sink
HDFS
Flume SolR Sink
Sink
Sink
Sink
SolR
Flume Hbase Sink
Sink
Sink
Sink
HBase
44
Reflective Thoughts
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
Research and Searching
©2014 Cloudera, Inc. All rights reserved.

More Related Content

PPTX
Cloud security
PDF
Generative AI: Past, Present, and Future – A Practitioner's Perspective
PPTX
Google Cloud GenAI Overview_071223.pptx
PPTX
Web mining
PPT
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
PDF
Cloud computing - Risks and Mitigation - GTS
PDF
Lecture 5: 3D User Interfaces for Virtual Reality
PDF
The Five Levels of Generative AI for Games
Cloud security
Generative AI: Past, Present, and Future – A Practitioner's Perspective
Google Cloud GenAI Overview_071223.pptx
Web mining
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Cloud computing - Risks and Mitigation - GTS
Lecture 5: 3D User Interfaces for Virtual Reality
The Five Levels of Generative AI for Games

What's hot (20)

PDF
Visualising Data with Code
PPTX
Hadoop And Their Ecosystem ppt
PDF
Apache Kafka® Use Cases for Financial Services
PPTX
Responsible/Trustworthy AI in the Era of Foundation Models
PDF
COMP 4010 - Lecture 1: Introduction to Virtual Reality
PPTX
Data Visualization
PPTX
Text MIning
PDF
[AI] ML Operationalization with Microsoft Azure
PDF
The current state of generative AI
PPTX
Metaverse
PDF
MLflow with Databricks
PDF
AI in Data science
PPTX
Virtual Reality
PDF
MLOps by Sasha Rosenbaum
PDF
Ml ops on AWS
PPTX
An Intro to NoSQL Databases
PDF
COMP 4010 Lecture3: Human Perception
PDF
Comp4010 lecture11 VR Applications
PDF
Blockchain
Visualising Data with Code
Hadoop And Their Ecosystem ppt
Apache Kafka® Use Cases for Financial Services
Responsible/Trustworthy AI in the Era of Foundation Models
COMP 4010 - Lecture 1: Introduction to Virtual Reality
Data Visualization
Text MIning
[AI] ML Operationalization with Microsoft Azure
The current state of generative AI
Metaverse
MLflow with Databricks
AI in Data science
Virtual Reality
MLOps by Sasha Rosenbaum
Ml ops on AWS
An Intro to NoSQL Databases
COMP 4010 Lecture3: Human Perception
Comp4010 lecture11 VR Applications
Blockchain
Ad

Similar to Fraud Detection Architecture (20)

PDF
Fraud Detection using Hadoop
PPTX
Fraud Detection for Israel BigThings Meetup
PPTX
Real Time Data Processing Using Spark Streaming
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PPTX
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
PPTX
End to End Streaming Architectures
PDF
Streaming architecture patterns
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
PPTX
Real Time Data Processing Using Spark Streaming
PPTX
Spark+flume seattle
PPTX
Realtime Detection of DDOS attacks using Apache Spark and MLLib
PDF
Fraud Detection with Hadoop
PDF
Meetup: Streaming Data Pipeline Development
PPTX
Event Detection Pipelines with Apache Kafka
PPTX
Real time analytics with Kafka and SparkStreaming
PPTX
Ingest and Stream Processing - What will you choose?
PPTX
Data Architectures for Robust Decision Making
PPTX
Have your cake and eat it too
PPTX
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
PPTX
Intro to Apache Spark
Fraud Detection using Hadoop
Fraud Detection for Israel BigThings Meetup
Real Time Data Processing Using Spark Streaming
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
End to End Streaming Architectures
Streaming architecture patterns
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing Using Spark Streaming
Spark+flume seattle
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Fraud Detection with Hadoop
Meetup: Streaming Data Pipeline Development
Event Detection Pipelines with Apache Kafka
Real time analytics with Kafka and SparkStreaming
Ingest and Stream Processing - What will you choose?
Data Architectures for Robust Decision Making
Have your cake and eat it too
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Intro to Apache Spark
Ad

More from Gwen (Chen) Shapira (20)

PPTX
Velocity 2019 - Kafka Operations Deep Dive
PPTX
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
PPTX
Gluecon - Kafka and the service mesh
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
PPTX
Papers we love realtime at facebook
PPTX
Kafka reliability velocity 17
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
PPTX
Streaming Data Integration - For Women in Big Data Meetup
PPTX
Kafka at scale facebook israel
PPTX
Kafka connect-london-meetup-2016
PPT
Kafka Reliability - When it absolutely, positively has to be there
PPTX
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
PPTX
Kafka for DBAs
PPTX
Kafka and Hadoop at LinkedIn Meetup
PPTX
Kafka & Hadoop - for NYC Kafka Meetup
PPTX
Twitter with hadoop for oow
PPTX
R for hadoopers
PPTX
Scaling ETL with Hadoop - Avoiding Failure
PPTX
Intro to Spark - for Denver Big Data Meetup
PPTX
Incredible Impala
Velocity 2019 - Kafka Operations Deep Dive
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Gluecon - Kafka and the service mesh
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Papers we love realtime at facebook
Kafka reliability velocity 17
Multi-Datacenter Kafka - Strata San Jose 2017
Streaming Data Integration - For Women in Big Data Meetup
Kafka at scale facebook israel
Kafka connect-london-meetup-2016
Kafka Reliability - When it absolutely, positively has to be there
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Kafka for DBAs
Kafka and Hadoop at LinkedIn Meetup
Kafka & Hadoop - for NYC Kafka Meetup
Twitter with hadoop for oow
R for hadoopers
Scaling ETL with Hadoop - Avoiding Failure
Intro to Spark - for Denver Big Data Meetup
Incredible Impala

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Introduction to Business Data Analytics.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
.pdf is not working space design for the following data for the following dat...
Database Infoormation System (DBIS).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Business Data Analytics.
IBA_Chapter_11_Slides_Final_Accessible.pptx
Quality review (1)_presentation of this 21
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
climate analysis of Dhaka ,Banglades.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
oil_refinery_comprehensive_20250804084928 (1).pptx
Clinical guidelines as a resource for EBP(1).pdf
Data_Analytics_and_PowerBI_Presentation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Knowledge Engineering Part 1
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
.pdf is not working space design for the following data for the following dat...

Fraud Detection Architecture

  • 1. Real Time Fraud Detection Patterns and reference architectures Ted Malaska // PSA Gwen Shapira // Software Engineer
  • 2. 2 • Intro • Review Problem • Quick overview of key technology • High level architecture • Deep Dive into NRT Processing • Completing the Puzzle – Micro-batch, Ingest and Batch Overview ©2014 Cloudera, Inc. All rights reserved.
  • 3. 3©2014 Cloudera, Inc. All rights reserved. • 15 years of moving data • Formerly consultant • Now Cloudera Engineer: – Sqoop Committer – Kafka – Flume • @gwenshap Gwen Shapira
  • 4. 4 • Ted Malaska (PSA at Cloudera) • Hadoop for ~5 years • Contributed to – HDFS, MapReduce, Yarn, HBase, Spark, Avro, – Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo – And working on a Sentry Patch • Co-Author to O’Reilly Hadoop Application Architectures • Worked with about 70 companies in 8 countries • Marvel Fan Boy • Runner Hello ©2014 Cloudera, Inc. All rights reserved.
  • 5. 5 The Problem ©2014 Cloudera, Inc. All rights reserved.
  • 6. 6 Credit Card Transaction Fraud ©2014 Cloudera, Inc. All rights reserved.
  • 7. 7 Ikea Meat Balls ©2014 Cloudera, Inc. All rights reserved.
  • 8. 8 Coupon Fraud ©2014 Cloudera, Inc. All rights reserved.
  • 9. 9 Video Game Strategy ©2014 Cloudera, Inc. All rights reserved.
  • 10. 10 Health Insurance Fraud ©2014 Cloudera, Inc. All rights reserved.
  • 11. 11 • Typical Atomic Card Fraud Detection • Ikea Meat Ball • Multi Coupons Combinations • OP or Negative Video Games Strategies • Ad Serving • Health Insurance Fraud • Kid Coming Home From School Review of the Problem ©2014 Cloudera, Inc. All rights reserved.
  • 12. 12 How do we React • Human Brain at Tennis – Muscle Memory – Reaction Thought – Reflective Meditation ©2014 Cloudera, Inc. All rights reserved.
  • 13. 13 Overview of Key Technologies ©2014 Cloudera, Inc. All rights reserved.
  • 14. 14 Kafka ©2014 Cloudera, Inc. All Rights Reserved.
  • 15. 15©2014 Cloudera, Inc. All rights reserved. • Messages are organized into topics • Producers push messages • Consumers pull messages • Kafka runs in a cluster. Nodes are called brokers The Basics
  • 16. 16©2014 Cloudera, Inc. All rights reserved. Topics, Partitions and Logs
  • 17. 17©2014 Cloudera, Inc. All rights reserved. Each partition is a log
  • 18. 18©2014 Cloudera, Inc. All rights reserved. Each Broker has many partitions Partition 0 Partition 0 Partition 1 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partion 2
  • 19. 19©2014 Cloudera, Inc. All rights reserved. Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  • 20. 20©2014 Cloudera, Inc. All rights reserved. Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  • 21. 21©2014 Cloudera, Inc. All rights reserved. Consumers Consumer Group Y Consumer Group X Consumer Kafka Cluster Topic Partition A (File) Partition B (File) Partition C (File) Consumer Consumer Consumer Order retained with in partition Order retained with in partition but not over partitionsOffSetX OffSetX OffSetX OffSetYOffSetYOffSetY Off sets are kept per consumer group
  • 23. 23 Sources Interceptors Selectors Channels Sinks Flume Agent Short Intro to Flume Twitter, logs, JMS, webserver, Kafka Mask, re-format, validate… DR, critical Memory, file, Kafka HDFS, HBase, Solr
  • 24. 24 Flume and/or Kafka ©2014 Cloudera, Inc. All rights reserved. Flume UpStream Flume Source Interceptor Flume Channel Flume Sink Down Stream Selector Can Be KafkaCan Be KafkaCan Be Kafka
  • 25. 25 Interceptors • Mask fields • Validate information against external source • Extract fields • Modify data format • Filter or split events ©2014 Cloudera, Inc. All rights reserved.
  • 27. 27 Spark Streaming Example ©2014 Cloudera, Inc. All rights reserved. 1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1)) 3. val lines = ssc.socketTextStream("localhost", 9999) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print() 8. SSC.start()
  • 28. 28 Spark Streaming Example ©2014 Cloudera, Inc. All rights reserved. 1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf) 3. val lines = sc.textFile(path, 2) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print()
  • 29. 29 DStream DStream DStream Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  • 30. 30 DStream DStream DStreamSpark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1
  • 31. 31 Spark Streaming and HBase ©2014 Cloudera, Inc. All rights reserved. Driver Walker Node Configs Executor Static Space Configs HConnection Tasks Tasks Walker Node Executor Static Space Configs HConnection Tasks Tasks
  • 32. 32 High Level Architecture ©2014 Cloudera, Inc. All rights reserved.
  • 33. 33 Real-Time Event Processing Approach ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App
  • 34. 34 NRT Processing ©2014 Cloudera, Inc. All rights reserved.
  • 35. 35 Focus on NRT First ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App NRT Event Processing with Context
  • 36. 36 Streaming Architecture – NRT Event Processing ©2014 Cloudera, Inc. All rights reserved. Flume Source Flume Source Kafka Initial Events Topic Flume Source Flume Interceptor Event Processing Logic Local Memory HBase Client Kafka Answer Topic HBase KafkaConsumer KafkaProducer Able to respond with in 10s of milliseconds
  • 37. 37 Partitioned NRT Event Processing ©2014 Cloudera, Inc. All rights reserved. Flume Source Flume Source Kafka Initial Events Topic Flume Source Flume Interceptor Event Processing Logic Local Memory HBase Client Kafka Answer Topic HBase KafkaConsumer KafkaProducer Topic Partition A Partition B Partition C Producer Partitione r Producer Partitione r Producer Partitione r Custom Partitioner Better use of local memory
  • 38. 38 Completing the Puzzle ©2014 Cloudera, Inc. All rights reserved.
  • 39. 39 Micro Batching ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Micro Batching Micro Batching Micro Batching
  • 40. 40 Complex Topologies ©2014 Cloudera, Inc. All rights reserved. Kafka Initial Events Topic Spark Streaming KafkaDirect Connection Dag Topologies Kafka Initial Events Topic Spark Streaming Kafka Receivers Dag Topologies Kafka Receivers Kafka Receivers • Manages Offset • Stores Offset is RDD • No longer needs HDFS for initial RDD check pointing • Lets Kafka Manage Offsets • Uses HDFS for initial RDD recovery 1.3 1.2
  • 41. 41 MicroBatch Bad-Input Handling ©2014 Cloudera, Inc. All rights reserved. 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Kafka – incoming events topic Dag Topologies 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Kafka – bad events topic 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Kafka – resolved events topic 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Kafka – results topic
  • 42. 42 Ingestion ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Ingestion Ingestion
  • 43. 43 Ingestion ©2014 Cloudera, Inc. All rights reserved. Flume HDFS Sink Kafka Cluster Topic Partition A Partition B Partition C Sink Sink Sink HDFS Flume SolR Sink Sink Sink Sink SolR Flume Hbase Sink Sink Sink Sink HBase
  • 44. 44 Reflective Thoughts ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Research and Searching
  • 45. ©2014 Cloudera, Inc. All rights reserved.

Editor's Notes

  • #4: This gives me a lot of perspective regarding the use of Hadoop
  • #17: Topics are partitioned, each partition ordered and immutable. Messages in a partition have an ID, called Offset. Offset uniquely identifies a message within a partition
  • #18: Kafka retains all messages for fixed amount of time. Not waiting for acks from consumers. The only metadata retained per consumer is the position in the log – the offset So adding many consumers is cheap On the other hand, consumers have more responsibility and are more challenging to implement correctly And “batching” consumers is not a problem
  • #19: 3 partitions, each replicated 3 times.
  • #20: The choose how many replicas must ACK a message before its considered committed. This is the tradeoff between speed and reliability
  • #21: The choose how many replicas must ACK a message before its considered committed. This is the tradeoff between speed and reliability
  • #22: can read from one or more partition leader. You can’t have two consumers in same group reading the same partition. Leaders obviously do more work – but they are balanced between nodes We reviewed the basic components on the system, and it may seem complex. In the next section we’ll see how simple it actually is to get started with Kafka.
  • #24: Does not require programming.