SlideShare a Scribd company logo
© 2014, Conversant, Inc. All rights reserved.
PRESENTED BY
May 18, 2016
APACHE FLUME OR
APACHE KAFKA?
HOW ABOUT BOTH?
Apache North America Big Data Conference - May 10, 2016
Jayesh Thakrar (jthakrar@conversantmedia.com)
© 2014, Conversant, Inc. All rights reserved.2
 Conversant (www.conversantmedia.com)
• Adserving - real-time bidding
• Intelligent messaging using online and offline activity without using
personally identifiable information (PII)
 Hadoop Engineering
• Designs, builds, and manages clusters running Hadoop, HBase, Spark,
Storm, Kafka, Cassandra, OpenTSDB, etc.
• Team: 4 people, 20+ clusters, 500+ servers, PBs of storage, etc.
© 2014, Conversant, Inc. All rights reserved.3
AGENDA
 History and Evolution of Conversant's Data Pipeline
 Flume Customization
 Compare Flume and Kafka
 Metrics and Monitoring
© 2014, Conversant, Inc. All rights reserved.4
Conversant Data
Pipeline Overview
© 2014, Conversant, Inc. All rights reserved.5
INTER-DATACENTER DATA PIPELINE
InternetAd
Exchanges
Web Sites
(Publishers)
Users
U.S. East Coast
Data Center
European
Data Center
Chicago
Data Center (EDW)
U.S. West Coast
Data Center
© 2014, Conversant, Inc. All rights reserved.6
 Home-grown log collection system in PERL, shell and Python
 15-20 billion log lines
 Comma or tab separated log format, implicit schema
DATA PIPELINE VERSION 1
(PRIOR TO SEPT 2013)
AdServer
Application
AdServer
Application
AdServer
Applications
AdServer
Application
AdServer
Application
Data Center
Local Log
Manager
AdServer
Application
Chicago Log
Aggregator
Data
Warehouse
© 2014, Conversant, Inc. All rights reserved.7
 Non-trivial operational and recovery effort during
• Network/WAN outage
• Planned/unplanned server maintenance
 Difficult file format/schema evolution
 Delayed reporting and metrics (2-3 hours)
 Scaling and storage utilization on local log manager
DATA PIPELINE VERSION 1
© 2014, Conversant, Inc. All rights reserved.8
DATA PIPELINE VERSION 2
(SEP 2013 - MAR 2015)
 Application logging in Avro format
 50-80+ billion daily log lines
 3-hop flume pipeline
 Flume event schema : event header, event payload
• Header key/value = log type, log version, server-id, UUID, timestamp, # of log lines
• Payload = byte array = Avro file
AdServer
Application
AdServer
Application
AdServer
Applications
with Local
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Dedicated
Hadoop Cluster
© 2014, Conversant, Inc. All rights reserved.9
DATA PIPELINE VERSION 2
 Explicit application log schema
 Version tagged payload = easier log file schema evolution
 No manual recovery during network outages and server maintenance
 Detailed, explicit metrics in real-time
© 2014, Conversant, Inc. All rights reserved.10
DATA PIPELINE VERSION 3
(MAR 2015-JUN 2015)
 Switch from dedicated MapR cluster to CDH cluster (new EDW)
AdServer
Application
AdServer
Application
AdServer
Applications
with Local
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Dedicated
Hadoop Cluster
Enterprise
Hadoop Cluster
© 2014, Conversant, Inc. All rights reserved.11
DATA PIPELINE VERSION 3
 About 4-5k file creation/sec by Flume - Namenode overwhelmed!!
 Manual intervention for data recovery - painful reminder of version 1
© 2014, Conversant, Inc. All rights reserved.12
DATA PIPELINE VERSION 4
(JUNE 2015+ )
 Embedded flume agents in applications
 Kafka to "buffer/self-regulate" data flow
 Camus mapreduce framework to land data
AdServer
Application
AdServer
Application
AdServer
Applications
with Embedded
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Enterprise Hadoop
Cluster + Camus
Mapreduce
AdServer
Application
AdServer
ApplicationKafka
© 2014, Conversant, Inc. All rights reserved.13
DATA PIPELINE VERSION 4
 Kafka + Flume = Hadoop decoupling and data redundancy
 Additional metrics and visibility from Kafka
 In future, allows for data sniffing/sampling and real-time stream processing of
log data
© 2014, Conversant, Inc. All rights reserved.14
Flume Customization
© 2014, Conversant, Inc. All rights reserved.15
ADSTACK DATA CENTER BUILDING BLOCK
 Multi-threaded application flushes
batched Avro log lines through
embedded Flume agent based on
time and/or line count thresholds
 Compressor agent compresses
data and sends downstream to
Chicago
• Custom Flume interceptor =
compression and filtering
• Custom Flume selector = event
forwarding to specific channel
© 2014, Conversant, Inc. All rights reserved.16
CHICAGO DEDUPING AND BIFURCATING AGENTS
 Landing Flume Agent
• Custom Interceptor = Check
HBase for UUID, forward if
absent (check-and-forward)
and insert into HBase
• Custom selector = forward
every Nth event to QA flow
(dedicated channel and sink)
© 2014, Conversant, Inc. All rights reserved.17
INTO THE DATA WAREHOUSE
© 2014, Conversant, Inc. All rights reserved.18
KEY POINTS
 Batch of application log lines = "logical log file"
= 1 Flume event = Kafka message
 Application created custom header key/value pairs in Flume events -
log type, server-id, UUID, log version, # of log lines, timestamp, etc.
 Events compressed at remote data center
 Events deduped using HBase lookup (check-and-forward) in Chicago
 Data pipeline resilient to server and network outages and system
maintenance
© 2014, Conversant, Inc. All rights reserved.19
Flume and Kafka
or
Flume v/s Kafka
© 2014, Conversant, Inc. All rights reserved.20
FLUME IN A NUTSHELL: ARCHITECTURE
Source
Channel 1
Channel 2
Sink 2
Sink 1
Flume Agent
interceptor
selector
Source or Sink
© 2014, Conversant, Inc. All rights reserved.21
FLUME IN A NUTSHELL: ECOSYSTEM
Pre-canned Flume Sources
 Avro
Flume Sink (for daisy chaining)
 Thrift
 Exec (Unix pipe/stdout)
 Kafka
 Netcat
 HTTP
 Spooling Directory
 Custom Code
Pre-canned Flume Sinks
 HDFS
 Hive
 Avro
Flume Sink (for daisy chaining)
 Thrift
 Kafka
 File Roll
(Output spooling directory)
 HBase
 Solr
 Elastic Search
 Custom Code
Pre-canned Channels
 Memory Channel
 File Channel
 Kafka Channel
 JDBC Channel
© 2014, Conversant, Inc. All rights reserved.22
KAFKA IN A NUTSHELL: ARCHITECTURE
oldest data latest data
Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Kafka Broker
Producer
Consumer A Consumer B
© 2014, Conversant, Inc. All rights reserved.23
KAFKA IN A NUTSHELL: SCALABILITY
Producer 1 Producer 2
© 2014, Conversant, Inc. All rights reserved.24
FLUME AND KAFKA: DATA PIPELINE BLOCKS
Data
Source
Flume or
Kafka
Data
Destination
Flume or
Kafka
Data
Source
Data
Source
Data
Destination
Data
Destination
Data
Destination
Data
Destination
Flume or
Kafka
© 2014, Conversant, Inc. All rights reserved.25
FLUME V/S KAFKA: DATA AND ROUTING
 Data pipeline block philosophy
• Flume = buffered pipeline => transfer and forget
• Kafka = buffered temporal log => transfer and remember (short-term)
 Data introspection, manipulation and conditional routing/multiplexing
• Flume = Can intercept and manipulate events (source/sink interceptor)
Flume = Conditional routing, multiplexing of events (source/sink selector)
• Kafka = Pass-through only
 Changes in data destination or data source
• Flume = requires re-configuration
• Kafka = N/A, source (producer) and destination (consumer) agnostic
© 2014, Conversant, Inc. All rights reserved.26
FLUME V/S KAFKA: RELIABILTY, SCALABITY, ECOSYSTEM
 Server Outage
• Flume = Flume-to-flume or incoming flow = failover to backup flume agent
Outgoing flow = buffered in channel in flume agent
• Kafka = Producer/consumer failover to another broker (replica partition)
 Scalability
• Flume = add agents, re-configure (re-wire) data flows
• Kafka = add brokers, increase topic partitions and (re)distribute partitions
 Ecosystem
• Flume = Pre-canned sources, sinks and channels
• Kafka = Kafka Connect and Kafka Streams
© 2014, Conversant, Inc. All rights reserved.27
Administration,
Metrics and Monitoring
© 2014, Conversant, Inc. All rights reserved.28
ADMINISTRATION
 No UI in either of them
 Flume: agent stop/start shell script
 Kafka
• Stop/start brokers
• Create/delete/view/manage topics, partitions and topic configuration
• Other utilities - e.g. view log data, stress testing, etc.
© 2014, Conversant, Inc. All rights reserved.29
FLUME METRICS: JMX AND HTTP/JSON ENDPOINT
© 2014, Conversant, Inc. All rights reserved.30
METRICS
 Kafka - JMX and API
• Broker network traffic
• Topic and partition traffic
• Replication and consumer lag
© 2014, Conversant, Inc. All rights reserved.31
MONITORING AND ALERTING
 Flume Key Health Indicators
• Flume listener port
• Incoming traffic rate and errors
• Outgoing traffic rate and errors
• Channel capacity utilization
 Kafka Key Health Indicators
• Broker listener port
• Under-replicated partitions (In-sync replica and amount of replica lag)
• Consumer lag
© 2014, Conversant, Inc. All rights reserved.32
MONITORING & METRICS @ CONVERSANT
TSDB Graph of
Flume Events
Across Data Centers
The blip is a rolling
restart of servers for a
software deploy
Legend
Chicago
East Coast
West Coast
Europe
© 2014, Conversant, Inc. All rights reserved.33
MONITORING & METRICS IN GRAFANA DASHBOARDS
© 2014, Conversant, Inc. All rights reserved.34
MORE INFO ON CONVERSANT DATA PIPELINE
 Conversant Blog
http://engineering.conversantmedia.com/community/2015/06/01/conversant-big-data-everywhere
 Sample GitHub Project
https://github.com/mbkeane/BigDataTechCon
 Chicago Area Kafka Enthusiasts (CAKE)
http://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/230867233
© 2014, Conversant, Inc. All rights reserved.35
Questions?

More Related Content

PPTX
ApacheCon-HBase-2016
PDF
Data Aggregation At Scale Using Apache Flume
PPTX
Introduction to streaming and messaging flume,kafka,SQS,kinesis
PDF
Apache Flume - DataDayTexas
PDF
Inside Flume
PPTX
Apache flume - an Introduction
PPTX
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
PDF
Apache flume by Swapnil Dubey
ApacheCon-HBase-2016
Data Aggregation At Scale Using Apache Flume
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Apache Flume - DataDayTexas
Inside Flume
Apache flume - an Introduction
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Apache flume by Swapnil Dubey

What's hot (20)

PDF
Big data: Loading your data with flume and sqoop
PPTX
PDF
Apache Flume and its use case in Manufacturing
PDF
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
PPTX
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
PDF
Apache Flume (NG)
PPTX
Deploying Apache Flume to enable low-latency analytics
PPTX
Apache flume - Twitter Streaming
PPTX
Spark+flume seattle
PPTX
Apache flume
PDF
Apache kafka
PDF
Apache Flume
PPTX
Realtime Detection of DDOS attacks using Apache Spark and MLLib
PPTX
Centralized logging with Flume
PPTX
Cloudera's Flume
KEY
Near-realtime analytics with Kafka and HBase
PDF
Flume intro-100715
PDF
Flume @ Austin HUG 2/17/11
PPTX
Extracting twitter data using apache flume
PPTX
Apache Phoenix: Use Cases and New Features
Big data: Loading your data with flume and sqoop
Apache Flume and its use case in Manufacturing
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume (NG)
Deploying Apache Flume to enable low-latency analytics
Apache flume - Twitter Streaming
Spark+flume seattle
Apache flume
Apache kafka
Apache Flume
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Centralized logging with Flume
Cloudera's Flume
Near-realtime analytics with Kafka and HBase
Flume intro-100715
Flume @ Austin HUG 2/17/11
Extracting twitter data using apache flume
Apache Phoenix: Use Cases and New Features
Ad

Viewers also liked (9)

PDF
Parquet and AVRO
PDF
Moving to a data-centric architecture: Toronto Data Unconference 2015
PPTX
大型电商的数据服务的要点和难点
PDF
Implementing and running a secure datalake from the trenches
PPT
Parquet overview
PDF
Paytm labs soyouwanttodatascience
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
PPTX
Flume vs. kafka
Parquet and AVRO
Moving to a data-centric architecture: Toronto Data Unconference 2015
大型电商的数据服务的要点和难点
Implementing and running a secure datalake from the trenches
Parquet overview
Paytm labs soyouwanttodatascience
File Format Benchmark - Avro, JSON, ORC & Parquet
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Flume vs. kafka
Ad

Similar to ApacheCon-Flume-Kafka-2016 (20)

PDF
OSSNA Building Modern Data Streaming Apps
POTX
Schema Registry & Stream Analytics Manager
PPTX
SAM - Streaming Analytics Made Easy
PDF
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
PPTX
Streaming analytics manager
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
PDF
Leverage Kafka to build a stream processing platform
PDF
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PPTX
Architecting a Fraud Detection Application with Hadoop
PPTX
Fraud Detection Architecture
PDF
ITPC Building Modern Data Streaming Apps
PDF
Superpower Your Apache Kafka Applications Development with Complementary Open...
PDF
Fraud Detection using Hadoop
PPTX
Current and Future of Apache Kafka
PDF
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
PDF
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
PDF
Building Real-Time Travel Alerts
PDF
Part 2: Architecture and the Operator Experience (Pivotal Cloud Platform Road...
PDF
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
PDF
Music city data Hail Hydrate! from stream to lake
OSSNA Building Modern Data Streaming Apps
Schema Registry & Stream Analytics Manager
SAM - Streaming Analytics Made Easy
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Streaming analytics manager
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Leverage Kafka to build a stream processing platform
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
Architecting a Fraud Detection Application with Hadoop
Fraud Detection Architecture
ITPC Building Modern Data Streaming Apps
Superpower Your Apache Kafka Applications Development with Complementary Open...
Fraud Detection using Hadoop
Current and Future of Apache Kafka
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-Time Travel Alerts
Part 2: Architecture and the Operator Experience (Pivotal Cloud Platform Road...
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Music city data Hail Hydrate! from stream to lake

More from Jayesh Thakrar (6)

PPTX
ApacheCon North America 2018: Creating Spark Data Sources
PPTX
Apache big-data-2017-spark-profiling
PPTX
Data Modeling for IoT and Big Data
PPTX
Apache big-data-2017-scala-sql
PPT
Data Loss and Duplication in Kafka
PPTX
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
ApacheCon North America 2018: Creating Spark Data Sources
Apache big-data-2017-spark-profiling
Data Modeling for IoT and Big Data
Apache big-data-2017-scala-sql
Data Loss and Duplication in Kafka
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14

ApacheCon-Flume-Kafka-2016

  • 1. © 2014, Conversant, Inc. All rights reserved. PRESENTED BY May 18, 2016 APACHE FLUME OR APACHE KAFKA? HOW ABOUT BOTH? Apache North America Big Data Conference - May 10, 2016 Jayesh Thakrar (jthakrar@conversantmedia.com)
  • 2. © 2014, Conversant, Inc. All rights reserved.2  Conversant (www.conversantmedia.com) • Adserving - real-time bidding • Intelligent messaging using online and offline activity without using personally identifiable information (PII)  Hadoop Engineering • Designs, builds, and manages clusters running Hadoop, HBase, Spark, Storm, Kafka, Cassandra, OpenTSDB, etc. • Team: 4 people, 20+ clusters, 500+ servers, PBs of storage, etc.
  • 3. © 2014, Conversant, Inc. All rights reserved.3 AGENDA  History and Evolution of Conversant's Data Pipeline  Flume Customization  Compare Flume and Kafka  Metrics and Monitoring
  • 4. © 2014, Conversant, Inc. All rights reserved.4 Conversant Data Pipeline Overview
  • 5. © 2014, Conversant, Inc. All rights reserved.5 INTER-DATACENTER DATA PIPELINE InternetAd Exchanges Web Sites (Publishers) Users U.S. East Coast Data Center European Data Center Chicago Data Center (EDW) U.S. West Coast Data Center
  • 6. © 2014, Conversant, Inc. All rights reserved.6  Home-grown log collection system in PERL, shell and Python  15-20 billion log lines  Comma or tab separated log format, implicit schema DATA PIPELINE VERSION 1 (PRIOR TO SEPT 2013) AdServer Application AdServer Application AdServer Applications AdServer Application AdServer Application Data Center Local Log Manager AdServer Application Chicago Log Aggregator Data Warehouse
  • 7. © 2014, Conversant, Inc. All rights reserved.7  Non-trivial operational and recovery effort during • Network/WAN outage • Planned/unplanned server maintenance  Difficult file format/schema evolution  Delayed reporting and metrics (2-3 hours)  Scaling and storage utilization on local log manager DATA PIPELINE VERSION 1
  • 8. © 2014, Conversant, Inc. All rights reserved.8 DATA PIPELINE VERSION 2 (SEP 2013 - MAR 2015)  Application logging in Avro format  50-80+ billion daily log lines  3-hop flume pipeline  Flume event schema : event header, event payload • Header key/value = log type, log version, server-id, UUID, timestamp, # of log lines • Payload = byte array = Avro file AdServer Application AdServer Application AdServer Applications with Local Flume Agents AdServer Application AdServer Application Data Center Local Compressor Flume Agents AdServer Application Chicago Deduping and Bifurcating Flume Agents Dedicated Hadoop Cluster
  • 9. © 2014, Conversant, Inc. All rights reserved.9 DATA PIPELINE VERSION 2  Explicit application log schema  Version tagged payload = easier log file schema evolution  No manual recovery during network outages and server maintenance  Detailed, explicit metrics in real-time
  • 10. © 2014, Conversant, Inc. All rights reserved.10 DATA PIPELINE VERSION 3 (MAR 2015-JUN 2015)  Switch from dedicated MapR cluster to CDH cluster (new EDW) AdServer Application AdServer Application AdServer Applications with Local Flume Agents AdServer Application AdServer Application Data Center Local Compressor Flume Agents AdServer Application Chicago Deduping and Bifurcating Flume Agents Dedicated Hadoop Cluster Enterprise Hadoop Cluster
  • 11. © 2014, Conversant, Inc. All rights reserved.11 DATA PIPELINE VERSION 3  About 4-5k file creation/sec by Flume - Namenode overwhelmed!!  Manual intervention for data recovery - painful reminder of version 1
  • 12. © 2014, Conversant, Inc. All rights reserved.12 DATA PIPELINE VERSION 4 (JUNE 2015+ )  Embedded flume agents in applications  Kafka to "buffer/self-regulate" data flow  Camus mapreduce framework to land data AdServer Application AdServer Application AdServer Applications with Embedded Flume Agents AdServer Application AdServer Application Data Center Local Compressor Flume Agents AdServer Application Chicago Deduping and Bifurcating Flume Agents Enterprise Hadoop Cluster + Camus Mapreduce AdServer Application AdServer ApplicationKafka
  • 13. © 2014, Conversant, Inc. All rights reserved.13 DATA PIPELINE VERSION 4  Kafka + Flume = Hadoop decoupling and data redundancy  Additional metrics and visibility from Kafka  In future, allows for data sniffing/sampling and real-time stream processing of log data
  • 14. © 2014, Conversant, Inc. All rights reserved.14 Flume Customization
  • 15. © 2014, Conversant, Inc. All rights reserved.15 ADSTACK DATA CENTER BUILDING BLOCK  Multi-threaded application flushes batched Avro log lines through embedded Flume agent based on time and/or line count thresholds  Compressor agent compresses data and sends downstream to Chicago • Custom Flume interceptor = compression and filtering • Custom Flume selector = event forwarding to specific channel
  • 16. © 2014, Conversant, Inc. All rights reserved.16 CHICAGO DEDUPING AND BIFURCATING AGENTS  Landing Flume Agent • Custom Interceptor = Check HBase for UUID, forward if absent (check-and-forward) and insert into HBase • Custom selector = forward every Nth event to QA flow (dedicated channel and sink)
  • 17. © 2014, Conversant, Inc. All rights reserved.17 INTO THE DATA WAREHOUSE
  • 18. © 2014, Conversant, Inc. All rights reserved.18 KEY POINTS  Batch of application log lines = "logical log file" = 1 Flume event = Kafka message  Application created custom header key/value pairs in Flume events - log type, server-id, UUID, log version, # of log lines, timestamp, etc.  Events compressed at remote data center  Events deduped using HBase lookup (check-and-forward) in Chicago  Data pipeline resilient to server and network outages and system maintenance
  • 19. © 2014, Conversant, Inc. All rights reserved.19 Flume and Kafka or Flume v/s Kafka
  • 20. © 2014, Conversant, Inc. All rights reserved.20 FLUME IN A NUTSHELL: ARCHITECTURE Source Channel 1 Channel 2 Sink 2 Sink 1 Flume Agent interceptor selector Source or Sink
  • 21. © 2014, Conversant, Inc. All rights reserved.21 FLUME IN A NUTSHELL: ECOSYSTEM Pre-canned Flume Sources  Avro Flume Sink (for daisy chaining)  Thrift  Exec (Unix pipe/stdout)  Kafka  Netcat  HTTP  Spooling Directory  Custom Code Pre-canned Flume Sinks  HDFS  Hive  Avro Flume Sink (for daisy chaining)  Thrift  Kafka  File Roll (Output spooling directory)  HBase  Solr  Elastic Search  Custom Code Pre-canned Channels  Memory Channel  File Channel  Kafka Channel  JDBC Channel
  • 22. © 2014, Conversant, Inc. All rights reserved.22 KAFKA IN A NUTSHELL: ARCHITECTURE oldest data latest data Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying Kafka Broker Producer Consumer A Consumer B
  • 23. © 2014, Conversant, Inc. All rights reserved.23 KAFKA IN A NUTSHELL: SCALABILITY Producer 1 Producer 2
  • 24. © 2014, Conversant, Inc. All rights reserved.24 FLUME AND KAFKA: DATA PIPELINE BLOCKS Data Source Flume or Kafka Data Destination Flume or Kafka Data Source Data Source Data Destination Data Destination Data Destination Data Destination Flume or Kafka
  • 25. © 2014, Conversant, Inc. All rights reserved.25 FLUME V/S KAFKA: DATA AND ROUTING  Data pipeline block philosophy • Flume = buffered pipeline => transfer and forget • Kafka = buffered temporal log => transfer and remember (short-term)  Data introspection, manipulation and conditional routing/multiplexing • Flume = Can intercept and manipulate events (source/sink interceptor) Flume = Conditional routing, multiplexing of events (source/sink selector) • Kafka = Pass-through only  Changes in data destination or data source • Flume = requires re-configuration • Kafka = N/A, source (producer) and destination (consumer) agnostic
  • 26. © 2014, Conversant, Inc. All rights reserved.26 FLUME V/S KAFKA: RELIABILTY, SCALABITY, ECOSYSTEM  Server Outage • Flume = Flume-to-flume or incoming flow = failover to backup flume agent Outgoing flow = buffered in channel in flume agent • Kafka = Producer/consumer failover to another broker (replica partition)  Scalability • Flume = add agents, re-configure (re-wire) data flows • Kafka = add brokers, increase topic partitions and (re)distribute partitions  Ecosystem • Flume = Pre-canned sources, sinks and channels • Kafka = Kafka Connect and Kafka Streams
  • 27. © 2014, Conversant, Inc. All rights reserved.27 Administration, Metrics and Monitoring
  • 28. © 2014, Conversant, Inc. All rights reserved.28 ADMINISTRATION  No UI in either of them  Flume: agent stop/start shell script  Kafka • Stop/start brokers • Create/delete/view/manage topics, partitions and topic configuration • Other utilities - e.g. view log data, stress testing, etc.
  • 29. © 2014, Conversant, Inc. All rights reserved.29 FLUME METRICS: JMX AND HTTP/JSON ENDPOINT
  • 30. © 2014, Conversant, Inc. All rights reserved.30 METRICS  Kafka - JMX and API • Broker network traffic • Topic and partition traffic • Replication and consumer lag
  • 31. © 2014, Conversant, Inc. All rights reserved.31 MONITORING AND ALERTING  Flume Key Health Indicators • Flume listener port • Incoming traffic rate and errors • Outgoing traffic rate and errors • Channel capacity utilization  Kafka Key Health Indicators • Broker listener port • Under-replicated partitions (In-sync replica and amount of replica lag) • Consumer lag
  • 32. © 2014, Conversant, Inc. All rights reserved.32 MONITORING & METRICS @ CONVERSANT TSDB Graph of Flume Events Across Data Centers The blip is a rolling restart of servers for a software deploy Legend Chicago East Coast West Coast Europe
  • 33. © 2014, Conversant, Inc. All rights reserved.33 MONITORING & METRICS IN GRAFANA DASHBOARDS
  • 34. © 2014, Conversant, Inc. All rights reserved.34 MORE INFO ON CONVERSANT DATA PIPELINE  Conversant Blog http://engineering.conversantmedia.com/community/2015/06/01/conversant-big-data-everywhere  Sample GitHub Project https://github.com/mbkeane/BigDataTechCon  Chicago Area Kafka Enthusiasts (CAKE) http://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/230867233
  • 35. © 2014, Conversant, Inc. All rights reserved.35 Questions?