SlideShare a Scribd company logo
Kafka &
Hadoop
Gwen Shapira / Software Engineer
2©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data around
• Formerly consultant
• Now Cloudera Engineer:
– Sqoop Committer
– Kafka
– Flume
About Me
3©2014 Cloudera, Inc. All rights reserved.
There’s a book on that!
4©2014 Cloudera, Inc. All rights reserved.
We are also blogging
6
Getting Data from Kafka to Hadoop
There are only
bad options.
It's about finding
the best one.
©2014 Cloudera, Inc. All rights reserved.
7
Batch
©2014 Cloudera, Inc. All rights reserved.
8©2014 Cloudera, Inc. All rights reserved.
Camus
9©2014 Cloudera, Inc. All rights reserved.
Camus
ZooKeeper
Setup
Topic Offsets
ProcessesHDFSOtherSystems
Task
Task
Task
In process
Avro Files
In process
Avro Files Audit Counts
Clean Up
Kakfa
B
A
C
D
F
G H
I
E
10©2014 Cloudera, Inc. All rights reserved.
Sqoop2
From
(RDBMS,
HDFS,
Hive,
Hbase)
To
(RDBMS,
HDFS,
Hbase,
Hive
Kafka)
Engine
(Webserver,
Rest API,
Repository,
MapReduce)
Client
11©2014 Cloudera, Inc. All rights reserved.
NiFi!
12
Mappers
HiveKa = Hive + Kafka
Hive
Storag
e
Handle
r
KafkaInputFor
mat.
getSplits()
Kafka
Get topic, partitions
and offsets
MapReduc
e
Setup
Mappers
Mappers
KafkaRecordRea
der
Get data
Avro
SerDe
Kafka
Kafka
13Click to enter confidentiality information
14Click to enter confidentiality information
15
Streaming
©2014 Cloudera, Inc. All rights reserved.
16©2014 Cloudera, Inc. All rights reserved.
Flume + Kafka = Flafka
17
Sources Interceptors Selectors Channels Sinks
Flume Agent
How does work?
Twitter, logs,
webserver,
Kafka…
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
HDFS,
Hbase, Solr,
Kafka
18
But I just want to
get data from Kafka
to Hbase / HDFS
©2014 Cloudera, Inc. All rights reserved.
19
Channels Sinks
Flume Agent
Kafka Channel
Kafka! HDFS,
Hbase, Solr
20
Kafka Channel
Sources Interceptors Selectors Channels
Flume Agent
Twitter, logs,
webserver,
Kafka…
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
21©2014 Cloudera, Inc. All rights reserved.
SparkStreaming
Single Pass
Source
RawInput
DStream
RDD
Source
RawInput
DStream
RDD
RDD
Filter Count Print
Source
RawInput
DStream
RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch
22©2014 Cloudera, Inc. All rights reserved.
Storm
Spout
Source
Split
words
bolts
Split
words
bolts
Spout
Split
words
bolts
Split
words
bolts
Count
Count
Count
Spout Layer Fan out Layer 1 Shuffle Layer 2
23©2014 Cloudera, Inc. All rights reserved.
Retro Thoughts
24©2014 Cloudera, Inc. All rights reserved.
• Data often has schema
• At least it should
• Kafka is unaware – which is good
• Need capability to figure out schema for events
• Without including it in every event
Schema
25©2014 Cloudera, Inc. All rights reserved.
Kafka in Cloudera Manager
Questions?

More Related Content

PPTX
Twitter with hadoop for oow
PPTX
Kafka & Hadoop - for NYC Kafka Meetup
PPTX
Fraud Detection for Israel BigThings Meetup
PPTX
Data Architectures for Robust Decision Making
PPTX
Event Detection Pipelines with Apache Kafka
PDF
Apache Eagle - Monitor Hadoop in Real Time
PPTX
Real time analytics with Kafka and SparkStreaming
PPTX
Real Time Data Processing Using Spark Streaming
Twitter with hadoop for oow
Kafka & Hadoop - for NYC Kafka Meetup
Fraud Detection for Israel BigThings Meetup
Data Architectures for Robust Decision Making
Event Detection Pipelines with Apache Kafka
Apache Eagle - Monitor Hadoop in Real Time
Real time analytics with Kafka and SparkStreaming
Real Time Data Processing Using Spark Streaming

What's hot (20)

PPTX
Scaling ETL with Hadoop - Avoiding Failure
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
PPTX
Kafka connect-london-meetup-2016
PPTX
Emerging technologies /frameworks in Big Data
PDF
Apache storm vs. Spark Streaming
PDF
Application architectures with Hadoop – Big Data TechCon 2014
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
PDF
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PPTX
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
PPTX
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Hive on Spark, production experience @Uber
PDF
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
PDF
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
PPTX
Architecture of a Kafka camus infrastructure
PDF
Data Pipeline with Kafka
PDF
Storm: distributed and fault-tolerant realtime computation
PDF
fluentd -- the missing log collector
Scaling ETL with Hadoop - Avoiding Failure
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
Kafka connect-london-meetup-2016
Emerging technologies /frameworks in Big Data
Apache storm vs. Spark Streaming
Application architectures with Hadoop – Big Data TechCon 2014
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Real time Analytics with Apache Kafka and Apache Spark
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Kafka spark cassandra webinar feb 16 2016
Hive on Spark, production experience @Uber
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Architecture of a Kafka camus infrastructure
Data Pipeline with Kafka
Storm: distributed and fault-tolerant realtime computation
fluentd -- the missing log collector
Ad

Viewers also liked (20)

PPTX
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
PPTX
Kafka at scale facebook israel
PPTX
Streaming Data Integration - For Women in Big Data Meetup
PDF
Apache kafka
PPTX
Have your cake and eat it too
PDF
Intro to Pinot (2016-01-04)
PDF
Realtime BigData Step by Step mit Lambda, Kafka, Storm und Hadoop
PPTX
Kafka for DBAs
PDF
Lambdaarchitektur für BigData
PDF
Pinot: Realtime Distributed OLAP datastore
PDF
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
PPT
Kafka Reliability - When it absolutely, positively has to be there
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
PPTX
Enterprise Kafka: Kafka as a Service
PDF
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
PPTX
Building Event-Driven Systems with Apache Kafka
PPTX
Fraud Detection Architecture
PPTX
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
PPTX
Data Pipelines with Kafka Connect
ODP
Introduction to Kafka connect
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Kafka at scale facebook israel
Streaming Data Integration - For Women in Big Data Meetup
Apache kafka
Have your cake and eat it too
Intro to Pinot (2016-01-04)
Realtime BigData Step by Step mit Lambda, Kafka, Storm und Hadoop
Kafka for DBAs
Lambdaarchitektur für BigData
Pinot: Realtime Distributed OLAP datastore
JustGiving – Serverless Data Pipelines, API, Messaging and Stream Processing
Kafka Reliability - When it absolutely, positively has to be there
Multi-Datacenter Kafka - Strata San Jose 2017
Enterprise Kafka: Kafka as a Service
Strata+Hadoop 2017 San Jose - The Rise of Real Time: Apache Kafka and the Str...
Building Event-Driven Systems with Apache Kafka
Fraud Detection Architecture
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Data Pipelines with Kafka Connect
Introduction to Kafka connect
Ad

Similar to Kafka and Hadoop at LinkedIn Meetup (20)

PDF
Hive on kafka
PPT
Hadoop ecosystem
PPTX
GETTING YOUR DATA IN HADOOP.pptx
PPTX
Building data pipelines with kite
PDF
Kite SDK introduction for Portland Big Data
PPTX
Bigdata
PPTX
Streaming Data and Stream Processing with Apache Kafka
PPTX
End to End Streaming Architectures
PPTX
Flume vs. kafka
PPTX
HBase Data Modeling and Access Patterns with Kite SDK
PDF
Fraud Detection using Hadoop
PDF
Building an Apache Hadoop data application
PPTX
Architecting a Fraud Detection Application with Hadoop
PPTX
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
PPTX
Hadoop and Hive in Enterprises
PDF
Meetup: Streaming Data Pipeline Development
PPTX
Westpac Bank Tech Talk 1: Dive into Apache Kafka
PDF
Applications on Hadoop
PDF
Confluent Enterprise Datasheet
PPTX
kafka for db as postgres
Hive on kafka
Hadoop ecosystem
GETTING YOUR DATA IN HADOOP.pptx
Building data pipelines with kite
Kite SDK introduction for Portland Big Data
Bigdata
Streaming Data and Stream Processing with Apache Kafka
End to End Streaming Architectures
Flume vs. kafka
HBase Data Modeling and Access Patterns with Kite SDK
Fraud Detection using Hadoop
Building an Apache Hadoop data application
Architecting a Fraud Detection Application with Hadoop
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Hadoop and Hive in Enterprises
Meetup: Streaming Data Pipeline Development
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Applications on Hadoop
Confluent Enterprise Datasheet
kafka for db as postgres

More from Gwen (Chen) Shapira (16)

PPTX
Velocity 2019 - Kafka Operations Deep Dive
PPTX
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
PPTX
Gluecon - Kafka and the service mesh
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
PPTX
Papers we love realtime at facebook
PPTX
Kafka reliability velocity 17
PPTX
R for hadoopers
PPTX
Intro to Spark - for Denver Big Data Meetup
PPTX
Incredible Impala
PPTX
Data Wrangling and Oracle Connectors for Hadoop
PPTX
Scaling etl with hadoop shapira 3
PPTX
Is hadoop for you
PPTX
Ssd collab13
PPTX
Integrated dwh 3
PPTX
Visualizing database performance hotsos 13-v2
PPTX
Flexible Design
Velocity 2019 - Kafka Operations Deep Dive
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Gluecon - Kafka and the service mesh
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Papers we love realtime at facebook
Kafka reliability velocity 17
R for hadoopers
Intro to Spark - for Denver Big Data Meetup
Incredible Impala
Data Wrangling and Oracle Connectors for Hadoop
Scaling etl with hadoop shapira 3
Is hadoop for you
Ssd collab13
Integrated dwh 3
Visualizing database performance hotsos 13-v2
Flexible Design

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
L1 - Introduction to python Backend.pptx
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
System and Network Administration Chapter 2
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
2025 Textile ERP Trends: SAP, Odoo & Oracle
L1 - Introduction to python Backend.pptx
Design an Analysis of Algorithms II-SECS-1021-03
Design an Analysis of Algorithms I-SECS-1021-03
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PTS Company Brochure 2025 (1).pdf.......
Navsoft: AI-Powered Business Solutions & Custom Software Development
How Creative Agencies Leverage Project Management Software.pdf
Understanding Forklifts - TECH EHS Solution
CHAPTER 2 - PM Management and IT Context
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
System and Network Administration Chapter 2
Odoo Companies in India – Driving Business Transformation.pdf
ISO 45001 Occupational Health and Safety Management System
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Softaken Excel to vCard Converter Software.pdf
ManageIQ - Sprint 268 Review - Slide Deck
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...

Kafka and Hadoop at LinkedIn Meetup

Editor's Notes

  • #3: This gives me a lot of perspective regarding the use of Hadoop
  • #7: https://gist.github.com/gwenshap/9699072
  • #9: Batch MapReduce job. Exactly once semantics. Run once every X minutes.
  • #10: A - The setup stage fetches broker urls and topic information from ZooKeeper. B - The setup stage persists information about topics and offsets in HDFS for the tasks to read. C - The tasks read the persisted information from the setup stage. D - The tasks get events from Kakfa. E - The tasks write data to a temp location in HDFS in the format defined by the user defined decoder, in this case Avro formatted files. F - The tasks move the data in the temp location to a final location when the task is cleaning up. G - The task writes out audit counts on its activities. H - A clean up stage reads all the audit counts from all the tasks. I - The clean up stage reports back to Kakfa what has been persisted.
  • #17: Kafka source + sink for Flume
  • #18: Does not require programming.
  • #20: Does not require programming.
  • #21: Does not require programming.
  • #22: MicroBatch stream processing framework. Basically Spark code executed in a slightly different context and some windowing functions.
  • #23: Stream processing framework. Quite popular. Can be event-based or micro-batching (with Trident). Requires low level awareness of API.