SlideShare a Scribd company logo
Big	Data	Day	LA
Kafka
About	Kafka
Kafka	is	a	distributed	publish	subscribe	system
It	uses	a	commit	log	to	track	changes
Kafka	was	originally	created	at	LinkedIn
Open	sourced	in	2011
Graduated	to	a	top-level	Apache	project	in	2012
Many	Big	Data	projects	are	open	source	implementations	of	closed
source	products
Unlike	Hadoop,	HBase	or	Cassandra,	Kafka	actually	isn't	a	clone
of	an	existing	closed	source	product
What	Is	Kafka?
A	publish/subscribe	is	used	to	move	data
Also	known	as	a	producer/consumer	system
The	publisher	creates	data
Can	be	from	any	source
Can	be	binary	or	text
The	subscriber	consumes	the	publisher's	data
The	subscriber	will	use	the	data	for	its	algorithms
Pub/Sub
Decoupling	is	removing	knowledge	about	how	a	system	flows
A	highly	coupled	system	breaks	when	a	simple	change	is	made
A	highly	coupled	system	needs	to	know	all	configurations	and
destinations
A	decoupled	system	is	resilient	to	change
It	does	not	break	during	a	change
Does	not	need	extensive	knowledge	about	the	rest	of	the	system
Decoupling
Kafka	is	proven	with	Big	Data
Kafka	decouples	systems
Becoming	common	in	enterprise	data	flows
The	same	codebase	being	used	for	years	at	LinkedIn	answers	the
questions:
Does	it	scale?
Is	it	fast?
Is	it	robust?
Is	it	production	ready?
Kafka	supports	the	traditional	publish/subscribe	features
Why	Use	Kafka?
We	will	now	demonstrate	how	Kafka	works	with	Legos
Concepts	shown:
Publish/Subscribe
Topics
Partitioning
Commit	Logs
Log	compaction
DEMO:	Kafka	With	Legos
Kafka	Internals
Producers	publish	or	create	the	data	sent	on	the	cluster
All	producer	data	is	sent	over	the	network	to	the	Kafka	cluster
All	producer	data	is	sent	as	keys	and	values
The	keys	and	values	can	be	binary	or	text
Publisher
Consumers	receive	the	producer's	data
The	consumers	actually	pull	the	data	from	the	Kafka	cluster
The	consumers	receive	the	keys	and	values	sent	by	the	producer
Subscriber
Topics	are	a	way	of	grouping	data	together
Publishers	push	data	on	a	topic
Consumers	receive	all	of	their	data	on	a	topic
The	topic	must	match	exactly	on	both	the	publisher	and	consumer
Topics
Kafka	API
There	are	various	ways	to	access	Kafka
The	most	common	way	is	to	use	the	Java	API
It	is	the	only	first	class	citizen
Other	languages	have	API	implementations	but	aren't	part	of	the
Apache	Kafka	project
The	REST	interface	allows	many	languages	to	use	Kafka
This	requires	access	to	the	REST	Server
Kafka	Connect	allows	general	purpose	integrations
Data	can	be	ingested	into	Hadoop
Data	can	be	added	to	RDBMS
Accessing	Kafka
import	org.apache.kafka.clients.producer.KafkaProducer;
import	org.apache.kafka.clients.producer.ProducerRecord;
Properties	props	=	new	Properties();
//	Configure	brokers	to	connect	to
props.put("bootstrap.servers",	"broker1:9092");
//	Configure	serializer	classes
props.put("key.serializer",
										"org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer",
										"org.apache.kafka.common.serialization.StringSerializer");
KafkaProducer<String,	String>	producer	=	new
KafkaProducer<String,	String>(
				props);
//	Create	ProducerRecord	and	send	it
String	key	=	"mykey";
String	value	=	"myvalue";
ProducerRecord<String,	String>	record	=	new
ProducerRecord<String,	String>(
				"my_topic",	key,	value);
producer.send(record);
producer.close();
Creating	a	Publisher
import	org.apache.kafka.clients.consumer.ConsumerRecord;
import	org.apache.kafka.clients.consumer.ConsumerRecords;
import	org.apache.kafka.clients.consumer.KafkaConsumer;
String	topic	=	"hello_topic";
Properties	props	=	new	Properties();
//	Configure	initial	location	bootstrap	servers
props.put("bootstrap.servers",	"broker1:9092");
//	Configure	consumer	group
props.put("group.id",	"group1");
//	Configure	key	and	value	deserializers
props.put("key.deserializer",
										"org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer",
										"org.apache.kafka.common.serialization.StringDeserializer");
//	Create	the	consumer	and	subscribe	to	the	topic
consumer	=	new	KafkaConsumer<String,	String>(props);
consumer.subscribe(Arrays.asList(topic));
Creating	a	Consumer	(1/2)
while	(true)	{
												//	Poll	for	ConsumerRecords	for	a	certain	amount	of	time
												ConsumerRecords<String,	String>	records	=	consumer.poll(100);
												//	Process	the	ConsumerRecords,	if	any,	that	came	back
												for	(ConsumerRecord<String,	String>	record	:	records)	{
																String	key	=	record.key();
																String	value	=	record.value();
																//	Do	something	with	message
												}
								}
				}
				public	void	close()	{
								consumer.close();
				}
				public	static	void	main(String[]	args)	{
								MyConsumer	consumer	=	new	MyConsumer();
								consumer.createConsumer();
								consumer.close();
				}
}
Creating	a	Consumer	(2/2)
Current:	Instructor,	Thought	Leader,	Monkey	Tamer
Previously:
Curriculum	Developer	and	Instructor	@	Cloudera
Senior	Software	Engineer	@	Intuit
Covered,	Conferences	and	Published	In:
GigaOM,	ArsTecnica,	Pragmatic	Programmers,	Strata,	OSCON,
Wall	Street	Journal,	CNN,	BBC,	NPR
See	Me	On:
@jessetanderson
http://tiny.smokinghand.com/linkedin
http://tiny.smokinghand.com/youtube
About	Me

More Related Content

PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PPTX
Spark introduction and architecture
PDF
Impala use case @ Zoosk
PDF
Exponea - Kafka and Hadoop as components of architecture
PPTX
How to deploy Apache Spark in a multi-tenant, on-premises environment
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Alluxio (formerly Tachyon)...
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Spark introduction and architecture
Impala use case @ Zoosk
Exponea - Kafka and Hadoop as components of architecture
How to deploy Apache Spark in a multi-tenant, on-premises environment
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...

What's hot (20)

PPTX
Data Science at Scale Using Apache Spark and Apache Hadoop
PPTX
Apache hadoop technology : Beginners
PPTX
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
PPTX
Spark Infrastructure Made Easy
PPTX
Ravi Namboori 's Open stack framework introduction
PPTX
Membase Meetup 2010
PPTX
Ignite Your Big Data With a Spark!
PDF
MapR-DB Elasticsearch Integration
PDF
Apache Flink & Kudu: a connector to develop Kappa architectures
PPTX
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
PPTX
Real time fraud detection at 1+M scale on hadoop stack
PPTX
Preventative Maintenance of Robots in Automotive Industry
PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
PPTX
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
PPTX
Jax Cloud 2016 Microsoft Ignite Recap
PDF
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
PDF
What's new in SQL on Hadoop and Beyond
PPTX
Data Engineer's Lunch #55: Get Started in Data Engineering
PDF
Spark Summit EU talk by Mike Percy
PPTX
LEGO: Data Driven Growth Hacking Powered by Big Data
Data Science at Scale Using Apache Spark and Apache Hadoop
Apache hadoop technology : Beginners
OracleStore: A Highly Performant RawStore Implementation for Hive Metastore
Spark Infrastructure Made Easy
Ravi Namboori 's Open stack framework introduction
Membase Meetup 2010
Ignite Your Big Data With a Spark!
MapR-DB Elasticsearch Integration
Apache Flink & Kudu: a connector to develop Kappa architectures
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Real time fraud detection at 1+M scale on hadoop stack
Preventative Maintenance of Robots in Automotive Industry
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Jax Cloud 2016 Microsoft Ignite Recap
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
What's new in SQL on Hadoop and Beyond
Data Engineer's Lunch #55: Get Started in Data Engineering
Spark Summit EU talk by Mike Percy
LEGO: Data Driven Growth Hacking Powered by Big Data
Ad

Viewers also liked (20)

PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
PPTX
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
PDF
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
PDF
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
PDF
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
PPTX
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
PDF
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
PPTX
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
PPTX
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
PDF
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...
PDF
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
PPTX
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
PDF
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
PDF
Ag big datacampla-06-14-2014-ajay_gopal
PDF
Aziksa hadoop for buisness users2 santosh jha
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Big Data Day LA 2016/ Use Case Driven track - Data and Hollywood: "Je t'Aime ...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Explore big data at speed of thought with Spark 2.0 and Snappydata
Big Data Day LA 2016/ Use Case Driven track - Shaping the Role of Data Scienc...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Data Science Track - Data Science + Hollywood, Todd Ho...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Ag big datacampla-06-14-2014-ajay_gopal
Aziksa hadoop for buisness users2 santosh jha
Ad

Similar to Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Jesse Anderson, CEO, Smoking Hand (20)

PPTX
Kafka Basic For Beginners
PDF
2015-04-15 | Apache Kafka (Vienna Scala User Group)
PPTX
kafka for db as postgres
PPTX
04-Kafka.pptx
PPTX
04-Kafka.pptx
PDF
Kafka syed academy_v1_introduction
PPTX
Apache kafka
PPTX
Kafka
PDF
apachekafka-160907180205.pdf
PPTX
Kafka 101
PPTX
Kafka for Scale
PDF
Apache kafka
PPTX
Kafka tutorial
PDF
Kafka Up And Running For Network Devops Set Your Network Data In Motion Eric ...
PPTX
Unleashing Real-time Power with Kafka.pptx
PPTX
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
PDF
kafka-tutorial-cloudruable-v2.pdf
PPTX
Current and Future of Apache Kafka
PDF
PPTX
Apache kafka
Kafka Basic For Beginners
2015-04-15 | Apache Kafka (Vienna Scala User Group)
kafka for db as postgres
04-Kafka.pptx
04-Kafka.pptx
Kafka syed academy_v1_introduction
Apache kafka
Kafka
apachekafka-160907180205.pdf
Kafka 101
Kafka for Scale
Apache kafka
Kafka tutorial
Kafka Up And Running For Network Devops Set Your Network Data In Motion Eric ...
Unleashing Real-time Power with Kafka.pptx
Introducing Apache Kafka and why it is important to Oracle, Java and IT profe...
kafka-tutorial-cloudruable-v2.pdf
Current and Future of Apache Kafka
Apache kafka

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PPTX
A Presentation on Artificial Intelligence
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Monthly Chronicles - July 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
A Presentation on Artificial Intelligence
Diabetes mellitus diagnosis method based random forest with bat algorithm
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Advanced methodologies resolving dimensionality complications for autism neur...
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Jesse Anderson, CEO, Smoking Hand