SlideShare a Scribd company logo
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Cassandra and Kafka Support on AWS/EC2
Cloudurable
Introduction to Kafka
Support around Cassandra
and Kafka running in EC2
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Kafka growing
Why Kafka?
Kafka adoption is on the rise
but why
What is Kafka?
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka growth exploding
❖ Kafka growth exploding
❖ 1/3 of all Fortune 500 companies
❖ Top ten travel companies, 7 of top ten banks, 8 of top
ten insurance companies, 9 of top ten telecom
companies
❖ LinkedIn, Microsoft and Netflix process 4 comma
message a day with Kafka (1,000,000,000,000)
❖ Real-time streams of data, used to collect big data or to
do real time analysis (or both)
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Why Kafka is Needed?
❖ Real time streaming data processed for real time
analytics
❖ Service calls, track every call, IOT sensors
❖ Apache Kafka is a fast, scalable, durable, and fault-
tolerant publish-subscribe messaging system
❖ Kafka is often used instead of JMS, RabbitMQ and
AMQP
❖ higher throughput, reliability and replication
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Why is Kafka needed? 2
❖ Kafka can works in combination with
❖ Flume/Flafka, Spark Streaming, Storm, HBase and Spark
for real-time analysis and processing of streaming data
❖ Feed your data lakes with data streams
❖ Kafka brokers support massive message streams for follow-
up analysis in Hadoop or Spark
❖ Kafka Streaming (subproject) can be used for real-time
analytics
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Use Cases
❖ Stream Processing
❖ Website Activity Tracking
❖ Metrics Collection and Monitoring
❖ Log Aggregation
❖ Real time analytics
❖ Capture and ingest data into Spark / Hadoop
❖ CRQS, replay, error recovery
❖ Guaranteed distributed commit log for in-memory computing
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Who uses Kafka?
❖ LinkedIn: Activity data and operational metrics
❖ Twitter: Uses it as part of Storm – stream processing
infrastructure
❖ Square: Kafka as bus to move all system events to various
Square data centers (logs, custom events, metrics, an so
on). Outputs to Splunk, Graphite, Esper-like alerting
systems
❖ Spotify, Uber, Tumbler, Goldman Sachs, PayPal, Box,
Cisco, CloudFlare, DataDog, LucidWorks, MailChimp,
NetFlix, etc.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Why is Kafka Popular?
❖ Great performance
❖ Operational Simplicity, easy to setup and use, easy to reason
❖ Stable, Reliable Durability,
❖ Flexible Publish-subscribe/queue (scales with N-number of consumer groups),
❖ Robust Replication,
❖ Producer Tunable Consistency Guarantees,
❖ Ordering Preserved at shard level (Topic Partition)
❖ Works well with systems that have data streams to process, aggregate,
transform & load into other stores
Most important reason: Kafka’s great performance: throughput, latency, obtained through great engineering
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Why is Kafka so fast?
❖ Zero Copy - calls the OS kernel direct rather to move data fast
❖ Batch Data in Chunks - Batches data into chunks
❖ end to end from Producer to file system to Consumer
❖ Provides More efficient data compression. Reduces I/O latency
❖ Sequential Disk Writes - Avoids Random Disk Access
❖ writes to immutable commit log. No slow disk seeking. No random I/O
operations. Disk accessed in sequential manner
❖ Horizontal Scale - uses 100s to thousands of partitions for a single topic
❖ spread out to thousands of servers
❖ handle massive load
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Streaming Architecture
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Why Kafka Review
❖ Why is Kafka so fast?
❖ How fast is Kafka usage growing?
❖ How is Kafka getting used?
❖ Where does Kafka fit in the Big Data Architecture?
❖ How does Kafka relate to real-time analytics?
❖ Who uses Kafka?
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Cassandra / Kafka Support in EC2/AWS
What is Kafka? Kafka messaging
Kafka Overview
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
What is Kafka?
❖ Distributed Streaming Platform
❖ Publish and Subscribe to streams of records
❖ Fault tolerant storage
❖ Replicates Topic Log Partitions to multiple servers
❖ Process records as they occur
❖ Fast, efficient IO, batching, compression, and more
❖ Used to decouple data streams
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Decoupling Data
Streams
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Polyglot clients / Wire
protocol
❖ Kafka communication from clients and servers wire
protocol over TCP protocol
❖ Protocol versioned
❖ Maintains backwards compatibility
❖ Many languages supported
❖ Kafka REST proxy allows easy integration
❖ Also provides Avro/Schema registry support via Kafka
ecosystem
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Usage
❖ Build real-time streaming applications that react to streams
❖ Real-time data analytics
❖ Transform, react, aggregate, join real-time data flows
❖ Feed events to CEP for complex event processing
❖ Feed data lakes
❖ Build real-time streaming data pipe-lines
❖ Enable in-memory microservices (actors, Akka, Vert.x, Qbit,
RxJava)
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Use Cases
❖ Metrics / KPIs gathering
❖ Aggregate statistics from many sources
❖ Event Sourcing
❖ Used with microservices (in-memory) and actor systems
❖ Commit Log
❖ External commit log for distributed systems. Replicated
data between nodes, re-sync for nodes to restore state
❖ Real-time data analytics, Stream Processing, Log
Aggregation, Messaging, Click-stream tracking, Audit trail,
etc.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Record Retention
❖ Kafka cluster retains all published records
❖ Time based – configurable retention period
❖ Size based - configurable based on size
❖ Compaction - keeps latest record
❖ Retention policy of three days or two weeks or a month
❖ It is available for consumption until discarded by time, size or
compaction
❖ Consumption speed not impacted by size
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka scalable message
storage
❖ Kafka acts as a good storage system for records/messages
❖ Records written to Kafka topics are persisted to disk and replicated to
other servers for fault-tolerance
❖ Kafka Producers can wait on acknowledgment
❖ Write not complete until fully replicated
❖ Kafka disk structures scales well
❖ Writing in large streaming batches is fast
❖ Clients/Consumers can control read position (offset)
❖ Kafka acts like high-speed file system for commit log storage,
replication
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Review
❖ How does Kafka decouple streams of data?
❖ What are some use cases for Kafka where you work?
❖ What are some common use cases for Kafka?
❖ How is Kafka like a distributed message storage
system?
❖ How does Kafka know when to delete old messages?
❖ Which programming languages does Kafka support?
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Kafka Architecture
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Fundamentals
❖ Records have a key (optional), value and timestamp; Immutable
❖ Topic a stream of records (“/orders”, “/user-signups”), feed name
❖ Log topic storage on disk
❖ Partition / Segments (parts of Topic Log)
❖ Producer API to produce a streams or records
❖ Consumer API to consume a stream of records
❖ Broker: Kafka server that runs in a Kafka Cluster. Brokers form a cluster.
Cluster consists on many Kafka Brokers on many servers.
❖ ZooKeeper: Does coordination of brokers/cluster topology. Consistent file
system for configuration information and leadership election for Broker Topic
Partition Leaders
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka: Topics, Producers, and
Consumers
Kafka
Cluster
Topic
Producer
Producer
Producer
Consumer
Consumer
Consumer
record
record
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Core Kafka
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka needs Zookeeper
❖ Zookeeper helps with leadership election of Kafka Broker and
Topic Partition pairs
❖ Zookeeper manages service discovery for Kafka Brokers that
form the cluster
❖ Zookeeper sends changes to Kafka
❖ New Broker join, Broker died, etc.
❖ Topic removed, Topic added, etc.
❖ Zookeeper provides in-sync view of Kafka Cluster configuration
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Producer/Consumer
Details
❖ Producers write to and Consumers read from Topic(s)
❖ Topic associated with a log which is data structure on disk
❖ Producer(s) append Records at end of Topic log
❖ Topic Log consist of Partitions -
❖ Spread to multiple files on multiple nodes
❖ Consumers read from Kafka at their own cadence
❖ Each Consumer (Consumer Group) tracks offset from where they left off reading
❖ Partitions can be distributed on different machines in a cluster
❖ high performance with horizontal scalability and failover with replication
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Topic Partition, Consumers,
Producers
0 1 42 3 5 6 7 8 9 10 11
Partition
0
Consumer Group A
Producer
Consumer Group B
Consumer groups remember offset where they left off.
Consumers groups each have their own offset.
Producer writing to offset 12 of Partition 0 while…
Consumer Group A is reading from offset 6.
Consumer Group B is reading from offset 9.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Scale and Speed
❖ How can Kafka scale if multiple producers and consumers read/write to
same Kafka Topic log?
❖ Writes fast: Sequential writes to filesystem are fast (700 MB or more a
second)
❖ Scales writes and reads by sharding:
❖ Topic logs into Partitions (parts of a Topic log)
❖ Topics logs can be split into multiple Partitions different
machines/different disks
❖ Multiple Producers can write to different Partitions of the same Topic
❖ Multiple Consumers Groups can read from different partitions efficiently
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Brokers
❖ Kafka Cluster is made up of multiple Kafka Brokers
❖ Each Broker has an ID (number)
❖ Brokers contain topic log partitions
❖ Connecting to one broker bootstraps client to entire
cluster
❖ Start with at least three brokers, cluster can have, 10,
100, 1000 brokers if needed
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Cluster, Failover, ISRs
❖ Topic Partitions can be replicated
❖ across multiple nodes for failover
❖ Topic should have a replication factor greater than 1
❖ (2, or 3)
❖ Failover
❖ if one Kafka Broker goes down then Kafka Broker with
ISR (in-sync replica) can serve data
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
ZooKeeper does coordination for Kafka
Cluster
Kafka BrokerProducer
Producer
Producer
Consumer
Consumer
Consumer
Kafka Broker
Kafka Broker
Topic
ZooKeeper
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Failover vs. Disaster Recovery
❖ Replication of Kafka Topic Log partitions allows for failure of
a rack or AWS availability zone
❖ You need a replication factor of at least 3
❖ Kafka Replication is for Failover
❖ Mirror Maker is used for Disaster Recovery
❖ Mirror Maker replicates a Kafka cluster to another data-center
or AWS region
❖ Called mirroring since replication happens within a cluster
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Review
❖ How does Kafka decouple streams of data?
❖ What are some use cases for Kafka where you work?
❖ What are some common use cases for Kafka?
❖ What is a Topic?
❖ What is a Broker?
❖ What is a Partition? Offset?
❖ Can Kafka run without Zookeeper?
❖ How do implement failover in Kafka?
❖ How do you implement disaster recovery in Kafka?
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Kafka versus
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka vs JMS, SQS, RabbitMQ
Messaging
❖ Is Kafka a Queue or a Pub/Sub/Topic?
❖ Yes
❖ Kafka is like a Queue per consumer group
❖ Kafka is a queue system per consumer in consumer group so load
balancing like JMS, RabbitMQ queue
❖ Kafka is like Topics in JMS, RabbitMQ, MOM
❖ Topic/pub/sub by offering Consumer Groups which act like
subscriptions
❖ Broadcast to multiple consumer groups
❖ MOM = JMS, ActiveMQ, RabbitMQ, IBM MQ Series, Tibco, etc.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka vs MOM
❖ By design, Kafka is better suited for scale than traditional MOM systems due
to partition topic log
❖ Load divided among Consumers for read by partition
❖ Handles parallel consumers better than traditional MOM
❖ Also by moving location (partition offset) in log to client/consumer side of
equation instead of the broker, less tracking required by Broker and more
flexible consumers
❖ Kafka written with mechanical sympathy, modern hardware, cloud in mind
❖ Disks are faster
❖ Servers have tons of system memory
❖ Easier to spin up servers for scale out
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kinesis and Kafka are similar
❖ Kinesis Streams is like Kafka Core
❖ Kinesis Analytics is like Kafka Streams
❖ Kinesis Shard is like Kafka Partition
❖ Similar and get used in similar use cases
❖ In Kinesis, data is stored in shards. In Kafka, data is stored in partitions
❖ Kinesis Analytics allows you to perform SQL like queries on data streams
❖ Kafka Streaming allows you to perform functional aggregations and mutations
❖ Kafka integrates well with Spark and Flink which allows SQL like queries on
streams
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka vs. Kinesis
❖ Data is stored in Kinesis for default 24 hours, and you can increase that up to 7 days.
❖ Kafka records default stored for 7 days
❖ can increase until you run out of disk space.
❖ Decide by the size of data or by date.
❖ Can use compaction with Kafka so it only stores the latest timestamp per key per record
in the log
❖ With Kinesis data can be analyzed by lambda before it gets sent to S3 or RedShift
❖ With Kinesis you pay for use, by buying read and write units.
❖ Kafka is more flexible than Kinesis but you have to manage your own clusters, and requires
some dedicated DevOps resources to keep it going
❖ Kinesis is sold as a service and does not require a DevOps team to keep it going (can be
more expensive and less flexible, but much easier to setup and run)
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Kafka Topics
Kafka Topics
Architecture
Replication
Failover
Parallel processing
Kafka Topic Architecture
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Topics, Logs, Partitions
❖ Kafka Topic is a stream of records
❖ Topics stored in log
❖ Log broken up into partitions and segments
❖ Topic is a category or stream name or feed
❖ Topics are pub/sub
❖ Can have zero or many subscribers - consumer groups
❖ Topics are broken up and spread by partitions for speed and
size
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Topic Partitions
❖ Topics are broken up into partitions
❖ Partitions decided usually by key of record
❖ Key of record determines which partition
❖ Partitions are used to scale Kafka across many servers
❖ Record sent to correct partition by key
❖ Partitions are used to facilitate parallel consumers
❖ Records are consumed in parallel up to the number of partitions
❖ Order guaranteed per partition
❖ Partitions can be replicated to multiple brokers
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Topic Partition Log
❖ Order is maintained only in a single partition
❖ Partition is ordered, immutable sequence of records that is continually appended
to—a structured commit log
❖ Records in partitions are assigned sequential id number called the offset
❖ Offset identifies each record within the partition
❖ Topic Partitions allow Kafka log to scale beyond a size that will fit on a single server
❖ Topic partition must fit on servers that host it
❖ topic can span many partitions hosted on many servers
❖ Topic Partitions are unit of parallelism - a partition can only be used by one
consumer in group at a time
❖ Consumers can run in their own process or their own thread
❖ If a consumer stops, Kafka spreads partitions across remaining consumer in group
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Topic Partitions Layout
0 1 42 3 5 6 7 8 9 10 11
0 1 42 3 5 6 7 8
0 1 42 3 5 6 7 8 9 10
Older Newer
0 1 42 3 5 6 7
Partition
0
Partition
1
Partition
2
Partition
3
Writes
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Replication: Kafka Partition
Distribution
❖ Each partition has leader server and zero or more follower
servers
❖ Leader handles all read and write requests for partition
❖ Followers replicate leader, and take over if leader dies
❖ Used for parallel consumer handling within a group
❖ Partitions of log are distributed over the servers in the Kafka cluster
with each server handling data and requests for a share of
partitions
❖ Each partition can be replicated across a configurable number of
Kafka servers - Used for fault tolerance
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Replication: Kafka Partition
Leader
❖ One node/partition’s replicas is chosen as leader
❖ Leader handles all reads and writes of Records for
partition
❖ Writes to partition are replicated to followers
(node/partition pair)
❖ An follower that is in-sync is called an ISR (in-sync
replica)
❖ If a partition leader fails, one ISR is chosen as new
leader
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Replication to Partition 0
Kafka Broker 0
Partition 0
Partition 1
Partition 2
Partition 3
Partition 4
Kafka Broker 1
Partition 0
Partition 1
Partition 2
Partition 3
Partition 4
Kafka Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Client Producer
1) Write record
Partition 0
2) Replicate
record
2) Replicate
record
Leader Red
Follower Blue
Record is considered "committed"
when all ISRs for partition
wrote to their log.
Only committed records are
readable from consumer
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Replication to Partitions
1
Kafka Broker 0
Partition 0
Partition 1
Partition 2
Partition 3
Partition 4
Kafka Broker 1
Partition 0
Partition 1
Partition 2
Partition 3
Partition 4
Kafka Broker 2
Partition 1
Partition 2
Partition 3
Partition 4
Client Producer
1) Write record
Partition 0
2) Replicate
record
2) Replicate
record
Another partition can
be owned
by another leader
on another Kafka broker
Leader Red
Follower Blue
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Topic Review
❖ What is an ISR?
❖ How does Kafka scale Consumers?
❖ What are leaders? followers?
❖ How does Kafka perform failover for Consumers?
❖ How does Kafka perform failover for Brokers?
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Kafka Producers
Kafka Producers
Partition selection
Durability
Kafka Producer Architecture
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Producers
❖ Producers send records to topics
❖ Producer picks which partition to send record to per topic
❖ Can be done in a round-robin
❖ Can be based on priority
❖ Typically based on key of record
❖ Kafka default partitioner for Java uses hash of keys to
choose partitions, or a round-robin strategy if no key
❖ Important: Producer picks partition
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Producers and
Consumers
0 1 42 3 5 6 7 8 9 10 11
Partition
0
Producers
Consumer Group A
Producers are writing at Offset 12
Consumer Group A is Reading from Offset 9.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Producers
❖ Producers write at their own cadence so order of Records
cannot be guaranteed across partitions
❖ Producer configures consistency level (ack=0, ack=all,
ack=1)
❖ Producers pick the partition such that Record/messages
goes to a given same partition based on the data
❖ Example have all the events of a certain 'employeeId' go to
same partition
❖ If order within a partition is not needed, a 'Round Robin'
partition strategy can be used so Records are evenly
distributed across partitions.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Producer Review
❖ Can Producers occasionally write faster than
consumers?
❖ What is the default partition strategy for Producers
without using a key?
❖ What is the default partition strategy for Producers using
a key?
❖ What picks which partition a record is sent to?
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Kafka Consumers
Load balancing consumers
Failover for consumers
Offset management per consumer
group
Kafka Consumer Architecture
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Consumer Groups
❖ Consumers are grouped into a Consumer Group
❖ Consumer group has a unique id
❖ Each consumer group is a subscriber
❖ Each consumer group maintains its own offset
❖ Multiple subscribers = multiple consumer groups
❖ Each has different function: one might delivering records to
microservices while another is streaming records to Hadoop
❖ A Record is delivered to one Consumer in a Consumer Group
❖ Each consumer in consumer groups takes records and only one consumer
in group gets same record
❖ Consumers in Consumer Group load balance record consumption
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Consumer Load Share
❖ Kafka Consumer consumption divides partitions over consumers in a
Consumer Group
❖ Each Consumer is exclusive consumer of a "fair share" of partitions
❖ This is Load Balancing
❖ Consumer membership in Consumer Group is handled by the Kafka
protocol dynamically
❖ If new Consumers join Consumer group, it gets a share of partitions
❖ If Consumer dies, its partitions are split among remaining live Consumers
in Consumer Group
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Consumer Groups
0 1 42 3 5 6 7 8 9 10 11
Partition
0
Consumer Group A
Producers
Consumer Group B
Consumers remember offset where they left off.
Consumers groups each have their own offset per partition.
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Consumer Groups
Processing
❖ How does Kafka divide up topic so multiple Consumers in a Consumer
Group can process a topic?
❖ You group consumers into consumers group with a group id
❖ Consumers with same id belong in same Consumer Group
❖ One Kafka broker becomes group coordinator for Consumer Group
❖ assigns partitions when new members arrive (older clients would talk
direct to ZooKeeper now broker does coordination)
❖ or reassign partitions when group members leave or topic changes
(config / meta-data change
❖ When Consumer group is created, offset set according to reset policy of
topic
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Consumer Failover
❖ Consumers notify broker when it successfully processed a
record
❖ advances offset
❖ If Consumer fails before sending commit offset to Kafka
broker,
❖ different Consumer can continue from the last committed
offset
❖ some Kafka records could be reprocessed
❖ at least once behavior
❖ messages should be idempotent
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Consumer Offsets and
Recovery
❖ Kafka stores offsets in topic called “__consumer_offset”
❖ Uses Topic Log Compaction
❖ When a consumer has processed data, it should
commit offsets
❖ If consumer process dies, it will be able to start up and
start reading where it left off based on offset stored in
“__consumer_offset”
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Consumer: What can be
consumed?
❖ "Log end offset" is offset of last record written
to log partition and where Producers write to
next
❖ "High watermark" is offset of last record
successfully replicated to all partitions followers
❖ Consumer only reads up to “high watermark”.
Consumer can’t read un-replicated data
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Consumer to Partition
Cardinality
❖ Only a single Consumer from the same Consumer
Group can access a single Partition
❖ If Consumer Group count exceeds Partition count:
❖ Extra Consumers remain idle; can be used for failover
❖ If more Partitions than Consumer Group instances,
❖ Some Consumers will read from more than one
partition
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
2 server Kafka cluster hosting 4 partitions (P0-P5)
Kafka Cluster
Server 2
P0 P1 P5
Server 1
P2 P3 P4
Consumer Group A
C0 C1 C3
Consumer Group B
C0 C1 C3
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Multi-threaded Consumers
❖ You can run more than one Consumer in a JVM process
❖ If processing records takes a while, a single Consumer can run multiple threads to process
records
❖ Harder to manage offset for each Thread/Task
❖ One Consumer runs multiple threads
❖ 2 messages on same partitions being processed by two different threads
❖ Hard to guarantee order without threads coordination
❖ PREFER: Multiple Consumers can run each processing record batches in their own thread
❖ Easier to manage offset
❖ Each Consumer runs in its thread
❖ Easier to mange failover (each process runs X num of Consumer threads)
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Consumer Review
❖ What is a consumer group?
❖ Does each consumer have its own offset?
❖ When can a consumer see a record?
❖ What happens if there are more consumers than
partitions?
❖ What happens if you run multiple consumers in many
thread in the same JVM?
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Using Kafka Single Node
Using Kafka Single
Node
Run ZooKeeper, Kafka
Create a topic
Send messages from command
line
Read messages from command
line
Tutorial Using Kafka Single Node
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Run Kafka
❖ Run ZooKeeper start up script
❖ Run Kafka Server/Broker start up script
❖ Create Kafka Topic from command line
❖ Run producer from command line
❖ Run consumer from command line
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Run ZooKeeper
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Run Kafka Server
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Create Kafka Topic
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
List Topics
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Run Kafka Producer
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Run Kafka Consumer
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Running Kafka Producer and
Consumer
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Single Node Review
❖ What server do you run first?
❖ What tool do you use to create a topic?
❖ What tool do you use to see topics?
❖ What tool did we use to send messages on the command line?
❖ What tool did we use to view messages in a topic?
❖ Why were the messages coming out of order?
❖ How could we get the messages to come in order from the
consumer?
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Use Kafka to send and receive messages
Lab Use Kafka
Use single server version of
Kafka.
Setup single node.
Single ZooKeeper.
Create a topic.
Produce and consume messages
from the command line.
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Using Kafka Cluster
and Failover
Demonstrate Kafka Cluster
Create topic with replication
Show consumer failover
Show broker failover
Kafka Tutorial Cluster and
Failover
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Objectives
❖ Run many Kafka Brokers
❖ Create a replicated topic
❖ Demonstrate Pub / Sub
❖ Demonstrate load balancing
consumers
❖ Demonstrate consumer failover
❖ Demonstrate broker failover
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Running many nodes
❖ If not already running, start up ZooKeeper
❖ Shutdown Kafka from first lab
❖ Copy server properties for three brokers
❖ Modify properties files, Change port, Change Kafka log
location
❖ Start up many Kafka server instances
❖ Create Replicated Topic
❖ Use the replicated topic
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Create three new server-
n.properties files
❖ Copy existing server.properties to server-
0.properties, server-1.properties, server-2.properties
❖ Change server-1.properties to use log.dirs
“./logs/kafka-logs-0”
❖ Change server-1.properties to use port 9093, broker
id 1, and log.dirs “./logs/kafka-logs-1”
❖ Change server-2.properties to use port 9094, broker
id 2, and log.dirs “./logs/kafka-logs-2”
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Modify server-x.properties
❖ Each have different
broker.id
❖ Each have different
log.dirs
❖ Each had different
port
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Create Startup scripts for three Kafka
servers
❖ Passing properties files
from last step
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Run Servers
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Create Kafka replicated topic my-
failsafe-topic
❖ Replication Factor is set to 3
❖ Topic name is my-failsafe-topic
❖ Partitions is 13
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Start Kafka Consumer
❖ Pass list of Kafka servers to bootstrap-
server
❖ We pass two of the three
❖ Only one needed, it learns about the rest
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Start Kafka Producer
❖ Start producer
❖ Pass list of Kafka Brokers
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka 1 consumer and 1 producer
running
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Start a second and third
consumer
❖ Acts like pub/sub
❖ Each consumer
in its own group
❖ Message goes to
each
❖ How do we load
share?
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Running consumers in same
group
❖ Modify start consumer script
❖ Add the consumers to a group called
mygroup
❖ Now they will share load
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Start up three consumers
again
❖ Start up producer and three consumers
❖ Send 7 messages
❖ Notice how messages are spread among 3 consumers
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Consumer Failover
❖ Kill one consumer
❖ Send seven more
messages
❖ Load is spread to
remaining consumers
❖ Failover WORK!
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Create Kafka Describe Topic
❖ —describe will show list partitions, ISRs, and
partition leadership
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Use Describe Topics
❖ Lists which broker owns (leader of) which partition
❖ Lists Replicas and ISR (replicas that are up to
date)
❖ Notice there are 13 topics
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Test Broker Failover: Kill 1st
server
se Kafka topic describe to see that a new leader was elected!
Kill the first server
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Show Broker Failover Worked
❖ Send two more
messages from the
producer
❖ Notice that the
consumer gets the
messages
❖ Broker Failover WORKS!
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Cluster Review
❖ Why did the three consumers not load share the
messages at first?
❖ How did we demonstrate failover for consumers?
❖ How did we demonstrate failover for producers?
❖ What tool and option did we use to show ownership of
partitions and the ISRs?
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Use Kafka to send and receive messages
Lab 2 Use Kafka
multiple nodes
Use a Kafka Cluster to
replicate a Kafka topic log
™
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
Kafka Ecosystem
Kafka Connect
Kafka Streaming
Kafka Schema Registry
Kafka REST
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Ecosystem
❖ Kafka Streams
❖ Streams API to transform, aggregate, process records from a stream and
produce derivative streams
❖ Kafka Connect
❖ Connector API reusable producers and consumers
❖ (e.g., stream of changes from DynamoDB)
❖ Kafka REST Proxy
❖ Producers and Consumers over REST (HTTP)
❖ Schema Registry - Manages schemas using Avro for Kafka Records
❖ Kafka MirrorMaker - Replicate cluster data to another cluster
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka REST Proxy and
Kafka Schema Registry
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Ecosystem
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Stream Processing
❖ Kafka Streams for Stream Processing
❖ Kafka enable real-time processing of streams.
❖ Kafka Streams supports Stream Processor
❖ processing, transformation, aggregation, and produces 1 to * output streams
❖ Example: video player app sends events videos watched, videos paused
❖ output a new stream of user preferences
❖ can gear new video recommendations based on recent user activity
❖ can aggregate activity of many users to see what new videos are hot
❖ Solves hard problems: out of order records, aggregating/joining across streams, stateful
computations, and more
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Connectors and
Streams
Kafka
Cluster
App
App
App
App
App
App
DB DB
App App
Connectors
Producers
Consumers
Streams
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
Kafka Ecosystem review
❖ What is Kafka Streams?
❖ What is Kafka Connect?
❖ What is the Schema Registry?
❖ What is Kafka Mirror Maker?
❖ When might you use Kafka REST Proxy?
Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka
Consulting
™
References
❖ Learning Apache Kafka, Second Edition 2nd Edition by Nishant Garg (Author),
2015, ISBN 978-1784393090, Packet Press
❖ Apache Kafka Cookbook, 1st Edition, Kindle Edition by Saurabh Minni (Author),
2015, ISBN 978-1785882449, Packet Press
❖ Kafka Streams for Stream processing: A few words about how Kafka works,
Serban Balamaci, 2017, Blog: Plain Ol' Java
❖ Kafka official documentation, 2017
❖ Why we need Kafka? Quora
❖ Why is Kafka Popular? Quora
❖ Why is Kafka so Fast? Stackoverflow
❖ Kafka growth exploding (Tech Republic)

More Related Content

PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 2)
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PPTX
Kafka Intro With Simple Java Producer Consumers
PPTX
Avro Tutorial - Records with Schema for Kafka and Hadoop
PPTX
Kafka Tutorial - introduction to the Kafka streaming platform
PPTX
Kafka Tutorial - DevOps, Admin and Ops
PPTX
Amazon Cassandra Basics & Guidelines for AWS/EC2/VPC/EBS
PPTX
Brief introduction to Kafka Streaming Platform
Kafka Tutorial - Introduction to Apache Kafka (Part 2)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Intro With Simple Java Producer Consumers
Avro Tutorial - Records with Schema for Kafka and Hadoop
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - DevOps, Admin and Ops
Amazon Cassandra Basics & Guidelines for AWS/EC2/VPC/EBS
Brief introduction to Kafka Streaming Platform

What's hot (20)

PPTX
Kafka Tutorial: Streaming Data Architecture
PPTX
Kafka MirrorMaker: Disaster Recovery, Scaling Reads, Isolate Mission Critical...
PPTX
Kafka and Avro with Confluent Schema Registry
PPTX
Kafka Tutorial - basics of the Kafka streaming platform
PPTX
Kafka Tutorial Advanced Kafka Consumers
PPTX
Amazon AWS basics needed to run a Cassandra Cluster in AWS
PPTX
Kafka Tutorial: Kafka Security
PDF
Kafka as a message queue
PPTX
Best Practices for Running Kafka on Docker Containers
PPTX
Introduction to Kafka and Zookeeper
PPTX
Kafka blr-meetup-presentation - Kafka internals
PDF
Schema Evolution for Resilient Data microservices
PPTX
Building Event-Driven Systems with Apache Kafka
PDF
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
PPTX
Apache Con 2021 Structured Data Streaming
PDF
ES & Kafka
PPTX
Kafka: Internals
PPTX
Javaeeconf 2016 how to cook apache kafka with camel and spring boot
PPTX
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
PDF
Deploying Kafka on DC/OS
Kafka Tutorial: Streaming Data Architecture
Kafka MirrorMaker: Disaster Recovery, Scaling Reads, Isolate Mission Critical...
Kafka and Avro with Confluent Schema Registry
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial Advanced Kafka Consumers
Amazon AWS basics needed to run a Cassandra Cluster in AWS
Kafka Tutorial: Kafka Security
Kafka as a message queue
Best Practices for Running Kafka on Docker Containers
Introduction to Kafka and Zookeeper
Kafka blr-meetup-presentation - Kafka internals
Schema Evolution for Resilient Data microservices
Building Event-Driven Systems with Apache Kafka
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Apache Con 2021 Structured Data Streaming
ES & Kafka
Kafka: Internals
Javaeeconf 2016 how to cook apache kafka with camel and spring boot
Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases
Deploying Kafka on DC/OS
Ad

Similar to Kafka Tutorial, Kafka ecosystem with clustering examples (20)

PDF
kafka-tutorial-cloudruable-v2.pdf
PDF
Kafka syed academy_v1_introduction
PPTX
Apache kafka
PPTX
Kafkha real time analytics platform.pptx
PPTX
Kafka Basic For Beginners
PDF
Feeding Cassandra with Spark-Streaming and Kafka
PPTX
kafka for db as postgres
PDF
PDF
Kafka In Action Meap V12 Meap Dylan D Scott Viktor Gamov Dave Klein
PDF
Kafka in Action MEAP V12 Dylan D Scott Viktor Gamov Dave Klein
PPTX
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
PDF
Apache kafka
PDF
An Introduction to Apache Kafka
PDF
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
PPTX
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
PPTX
... No it's Apache Kafka!
PPTX
Westpac Bank Tech Talk 1: Dive into Apache Kafka
PPTX
A Gentle Introduction To Storm And Kafka
PPTX
What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...
PDF
Self-hosting Kafka at Scale: Netflix's Journey & Challenges
kafka-tutorial-cloudruable-v2.pdf
Kafka syed academy_v1_introduction
Apache kafka
Kafkha real time analytics platform.pptx
Kafka Basic For Beginners
Feeding Cassandra with Spark-Streaming and Kafka
kafka for db as postgres
Kafka In Action Meap V12 Meap Dylan D Scott Viktor Gamov Dave Klein
Kafka in Action MEAP V12 Dylan D Scott Viktor Gamov Dave Klein
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Apache kafka
An Introduction to Apache Kafka
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
... No it's Apache Kafka!
Westpac Bank Tech Talk 1: Dive into Apache Kafka
A Gentle Introduction To Storm And Kafka
What is Kafka & why is it Important? (UKOUG Tech17, Birmingham, UK - December...
Self-hosting Kafka at Scale: Netflix's Journey & Challenges
Ad

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PPT
Teaching material agriculture food technology
PPTX
Spectroscopy.pptx food analysis technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Cloud computing and distributed systems.
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Big Data Technologies - Introduction.pptx
Teaching material agriculture food technology
Spectroscopy.pptx food analysis technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MIND Revenue Release Quarter 2 2025 Press Release
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Cloud computing and distributed systems.
Spectral efficient network and resource selection model in 5G networks
Chapter 3 Spatial Domain Image Processing.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Approach and Philosophy of On baking technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Kafka Tutorial, Kafka ecosystem with clustering examples

  • 1. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Cassandra and Kafka Support on AWS/EC2 Cloudurable Introduction to Kafka Support around Cassandra and Kafka running in EC2
  • 2. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting
  • 3. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Kafka growing Why Kafka? Kafka adoption is on the rise but why What is Kafka?
  • 4. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka growth exploding ❖ Kafka growth exploding ❖ 1/3 of all Fortune 500 companies ❖ Top ten travel companies, 7 of top ten banks, 8 of top ten insurance companies, 9 of top ten telecom companies ❖ LinkedIn, Microsoft and Netflix process 4 comma message a day with Kafka (1,000,000,000,000) ❖ Real-time streams of data, used to collect big data or to do real time analysis (or both)
  • 5. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Why Kafka is Needed? ❖ Real time streaming data processed for real time analytics ❖ Service calls, track every call, IOT sensors ❖ Apache Kafka is a fast, scalable, durable, and fault- tolerant publish-subscribe messaging system ❖ Kafka is often used instead of JMS, RabbitMQ and AMQP ❖ higher throughput, reliability and replication
  • 6. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Why is Kafka needed? 2 ❖ Kafka can works in combination with ❖ Flume/Flafka, Spark Streaming, Storm, HBase and Spark for real-time analysis and processing of streaming data ❖ Feed your data lakes with data streams ❖ Kafka brokers support massive message streams for follow- up analysis in Hadoop or Spark ❖ Kafka Streaming (subproject) can be used for real-time analytics
  • 7. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Use Cases ❖ Stream Processing ❖ Website Activity Tracking ❖ Metrics Collection and Monitoring ❖ Log Aggregation ❖ Real time analytics ❖ Capture and ingest data into Spark / Hadoop ❖ CRQS, replay, error recovery ❖ Guaranteed distributed commit log for in-memory computing
  • 8. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Who uses Kafka? ❖ LinkedIn: Activity data and operational metrics ❖ Twitter: Uses it as part of Storm – stream processing infrastructure ❖ Square: Kafka as bus to move all system events to various Square data centers (logs, custom events, metrics, an so on). Outputs to Splunk, Graphite, Esper-like alerting systems ❖ Spotify, Uber, Tumbler, Goldman Sachs, PayPal, Box, Cisco, CloudFlare, DataDog, LucidWorks, MailChimp, NetFlix, etc.
  • 9. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Why is Kafka Popular? ❖ Great performance ❖ Operational Simplicity, easy to setup and use, easy to reason ❖ Stable, Reliable Durability, ❖ Flexible Publish-subscribe/queue (scales with N-number of consumer groups), ❖ Robust Replication, ❖ Producer Tunable Consistency Guarantees, ❖ Ordering Preserved at shard level (Topic Partition) ❖ Works well with systems that have data streams to process, aggregate, transform & load into other stores Most important reason: Kafka’s great performance: throughput, latency, obtained through great engineering
  • 10. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Why is Kafka so fast? ❖ Zero Copy - calls the OS kernel direct rather to move data fast ❖ Batch Data in Chunks - Batches data into chunks ❖ end to end from Producer to file system to Consumer ❖ Provides More efficient data compression. Reduces I/O latency ❖ Sequential Disk Writes - Avoids Random Disk Access ❖ writes to immutable commit log. No slow disk seeking. No random I/O operations. Disk accessed in sequential manner ❖ Horizontal Scale - uses 100s to thousands of partitions for a single topic ❖ spread out to thousands of servers ❖ handle massive load
  • 11. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Streaming Architecture
  • 12. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Why Kafka Review ❖ Why is Kafka so fast? ❖ How fast is Kafka usage growing? ❖ How is Kafka getting used? ❖ Where does Kafka fit in the Big Data Architecture? ❖ How does Kafka relate to real-time analytics? ❖ Who uses Kafka?
  • 13. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Cassandra / Kafka Support in EC2/AWS What is Kafka? Kafka messaging Kafka Overview
  • 14. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ What is Kafka? ❖ Distributed Streaming Platform ❖ Publish and Subscribe to streams of records ❖ Fault tolerant storage ❖ Replicates Topic Log Partitions to multiple servers ❖ Process records as they occur ❖ Fast, efficient IO, batching, compression, and more ❖ Used to decouple data streams
  • 15. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Decoupling Data Streams
  • 16. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Polyglot clients / Wire protocol ❖ Kafka communication from clients and servers wire protocol over TCP protocol ❖ Protocol versioned ❖ Maintains backwards compatibility ❖ Many languages supported ❖ Kafka REST proxy allows easy integration ❖ Also provides Avro/Schema registry support via Kafka ecosystem
  • 17. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Usage ❖ Build real-time streaming applications that react to streams ❖ Real-time data analytics ❖ Transform, react, aggregate, join real-time data flows ❖ Feed events to CEP for complex event processing ❖ Feed data lakes ❖ Build real-time streaming data pipe-lines ❖ Enable in-memory microservices (actors, Akka, Vert.x, Qbit, RxJava)
  • 18. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Use Cases ❖ Metrics / KPIs gathering ❖ Aggregate statistics from many sources ❖ Event Sourcing ❖ Used with microservices (in-memory) and actor systems ❖ Commit Log ❖ External commit log for distributed systems. Replicated data between nodes, re-sync for nodes to restore state ❖ Real-time data analytics, Stream Processing, Log Aggregation, Messaging, Click-stream tracking, Audit trail, etc.
  • 19. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Record Retention ❖ Kafka cluster retains all published records ❖ Time based – configurable retention period ❖ Size based - configurable based on size ❖ Compaction - keeps latest record ❖ Retention policy of three days or two weeks or a month ❖ It is available for consumption until discarded by time, size or compaction ❖ Consumption speed not impacted by size
  • 20. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka scalable message storage ❖ Kafka acts as a good storage system for records/messages ❖ Records written to Kafka topics are persisted to disk and replicated to other servers for fault-tolerance ❖ Kafka Producers can wait on acknowledgment ❖ Write not complete until fully replicated ❖ Kafka disk structures scales well ❖ Writing in large streaming batches is fast ❖ Clients/Consumers can control read position (offset) ❖ Kafka acts like high-speed file system for commit log storage, replication
  • 21. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Review ❖ How does Kafka decouple streams of data? ❖ What are some use cases for Kafka where you work? ❖ What are some common use cases for Kafka? ❖ How is Kafka like a distributed message storage system? ❖ How does Kafka know when to delete old messages? ❖ Which programming languages does Kafka support?
  • 22. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Kafka Architecture
  • 23. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Fundamentals ❖ Records have a key (optional), value and timestamp; Immutable ❖ Topic a stream of records (“/orders”, “/user-signups”), feed name ❖ Log topic storage on disk ❖ Partition / Segments (parts of Topic Log) ❖ Producer API to produce a streams or records ❖ Consumer API to consume a stream of records ❖ Broker: Kafka server that runs in a Kafka Cluster. Brokers form a cluster. Cluster consists on many Kafka Brokers on many servers. ❖ ZooKeeper: Does coordination of brokers/cluster topology. Consistent file system for configuration information and leadership election for Broker Topic Partition Leaders
  • 24. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka: Topics, Producers, and Consumers Kafka Cluster Topic Producer Producer Producer Consumer Consumer Consumer record record
  • 25. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Core Kafka
  • 26. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka needs Zookeeper ❖ Zookeeper helps with leadership election of Kafka Broker and Topic Partition pairs ❖ Zookeeper manages service discovery for Kafka Brokers that form the cluster ❖ Zookeeper sends changes to Kafka ❖ New Broker join, Broker died, etc. ❖ Topic removed, Topic added, etc. ❖ Zookeeper provides in-sync view of Kafka Cluster configuration
  • 27. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Producer/Consumer Details ❖ Producers write to and Consumers read from Topic(s) ❖ Topic associated with a log which is data structure on disk ❖ Producer(s) append Records at end of Topic log ❖ Topic Log consist of Partitions - ❖ Spread to multiple files on multiple nodes ❖ Consumers read from Kafka at their own cadence ❖ Each Consumer (Consumer Group) tracks offset from where they left off reading ❖ Partitions can be distributed on different machines in a cluster ❖ high performance with horizontal scalability and failover with replication
  • 28. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Topic Partition, Consumers, Producers 0 1 42 3 5 6 7 8 9 10 11 Partition 0 Consumer Group A Producer Consumer Group B Consumer groups remember offset where they left off. Consumers groups each have their own offset. Producer writing to offset 12 of Partition 0 while… Consumer Group A is reading from offset 6. Consumer Group B is reading from offset 9.
  • 29. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Scale and Speed ❖ How can Kafka scale if multiple producers and consumers read/write to same Kafka Topic log? ❖ Writes fast: Sequential writes to filesystem are fast (700 MB or more a second) ❖ Scales writes and reads by sharding: ❖ Topic logs into Partitions (parts of a Topic log) ❖ Topics logs can be split into multiple Partitions different machines/different disks ❖ Multiple Producers can write to different Partitions of the same Topic ❖ Multiple Consumers Groups can read from different partitions efficiently
  • 30. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Brokers ❖ Kafka Cluster is made up of multiple Kafka Brokers ❖ Each Broker has an ID (number) ❖ Brokers contain topic log partitions ❖ Connecting to one broker bootstraps client to entire cluster ❖ Start with at least three brokers, cluster can have, 10, 100, 1000 brokers if needed
  • 31. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Cluster, Failover, ISRs ❖ Topic Partitions can be replicated ❖ across multiple nodes for failover ❖ Topic should have a replication factor greater than 1 ❖ (2, or 3) ❖ Failover ❖ if one Kafka Broker goes down then Kafka Broker with ISR (in-sync replica) can serve data
  • 32. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ ZooKeeper does coordination for Kafka Cluster Kafka BrokerProducer Producer Producer Consumer Consumer Consumer Kafka Broker Kafka Broker Topic ZooKeeper
  • 33. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Failover vs. Disaster Recovery ❖ Replication of Kafka Topic Log partitions allows for failure of a rack or AWS availability zone ❖ You need a replication factor of at least 3 ❖ Kafka Replication is for Failover ❖ Mirror Maker is used for Disaster Recovery ❖ Mirror Maker replicates a Kafka cluster to another data-center or AWS region ❖ Called mirroring since replication happens within a cluster
  • 34. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Review ❖ How does Kafka decouple streams of data? ❖ What are some use cases for Kafka where you work? ❖ What are some common use cases for Kafka? ❖ What is a Topic? ❖ What is a Broker? ❖ What is a Partition? Offset? ❖ Can Kafka run without Zookeeper? ❖ How do implement failover in Kafka? ❖ How do you implement disaster recovery in Kafka?
  • 35. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Kafka versus
  • 36. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka vs JMS, SQS, RabbitMQ Messaging ❖ Is Kafka a Queue or a Pub/Sub/Topic? ❖ Yes ❖ Kafka is like a Queue per consumer group ❖ Kafka is a queue system per consumer in consumer group so load balancing like JMS, RabbitMQ queue ❖ Kafka is like Topics in JMS, RabbitMQ, MOM ❖ Topic/pub/sub by offering Consumer Groups which act like subscriptions ❖ Broadcast to multiple consumer groups ❖ MOM = JMS, ActiveMQ, RabbitMQ, IBM MQ Series, Tibco, etc.
  • 37. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka vs MOM ❖ By design, Kafka is better suited for scale than traditional MOM systems due to partition topic log ❖ Load divided among Consumers for read by partition ❖ Handles parallel consumers better than traditional MOM ❖ Also by moving location (partition offset) in log to client/consumer side of equation instead of the broker, less tracking required by Broker and more flexible consumers ❖ Kafka written with mechanical sympathy, modern hardware, cloud in mind ❖ Disks are faster ❖ Servers have tons of system memory ❖ Easier to spin up servers for scale out
  • 38. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kinesis and Kafka are similar ❖ Kinesis Streams is like Kafka Core ❖ Kinesis Analytics is like Kafka Streams ❖ Kinesis Shard is like Kafka Partition ❖ Similar and get used in similar use cases ❖ In Kinesis, data is stored in shards. In Kafka, data is stored in partitions ❖ Kinesis Analytics allows you to perform SQL like queries on data streams ❖ Kafka Streaming allows you to perform functional aggregations and mutations ❖ Kafka integrates well with Spark and Flink which allows SQL like queries on streams
  • 39. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka vs. Kinesis ❖ Data is stored in Kinesis for default 24 hours, and you can increase that up to 7 days. ❖ Kafka records default stored for 7 days ❖ can increase until you run out of disk space. ❖ Decide by the size of data or by date. ❖ Can use compaction with Kafka so it only stores the latest timestamp per key per record in the log ❖ With Kinesis data can be analyzed by lambda before it gets sent to S3 or RedShift ❖ With Kinesis you pay for use, by buying read and write units. ❖ Kafka is more flexible than Kinesis but you have to manage your own clusters, and requires some dedicated DevOps resources to keep it going ❖ Kinesis is sold as a service and does not require a DevOps team to keep it going (can be more expensive and less flexible, but much easier to setup and run)
  • 40. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Kafka Topics Kafka Topics Architecture Replication Failover Parallel processing Kafka Topic Architecture
  • 41. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Topics, Logs, Partitions ❖ Kafka Topic is a stream of records ❖ Topics stored in log ❖ Log broken up into partitions and segments ❖ Topic is a category or stream name or feed ❖ Topics are pub/sub ❖ Can have zero or many subscribers - consumer groups ❖ Topics are broken up and spread by partitions for speed and size
  • 42. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Topic Partitions ❖ Topics are broken up into partitions ❖ Partitions decided usually by key of record ❖ Key of record determines which partition ❖ Partitions are used to scale Kafka across many servers ❖ Record sent to correct partition by key ❖ Partitions are used to facilitate parallel consumers ❖ Records are consumed in parallel up to the number of partitions ❖ Order guaranteed per partition ❖ Partitions can be replicated to multiple brokers
  • 43. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Topic Partition Log ❖ Order is maintained only in a single partition ❖ Partition is ordered, immutable sequence of records that is continually appended to—a structured commit log ❖ Records in partitions are assigned sequential id number called the offset ❖ Offset identifies each record within the partition ❖ Topic Partitions allow Kafka log to scale beyond a size that will fit on a single server ❖ Topic partition must fit on servers that host it ❖ topic can span many partitions hosted on many servers ❖ Topic Partitions are unit of parallelism - a partition can only be used by one consumer in group at a time ❖ Consumers can run in their own process or their own thread ❖ If a consumer stops, Kafka spreads partitions across remaining consumer in group
  • 44. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Topic Partitions Layout 0 1 42 3 5 6 7 8 9 10 11 0 1 42 3 5 6 7 8 0 1 42 3 5 6 7 8 9 10 Older Newer 0 1 42 3 5 6 7 Partition 0 Partition 1 Partition 2 Partition 3 Writes
  • 45. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Replication: Kafka Partition Distribution ❖ Each partition has leader server and zero or more follower servers ❖ Leader handles all read and write requests for partition ❖ Followers replicate leader, and take over if leader dies ❖ Used for parallel consumer handling within a group ❖ Partitions of log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of partitions ❖ Each partition can be replicated across a configurable number of Kafka servers - Used for fault tolerance
  • 46. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Replication: Kafka Partition Leader ❖ One node/partition’s replicas is chosen as leader ❖ Leader handles all reads and writes of Records for partition ❖ Writes to partition are replicated to followers (node/partition pair) ❖ An follower that is in-sync is called an ISR (in-sync replica) ❖ If a partition leader fails, one ISR is chosen as new leader
  • 47. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Replication to Partition 0 Kafka Broker 0 Partition 0 Partition 1 Partition 2 Partition 3 Partition 4 Kafka Broker 1 Partition 0 Partition 1 Partition 2 Partition 3 Partition 4 Kafka Broker 2 Partition 1 Partition 2 Partition 3 Partition 4 Client Producer 1) Write record Partition 0 2) Replicate record 2) Replicate record Leader Red Follower Blue Record is considered "committed" when all ISRs for partition wrote to their log. Only committed records are readable from consumer
  • 48. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Replication to Partitions 1 Kafka Broker 0 Partition 0 Partition 1 Partition 2 Partition 3 Partition 4 Kafka Broker 1 Partition 0 Partition 1 Partition 2 Partition 3 Partition 4 Kafka Broker 2 Partition 1 Partition 2 Partition 3 Partition 4 Client Producer 1) Write record Partition 0 2) Replicate record 2) Replicate record Another partition can be owned by another leader on another Kafka broker Leader Red Follower Blue
  • 49. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Topic Review ❖ What is an ISR? ❖ How does Kafka scale Consumers? ❖ What are leaders? followers? ❖ How does Kafka perform failover for Consumers? ❖ How does Kafka perform failover for Brokers?
  • 50. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Kafka Producers Kafka Producers Partition selection Durability Kafka Producer Architecture
  • 51. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Producers ❖ Producers send records to topics ❖ Producer picks which partition to send record to per topic ❖ Can be done in a round-robin ❖ Can be based on priority ❖ Typically based on key of record ❖ Kafka default partitioner for Java uses hash of keys to choose partitions, or a round-robin strategy if no key ❖ Important: Producer picks partition
  • 52. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Producers and Consumers 0 1 42 3 5 6 7 8 9 10 11 Partition 0 Producers Consumer Group A Producers are writing at Offset 12 Consumer Group A is Reading from Offset 9.
  • 53. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Producers ❖ Producers write at their own cadence so order of Records cannot be guaranteed across partitions ❖ Producer configures consistency level (ack=0, ack=all, ack=1) ❖ Producers pick the partition such that Record/messages goes to a given same partition based on the data ❖ Example have all the events of a certain 'employeeId' go to same partition ❖ If order within a partition is not needed, a 'Round Robin' partition strategy can be used so Records are evenly distributed across partitions.
  • 54. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Producer Review ❖ Can Producers occasionally write faster than consumers? ❖ What is the default partition strategy for Producers without using a key? ❖ What is the default partition strategy for Producers using a key? ❖ What picks which partition a record is sent to?
  • 55. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Kafka Consumers Load balancing consumers Failover for consumers Offset management per consumer group Kafka Consumer Architecture
  • 56. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Consumer Groups ❖ Consumers are grouped into a Consumer Group ❖ Consumer group has a unique id ❖ Each consumer group is a subscriber ❖ Each consumer group maintains its own offset ❖ Multiple subscribers = multiple consumer groups ❖ Each has different function: one might delivering records to microservices while another is streaming records to Hadoop ❖ A Record is delivered to one Consumer in a Consumer Group ❖ Each consumer in consumer groups takes records and only one consumer in group gets same record ❖ Consumers in Consumer Group load balance record consumption
  • 57. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Consumer Load Share ❖ Kafka Consumer consumption divides partitions over consumers in a Consumer Group ❖ Each Consumer is exclusive consumer of a "fair share" of partitions ❖ This is Load Balancing ❖ Consumer membership in Consumer Group is handled by the Kafka protocol dynamically ❖ If new Consumers join Consumer group, it gets a share of partitions ❖ If Consumer dies, its partitions are split among remaining live Consumers in Consumer Group
  • 58. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Consumer Groups 0 1 42 3 5 6 7 8 9 10 11 Partition 0 Consumer Group A Producers Consumer Group B Consumers remember offset where they left off. Consumers groups each have their own offset per partition.
  • 59. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Consumer Groups Processing ❖ How does Kafka divide up topic so multiple Consumers in a Consumer Group can process a topic? ❖ You group consumers into consumers group with a group id ❖ Consumers with same id belong in same Consumer Group ❖ One Kafka broker becomes group coordinator for Consumer Group ❖ assigns partitions when new members arrive (older clients would talk direct to ZooKeeper now broker does coordination) ❖ or reassign partitions when group members leave or topic changes (config / meta-data change ❖ When Consumer group is created, offset set according to reset policy of topic
  • 60. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Consumer Failover ❖ Consumers notify broker when it successfully processed a record ❖ advances offset ❖ If Consumer fails before sending commit offset to Kafka broker, ❖ different Consumer can continue from the last committed offset ❖ some Kafka records could be reprocessed ❖ at least once behavior ❖ messages should be idempotent
  • 61. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Consumer Offsets and Recovery ❖ Kafka stores offsets in topic called “__consumer_offset” ❖ Uses Topic Log Compaction ❖ When a consumer has processed data, it should commit offsets ❖ If consumer process dies, it will be able to start up and start reading where it left off based on offset stored in “__consumer_offset”
  • 62. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Consumer: What can be consumed? ❖ "Log end offset" is offset of last record written to log partition and where Producers write to next ❖ "High watermark" is offset of last record successfully replicated to all partitions followers ❖ Consumer only reads up to “high watermark”. Consumer can’t read un-replicated data
  • 63. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Consumer to Partition Cardinality ❖ Only a single Consumer from the same Consumer Group can access a single Partition ❖ If Consumer Group count exceeds Partition count: ❖ Extra Consumers remain idle; can be used for failover ❖ If more Partitions than Consumer Group instances, ❖ Some Consumers will read from more than one partition
  • 64. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ 2 server Kafka cluster hosting 4 partitions (P0-P5) Kafka Cluster Server 2 P0 P1 P5 Server 1 P2 P3 P4 Consumer Group A C0 C1 C3 Consumer Group B C0 C1 C3
  • 65. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Multi-threaded Consumers ❖ You can run more than one Consumer in a JVM process ❖ If processing records takes a while, a single Consumer can run multiple threads to process records ❖ Harder to manage offset for each Thread/Task ❖ One Consumer runs multiple threads ❖ 2 messages on same partitions being processed by two different threads ❖ Hard to guarantee order without threads coordination ❖ PREFER: Multiple Consumers can run each processing record batches in their own thread ❖ Easier to manage offset ❖ Each Consumer runs in its thread ❖ Easier to mange failover (each process runs X num of Consumer threads)
  • 66. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Consumer Review ❖ What is a consumer group? ❖ Does each consumer have its own offset? ❖ When can a consumer see a record? ❖ What happens if there are more consumers than partitions? ❖ What happens if you run multiple consumers in many thread in the same JVM?
  • 67. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Using Kafka Single Node Using Kafka Single Node Run ZooKeeper, Kafka Create a topic Send messages from command line Read messages from command line Tutorial Using Kafka Single Node
  • 68. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Run Kafka ❖ Run ZooKeeper start up script ❖ Run Kafka Server/Broker start up script ❖ Create Kafka Topic from command line ❖ Run producer from command line ❖ Run consumer from command line
  • 69. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Run ZooKeeper
  • 70. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Run Kafka Server
  • 71. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Create Kafka Topic
  • 72. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ List Topics
  • 73. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Run Kafka Producer
  • 74. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Run Kafka Consumer
  • 75. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Running Kafka Producer and Consumer
  • 76. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Single Node Review ❖ What server do you run first? ❖ What tool do you use to create a topic? ❖ What tool do you use to see topics? ❖ What tool did we use to send messages on the command line? ❖ What tool did we use to view messages in a topic? ❖ Why were the messages coming out of order? ❖ How could we get the messages to come in order from the consumer?
  • 77. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Use Kafka to send and receive messages Lab Use Kafka Use single server version of Kafka. Setup single node. Single ZooKeeper. Create a topic. Produce and consume messages from the command line.
  • 78. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Using Kafka Cluster and Failover Demonstrate Kafka Cluster Create topic with replication Show consumer failover Show broker failover Kafka Tutorial Cluster and Failover
  • 79. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Objectives ❖ Run many Kafka Brokers ❖ Create a replicated topic ❖ Demonstrate Pub / Sub ❖ Demonstrate load balancing consumers ❖ Demonstrate consumer failover ❖ Demonstrate broker failover
  • 80. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Running many nodes ❖ If not already running, start up ZooKeeper ❖ Shutdown Kafka from first lab ❖ Copy server properties for three brokers ❖ Modify properties files, Change port, Change Kafka log location ❖ Start up many Kafka server instances ❖ Create Replicated Topic ❖ Use the replicated topic
  • 81. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Create three new server- n.properties files ❖ Copy existing server.properties to server- 0.properties, server-1.properties, server-2.properties ❖ Change server-1.properties to use log.dirs “./logs/kafka-logs-0” ❖ Change server-1.properties to use port 9093, broker id 1, and log.dirs “./logs/kafka-logs-1” ❖ Change server-2.properties to use port 9094, broker id 2, and log.dirs “./logs/kafka-logs-2”
  • 82. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Modify server-x.properties ❖ Each have different broker.id ❖ Each have different log.dirs ❖ Each had different port
  • 83. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Create Startup scripts for three Kafka servers ❖ Passing properties files from last step
  • 84. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Run Servers
  • 85. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Create Kafka replicated topic my- failsafe-topic ❖ Replication Factor is set to 3 ❖ Topic name is my-failsafe-topic ❖ Partitions is 13
  • 86. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Start Kafka Consumer ❖ Pass list of Kafka servers to bootstrap- server ❖ We pass two of the three ❖ Only one needed, it learns about the rest
  • 87. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Start Kafka Producer ❖ Start producer ❖ Pass list of Kafka Brokers
  • 88. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka 1 consumer and 1 producer running
  • 89. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Start a second and third consumer ❖ Acts like pub/sub ❖ Each consumer in its own group ❖ Message goes to each ❖ How do we load share?
  • 90. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Running consumers in same group ❖ Modify start consumer script ❖ Add the consumers to a group called mygroup ❖ Now they will share load
  • 91. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Start up three consumers again ❖ Start up producer and three consumers ❖ Send 7 messages ❖ Notice how messages are spread among 3 consumers
  • 92. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Consumer Failover ❖ Kill one consumer ❖ Send seven more messages ❖ Load is spread to remaining consumers ❖ Failover WORK!
  • 93. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Create Kafka Describe Topic ❖ —describe will show list partitions, ISRs, and partition leadership
  • 94. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Use Describe Topics ❖ Lists which broker owns (leader of) which partition ❖ Lists Replicas and ISR (replicas that are up to date) ❖ Notice there are 13 topics
  • 95. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Test Broker Failover: Kill 1st server se Kafka topic describe to see that a new leader was elected! Kill the first server
  • 96. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Show Broker Failover Worked ❖ Send two more messages from the producer ❖ Notice that the consumer gets the messages ❖ Broker Failover WORKS!
  • 97. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Cluster Review ❖ Why did the three consumers not load share the messages at first? ❖ How did we demonstrate failover for consumers? ❖ How did we demonstrate failover for producers? ❖ What tool and option did we use to show ownership of partitions and the ISRs?
  • 98. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Use Kafka to send and receive messages Lab 2 Use Kafka multiple nodes Use a Kafka Cluster to replicate a Kafka topic log
  • 99. ™ Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting Kafka Ecosystem Kafka Connect Kafka Streaming Kafka Schema Registry Kafka REST
  • 100. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Ecosystem ❖ Kafka Streams ❖ Streams API to transform, aggregate, process records from a stream and produce derivative streams ❖ Kafka Connect ❖ Connector API reusable producers and consumers ❖ (e.g., stream of changes from DynamoDB) ❖ Kafka REST Proxy ❖ Producers and Consumers over REST (HTTP) ❖ Schema Registry - Manages schemas using Avro for Kafka Records ❖ Kafka MirrorMaker - Replicate cluster data to another cluster
  • 101. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka REST Proxy and Kafka Schema Registry
  • 102. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Ecosystem
  • 103. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Stream Processing ❖ Kafka Streams for Stream Processing ❖ Kafka enable real-time processing of streams. ❖ Kafka Streams supports Stream Processor ❖ processing, transformation, aggregation, and produces 1 to * output streams ❖ Example: video player app sends events videos watched, videos paused ❖ output a new stream of user preferences ❖ can gear new video recommendations based on recent user activity ❖ can aggregate activity of many users to see what new videos are hot ❖ Solves hard problems: out of order records, aggregating/joining across streams, stateful computations, and more
  • 104. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Connectors and Streams Kafka Cluster App App App App App App DB DB App App Connectors Producers Consumers Streams
  • 105. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ Kafka Ecosystem review ❖ What is Kafka Streams? ❖ What is Kafka Connect? ❖ What is the Schema Registry? ❖ What is Kafka Mirror Maker? ❖ When might you use Kafka REST Proxy?
  • 106. Cassandra / Kafka Support in EC2/AWS. Kafka Training, Kafka Consulting ™ References ❖ Learning Apache Kafka, Second Edition 2nd Edition by Nishant Garg (Author), 2015, ISBN 978-1784393090, Packet Press ❖ Apache Kafka Cookbook, 1st Edition, Kindle Edition by Saurabh Minni (Author), 2015, ISBN 978-1785882449, Packet Press ❖ Kafka Streams for Stream processing: A few words about how Kafka works, Serban Balamaci, 2017, Blog: Plain Ol' Java ❖ Kafka official documentation, 2017 ❖ Why we need Kafka? Quora ❖ Why is Kafka Popular? Quora ❖ Why is Kafka so Fast? Stackoverflow ❖ Kafka growth exploding (Tech Republic)