SlideShare a Scribd company logo
101* ways to configure
Kafka - badly
Audun Fauchald Strand
Lead Developer Infrastructure
@audunstrand
bio: gof, mq, ejb,
mda, wli, bpel eda,
soa, ws*,esb, ddd
Henning Spjelkavik
Architect
@spjelkavik
bio: Skiinfo (Vail Resorts),
FINN.no
enjoys reading jstacks
agenda
introduction to kafka
kafka @ finn.no
101* mistakes
questions
“From a certain point onward
there is no longer any turning
back. That is the point that
must be reached.”
― Franz Kafka, The Trial
Top 5
1. no consideration of data on the
inside vs outside
2. schema not externally defined
3. same config for every
client/topic
4. 128 partitions as default config
5. running on 8 overloaded nodes
FINN.no
2nd largest website in norway
classified ads ( Ebay, Zillow in one)
60 millions pageviews a day
80 microservices
130 developers
1000 deploys to production a week
6 minutes from commit to deploy
(median)
#kafkasummit @spjelkavik @audunstrand
Schibsted Media Group
6800 people in 30 countries
FINN.no is a part of
kafka @ finn.no
kafka @finn.
no
architecture
use cases
tools
#kafkasummit @spjelkavik @audunstrand
in the beginning ...
Architecture governance board decided to use RabbitMQ as message queue.
Kafka was installed for a proof of concept, after developers spotted it januar 2013.
#kafkasummit @spjelkavik @audunstrand
2013 - POC
“High” volume
Stream of classified ads
Ad matching
Ad indexed
mod05
zk
kafka
mod07
zk
kafka
mod01
zk
kafka
mod03
zk
kafka
mod06
zk
kafka
mod08
zk
kafka
mod02
zk
kafka
mod04
zk
kafka
dc 1
dc 2
Version 0.8.1
4 partitions
common client
java library
thrift
#kafkasummit @spjelkavik @audunstrand
2014 - Adoption and
complaining
low volume/ high
reliability
Ad Insert
Product Orchestration
Payment
Build Pipeline
click streams
mod05
zk
kafka
mod07
zk
kafka
mod01
zk
kafka
mod03
zk
kafka
mod06
zk
kafka
mod08
zk
kafka
mod02
zk
kafka
mod04
zk
kafka
dc 1
dc 2
Version 0.8.1
4 partitions
experimenting
with
configuration
common java
library
#kafkasummit @spjelkavik @audunstrand
tooling
alerting
#kafkasummit @spjelkavik @audunstrand
2015 - Migration and
consolidation
“reliable messaging”
asynchronous
communication
between services
store and forward
zipkin
slack notifications
dc 1
dc 2
Version 0.8.2
5-20 partitions
multiple
configurations
broker05
zk
kafka
broker01
zk
kafka
broker03
zk
kafka
broker04
zk
kafka
broker02
zk
kafka
#kafkasummit @spjelkavik @audunstrand
tooling
Grafana dashboard visualizing jmx stats
kafka-manager
kafka-cat
#kafkasummit @spjelkavik @audunstrand
2016 - Confluent
zk04 zk
broker01
broker05
kafka
kafka
broker03
kafka
broker04
kafka
broker02
kafka
zk05 zk
zk02 zk zk03 zk
zk01 zk
platform
schema registry
data replication
kafka connect
kafka streams
101* mistakes
“God gives the
nuts, but he
does not crack
them.”
― Franz Kafka
Pattern
Language
why is it a mistake
what is the consequence
what is the correct solution
what has finn.no done
Top 5
1. no consideration of data on the
inside vs outside
2. schema not externally defined
3. same config for every
client/topic
4. 128 partitions as default config
5. running on 8 overloaded nodes
#kafkasummit @spjelkavik @audunstrand
mistake:
no consideration of data on
the inside vs outside
https://flic.kr/p/6MjhUR
#kafkasummit @spjelkavik @audunstrand
why is it a mistake
everything published on Kafka (0.8.2) is visible to any client that can access
#kafkasummit @spjelkavik @audunstrand
what is the consequence
direct reads across services/domains is quite normal in legacy and/or enterprise
systems
coupling makes it hard to make changes
unknown and unwanted coupling has a cost
Kafka had no security per topic - you must add that yourself
#kafkasummit @spjelkavik @audunstrand
what is the correct solution
Consider what is data on the inside, versus data on the outside
Convention for what is private data and what is public data
If you want to change your internal representation often, map it before publishing it
publicly (Anti corruption layer)
#kafkasummit @spjelkavik @audunstrand
what has finn.no done
Decided on a naming convention (i.e Public.xyzzy) for public topics
Communicates the intention (contract)
#kafkasummit @spjelkavik @audunstrand
mistake:
schema not externally
defined
#kafkasummit @spjelkavik @audunstrand
why is it a mistake
data and code needs separate versioning strategies
version should be part of the data
defining schema in a java library makes it more difficult to access data from non-
jvm languages
very little discoverability of data, people chose other means to get their data
difficult to create tools
#kafkasummit @spjelkavik @audunstrand
what is the consequence
development speed outside jvm has been slow
change of data needs coordinated deployment
no process for data versioning, like backwards compatibility checks
difficult to create tooling that needs to know data format, like data
lake and database sinks
#kafkasummit @spjelkavik @audunstrand
what is the correct solution
confluent.io platform has a separate schema registry
apache avro
multiple compatibility settings and evolutions strategies
connect
Take complexity out of the applications
#kafkasummit @spjelkavik @audunstrand
what has finn.no done
still using java library, with schemas in builders
confluent platform 2.0 is planned for the next step, not (just) kafka 0.9
#kafkasummit @spjelkavik @audunstrand
mistake:
running mixed load with a
single, default configuration
https://flic.kr/p/qbarDR
#kafkasummit @spjelkavik @audunstrand
why is it a mistake
Historically - One Big Database with Expensive License
Database world - OLTP and OLAP
Changed with Open Source software and Cloud
Tried to simplify the developer's day with a single config
Kafka supports very high throughput and highly reliable
#kafkasummit @spjelkavik @audunstrand
what is the consequence
Trade off between throughput and degree of reliability
With a single configuration - the last commit wins
Either high throughput, and risk of loss - or potentially too slow
#kafkasummit @spjelkavik @audunstrand
what is the correct solution
Understand your use cases and their needs!
Use proper pr topic configuration
Consider splitting / isolation
#kafkasummit @spjelkavik @audunstrand
Defaults that are quite reliable
Exposing configuration variables in the client
Ask the questions;
● at least once delivery
● ordering - if you partition, what must have strict ordering
● 99% delivery - is that good enough?
● what level of throughput is needed
what has finn.no done
#kafkasummit @spjelkavik @audunstrand
Configuration
Configuration for production
● Partitions
● Replicas (default.replication.factor)
● Minimum ISR (min.insync.replicas)
● Wait for acknowledge when producing messages (request.required.acks, block.on.buffer.full)
● Retries
● Leader election
Configuration for consumer
● Number of threads
● When to commit (autocommit.enable vs consumer.commitOffsets)
#kafkasummit @spjelkavik @audunstrand
Gwen Shapira recommends...
● akcs = all
● block.on.buffer.full = true
● retries = MAX_INT
● max.inflight.requests.per.connect = 1
● Producer.close()
● replication-factor >= 3
● min.insync.replicas = 2
● unclean.leader.election = false
● auto.offset.commit = false
● commit after processing
● monitor!
#kafkasummit @spjelkavik @audunstrand
mistake:
default configuration of 128 partitions
for each topic
https://flic.kr/p/6KxPgZ
#kafkasummit @spjelkavik @audunstrand
why is it a mistake
partitions are kafkas way of scaling consumers, 128 partitions can handle 128
consumer processes
in 0.8; clusters could not reduce the number of partitions without deleting data
highest number of consumers today is 20
#kafkasummit @spjelkavik @audunstrand
what is the consequence
our 0.8 cluster was configured with 128 partitions as default, for all topics.
many partitions and many topics creates many datapoints that must be coordinated
zookeeper must coordinate all this
rebalance must balance all clients on all partitions
zookeeper and kafka went down (may 2015)
Users could note create ads for two days
#kafkasummit @spjelkavik @audunstrand
what is the correct solution
small number of partitions as default
increase number of partitions for selected topics
understand your use case (throughput target)
reduce length of transactions on consumer side
Max partitions on a broker => 1500 advised in our case - we had 38k
http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
#kafkasummit @spjelkavik @audunstrand
what has finn.no done
5 partitions as default
2 heavy-traffic topics have more than 5 partitions
#kafkasummit @spjelkavik @audunstrand
mistake:
deploy a proof of concept
hack - in production ; i.e
why we had 8 zk nodes
https://flic.kr/p/6eoSgT
#kafkasummit @spjelkavik @audunstrand
why is it a mistake
Kafka was set up by Ops for a proof of concept - not for hardened production use
By coincidence we had 8 nodes for kafka, the same 8 nodes for zookeeper
Zookeeper is dependent on a majority quorum, low latency between nodes
The 8 nodes were NOT dedicated - in fact - they were overloaded already
#kafkasummit @spjelkavik @audunstrand
what is the consequence
Zookeeper recommends 3 nodes for normal usage, 5 for high, and any more is
questionable
More nodes leads to longer time for finding consensus, more communication
If we get a split between data centers, there will be 4 in each
You should not run Zk between data centers, due to latency and outage
possibilities
#kafkasummit @spjelkavik @audunstrand
what is the correct solution
Have an odd number of Zookeeper nodes - preferrably 3, at most 5
Don’t cross data centers
Check the documentation before deploying serious production load
Don’t run a sensitive service (Zookeeper) on a server with 50 jvm-based services,
300% over committed on RAM
Watch GC times
#kafkasummit @spjelkavik @audunstrand
what has finn.no done
dc 1
dc 2
broker05
zk
kafka
broker01
zk
kafka
broker03
zk
kafka
broker04
zk
kafka
broker02
zk
kafka
Version 0.8.2
5-20 partitions
multiple
configurations
101 ways to configure kafka - badly (Kafka Summit)
#kafkasummit @spjelkavik @audunstrand
“They say ignorance is
bliss.... they're wrong ”
― Franz Kafka
#kafkasummit @spjelkavik @audunstrand
References / Further reading
Designing data intensive systems, Martin Kleppmann
Data on the inside - data on the outside, Pat Helland
I Heart Logs, Jay Kreps
The Confluent Blog, http://confluent.io/
Kafka - The definitive guide
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations
http://www.finn.no/apply-here
http://www.schibsted.com/en/Career/
“It's only because of
their stupidity that
they're able to be so
sure of themselves.”
― Franz Kafka, The
Trial
Audun Fauchald Strand
@audunstrand
Henning Spjelkavik
@spjelkavik
http://www.finn.no/apply-here
http://www.schibsted.com/en/Career/
Q?
#kafkasummit @spjelkavik @audunstrand
Runner up
Using pre-1.0 software
Have control of topic creation
Kafka is storage - treat it like one also ops-wise
Client side rebalancing, misunderstood
Commiting on all consumer threads, believing that you only commited on one

More Related Content

PPTX
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
PDF
Ingesting Healthcare Data, Micah Whitacre
PDF
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
PDF
Kafka At Scale in the Cloud
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
PPTX
Kafka at scale facebook israel
PPTX
kafka for db as postgres
PPTX
Kafka reliability velocity 17
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Ingesting Healthcare Data, Micah Whitacre
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Kafka At Scale in the Cloud
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Kafka at scale facebook israel
kafka for db as postgres
Kafka reliability velocity 17

What's hot (20)

PPTX
Putting Kafka Into Overdrive
PDF
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
PDF
Kafka and Storm - event processing in realtime
PDF
Kafkaesque days at linked in in 2015
PDF
Disaster Recovery Plans for Apache Kafka
PPTX
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
PDF
101 ways to configure kafka - badly
PDF
Exactly-once Semantics in Apache Kafka
PPT
Kafka Reliability - When it absolutely, positively has to be there
PPTX
Papers we love realtime at facebook
PDF
Kafka internals
PPTX
How Apache Kafka is transforming Hadoop, Spark and Storm
PDF
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
PPTX
Introduction to Apache Kafka
ODP
Kafka aws
PDF
Developing Real-Time Data Pipelines with Apache Kafka
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PPTX
Decoupling Decisions with Apache Kafka
PDF
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
Putting Kafka Into Overdrive
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
Multi-Datacenter Kafka - Strata San Jose 2017
Kafka and Storm - event processing in realtime
Kafkaesque days at linked in in 2015
Disaster Recovery Plans for Apache Kafka
When it Absolutely, Positively, Has to be There: Reliability Guarantees in Ka...
101 ways to configure kafka - badly
Exactly-once Semantics in Apache Kafka
Kafka Reliability - When it absolutely, positively has to be there
Papers we love realtime at facebook
Kafka internals
How Apache Kafka is transforming Hadoop, Spark and Storm
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
Introduction to Apache Kafka
Kafka aws
Developing Real-Time Data Pipelines with Apache Kafka
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Decoupling Decisions with Apache Kafka
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...
Ad

Viewers also liked (20)

PDF
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
PPTX
More Datacenters, More Problems
PDF
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
PDF
Streaming SQL
PDF
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
PDF
101 mistakes FINN.no has made with Kafka (Baksida meetup)
PPTX
Kafka replication apachecon_2013
KEY
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
PPTX
Reducing Microservice Complexity with Kafka and Reactive Streams
PDF
What's new in Confluent 3.2 and Apache Kafka 0.10.2
PPTX
Real time Analytics with Apache Kafka and Apache Spark
POTX
WSO2Con US 2013 - Creating the API Centric Enterprise Towards a Connected Bus...
PDF
SnapLogic Adds Support for Kafka and HDInsight to Elastic Integration Platform
PPTX
Cloud-Con: Integration & Web APIs
PDF
IPAAS_information on your terms
PPTX
Anypoint mq (mulesoft) introduction
PPTX
PDF
SnapLogic's Latest Elastic iPaaS Release Adds Hybrid Links for Spark, Cortana...
PPT
Java Messaging Service
PDF
Cloud fuse-apachecon eu-2012
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
More Datacenters, More Problems
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Streaming SQL
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
101 mistakes FINN.no has made with Kafka (Baksida meetup)
Kafka replication apachecon_2013
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Reducing Microservice Complexity with Kafka and Reactive Streams
What's new in Confluent 3.2 and Apache Kafka 0.10.2
Real time Analytics with Apache Kafka and Apache Spark
WSO2Con US 2013 - Creating the API Centric Enterprise Towards a Connected Bus...
SnapLogic Adds Support for Kafka and HDInsight to Elastic Integration Platform
Cloud-Con: Integration & Web APIs
IPAAS_information on your terms
Anypoint mq (mulesoft) introduction
SnapLogic's Latest Elastic iPaaS Release Adds Hybrid Links for Spark, Cortana...
Java Messaging Service
Cloud fuse-apachecon eu-2012
Ad

Similar to 101 ways to configure kafka - badly (Kafka Summit) (20)

PDF
Dask and Machine Learning Models in Production - PyColorado 2019
PDF
What is Apache Kafka®?
PDF
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
PPTX
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
PPTX
London In-Memory Computing Meetup - A Change-Data-Capture use-case: designing...
PDF
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
PDF
Polyglot, fault-tolerant event-driven programming with kafka, kubernetes and ...
PDF
PDF
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
PPTX
Jack Gudenkauf sparkug_20151207_7
PPTX
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
PPT
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
PPTX
Apache Kafka
PDF
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
PDF
Apache Kafka – (Pattern and) Anti-Pattern
PDF
An Introduction to Apache Kafka
PDF
jLove - A Change-Data-Capture use-case: designing an evergreen cache
PPTX
VM Forking and Hypervisor-based fuzzing
PDF
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
PDF
Anomaly Detection at Scale
Dask and Machine Learning Models in Production - PyColorado 2019
What is Apache Kafka®?
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
London In-Memory Computing Meetup - A Change-Data-Capture use-case: designing...
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Polyglot, fault-tolerant event-driven programming with kafka, kubernetes and ...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Jack Gudenkauf sparkug_20151207_7
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Apache Kafka
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
Apache Kafka – (Pattern and) Anti-Pattern
An Introduction to Apache Kafka
jLove - A Change-Data-Capture use-case: designing an evergreen cache
VM Forking and Hypervisor-based fuzzing
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
Anomaly Detection at Scale

More from Henning Spjelkavik (20)

PDF
Hles 2021 Digital transformation - How to use digital tools to improve our ev...
PDF
Digital techlunsj hos FINN.no 2020-06-10
PDF
10 years of microservices at finn.no - why is that dragon still here (ndc o...
PDF
How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018
PDF
An approach to it in a high level event - IOF HLES 2017
PPTX
Smidig 2016 - Er ledelse verdifullt likevel?
PDF
Geomatikkdagene 2016 - Kart på FINN.no
PDF
IT for Event Directors
PPTX
Hvorfor vi bør brenne gammel management litteratur
PPTX
How we sleep well at night using Hystrix at Finn.no
PDF
HLES 2015 It in a high level event
PDF
Strategisk design med "Impact Mapping"
PDF
Smidig 2014 - Impact Mapping - Levér det som teller
PDF
Kart på FINN.no - Fra CGI til slippy map
PDF
Arena and TV-production - at IOF Open Technical Meeting in Lavarone 2014
PDF
Misbruk av målstyring
PDF
Jz2010 Hvordan enkel analyse kan øke stabiliteten og hastigheten
PDF
Fornebuløpet - Brosjyre
PDF
Fornebuløpet - Treningsprogram
PDF
Verdistrømanalyse Smidig 2009
Hles 2021 Digital transformation - How to use digital tools to improve our ev...
Digital techlunsj hos FINN.no 2020-06-10
10 years of microservices at finn.no - why is that dragon still here (ndc o...
How FINN became somewhat search engine friendly @ Oslo SEO meetup 2018
An approach to it in a high level event - IOF HLES 2017
Smidig 2016 - Er ledelse verdifullt likevel?
Geomatikkdagene 2016 - Kart på FINN.no
IT for Event Directors
Hvorfor vi bør brenne gammel management litteratur
How we sleep well at night using Hystrix at Finn.no
HLES 2015 It in a high level event
Strategisk design med "Impact Mapping"
Smidig 2014 - Impact Mapping - Levér det som teller
Kart på FINN.no - Fra CGI til slippy map
Arena and TV-production - at IOF Open Technical Meeting in Lavarone 2014
Misbruk av målstyring
Jz2010 Hvordan enkel analyse kan øke stabiliteten og hastigheten
Fornebuløpet - Brosjyre
Fornebuløpet - Treningsprogram
Verdistrømanalyse Smidig 2009

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Empathic Computing: Creating Shared Understanding
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Machine learning based COVID-19 study performance prediction
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PPT
Teaching material agriculture food technology
PDF
KodekX | Application Modernization Development
PDF
Network Security Unit 5.pdf for BCA BBA.
cuic standard and advanced reporting.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Approach and Philosophy of On baking technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Diabetes mellitus diagnosis method based random forest with bat algorithm
Empathic Computing: Creating Shared Understanding
“AI and Expert System Decision Support & Business Intelligence Systems”
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Machine learning based COVID-19 study performance prediction
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Teaching material agriculture food technology
KodekX | Application Modernization Development
Network Security Unit 5.pdf for BCA BBA.

101 ways to configure kafka - badly (Kafka Summit)

  • 1. 101* ways to configure Kafka - badly Audun Fauchald Strand Lead Developer Infrastructure @audunstrand bio: gof, mq, ejb, mda, wli, bpel eda, soa, ws*,esb, ddd Henning Spjelkavik Architect @spjelkavik bio: Skiinfo (Vail Resorts), FINN.no enjoys reading jstacks
  • 2. agenda introduction to kafka kafka @ finn.no 101* mistakes questions “From a certain point onward there is no longer any turning back. That is the point that must be reached.” ― Franz Kafka, The Trial
  • 3. Top 5 1. no consideration of data on the inside vs outside 2. schema not externally defined 3. same config for every client/topic 4. 128 partitions as default config 5. running on 8 overloaded nodes
  • 4. FINN.no 2nd largest website in norway classified ads ( Ebay, Zillow in one) 60 millions pageviews a day 80 microservices 130 developers 1000 deploys to production a week 6 minutes from commit to deploy (median)
  • 5. #kafkasummit @spjelkavik @audunstrand Schibsted Media Group 6800 people in 30 countries FINN.no is a part of
  • 8. #kafkasummit @spjelkavik @audunstrand in the beginning ... Architecture governance board decided to use RabbitMQ as message queue. Kafka was installed for a proof of concept, after developers spotted it januar 2013.
  • 9. #kafkasummit @spjelkavik @audunstrand 2013 - POC “High” volume Stream of classified ads Ad matching Ad indexed mod05 zk kafka mod07 zk kafka mod01 zk kafka mod03 zk kafka mod06 zk kafka mod08 zk kafka mod02 zk kafka mod04 zk kafka dc 1 dc 2 Version 0.8.1 4 partitions common client java library thrift
  • 10. #kafkasummit @spjelkavik @audunstrand 2014 - Adoption and complaining low volume/ high reliability Ad Insert Product Orchestration Payment Build Pipeline click streams mod05 zk kafka mod07 zk kafka mod01 zk kafka mod03 zk kafka mod06 zk kafka mod08 zk kafka mod02 zk kafka mod04 zk kafka dc 1 dc 2 Version 0.8.1 4 partitions experimenting with configuration common java library
  • 12. #kafkasummit @spjelkavik @audunstrand 2015 - Migration and consolidation “reliable messaging” asynchronous communication between services store and forward zipkin slack notifications dc 1 dc 2 Version 0.8.2 5-20 partitions multiple configurations broker05 zk kafka broker01 zk kafka broker03 zk kafka broker04 zk kafka broker02 zk kafka
  • 13. #kafkasummit @spjelkavik @audunstrand tooling Grafana dashboard visualizing jmx stats kafka-manager kafka-cat
  • 14. #kafkasummit @spjelkavik @audunstrand 2016 - Confluent zk04 zk broker01 broker05 kafka kafka broker03 kafka broker04 kafka broker02 kafka zk05 zk zk02 zk zk03 zk zk01 zk platform schema registry data replication kafka connect kafka streams
  • 15. 101* mistakes “God gives the nuts, but he does not crack them.” ― Franz Kafka
  • 16. Pattern Language why is it a mistake what is the consequence what is the correct solution what has finn.no done
  • 17. Top 5 1. no consideration of data on the inside vs outside 2. schema not externally defined 3. same config for every client/topic 4. 128 partitions as default config 5. running on 8 overloaded nodes
  • 18. #kafkasummit @spjelkavik @audunstrand mistake: no consideration of data on the inside vs outside https://flic.kr/p/6MjhUR
  • 19. #kafkasummit @spjelkavik @audunstrand why is it a mistake everything published on Kafka (0.8.2) is visible to any client that can access
  • 20. #kafkasummit @spjelkavik @audunstrand what is the consequence direct reads across services/domains is quite normal in legacy and/or enterprise systems coupling makes it hard to make changes unknown and unwanted coupling has a cost Kafka had no security per topic - you must add that yourself
  • 21. #kafkasummit @spjelkavik @audunstrand what is the correct solution Consider what is data on the inside, versus data on the outside Convention for what is private data and what is public data If you want to change your internal representation often, map it before publishing it publicly (Anti corruption layer)
  • 22. #kafkasummit @spjelkavik @audunstrand what has finn.no done Decided on a naming convention (i.e Public.xyzzy) for public topics Communicates the intention (contract)
  • 24. #kafkasummit @spjelkavik @audunstrand why is it a mistake data and code needs separate versioning strategies version should be part of the data defining schema in a java library makes it more difficult to access data from non- jvm languages very little discoverability of data, people chose other means to get their data difficult to create tools
  • 25. #kafkasummit @spjelkavik @audunstrand what is the consequence development speed outside jvm has been slow change of data needs coordinated deployment no process for data versioning, like backwards compatibility checks difficult to create tooling that needs to know data format, like data lake and database sinks
  • 26. #kafkasummit @spjelkavik @audunstrand what is the correct solution confluent.io platform has a separate schema registry apache avro multiple compatibility settings and evolutions strategies connect Take complexity out of the applications
  • 27. #kafkasummit @spjelkavik @audunstrand what has finn.no done still using java library, with schemas in builders confluent platform 2.0 is planned for the next step, not (just) kafka 0.9
  • 28. #kafkasummit @spjelkavik @audunstrand mistake: running mixed load with a single, default configuration https://flic.kr/p/qbarDR
  • 29. #kafkasummit @spjelkavik @audunstrand why is it a mistake Historically - One Big Database with Expensive License Database world - OLTP and OLAP Changed with Open Source software and Cloud Tried to simplify the developer's day with a single config Kafka supports very high throughput and highly reliable
  • 30. #kafkasummit @spjelkavik @audunstrand what is the consequence Trade off between throughput and degree of reliability With a single configuration - the last commit wins Either high throughput, and risk of loss - or potentially too slow
  • 31. #kafkasummit @spjelkavik @audunstrand what is the correct solution Understand your use cases and their needs! Use proper pr topic configuration Consider splitting / isolation
  • 32. #kafkasummit @spjelkavik @audunstrand Defaults that are quite reliable Exposing configuration variables in the client Ask the questions; ● at least once delivery ● ordering - if you partition, what must have strict ordering ● 99% delivery - is that good enough? ● what level of throughput is needed what has finn.no done
  • 33. #kafkasummit @spjelkavik @audunstrand Configuration Configuration for production ● Partitions ● Replicas (default.replication.factor) ● Minimum ISR (min.insync.replicas) ● Wait for acknowledge when producing messages (request.required.acks, block.on.buffer.full) ● Retries ● Leader election Configuration for consumer ● Number of threads ● When to commit (autocommit.enable vs consumer.commitOffsets)
  • 34. #kafkasummit @spjelkavik @audunstrand Gwen Shapira recommends... ● akcs = all ● block.on.buffer.full = true ● retries = MAX_INT ● max.inflight.requests.per.connect = 1 ● Producer.close() ● replication-factor >= 3 ● min.insync.replicas = 2 ● unclean.leader.election = false ● auto.offset.commit = false ● commit after processing ● monitor!
  • 35. #kafkasummit @spjelkavik @audunstrand mistake: default configuration of 128 partitions for each topic https://flic.kr/p/6KxPgZ
  • 36. #kafkasummit @spjelkavik @audunstrand why is it a mistake partitions are kafkas way of scaling consumers, 128 partitions can handle 128 consumer processes in 0.8; clusters could not reduce the number of partitions without deleting data highest number of consumers today is 20
  • 37. #kafkasummit @spjelkavik @audunstrand what is the consequence our 0.8 cluster was configured with 128 partitions as default, for all topics. many partitions and many topics creates many datapoints that must be coordinated zookeeper must coordinate all this rebalance must balance all clients on all partitions zookeeper and kafka went down (may 2015) Users could note create ads for two days
  • 38. #kafkasummit @spjelkavik @audunstrand what is the correct solution small number of partitions as default increase number of partitions for selected topics understand your use case (throughput target) reduce length of transactions on consumer side Max partitions on a broker => 1500 advised in our case - we had 38k http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
  • 39. #kafkasummit @spjelkavik @audunstrand what has finn.no done 5 partitions as default 2 heavy-traffic topics have more than 5 partitions
  • 40. #kafkasummit @spjelkavik @audunstrand mistake: deploy a proof of concept hack - in production ; i.e why we had 8 zk nodes https://flic.kr/p/6eoSgT
  • 41. #kafkasummit @spjelkavik @audunstrand why is it a mistake Kafka was set up by Ops for a proof of concept - not for hardened production use By coincidence we had 8 nodes for kafka, the same 8 nodes for zookeeper Zookeeper is dependent on a majority quorum, low latency between nodes The 8 nodes were NOT dedicated - in fact - they were overloaded already
  • 42. #kafkasummit @spjelkavik @audunstrand what is the consequence Zookeeper recommends 3 nodes for normal usage, 5 for high, and any more is questionable More nodes leads to longer time for finding consensus, more communication If we get a split between data centers, there will be 4 in each You should not run Zk between data centers, due to latency and outage possibilities
  • 43. #kafkasummit @spjelkavik @audunstrand what is the correct solution Have an odd number of Zookeeper nodes - preferrably 3, at most 5 Don’t cross data centers Check the documentation before deploying serious production load Don’t run a sensitive service (Zookeeper) on a server with 50 jvm-based services, 300% over committed on RAM Watch GC times
  • 44. #kafkasummit @spjelkavik @audunstrand what has finn.no done dc 1 dc 2 broker05 zk kafka broker01 zk kafka broker03 zk kafka broker04 zk kafka broker02 zk kafka Version 0.8.2 5-20 partitions multiple configurations
  • 46. #kafkasummit @spjelkavik @audunstrand “They say ignorance is bliss.... they're wrong ” ― Franz Kafka
  • 47. #kafkasummit @spjelkavik @audunstrand References / Further reading Designing data intensive systems, Martin Kleppmann Data on the inside - data on the outside, Pat Helland I Heart Logs, Jay Kreps The Confluent Blog, http://confluent.io/ Kafka - The definitive guide https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations http://www.finn.no/apply-here http://www.schibsted.com/en/Career/
  • 48. “It's only because of their stupidity that they're able to be so sure of themselves.” ― Franz Kafka, The Trial Audun Fauchald Strand @audunstrand Henning Spjelkavik @spjelkavik http://www.finn.no/apply-here http://www.schibsted.com/en/Career/ Q?
  • 49. #kafkasummit @spjelkavik @audunstrand Runner up Using pre-1.0 software Have control of topic creation Kafka is storage - treat it like one also ops-wise Client side rebalancing, misunderstood Commiting on all consumer threads, believing that you only commited on one