SlideShare a Scribd company logo
Apache Kafka at Rocana
Persistent Machine Data Collection at Scale
© 2015 Rocana, Inc. All Rights Reserved.
Who am I?
2
Platform Engineer
Based in Ottawa
alan@rocana.com
@alanctgardner
© 2015 Rocana, Inc. All Rights Reserved.
Working at Rocana
3
© 2015 Rocana, Inc. All Rights Reserved.
Rocana Ops
4
© 2015 Rocana, Inc. All Rights Reserved.
Kafka Principles
© 2015 Rocana, Inc. All Rights Reserved.
History
6
• Designed at LinkedIn
• Documented in a 2013 blog post by Jay Kreps
• LinkedIn moved from a monolith to multiple data stores and services
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
© 2015 Rocana, Inc. All Rights Reserved.
Complexity
7
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
© 2015 Rocana, Inc. All Rights Reserved.
Complexity
8
© 2015 Rocana, Inc. All Rights Reserved.
Centralized Data Bus
9
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
© 2015 Rocana, Inc. All Rights Reserved.10
Centralized Data Bus
© 2015 Rocana, Inc. All Rights Reserved.
Design Goals
11
• A centralized data bus that:
• Scales horizontally
• Delivers (some) events in order
• Decouples producers and consumers
• Has low latency end-to-end
© 2015 Rocana, Inc. All Rights Reserved.
A Horizontally Scalable Log
12
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
© 2015 Rocana, Inc. All Rights Reserved.
Asynchronous Consumers
13
© 2015 Rocana, Inc. All Rights Reserved.
Low-Latency, Durable Writes
14
• Kafka writes all events to disk
• Events are stored on disk in the wire protocol
• Zero-copy reads and writes avoid events ever entering user space
• Kafka relies on the page cache for low-latency serving of recent events
© 2015 Rocana, Inc. All Rights Reserved.
Putting it all together
15
© 2015 Rocana, Inc. All Rights Reserved.
Our Experience
© 2015 Rocana, Inc. All Rights Reserved.
17
© 2015 Rocana, Inc. All Rights Reserved.18
Resource Constraints
• Customer machines are doing real
work
• Agent footprint must be small
• Can’t depend on availability of back-
end services
• Batching is crucial
© 2015 Rocana, Inc. All Rights Reserved.19
Independent consumers
• Consumers aren’t coupled to each
other
• Maintenance and upgrades are
simplified
• Horizontal scale per consumer
© 2015 Rocana, Inc. All Rights Reserved.
Vendor Support
20
© 2015 Rocana, Inc. All Rights Reserved.21
© 2015 Rocana, Inc. All Rights Reserved.
22
Shamelessly stolen from https://aphyr.com/
© 2015 Rocana, Inc. All Rights Reserved.
23
{
“syslog_arrival_ts":"1444489076463",
"syslog_conn_dns":"localhost",
“syslog_conn_port":"57788",
“body”: …,
“id”:”KLE5GZF7WB2WSA5…”,
…
}
{
“tailed_file_inode”:”2371810",
“tailed_file_offset”:"384930",
“timestamp":"",
“body”: …,
“id”:”73XXMLRJNHKA76…”,
…
}
Ephemeral Source Durable Source
© 2015 Rocana, Inc. All Rights Reserved.
Durability
24
© 2015 Rocana, Inc. All Rights Reserved.
Unclean Elections
25
• Kafka maintains a set of up-to-date replicas in ZK
• “In-sync replicas” or the ISR
• ISR can dynamically grow or shrink
• by default Kafka will accept writes with a single ISR
• It is possible for the set to shrink to 0 nodes, which either leads to:
• partition unavailability until an in-sync replica returns to life
• OR data loss when an out-of-sync node begins accepting writes
• This is tunable with the “unclean leader election” property
• Defaults to true in 0.8.2
http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen
© 2015 Rocana, Inc. All Rights Reserved.26
© 2015 Rocana, Inc. All Rights Reserved.
Schema Versioning
27
http://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one
• Schemas are absolutely necessary
• Have a plan for how to evolve the
schema before v1
• A schema registry is a good
investment
© 2015 Rocana, Inc. All Rights Reserved.
Security
28
• No encryption or authentication in
0.8.x
• stunnel, encryption at the app layer
are possible
• Should be fixed in 0.9.0
© 2015 Rocana, Inc. All Rights Reserved.
Replication
29
• Cross-DC clusters are not
recommended
• Kafka includes MirrorMaker for
replication between two clusters
• Replication is asynchronous
• Offsets aren’t consistent
© 2015 Rocana, Inc. All Rights Reserved.
Operations
30
• Everything is manual:
• Rebalancing partitions
• Rebalancing leaders
• Decomissioning nodes
• Watch for lagging consumers
© 2015 Rocana, Inc. All Rights Reserved.
Sizing
31
• Consider both throughput and
retention time
• Overprovision number of partitions
• Rebalancing is easy, but re-sharding
breaks consistent hashing
http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
© 2015 Rocana, Inc. All Rights Reserved.
Performance
32
• Jay Kreps ran an on-premises
benchmark
• 18 spindles, 18 cores in 3 boxes
could produce 2.5M events/sec
• Aggressive batching is necessary
• Synchronous ACKs halve throughput
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
© 2015 Rocana, Inc. All Rights Reserved.
Performance
33
• Reproduced on AWS with 3 and 5 node clusters
• d2.xlarge nodes have 3 spindles, 4 cores, 30.5GB RAM
• 5 producers on m3.xlarge instances
• 3 nodes accepted 2.6M events/s
• 24 partitions, one replica, one ACK
• dropped to 1.7M with 3x replication and 1 ack
• 5 nodes accepted 3.6M events/s
• 48 partitions, one replica, one ACK
• dropped to 2.16M with 3x replication and 1 ack
© 2015 Rocana, Inc. All Rights Reserved.
Thank You!
34

More Related Content

PPTX
High cardinality time series search: A new level of scale - Data Day Texas 2016
PDF
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
PPTX
Building a system for machine and event-oriented data - Data Day Seattle 2015
PPTX
Building a system for machine and event-oriented data - Velocity, Santa Clara...
PPTX
from source to solution - building a system for event-oriented data
PDF
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
PPTX
Building a system for machine and event-oriented data with Rocana
PPTX
Running Kafka for Maximum Pain
High cardinality time series search: A new level of scale - Data Day Texas 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Velocity, Santa Clara...
from source to solution - building a system for event-oriented data
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
Building a system for machine and event-oriented data with Rocana
Running Kafka for Maximum Pain

What's hot (20)

PDF
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
PDF
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
PDF
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
PDF
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
PPTX
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
PPTX
DataEngConf SF16 - High cardinality time series search
PDF
Real-time Data Streaming from Oracle to Apache Kafka
PDF
Kafka Summit SF 2017 - Fast Data in Supply Chain Planning
PDF
APAC Kafka Summit - Best Of
PDF
The State of Stream Processing
PDF
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
PDF
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
PDF
Kafka Streams Windows: Behind the Curtain
PPTX
Volta: Logging, Metrics, and Monitoring as a Service
PDF
War Stories: DIY Kafka
PDF
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
PPTX
Event Driven Architecture
PDF
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
How Credit Karma Makes Real-Time Decisions For 60 Million Users With Akka Str...
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Reactive Fast Data & the Data Lake with Akka, Kafka, Spark
DataEngConf SF16 - High cardinality time series search
Real-time Data Streaming from Oracle to Apache Kafka
Kafka Summit SF 2017 - Fast Data in Supply Chain Planning
APAC Kafka Summit - Best Of
The State of Stream Processing
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
Kafka Streams Windows: Behind the Curtain
Volta: Logging, Metrics, and Monitoring as a Service
War Stories: DIY Kafka
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Event Driven Architecture
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Ad

Viewers also liked (19)

PDF
Performance Metrics and Ontology for Describing Performance Data of Grid Work...
PDF
Ceph Day San Jose - From Zero to Ceph in One Minute
PPTX
Ceph Day San Jose - Ceph at Salesforce
PPTX
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
PDF
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
PPTX
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
PDF
Ceph Day San Jose - HA NAS with CephFS
PPTX
Ceph Day San Jose - Ceph in a Post-Cloud World
PPTX
Ceph Day Tokyo - High Performance Layered Architecture
PDF
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
PPTX
Ceph Day Tokyo - Bring Ceph to Enterprise
PPTX
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
PPTX
Ceph Day Tokyo - Ceph Community Update
PDF
Ceph Day Tokyo -- Ceph on All-Flash Storage
PDF
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
PPTX
Connected Vehicle Data Platform
PDF
Ceph Day San Jose - Object Storage for Big Data
PPTX
Double Your Hadoop Hardware Performance with SmartSense
Performance Metrics and Ontology for Describing Performance Data of Grid Work...
Ceph Day San Jose - From Zero to Ceph in One Minute
Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Enable Fast Big Data Analytics on Ceph with Alluxio
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - HA NAS with CephFS
Ceph Day San Jose - Ceph in a Post-Cloud World
Ceph Day Tokyo - High Performance Layered Architecture
Ceph Day Tokyo - Bit-Isle's 3 years footprint with Ceph
Ceph Day Tokyo - Bring Ceph to Enterprise
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Ceph Community Update
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo - Ceph on ARM: Scaleable and Efficient
Connected Vehicle Data Platform
Ceph Day San Jose - Object Storage for Big Data
Double Your Hadoop Hardware Performance with SmartSense
Ad

Similar to DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine data (20)

PDF
Kafka in action - Tech Talk - Paytm
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
PDF
Streaming Processing with a Distributed Commit Log
PDF
A la rencontre de Kafka, le log distribué par Florian GARCIA
PDF
Introduction to apache kafka
PPTX
Microservices interaction at scale using Apache Kafka
PDF
Building a system for machine and event-oriented data - SF HUG Nov 2015
PDF
Insta clustr seattle kafka meetup presentation bb
PPTX
Apache Kafka
PDF
Kafka syed academy_v1_introduction
PDF
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
PPTX
Streaming in Practice - Putting Apache Kafka in Production
PPTX
Fundamentals and Architecture of Apache Kafka
PDF
Building High-Throughput, Low-Latency Pipelines in Kafka
PDF
Introduction to Apache Kafka
PDF
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
PDF
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
PPTX
Distributed messaging through Kafka
PDF
Apache Kafka - Martin Podval
PPTX
Building Data Streaming Platforms using OpenShift and Kafka
Kafka in action - Tech Talk - Paytm
Capital One Delivers Risk Insights in Real Time with Stream Processing
Streaming Processing with a Distributed Commit Log
A la rencontre de Kafka, le log distribué par Florian GARCIA
Introduction to apache kafka
Microservices interaction at scale using Apache Kafka
Building a system for machine and event-oriented data - SF HUG Nov 2015
Insta clustr seattle kafka meetup presentation bb
Apache Kafka
Kafka syed academy_v1_introduction
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
Streaming in Practice - Putting Apache Kafka in Production
Fundamentals and Architecture of Apache Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
Introduction to Apache Kafka
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
Distributed messaging through Kafka
Apache Kafka - Martin Podval
Building Data Streaming Platforms using OpenShift and Kafka

More from Hakka Labs (20)

PDF
Always Valid Inference (Ramesh Johari, Stanford)
PDF
DataEngConf SF16 - Data Asserts: Defensive Data Science
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
PDF
DataEngConf SF16 - Recommendations at Instacart
PDF
DataEngConf SF16 - Running simulations at scale
PDF
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
PDF
DataEngConf SF16 - Collecting and Moving Data at Scale
PDF
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
PDF
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
PDF
DataEngConf SF16 - Three lessons learned from building a production machine l...
PDF
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
PDF
DataEngConf SF16 - Bridging the gap between data science and data engineering
PDF
DataEngConf SF16 - Multi-temporal Data Structures
PDF
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
PDF
DataEngConf SF16 - Beginning with Ourselves
PDF
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
PDF
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
PDF
DataEngConf SF16 - Spark SQL Workshop
PDF
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
Always Valid Inference (Ramesh Johari, Stanford)
DataEngConf SF16 - Data Asserts: Defensive Data Science
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Spark SQL Workshop
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Network Security Unit 5.pdf for BCA BBA.
Modernizing your data center with Dell and AMD
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
Unlocking AI with Model Context Protocol (MCP)
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Monthly Chronicles - July 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
Dropbox Q2 2025 Financial Results & Investor Presentation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation_ Review paper, used for researhc scholars
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Machine learning based COVID-19 study performance prediction
Network Security Unit 5.pdf for BCA BBA.

DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine data

  • 1. Apache Kafka at Rocana Persistent Machine Data Collection at Scale
  • 2. © 2015 Rocana, Inc. All Rights Reserved. Who am I? 2 Platform Engineer Based in Ottawa alan@rocana.com @alanctgardner
  • 3. © 2015 Rocana, Inc. All Rights Reserved. Working at Rocana 3
  • 4. © 2015 Rocana, Inc. All Rights Reserved. Rocana Ops 4
  • 5. © 2015 Rocana, Inc. All Rights Reserved. Kafka Principles
  • 6. © 2015 Rocana, Inc. All Rights Reserved. History 6 • Designed at LinkedIn • Documented in a 2013 blog post by Jay Kreps • LinkedIn moved from a monolith to multiple data stores and services https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 7. © 2015 Rocana, Inc. All Rights Reserved. Complexity 7 https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 8. © 2015 Rocana, Inc. All Rights Reserved. Complexity 8
  • 9. © 2015 Rocana, Inc. All Rights Reserved. Centralized Data Bus 9 https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 10. © 2015 Rocana, Inc. All Rights Reserved.10 Centralized Data Bus
  • 11. © 2015 Rocana, Inc. All Rights Reserved. Design Goals 11 • A centralized data bus that: • Scales horizontally • Delivers (some) events in order • Decouples producers and consumers • Has low latency end-to-end
  • 12. © 2015 Rocana, Inc. All Rights Reserved. A Horizontally Scalable Log 12 https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 13. © 2015 Rocana, Inc. All Rights Reserved. Asynchronous Consumers 13
  • 14. © 2015 Rocana, Inc. All Rights Reserved. Low-Latency, Durable Writes 14 • Kafka writes all events to disk • Events are stored on disk in the wire protocol • Zero-copy reads and writes avoid events ever entering user space • Kafka relies on the page cache for low-latency serving of recent events
  • 15. © 2015 Rocana, Inc. All Rights Reserved. Putting it all together 15
  • 16. © 2015 Rocana, Inc. All Rights Reserved. Our Experience
  • 17. © 2015 Rocana, Inc. All Rights Reserved. 17
  • 18. © 2015 Rocana, Inc. All Rights Reserved.18 Resource Constraints • Customer machines are doing real work • Agent footprint must be small • Can’t depend on availability of back- end services • Batching is crucial
  • 19. © 2015 Rocana, Inc. All Rights Reserved.19 Independent consumers • Consumers aren’t coupled to each other • Maintenance and upgrades are simplified • Horizontal scale per consumer
  • 20. © 2015 Rocana, Inc. All Rights Reserved. Vendor Support 20
  • 21. © 2015 Rocana, Inc. All Rights Reserved.21
  • 22. © 2015 Rocana, Inc. All Rights Reserved. 22 Shamelessly stolen from https://aphyr.com/
  • 23. © 2015 Rocana, Inc. All Rights Reserved. 23 { “syslog_arrival_ts":"1444489076463", "syslog_conn_dns":"localhost", “syslog_conn_port":"57788", “body”: …, “id”:”KLE5GZF7WB2WSA5…”, … } { “tailed_file_inode”:”2371810", “tailed_file_offset”:"384930", “timestamp":"", “body”: …, “id”:”73XXMLRJNHKA76…”, … } Ephemeral Source Durable Source
  • 24. © 2015 Rocana, Inc. All Rights Reserved. Durability 24
  • 25. © 2015 Rocana, Inc. All Rights Reserved. Unclean Elections 25 • Kafka maintains a set of up-to-date replicas in ZK • “In-sync replicas” or the ISR • ISR can dynamically grow or shrink • by default Kafka will accept writes with a single ISR • It is possible for the set to shrink to 0 nodes, which either leads to: • partition unavailability until an in-sync replica returns to life • OR data loss when an out-of-sync node begins accepting writes • This is tunable with the “unclean leader election” property • Defaults to true in 0.8.2 http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen
  • 26. © 2015 Rocana, Inc. All Rights Reserved.26
  • 27. © 2015 Rocana, Inc. All Rights Reserved. Schema Versioning 27 http://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one • Schemas are absolutely necessary • Have a plan for how to evolve the schema before v1 • A schema registry is a good investment
  • 28. © 2015 Rocana, Inc. All Rights Reserved. Security 28 • No encryption or authentication in 0.8.x • stunnel, encryption at the app layer are possible • Should be fixed in 0.9.0
  • 29. © 2015 Rocana, Inc. All Rights Reserved. Replication 29 • Cross-DC clusters are not recommended • Kafka includes MirrorMaker for replication between two clusters • Replication is asynchronous • Offsets aren’t consistent
  • 30. © 2015 Rocana, Inc. All Rights Reserved. Operations 30 • Everything is manual: • Rebalancing partitions • Rebalancing leaders • Decomissioning nodes • Watch for lagging consumers
  • 31. © 2015 Rocana, Inc. All Rights Reserved. Sizing 31 • Consider both throughput and retention time • Overprovision number of partitions • Rebalancing is easy, but re-sharding breaks consistent hashing http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
  • 32. © 2015 Rocana, Inc. All Rights Reserved. Performance 32 • Jay Kreps ran an on-premises benchmark • 18 spindles, 18 cores in 3 boxes could produce 2.5M events/sec • Aggressive batching is necessary • Synchronous ACKs halve throughput https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
  • 33. © 2015 Rocana, Inc. All Rights Reserved. Performance 33 • Reproduced on AWS with 3 and 5 node clusters • d2.xlarge nodes have 3 spindles, 4 cores, 30.5GB RAM • 5 producers on m3.xlarge instances • 3 nodes accepted 2.6M events/s • 24 partitions, one replica, one ACK • dropped to 1.7M with 3x replication and 1 ack • 5 nodes accepted 3.6M events/s • 48 partitions, one replica, one ACK • dropped to 2.16M with 3x replication and 1 ack
  • 34. © 2015 Rocana, Inc. All Rights Reserved. Thank You! 34

Editor's Notes

  • #2: I’m Alan Gardner Here to talk about our use of Apache Kafka at Rocana
  • #3: Platform engineer at Rocana Work on data ingest, storage and processing Distributed open-source systems: Hadoop, Kafka, Solr Systems programming work as well Work remotely from Ottawa, Canada This is my cat.
  • #4: Working at Rocana is great: everybody is remote very smart, very nice people quarterly onsites
  • #5: What is Rocana Ops? a platform for IT operations data designed for 10’s of thousands of servers in multiple data centers distill the entire organization’s IT infrastructure down to a single screen: “what’s wrong?” scalable collection framework - out of the box host data and app logs event data warehouse built on open source technologies and open schemas visualization, anomaly detection and machine learning to provide guided root cause analysis as opposed to a wall of graphs or pile of logs Apache Kafka is the “Enterprise Data Bus” Going to talk about why we chose Kafka in that role
  • #6: To explain why we chose Kafka, I’m going to start with how Kafka works and why it’s designed the way it is.
  • #7: Designed at LinkedIn to handle the explosion of different systems being created Jay Kreps blog post describes Kafka from first principles, including motivation Some of these images are cribbed from that post, where appropriate
  • #8: LinkedIn’s problem: lots of front-end services lots of back-end services hooking them together produces this complex spaghetti of dependencies Front-end has to be highly available and low-latency if you write synchronously, you can only be as fast as your slowest backend service
  • #10: Kafka acts as a central bus for data: every front-end service writes all events into Kafka backend services can take only the events they’re interested in data doesn’t live in Kafka forever Kafka is run as a utility within LinkedIn Solves one goal: centralized data bus, still need horizontal scale, durability
  • #11: This is much better
  • #13: Kafka is fundamentally a collection of logs events are only appended events are always consumed in the same order A single partition is a log: an ordered set of events Every event has an offset Partitions are the units of scale, like shards Log operations are constrained, so we can make them fast Example is sharding on users
  • #14: Consuming and producing are completely decoupled: consumers maintain their own logical offset, representing the last event they consumed different consumers can consume at different rates producers continue to append new events in order events are retained until an expiry time, or max log size Kafka is not a durable long-term store Consumers can go offline for extended time or start from scratch and consume all available events Events are durably written and replicated
  • #15: Kafka writes all data to disk, lots of good tricks: low-latency for recent data from the page cache data on the wire is the same as on disk no GC overhead for the page cache zero-copy ops
  • #16: This is an overview of a typical Kafka system: multiple producers, brokers and consumers each broker has ownership of a set of partitions (it’s the primary) broker lists and partition assignment are stored in ZK consumers are using ZK to store offsets here, but that’s not the only way
  • #18: Let’s revisit the Rocana architecture: thousands of agents writing into Kafka events are distributed across multiple partitions, written durably to disk multiple, separate consumers are decoupled from the producers and each other
  • #19: Resource limits on producer machines: these machines are doing real work that’s important to the business our agent needs to quickly encode events and produce them batching is important to ensure efficiency latency to write to Kafka is still very low
  • #20: Consumers don’t affect each other: each maintain their own offsets one consumer can be taken offline, can be slow, etc. with little impact upgrades are very easy a single consumer can even be rewound (theoretically) consumers can scale horizontally with the number of partitions
  • #21: Kafka has critical mass within the industry: Cloudera, Hortonworks, MapR all support it Confluent has all the designers of Kafka working on a commercial stream processing platform
  • #22: Those are all good things, but there are some sharp edges to watch out for.
  • #23: Kingsbury tire fire slide Exactly once delivery is very hard Not all of our consumers are doing something idempotent You can play back the whole partition to find the last message which was written
  • #24: Overview of a Rocana Event which would be published into Kafka: fixed fields and key-value pairs ID is a hash of an event fields, used for duplicate detection for durable sources we can use offset and inode, get 99% of the way for ephemeral sources we use arrival time + internal fields ID used for three things: assignment to a partition deduplication filter ID in Solr for idempotent inserts
  • #25: Kafka “writes every message to disk” Defaults to fsyncing every 10k messages, or every 3 seconds (at most) ACK happens when a message is written but not fsynced OK, so I’ll replicate data across multiple machines
  • #26: The default in Kafka is to continue making progress in the presence of node failures (AP): - unclean elections allow a replica which has not seen all writes to become the leader when the ISR shrinks to 0 - minimum ISR size is only 1 to accept writes by default - when a previously in-sync replica comes back, those records are lost - it can be disabled, see Jay’s blog for more discussion
  • #27: Some things aren’t hard, but you need to look out for:
  • #28: Data you put in Kafka really needs to have a schema Schemas really need to have an evolution strategy You probably want some notion of a schema registry Gwen’s post is great We use Avro, where the consumer has to know the writer schema tried to mitigate this with nullable fields, no luck
  • #29: There isn’t any. No encryption on disk or in flight, no authentication: you can use stunnel you could encrypt each byte buffer and decrypt on the client side No authentication These will probably both be fixed in 0.9.0 this month
  • #30: MirrorMaker is basically just a consumer/producer which pumps data between clusters: doesn’t preserve offsets, so consumers can’t fail over you can send events between two different sized clusters you can merge streams from two data centres
  • #31: Kafka operations are pretty basic, it comes with a giant `bin` dir full of tools: CLI for rebalancing partitions and leaders leaders and partitions rebalance on node failure adding nodes requires reasignment Decomissioning nodes is a giant pain right now Tool for lagging consumer
  • #32: Factors to consider when sizing a cluster: I/O throughput retention time frame (throughput over time) Partitions limit concurrency of consumers future growth (in terms of setting # of partitions) growing a cluster online is manual but possible in 0.8.2 growing number of partitions breaks consistent hashing! (http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/)
  • #33: Jay Kreps has a blog post about this, using 3 commodity broker boxes with 6 cores, 6 spindles each: it’s a little weird, he only uses 6 partitions so he never exercises all the spindles in the cluster he batches small messages really aggressively (8k batches of 100 byte messages) his is on-premises, he hits 2.5M records/sec producing and consuming requiring 3 acks for every message halved throughput
  • #34: I used a similar methodology on EC2 to get some sizing numbers: - Used 4k batch sizes, results were broadly similar (1k and 2k hurt perf) Over-provisioning partitions by 2x spindles doesn’t give benefit, but doesn’t slow down either Over-provisioning by 2x and adding 3x replication did cause slow down One partition actually hit 700k events/s, there may be coordination issues in the producer Synchronous acks were brutal, 10x performance hit, this is almost definitely due to AWS network latency Each node is ~$500/month At 250MB/sec, we’d only get ~18 hours of retention We’ve seen instances of only 12 hours of retention