instaclustr.comTwitter @instaclustr info@instaclustr.com instaclustr.com
Lessons Learned from Building an
Apache Kafka
Managed Service
instaclustr.com
Introduction
● Over 20 million node-hours of experience managing Cassandra,
Spark and Elassandra
● Our platform provides automated provisioning, monitoring and
management
● Available on AWS, GCP, Azure and IBM Cloud
● Managed Apache Kafka released May 21st
instaclustr.com
Agenda
● Context - our offering and development process
● Hardware choice and benchmarking
● Topic and user management
● Broker security configuration
● Monitoring
● Backup and Restore
instaclustr.com
Instaclustr Managed Kafka - Key Features
● Preview Release available:
○ Open source Apache Kafka and Zookeeper provisioned in AWS, GCP and Azure
○ Broker monitoring
○ Instaclustr monitoring and provisioning API support
○ Private network clusters (AWS only)
○ Run in your cloud provider account or ours
○ Topic management via a custom CLI tool
instaclustr.com
Instaclustr Managed Kafka - Key Features
● For GA (end June):
○ SOC2 compliant
○ User & credential management
○ Providing more cluster config options
○ Topic level and synthetic transaction monitoring
○ Infrastructure config tuning
instaclustr.com
Instaclustr Managed Kafka - Development Process
● First customer requests 2016
● Internal infrastructure deployment and usage of Kafka mid 2017
● Managed service platform development
commenced November 2017
● Early access program with 4 customers
commenced December 2017
● Public preview release 21 May 2018
● GA expected 25 June 2018
instaclustr.com
Hardware Choice and Benchmarking - GP2 vs ST1
● Disk Type
○ AWS benchmark - r4.large w 500GB disks
■ 1 x 500GB ST1 volume
■ 10 x 50GB GP2 volumes in RAID0 configuration
○ Avg 10% improved throughput with ST1 vs GP2 EBS
○ ST1 is 45% of the cost of GP2
○ Non-RAIDed mount simplifies re-sizing EBS volumes
Type Writes (m/s) Reads (m/s) Mixed (m/s)
ST1 223,851 149,506 W: 171,305 / R: 49,898
GP2 203,409 127,127 W: 162,966 / R: 44,869
instaclustr.com
ST1
GP2
instaclustr.com
Provider Comparison
instaclustr.com
Hardware Choice and Benchmarking - SSL vs non-SSL
● Encryption enabled on broker-to-broker and client-to-broker
○ AWS benchmark - r4.large w 1500GB ST1 disk
○ 512 byte messages
○ ~30% decrease in throughput with Broker and Client SSL enabled
● Follow-up benchmarks on OpenJDK 8 vs. 9, based on KAFKA-2561
○ 50% increased throughput in writes
○ 80% increased throughput in reads
instaclustr.com
instaclustr.com
Hardware Choice and Benchmarking - Number of Topics
● Possible urban myth that
increasing topics reduces
performance
● However,
more topics = more
partitions
● Significantly slows recovery
time from node failure
10
Topic
s
100
Topic
s
1000
Topic
s
5000
Topic
s
instaclustr.com
Hardware Choice and Benchmarking -
Colocated Zookeeper
● Often recommended to host zookeeper separately to Kafka
● However, recent changes have significantly reduced load on Zookeeper from Kafka
○ Consumer offsets are no longer stored in Zookeeper
● Our benchmarking showed no measurable difference in performance, at least for smaller clusters
instaclustr.com
Hardware Choice and Benchmarking -
Colocated Zookeeper
Consumer Rate - Separate Consumer Rate - Colocated
● 6 node cluster with broker restart
○ Similar results with dedicated Zookeeper disk vs. shared
instaclustr.com
Topic and User Configuration Management
● Kafka utilities require direct access to Zookeeper
● Zookeeper does not have a robust external security model
● Felt that providing access to Zookeeper was a risk
● Solutions
○ Developed command line tool to use Kafka API for topic configuration
https://github.com/instaclustr/ic-kafka-tools
■ Future: Console UI support?
■ Value topic configuration versioning and management
○ Adding user management to Instaclustr Console
■ Additional authentication required
instaclustr.com
Broker Security Configuration
● Using SCRAM (Salted Challenge Response Authentication Mechanism) authentication
○ Used for client->broker
○ Broker->broker uses SASL plaintext
● Using SASL plaintext authentication
○ Used for broker->broker
○ Were planning on integrating SCRAM authentication, but dynamic configuration still requires
broker restart
○ Instead planning on short-lived signed broker keys as dynamic configuration does not require
restart
instaclustr.com
Broker Security Configuration
● Access to managed clusters
○ Public IPs and whitelisting in firewall (security group or equivalent)
○ Private IPs with VPC Peering (or equivalent in other cloud providers)
○ Private Network Clusters where nodes are not allocated public IPs and gateway box is used for
admin access
○ Don’t expose Zookeeper through firewall due to weak security model
instaclustr.com
Monitoring
● Metrics exposed via JMX
○ Custom collection agent -> RabbitMQ (planned to migrate to Kafka) -> Riemann ->
Cassandra+Spark -> Console, APIs, Grafana
● Exposing broker-level and per-topic metrics
● Alerting
○ Basics: service state, disk usage free space, server still exists
○ Kafka metrics: offline partitions, active controllers != 1, partition under replicated
■ Active controller very sensitive, are re-assessing alert thresholds
○ Synthetic transactions: publish and consume message to controlled topic, measure success and
latency
instaclustr.com
Monitoring
● Central Logging
○ Fleet logs transferred via Kafka to an Elassandra cluster
○ 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra
○ Kafka experience in this project has been very positive
● Only issue
○ Auto offset commit failed for group logstash: Commit offsets failed with retriable exception. You
should retry committing offsets.
○ We weren’t monitoring consumer lag closely enough
○ Increased consumer session and request timeouts
instaclustr.com
Backup and Restore
● Internet wisdom = Kafka Backups is not a thing
○ Rely on replication within cluster or mirror maker
replication to another cluster
● Cassandra experience says backups are valuable
○ Hardware failure is not an issue but corruption due to
app bugs or user error can occur and be spread by
replication
● Future
○ Working on regular automated backup and restore of
topic and security configuration
○ Consider using Kafka Connect to write important
messages to offline backup
instaclustr.com
Thanks for listening!
● Currently in Preview
● Would love any feedback, suggestions or just telling us what we missed
● 14-day free trial option (no CC needed) - console.instaclustr.com

More Related Content

PDF
Insta clustr seattle kafka meetup presentation bb
PDF
Kafka on Kubernetes—From Evaluation to Production at Intuit
PDF
Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...
PDF
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
PDF
Chicago Kafka Meetup
PDF
Apache Kafka - Martin Podval
PDF
OpenStack Networking LBaaS
PDF
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...
Insta clustr seattle kafka meetup presentation bb
Kafka on Kubernetes—From Evaluation to Production at Intuit
Asynchronous Transaction Processing With Kafka as a Single Source of Truth - ...
The Easiest Way to Configure Security for Clients AND Servers (Dani Traphagen...
Chicago Kafka Meetup
Apache Kafka - Martin Podval
OpenStack Networking LBaaS
Aptira presents OpenStack Load Balancing as a Service at Banglore India OSUG ...

What's hot (18)

PPTX
Samuel Bercovici - lbaaS for Havana
PDF
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
PDF
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
PDF
How Apache Kafka® Works
PPTX
Copy of Kafka-Camus
PPTX
Apache Pulsar First Overview
PDF
Kubecon 2019_eu-k8s-secrets-csi
PDF
Learnings from the Field. Lessons from Working with Dozens of Small & Large D...
PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
Mongo DB Monitoring - Become a MongoDB DBA
PDF
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
PDF
5 levels of high availability from multi instance to hybrid cloud
PDF
KubeCon + CloudNative Con NA 2021 | A New Generation of NATS
PDF
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
PPTX
Spring Boot+Kafka: the New Enterprise Platform
PDF
Architectural patterns for high performance microservices in kubernetes
PDF
Effectively-once semantics in Apache Pulsar
PPTX
Accelerating the Path to Digital with a Cloud Data Strategy
Samuel Bercovici - lbaaS for Havana
Monitoring, the Prometheus Way - Julius Voltz, Prometheus
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
How Apache Kafka® Works
Copy of Kafka-Camus
Apache Pulsar First Overview
Kubecon 2019_eu-k8s-secrets-csi
Learnings from the Field. Lessons from Working with Dozens of Small & Large D...
Apache Kafka Architecture & Fundamentals Explained
Mongo DB Monitoring - Become a MongoDB DBA
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
5 levels of high availability from multi instance to hybrid cloud
KubeCon + CloudNative Con NA 2021 | A New Generation of NATS
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Spring Boot+Kafka: the New Enterprise Platform
Architectural patterns for high performance microservices in kubernetes
Effectively-once semantics in Apache Pulsar
Accelerating the Path to Digital with a Cloud Data Strategy
Ad

Similar to Instaclustr Kafka Meetup Sydney Presentation (20)

PDF
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
PDF
30 Of My Favourite Open Source Technologies In 30 Minutes
PDF
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
PDF
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
PDF
Scaling Open Source Big Data Cloud Applications is Easy/Hard
PDF
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
PDF
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
PDF
Lookout on Scaling Security to 100 Million Devices
PDF
War Stories: DIY Kafka
PDF
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...
PDF
War Stories: DIY Kafka
PPTX
Running Kafka for Maximum Pain
PPTX
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
PDF
Our Multi-Year Journey to a 10x Faster Confluent Cloud
PDF
Immutable infrastructure - Beyond stateless
PDF
DEVELOPING FAST APPLICATIONS WITH OPEN SOURCE SOFTWARE - WITHOUT THE FURY
PPTX
Salesforce enabling real time scenarios at scale using kafka
PDF
Instaclustr webinar 2017 feb 08 japan
PDF
Cassandra at teads
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
30 Of My Favourite Open Source Technologies In 30 Minutes
Kafka on Kubernetes: Keeping It Simple (Nikki Thean, Etsy) Kafka Summit SF 2019
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Lookout on Scaling Security to 100 Million Devices
War Stories: DIY Kafka
Let’s Make Your CFO Happy; A Practical Guide for Kafka Cost Reduction with El...
War Stories: DIY Kafka
Running Kafka for Maximum Pain
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
Our Multi-Year Journey to a 10x Faster Confluent Cloud
Immutable infrastructure - Beyond stateless
DEVELOPING FAST APPLICATIONS WITH OPEN SOURCE SOFTWARE - WITHOUT THE FURY
Salesforce enabling real time scenarios at scale using kafka
Instaclustr webinar 2017 feb 08 japan
Cassandra at teads
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
Ad

Recently uploaded (20)

PDF
Microsoft Office 365 Crack Download Free
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
PPTX
Trending Python Topics for Data Visualization in 2025
PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
Visual explanation of Dijkstra's Algorithm using Python
PDF
iTop VPN Crack Latest Version Full Key 2025
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
CCleaner 6.39.11548 Crack 2025 License Key
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PPTX
Computer Software - Technology and Livelihood Education
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PDF
Cost to Outsource Software Development in 2025
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
Microsoft Office 365 Crack Download Free
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
AI/ML Infra Meetup | LLM Agents and Implementation Challenges
Trending Python Topics for Data Visualization in 2025
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
Visual explanation of Dijkstra's Algorithm using Python
iTop VPN Crack Latest Version Full Key 2025
Monitoring Stack: Grafana, Loki & Promtail
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
CCleaner 6.39.11548 Crack 2025 License Key
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
Computer Software - Technology and Livelihood Education
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
Cost to Outsource Software Development in 2025
How Tridens DevSecOps Ensures Compliance, Security, and Agility
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Weekly report ppt - harsh dattuprasad patel.pptx

Instaclustr Kafka Meetup Sydney Presentation

  • 1. instaclustr.comTwitter @instaclustr info@instaclustr.com instaclustr.com Lessons Learned from Building an Apache Kafka Managed Service
  • 2. instaclustr.com Introduction ● Over 20 million node-hours of experience managing Cassandra, Spark and Elassandra ● Our platform provides automated provisioning, monitoring and management ● Available on AWS, GCP, Azure and IBM Cloud ● Managed Apache Kafka released May 21st
  • 3. instaclustr.com Agenda ● Context - our offering and development process ● Hardware choice and benchmarking ● Topic and user management ● Broker security configuration ● Monitoring ● Backup and Restore
  • 4. instaclustr.com Instaclustr Managed Kafka - Key Features ● Preview Release available: ○ Open source Apache Kafka and Zookeeper provisioned in AWS, GCP and Azure ○ Broker monitoring ○ Instaclustr monitoring and provisioning API support ○ Private network clusters (AWS only) ○ Run in your cloud provider account or ours ○ Topic management via a custom CLI tool
  • 5. instaclustr.com Instaclustr Managed Kafka - Key Features ● For GA (end June): ○ SOC2 compliant ○ User & credential management ○ Providing more cluster config options ○ Topic level and synthetic transaction monitoring ○ Infrastructure config tuning
  • 6. instaclustr.com Instaclustr Managed Kafka - Development Process ● First customer requests 2016 ● Internal infrastructure deployment and usage of Kafka mid 2017 ● Managed service platform development commenced November 2017 ● Early access program with 4 customers commenced December 2017 ● Public preview release 21 May 2018 ● GA expected 25 June 2018
  • 7. instaclustr.com Hardware Choice and Benchmarking - GP2 vs ST1 ● Disk Type ○ AWS benchmark - r4.large w 500GB disks ■ 1 x 500GB ST1 volume ■ 10 x 50GB GP2 volumes in RAID0 configuration ○ Avg 10% improved throughput with ST1 vs GP2 EBS ○ ST1 is 45% of the cost of GP2 ○ Non-RAIDed mount simplifies re-sizing EBS volumes Type Writes (m/s) Reads (m/s) Mixed (m/s) ST1 223,851 149,506 W: 171,305 / R: 49,898 GP2 203,409 127,127 W: 162,966 / R: 44,869
  • 10. instaclustr.com Hardware Choice and Benchmarking - SSL vs non-SSL ● Encryption enabled on broker-to-broker and client-to-broker ○ AWS benchmark - r4.large w 1500GB ST1 disk ○ 512 byte messages ○ ~30% decrease in throughput with Broker and Client SSL enabled ● Follow-up benchmarks on OpenJDK 8 vs. 9, based on KAFKA-2561 ○ 50% increased throughput in writes ○ 80% increased throughput in reads
  • 12. instaclustr.com Hardware Choice and Benchmarking - Number of Topics ● Possible urban myth that increasing topics reduces performance ● However, more topics = more partitions ● Significantly slows recovery time from node failure 10 Topic s 100 Topic s 1000 Topic s 5000 Topic s
  • 13. instaclustr.com Hardware Choice and Benchmarking - Colocated Zookeeper ● Often recommended to host zookeeper separately to Kafka ● However, recent changes have significantly reduced load on Zookeeper from Kafka ○ Consumer offsets are no longer stored in Zookeeper ● Our benchmarking showed no measurable difference in performance, at least for smaller clusters
  • 14. instaclustr.com Hardware Choice and Benchmarking - Colocated Zookeeper Consumer Rate - Separate Consumer Rate - Colocated ● 6 node cluster with broker restart ○ Similar results with dedicated Zookeeper disk vs. shared
  • 15. instaclustr.com Topic and User Configuration Management ● Kafka utilities require direct access to Zookeeper ● Zookeeper does not have a robust external security model ● Felt that providing access to Zookeeper was a risk ● Solutions ○ Developed command line tool to use Kafka API for topic configuration https://github.com/instaclustr/ic-kafka-tools ■ Future: Console UI support? ■ Value topic configuration versioning and management ○ Adding user management to Instaclustr Console ■ Additional authentication required
  • 16. instaclustr.com Broker Security Configuration ● Using SCRAM (Salted Challenge Response Authentication Mechanism) authentication ○ Used for client->broker ○ Broker->broker uses SASL plaintext ● Using SASL plaintext authentication ○ Used for broker->broker ○ Were planning on integrating SCRAM authentication, but dynamic configuration still requires broker restart ○ Instead planning on short-lived signed broker keys as dynamic configuration does not require restart
  • 17. instaclustr.com Broker Security Configuration ● Access to managed clusters ○ Public IPs and whitelisting in firewall (security group or equivalent) ○ Private IPs with VPC Peering (or equivalent in other cloud providers) ○ Private Network Clusters where nodes are not allocated public IPs and gateway box is used for admin access ○ Don’t expose Zookeeper through firewall due to weak security model
  • 18. instaclustr.com Monitoring ● Metrics exposed via JMX ○ Custom collection agent -> RabbitMQ (planned to migrate to Kafka) -> Riemann -> Cassandra+Spark -> Console, APIs, Grafana ● Exposing broker-level and per-topic metrics ● Alerting ○ Basics: service state, disk usage free space, server still exists ○ Kafka metrics: offline partitions, active controllers != 1, partition under replicated ■ Active controller very sensitive, are re-assessing alert thresholds ○ Synthetic transactions: publish and consume message to controlled topic, measure success and latency
  • 19. instaclustr.com Monitoring ● Central Logging ○ Fleet logs transferred via Kafka to an Elassandra cluster ○ 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra ○ Kafka experience in this project has been very positive ● Only issue ○ Auto offset commit failed for group logstash: Commit offsets failed with retriable exception. You should retry committing offsets. ○ We weren’t monitoring consumer lag closely enough ○ Increased consumer session and request timeouts
  • 20. instaclustr.com Backup and Restore ● Internet wisdom = Kafka Backups is not a thing ○ Rely on replication within cluster or mirror maker replication to another cluster ● Cassandra experience says backups are valuable ○ Hardware failure is not an issue but corruption due to app bugs or user error can occur and be spread by replication ● Future ○ Working on regular automated backup and restore of topic and security configuration ○ Consider using Kafka Connect to write important messages to offline backup
  • 21. instaclustr.com Thanks for listening! ● Currently in Preview ● Would love any feedback, suggestions or just telling us what we missed ● 14-day free trial option (no CC needed) - console.instaclustr.com