SlideShare a Scribd company logo
Hello Cleveland!
Work for
Wrote this >>>
You can heckle on Twitter….
@jasonbelldata
The Plan
1. A quick overview of
project and my original
talk.
2. Look at how to make it
better with Kafka’s
components.
Think of this talk as a
conceptual how-to and
thoughts about awkward
data.
THIS IS A TRUE STORY.
The events depicted took place in
London in 2016.
At the request of the client,
the names have been changed.
Out of respect for the data,
the rest has been told exactly
as it occurred.
The Data
CSV Payloads are GZipped and sent to an
endpoint via HTTP.
The data contains live flight searches and are
batched together in chunks of a certain duration.
LHR,IST,2020-08-01,169.99,Y
There may be 3 rows or 20,000 rows.
It’s variable and we can’t control that.
I call this “The Fun Bit”
So this is what we did as a proof of concept.
And boy did I learn.
Thoughts, Experiences and
Considerations with Throughput
and Latency on High Volume
Stream Processing Systems Using
Cloud Based Infrastructures.
How I Bled All Over
Onyx
One Day I Got a Question
!“We want to stream in 12TB of
data a day….. Can you do
that?”
A Streaming Application… kind of.
What is Onyx?
!Distributed Computation Platform
!Written 100% in Clojure
!Great plugin architecture (inputs and
outputs)
!Uses graph Direct Acyclic Graph workflow
!I spoke about it at ClojureX 2016
!It’s very good….
Kafka Streams: Revisiting the decisions of the past (How I could have made it better)
Kafka Streams: Revisiting the decisions of the past (How I could have made it better)
Phase 1 – Testing at 1% Volume
!It’s working!
!Stuff’s going to S3!
!Tell the client the good news!
Phase 2 – Testing at 2% Volume
!It’s working!
!Stuff’s going to S3!
!Tell the client the good news!
Phase 3 – Testing at 5% Volume
Know Thy Framework
! Aeron buffer size = 3 x (batch-size x segment size
x connections)
!(1 x 512k x 8) x 3 = 4mb
!Aeron default buffer is 16mb but then the twist
!Onyx segment size = aeron.term.buffer.size /
8
!Max message size of 2mb
Onyx in Docker containers.
! --shm-size ended up being 10GB
! OOM killers were frequent
! Java logging is helpful but still hard to deal with.
CGROUPS_MEM=$(cat /sys/fs/cgroup/memory/memory.limit_in_bytes)
MEMINFO_MEM=$(($(awk '/MemTotal/ {print $2}' /proc/
meminfo)*1024)) MEM=$(($MEMINFO_MEM>$CGROUPS_MEM?$CGROUPS_MEM:
$MEMINFO_MEM)) JVM_PEER_HEAP_RATIO=${JVM_PEER_HEAP_RATIO:-0.6}
XMX=$(awk '{printf("%d",$1*$2/1024^2)}' <<< " ${MEM} $
{JVM_PEER_HEAP_RATIO} ") # Use the container memory limit to
set max heap size so that the GC # knows to collect before
it's hard-stopped by the container environment, # causing OOM
exception.
One Friday evening…
2. I did a rewrite in Kafka Streams.
!Moved the pure Clojure functions
into the streams architecture.
!Used Amazonica to write to AWS
S3.
Kafka Streams rewrite…
!Took 2 hours to write and deploy
to DCOS.
!And, by this stage, was slightly
tipsy.
Kafka Streams rewrite…
!In deployment Kafka Streams
didn’t touch more than 2% of the
CPU/Memory load on the
instance.
!Max memory was 790Mb
compared to 8Gb Onyx jobs.
Monday Morning…
!Now might be a good time to tell
the CTO. ;)
Now lets make it better.
I need to break down the process
into separate components.
Doing everything in one streaming app is bad!
The Event Log Producer
Kafka Streams: Revisiting the decisions of the past (How I could have made it better)
I’m going to make this into a streaming
application.
Convert GZIP to a
String CSV bundle.
Transform, adding a
uuid, batchuuid and
timestamp to rows.
Push payload to
topics.
compress.payload
(topic)
cflight.incoming
(topic)
Streaming Event Log
LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
Deserialise Decompress Step
(defn deserialize-gzip-message [bytes]
(try (-> bytes
ByteArrayInputStream.
GZIPInputStream.
io/reader
slurp)
(catch Exception e {:error e})))
LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
Transform Step
Define a batch UUID called batchuuid
Define a batch timestamp
Define a StringBuilder outputbundle.
Split bundle by newline:
For each CSV line:
Define a uuid
Append the uuid
Append the batchuuid
Append the timestamp
Append complete line to outputbundle + newline
Push outputbundle string to compress.payload
Push outputbundle string to cflight.incoming
LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
The Cheapest Price Streaming App
Option 1 - Compare each line of the CSV bundle.
(defn get-price
[row]
(let [first-comma (inc (cstr/index-of row ","))
last-comma (cstr/index-of row "," first-comma)
price (subs row first-comma last-comma)]
(Double/parseDouble price)))
;; Finds the cheapest flight on a row by row comparison.
;; Uses get-price function.
(defn find-cheapest-flight [data]
(println "Finding cheapest flight")
(let [rows (cstr/split data #"n")
[price cheapest-flight] (reduce (fn [[cheapest-price cheapest-row :as cheapest]
[candidate-price candidate-row :as candidate]]
(if (> cheapest-price
candidate-price)
candidate
cheapest))
(map (fn [row-o]
[(get-price row-o) row-o])
rows))]
(str cheapest-flight "n"))) ;; put the EOL in before we serialise it again
OR????????
KSQL?
Option 2 - Step 1: Convert the CSV bundle to AVRO payload
{“depiata":"LHR",
“arriata":"IST",
“searchdate”:"2020-08-01",
“gbpprice”:169.99,
“paxclass":"Y",
“uuid":"b6e87fb96f73c119caed08d6e9509c66",
“batchuuid":"39b6aba33172e379c763fcbd1b23aad7",
"timestamp":"2020-08-01T09:00:00"}
Convert GZIP to a
String CSV bundle.
Transform, adding a
uuid, batchuuid and
timestamp to rows.
Push payload to
topics.
compress.payload
(topic)
cflight.incoming
(topic)
LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
Event Log Transform Step Amended
Define a batch UUID called batchuuid
Define a batch timestamp
Define a StringBuilder outputbundle.
Split bundle by newline:
For each CSV line:
Define a uuid
Append the uuid
Append the batchuuid
Append the timestamp
Append complete line to outputbundle + newline
Convert line to AVRO
Push AVRO payload to cflight.incoming topic
Push outputbundle string to compress.payload
LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
Option 2 - Step 2: Craft a KSQL Job
CREATE STREAM CFLIGHT (………………….)
WITH (KAFKA_TOPIC=‘cflight.incoming',
PARTITIONS=1,
VALUE_FORMAT='avro');
CREATE STREAM CFLIGHT_CHEAPEST AS
SELECT UUID, BATCHUUID,
MIN(GBPPRICE) AS CHEAPEST_QUOTE
FROM CFLIGHT
GROUP BY BATCHUUID
EMIT CHANGES;
The Batch Compressor Streaming App
Convert GZIP to a
String CSV bundle.
Transform, adding a
uuid, batchuuid and
timestamp to rows.
Push payload to
topics.
compress.payload
(topic)
cflight.incoming
(topic)
compress.payload
(topic)
GZip the output
stream, get the
byte array.
batch.compressed
(topic)
Compress and Serialise Step
(def gzip-serializer-fn
(fn [output-str]
(let [out (java.io.ByteArrayOutputStream.)]
(do (doto (java.io.BufferedOutputStream.
(java.util.zip.GZIPOutputStream. out))
(.write (.getBytes output-str))
(.close)))
(.toByteArray out))))
Persist to S3
Kafka Connect
Use Prebundled connectors
where you can.
Also means you have access to further filter
and transform steps if required.
{"name": "cflight.sink",
"config":
{"connector.class":"io.confluent.connect.s3.S3SinkConnector",
"tasks.max":"1",
“topics”:"CHEAPEST_FLIGHT",
"s3.region":"us-east-1",
"s3.bucket.name": "cflight.bucket",
"s3.part.size":"5242880",
"flush.size":"1",
"storage.class":"io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.avro.AvroFormat",
"partitioner.class":"io.confluent.connect.storage.partitioner.DefaultPartitioner",
"schema.compatibility":"NONE"}}
{"name": "batchcompress.sink",
"config":
{"connector.class":"io.confluent.connect.s3.S3SinkConnector",
"tasks.max":"1",
“topics":"batch.compressed",
"s3.region":"us-east-1",
"s3.bucket.name": "flightbatch.bucket",
"s3.part.size":"5242880",
"flush.size":"1",
"storage.class":"io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.bytearray.ByteArrayFormat",
"partitioner.class":"io.confluent.connect.storage.partitioner.DefaultPartitioner",
"schema.compatibility":"NONE"}}
EventLog
Streaming App
nginx
Cheapest Flight
App
Compress
Bundle
Streaming App
Kafka Connect
Batch compress
Sink
Kafka Connect
CFlight sink
Brokers
S3 Storage
Thank you.
Many thanks to Shay and David for organising, everyone who attended and sent
kind wishes. Lastly, a huge thank you to MeetupCat.
Photo supplied by @jbfletch_

More Related Content

PPTX
Westpac Bank Tech Talk 1: Dive into Apache Kafka
PDF
Error Resilient Design: Building Scalable & Fault-Tolerant Microservices with...
PDF
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
PDF
Follow the (Kafka) Streams
PDF
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
PDF
War Stories: DIY Kafka
PDF
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
PDF
Spring Kafka beyond the basics - Lessons learned on our Kafka journey (Tim va...
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Error Resilient Design: Building Scalable & Fault-Tolerant Microservices with...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Follow the (Kafka) Streams
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
War Stories: DIY Kafka
Tradeoffs in Distributed Systems Design: Is Kafka The Best? (Ben Stopford and...
Spring Kafka beyond the basics - Lessons learned on our Kafka journey (Tim va...

What's hot (20)

PPTX
Tuning kafka pipelines
PPTX
Top Ten Kafka® Configs
PDF
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
PDF
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
PDF
Kafka 101 and Developer Best Practices
PDF
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
PDF
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
PDF
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
PDF
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
PDF
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
PDF
ksqlDB: A Stream-Relational Database System
PDF
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
PDF
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
PDF
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
PDF
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
PDF
From Zero to Hero with Kafka Connect
PDF
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
PDF
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
PPTX
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
PDF
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Tuning kafka pipelines
Top Ten Kafka® Configs
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Kafka 101 and Developer Best Practices
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Running large scale Kafka upgrades at Yelp (Manpreet Singh,Yelp) Kafka Summit...
Kafka Needs no Keeper( Jason Gustafson & Colin McCabe, Confluent) Kafka Summi...
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
ksqlDB: A Stream-Relational Database System
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
From Zero to Hero with Kafka Connect
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
A Modern C++ Kafka API | Kenneth Jia, Morgan Stanley
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Ad

Similar to Kafka Streams: Revisiting the decisions of the past (How I could have made it better) (20)

PDF
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
PDF
strangeloop 2012 apache cassandra anti patterns
PDF
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
PDF
Building High-Throughput, Low-Latency Pipelines in Kafka
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
PDF
Intro to big data choco devday - 23-01-2014
PDF
Application architectures with hadoop – big data techcon 2014
PDF
Application architectures with Hadoop – Big Data TechCon 2014
PDF
Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...
PPTX
Hadoop_File_Formats_and_Data_Ingestion.pptx
PPTX
HBase Data Modeling and Access Patterns with Kite SDK
PPTX
The Exabyte Journey and DataBrew with CICD
PDF
Cachopo - Scalable Stateful Services - Madrid Elixir Meetup
PDF
GOTO 2011 preso: 3x Hadoop
PDF
Perly Parallel Processing of Fixed Width Data Records
PDF
Scaling Cassandra for Big Data
PDF
spark stream - kafka - the right way
PDF
Quixote
Type safe, versioned, and rewindable stream processing with Apache {Avro, K...
strangeloop 2012 apache cassandra anti patterns
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lesson...
Building High-Throughput, Low-Latency Pipelines in Kafka
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Developing Real-Time Data Pipelines with Apache Kafka
Intro to big data choco devday - 23-01-2014
Application architectures with hadoop – big data techcon 2014
Application architectures with Hadoop – Big Data TechCon 2014
Building Scalable and Extendable Data Pipeline for Call of Duty Games (Yarosl...
Hadoop_File_Formats_and_Data_Ingestion.pptx
HBase Data Modeling and Access Patterns with Kite SDK
The Exabyte Journey and DataBrew with CICD
Cachopo - Scalable Stateful Services - Madrid Elixir Meetup
GOTO 2011 preso: 3x Hadoop
Perly Parallel Processing of Fixed Width Data Records
Scaling Cassandra for Big Data
spark stream - kafka - the right way
Quixote
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PPT
What is a Computer? Input Devices /output devices
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
STKI Israel Market Study 2025 version august
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Architecture types and enterprise applications.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Hybrid model detection and classification of lung cancer
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
What is a Computer? Input Devices /output devices
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
STKI Israel Market Study 2025 version august
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
observCloud-Native Containerability and monitoring.pptx
Architecture types and enterprise applications.pdf
Hindi spoken digit analysis for native and non-native speakers
Web App vs Mobile App What Should You Build First.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
A comparative study of natural language inference in Swahili using monolingua...
Programs and apps: productivity, graphics, security and other tools
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
1. Introduction to Computer Programming.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Getting started with AI Agents and Multi-Agent Systems
Hybrid model detection and classification of lung cancer
Univ-Connecticut-ChatGPT-Presentaion.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
OMC Textile Division Presentation 2021.pptx

Kafka Streams: Revisiting the decisions of the past (How I could have made it better)

  • 3. You can heckle on Twitter…. @jasonbelldata
  • 4. The Plan 1. A quick overview of project and my original talk. 2. Look at how to make it better with Kafka’s components. Think of this talk as a conceptual how-to and thoughts about awkward data.
  • 5. THIS IS A TRUE STORY.
  • 6. The events depicted took place in London in 2016.
  • 7. At the request of the client, the names have been changed.
  • 8. Out of respect for the data, the rest has been told exactly as it occurred.
  • 10. CSV Payloads are GZipped and sent to an endpoint via HTTP. The data contains live flight searches and are batched together in chunks of a certain duration.
  • 12. There may be 3 rows or 20,000 rows. It’s variable and we can’t control that.
  • 13. I call this “The Fun Bit”
  • 14. So this is what we did as a proof of concept.
  • 15. And boy did I learn.
  • 16. Thoughts, Experiences and Considerations with Throughput and Latency on High Volume Stream Processing Systems Using Cloud Based Infrastructures.
  • 17. How I Bled All Over Onyx
  • 18. One Day I Got a Question !“We want to stream in 12TB of data a day….. Can you do that?”
  • 20. What is Onyx? !Distributed Computation Platform !Written 100% in Clojure !Great plugin architecture (inputs and outputs) !Uses graph Direct Acyclic Graph workflow !I spoke about it at ClojureX 2016 !It’s very good….
  • 23. Phase 1 – Testing at 1% Volume !It’s working! !Stuff’s going to S3! !Tell the client the good news!
  • 24. Phase 2 – Testing at 2% Volume !It’s working! !Stuff’s going to S3! !Tell the client the good news!
  • 25. Phase 3 – Testing at 5% Volume
  • 26. Know Thy Framework ! Aeron buffer size = 3 x (batch-size x segment size x connections) !(1 x 512k x 8) x 3 = 4mb !Aeron default buffer is 16mb but then the twist !Onyx segment size = aeron.term.buffer.size / 8 !Max message size of 2mb
  • 27. Onyx in Docker containers. ! --shm-size ended up being 10GB ! OOM killers were frequent ! Java logging is helpful but still hard to deal with. CGROUPS_MEM=$(cat /sys/fs/cgroup/memory/memory.limit_in_bytes) MEMINFO_MEM=$(($(awk '/MemTotal/ {print $2}' /proc/ meminfo)*1024)) MEM=$(($MEMINFO_MEM>$CGROUPS_MEM?$CGROUPS_MEM: $MEMINFO_MEM)) JVM_PEER_HEAP_RATIO=${JVM_PEER_HEAP_RATIO:-0.6} XMX=$(awk '{printf("%d",$1*$2/1024^2)}' <<< " ${MEM} $ {JVM_PEER_HEAP_RATIO} ") # Use the container memory limit to set max heap size so that the GC # knows to collect before it's hard-stopped by the container environment, # causing OOM exception.
  • 29. 2. I did a rewrite in Kafka Streams. !Moved the pure Clojure functions into the streams architecture. !Used Amazonica to write to AWS S3.
  • 30. Kafka Streams rewrite… !Took 2 hours to write and deploy to DCOS. !And, by this stage, was slightly tipsy.
  • 31. Kafka Streams rewrite… !In deployment Kafka Streams didn’t touch more than 2% of the CPU/Memory load on the instance. !Max memory was 790Mb compared to 8Gb Onyx jobs.
  • 32. Monday Morning… !Now might be a good time to tell the CTO. ;)
  • 33. Now lets make it better.
  • 34. I need to break down the process into separate components.
  • 35. Doing everything in one streaming app is bad!
  • 36. The Event Log Producer
  • 38. I’m going to make this into a streaming application.
  • 39. Convert GZIP to a String CSV bundle. Transform, adding a uuid, batchuuid and timestamp to rows. Push payload to topics. compress.payload (topic) cflight.incoming (topic) Streaming Event Log LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
  • 40. Deserialise Decompress Step (defn deserialize-gzip-message [bytes] (try (-> bytes ByteArrayInputStream. GZIPInputStream. io/reader slurp) (catch Exception e {:error e}))) LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
  • 41. Transform Step Define a batch UUID called batchuuid Define a batch timestamp Define a StringBuilder outputbundle. Split bundle by newline: For each CSV line: Define a uuid Append the uuid Append the batchuuid Append the timestamp Append complete line to outputbundle + newline Push outputbundle string to compress.payload Push outputbundle string to cflight.incoming LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
  • 42. The Cheapest Price Streaming App
  • 43. Option 1 - Compare each line of the CSV bundle. (defn get-price [row] (let [first-comma (inc (cstr/index-of row ",")) last-comma (cstr/index-of row "," first-comma) price (subs row first-comma last-comma)] (Double/parseDouble price))) ;; Finds the cheapest flight on a row by row comparison. ;; Uses get-price function. (defn find-cheapest-flight [data] (println "Finding cheapest flight") (let [rows (cstr/split data #"n") [price cheapest-flight] (reduce (fn [[cheapest-price cheapest-row :as cheapest] [candidate-price candidate-row :as candidate]] (if (> cheapest-price candidate-price) candidate cheapest)) (map (fn [row-o] [(get-price row-o) row-o]) rows))] (str cheapest-flight "n"))) ;; put the EOL in before we serialise it again
  • 45. KSQL?
  • 46. Option 2 - Step 1: Convert the CSV bundle to AVRO payload {“depiata":"LHR", “arriata":"IST", “searchdate”:"2020-08-01", “gbpprice”:169.99, “paxclass":"Y", “uuid":"b6e87fb96f73c119caed08d6e9509c66", “batchuuid":"39b6aba33172e379c763fcbd1b23aad7", "timestamp":"2020-08-01T09:00:00"}
  • 47. Convert GZIP to a String CSV bundle. Transform, adding a uuid, batchuuid and timestamp to rows. Push payload to topics. compress.payload (topic) cflight.incoming (topic) LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
  • 48. Event Log Transform Step Amended Define a batch UUID called batchuuid Define a batch timestamp Define a StringBuilder outputbundle. Split bundle by newline: For each CSV line: Define a uuid Append the uuid Append the batchuuid Append the timestamp Append complete line to outputbundle + newline Convert line to AVRO Push AVRO payload to cflight.incoming topic Push outputbundle string to compress.payload LHR,IST,2020-08-01,169.99,Y,b6e87fb96f73c119caed08d6e9509c66,39b6aba33172e379c763fcbd1b23aad7,2020-08-01T09:00:00
  • 49. Option 2 - Step 2: Craft a KSQL Job CREATE STREAM CFLIGHT (………………….) WITH (KAFKA_TOPIC=‘cflight.incoming', PARTITIONS=1, VALUE_FORMAT='avro'); CREATE STREAM CFLIGHT_CHEAPEST AS SELECT UUID, BATCHUUID, MIN(GBPPRICE) AS CHEAPEST_QUOTE FROM CFLIGHT GROUP BY BATCHUUID EMIT CHANGES;
  • 50. The Batch Compressor Streaming App
  • 51. Convert GZIP to a String CSV bundle. Transform, adding a uuid, batchuuid and timestamp to rows. Push payload to topics. compress.payload (topic) cflight.incoming (topic)
  • 52. compress.payload (topic) GZip the output stream, get the byte array. batch.compressed (topic)
  • 53. Compress and Serialise Step (def gzip-serializer-fn (fn [output-str] (let [out (java.io.ByteArrayOutputStream.)] (do (doto (java.io.BufferedOutputStream. (java.util.zip.GZIPOutputStream. out)) (.write (.getBytes output-str)) (.close))) (.toByteArray out))))
  • 57. Also means you have access to further filter and transform steps if required.
  • 60. EventLog Streaming App nginx Cheapest Flight App Compress Bundle Streaming App Kafka Connect Batch compress Sink Kafka Connect CFlight sink Brokers S3 Storage
  • 61. Thank you. Many thanks to Shay and David for organising, everyone who attended and sent kind wishes. Lastly, a huge thank you to MeetupCat. Photo supplied by @jbfletch_