SlideShare a Scribd company logo
Heterogeneous Job processing
with Apache Kafka
Chu, Hua-Rong | PyCon HK 2018
TL; DL
Journey toward job processing in large scale
What you might be interested:
● Why and how to make use of Kafka for job queuing and scheduling
● Sketch of a reliable and scalable job processing system build with Python
● Exploring another use case for Kafka in addition to those in the data science area
Speaker
Pythonista from Taiwan, plus...
● Research engineer @ Chunghwa Telecom Laboratories
○ Focus on infrastracture and platform architecture desgin
● Enthusiast of open source and software technologies
○ Involving open source projects, meetups, conferences and so on
PyCon HK 2018 - Heterogeneous job processing with Apache Kafka
So what is job processing?
Consumer-Producer pattern
● Some business logic are too time-consuming to run in front of user
● Creating background jobs, placing those jobs on multiple queues, and processing them later
● Well-known as Celery, RQ in Python world
● Disambiguation: workers in the system are heterogeneous in contrasted to those in Apache Spark
https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch
Do labor jobs underneath most fancy services
Example: background tasks in Shopify
● Sending massive newsletters / image resizing / http downloads / updating smart collections /
updating solr / batch imports / spam checks
Fancy services also made their own fancy job processing system
● Github: Resque
● Livejournal: Gearman
● Shopify: delayed_job
Almost yet another job queue released every month...
http://queues.io/
Why we reinvent the wheel?
Indeed adopted existing artifact...until we break up
● Most existing artifacts are made by cool guys who are building fancy SaaSs
○ Low-latency, moderate reliability, handles relatively short time jobs (seconds to a minute)
○ Example: sending massive newsletters
● We are buidling heavy boring IaaS intra.
○ cloud resource provisioning, tiering storage system in several PB...
○ Require serious durability, handle long-run jobs (several minutes to hour)
○ Moderate latency is acceptable
Presented @ Taipei.py Sep. 2018
the Azure Blob Storage
Lifecycle
Evolution of our job
processing infrastructure
The Dark Ages - DB + cronjobs
● Todo lists in RDBMS
● Variety of cronjobs check the list periodically and do corresponding jobs
● Verdict
○ For - good choice in MVP (minimum viable product) building stage
○ Against - any aspect other than simplicity of dev
http://resque.github.io/
Renaissance - stood on the shoulders of Github
Github has been on the journey to seek a robust job processing infra.
● They’ve experienced many different background job systems
○ SQS, Starling, ActiveMessaging, BackgroundJob, DelayedJob, and beanstalkd
● resque is the answer learned from the journey.
○ Github’s own in-house job system
○ Redis-backed
Renaissance - stood on the shoulders of Github
● We adopted resque as our job processing system
○ Harder, better, faster, stronger
● Problem
○ We enjoy it happily ever after until the Redis daemon is killed by OOM killer due to outage
○ redis is not as reliable as a storage
○ Take a lot of time to redo long run job after fail
● resque is good for SaaS but...
○ Consider there are “only” 7.7 billion people in the world, clicking buttons in your app
○ There are trillions of files in AWS S3
Revolution - our in-house system
The solution satisfy:
● Both durability and scalabilities we desire
○ Jobs are stored and bypass in/via Apache Kafka
○ Let Kafka handle the hard queue/schedule problems
● Job execution is recoverable and diagnosable
○ We decouple a long-run job into shorter tasks
○ Inspired by stream processing and event sourcing
AS-IS TO-BE
Input Result
Huge
job
Input Result
Small tasks
Revolution - our in-house system
A job handling procedure is defined by a series of tasks in a spec.
● Tasks are chained together via adjacent input/output
● input/output are Kafka topics
Revolution - our in-house system
Initials the job handling procedure according the spec.
● Spawn workers which handle each tasks respectively
● Mange the number of each kinds of workers
Details on how we
achieve this
1. Master the power of Kafka in Python
2. Let it be durable
3. Let it be scalable
4. Decouple long run job into small pieces
Master the power of Kafka in Python
Brief of Apache Kafka
● Brought to you by Linkedin
● Focus on performance, scaliability and durability
○ Better throughput, built-in partitioning, replication,
and fault-tolerance for large scale message
processing applications.
● Widely adopted in the data sciense / big data areas
○ "K" of the SMACK stack
○ Website Activity Tracking, metrics, log aggregation
Ch.ko123, CC BY 4.0,
https://commons.wikimedia.org/w/index.php?curid=59871096
Master the power of Kafka in Python
Quick facts for pythonistas
● We can only use two of four major APIs provided by Kafka in python
○ (O) Producer API, Consumer API
○ (X) Connector API, Stream API
● Client binding: confluent-kafka-python
○ Supported by creators of Kafka
○ Wrapper around librdkafka => better performance, reliability, feature-proof
Producer Comsumer
Let it be durable
Kafka is one of the most durable store for messages
● message are retained as replicas which spread
among brokers
● replicas can be writen in store synchronizedly,
just as a real distributed file system
Other MQs are not such durable in general due to
use cases. For example, RabbitMQ documented:
● “Marking messages as persistent doesn't fully
guarantee that a message won't be lost”
https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-si
mplicity/
Let it be durable
Durability is not out-of-box with default config...
● Adjust following parameters in Kafka cluster config
○ default.replication.factor=3
○ unclean.leader.election.enable = false (Default in v0.11.0.0+)
○ min.insync.replicas=2
● Add following parameters in producer config
○ request.required.acks=all
Let it be durable
To archive the at-least-once sematic, be care of commit timing in the consumer side
Let it be scalable
Scalability is achieved with the system design
● Share nothing
○ Each worker is a single process
○ GIL free
● Job decoupling
○ Each task in job can be scaled individually,
just like fancy microservices
○ Admin can adjust number of worker in
runtime
Let it be scalable
Scalability is also achieved with Kafka properties
● Naturally-distributed
● Manage offset in consumer instread of
broker
Configuration
● Producer should write messages to several
partitions in one topic
● Consumer/Worker responsible for the same
task should belong to the same group
https://www.confluent.io/blog/apache-kafka-for-service-architectures/
Decouple long run job into small pieces
When facing business logic problem...
● Procedure can be recovered from the
failed task stage
○ Auto retry
○ “Abort” topic (queue) for human being
intervention
● Debug info available
○ Exception info in message header when
retry
Input Result
➔ Recoverable
➔ Diagnosable
Decouple long run job into small pieces
Tasks can be
● Stateful
● Grouped into sub systems by different
domain
● Flush result to external system
Decouple long run job into small pieces
Grouped into sub systems by
different domain
Flush result to external system
Task chain can do operations in the stream
processing / functional programming like
way
consume
flatmap
partition
Summary
Have done so far
● Why and how to make use of Kafka for job queuing and scheduling
● Sketch of a reliable and scalable job processing system build with Python
● Exploring another use case for Kafka in addition to those in the data science area
Final word
I highly recommend existing wonderful artifacts such as RQ and Celery to anyone unless:
● Share the existing Kafka cluster to make infrastructure more cost-effective
● For critical applications that require scalability and durability
多謝!
Example code
and contact
https://hrchu.github.io/
Job processing in large scale
An IaaS-graded job processing system powered by Kafka and Python
● Stood on the shoulders of Github
● Focus on durability, scalability, and the long-run job
● Design out of necessity
Against
● Availability - a CP system
● Monitoring
● Complexity

More Related Content

PDF
PyConline AU 2021 - Things might go wrong in a data-intensive application
PDF
EuroPython 2020 - Speak python with devices
PDF
Big Data Analytics Tokyo
PDF
Intro to open source observability with grafana, prometheus, loki, and tempo(...
PDF
Druid meetup @ Netflix (11/14/2018 )
PDF
Hitachi datasheet-universal-replicator
PDF
Strata Beijing 2017: Jumpy, a python interface for nd4j
PDF
Microservices, Kafka Streams and KafkaEsque
PyConline AU 2021 - Things might go wrong in a data-intensive application
EuroPython 2020 - Speak python with devices
Big Data Analytics Tokyo
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Druid meetup @ Netflix (11/14/2018 )
Hitachi datasheet-universal-replicator
Strata Beijing 2017: Jumpy, a python interface for nd4j
Microservices, Kafka Streams and KafkaEsque

What's hot (20)

PDF
Netflix machine learning
PDF
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
PDF
Building highly reliable data pipeline @datadog par Quentin François
PDF
Large Scale Processing of Unstructured Text
PPTX
Hadoop summit 2016
PPTX
Monitoring and scaling postgres at datadog
PPTX
Data Engineer’s Lunch #41: PygramETL
PDF
Data Streaming Technology Overview
PDF
Hdfs high availability
PDF
Data Science Challenges in Personal Program Analysis
PDF
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
PDF
Anomaly Detection and Automatic Labeling with Deep Learning
PDF
data.table and H2O at LondonR with Matt Dowle
PPTX
Optimizing Spark
PPTX
Boolan machine learning summit
PDF
netflix-real-time-data-strata-talk
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
PDF
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
PDF
DOWNSAMPLING DATA
PDF
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Netflix machine learning
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
Building highly reliable data pipeline @datadog par Quentin François
Large Scale Processing of Unstructured Text
Hadoop summit 2016
Monitoring and scaling postgres at datadog
Data Engineer’s Lunch #41: PygramETL
Data Streaming Technology Overview
Hdfs high availability
Data Science Challenges in Personal Program Analysis
Realtime Indexing for Fast Queries on Massive Semi-Structured Data
Anomaly Detection and Automatic Labeling with Deep Learning
data.table and H2O at LondonR with Matt Dowle
Optimizing Spark
Boolan machine learning summit
netflix-real-time-data-strata-talk
Why apache Flink is the 4G of Big Data Analytics Frameworks
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
DOWNSAMPLING DATA
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Ad

Similar to PyCon HK 2018 - Heterogeneous job processing with Apache Kafka (20)

PDF
Lessons Learned: Using Spark and Microservices
PDF
Uber Real Time Data Analytics
PDF
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
PDF
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
PPTX
Apache Kafka
PPTX
messaging.pptx
PDF
Developer-friendly taskqueues: What you should ask yourself before choosing one
PDF
Developer-friendly task queues: what we learned building MRQ, Sylvain Zimmer
PDF
7 ways to execute scheduled jobs with python
PPTX
kafka simplicity and complexity
PDF
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
PDF
How Uber scaled its Real Time Infrastructure to Trillion events per day
ODP
Introduction to Python Celery
PDF
Introduction to Serverless through Architectural Patterns
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Confluent Partner Tech Talk with Synthesis
PPT
Apache kafka- Onkar Kadam
PPTX
Kafka for data scientists
PDF
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
PDF
Work Queue Systems
Lessons Learned: Using Spark and Microservices
Uber Real Time Data Analytics
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Apache Kafka
messaging.pptx
Developer-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly task queues: what we learned building MRQ, Sylvain Zimmer
7 ways to execute scheduled jobs with python
kafka simplicity and complexity
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
How Uber scaled its Real Time Infrastructure to Trillion events per day
Introduction to Python Celery
Introduction to Serverless through Architectural Patterns
Apache Kafka - Scalable Message-Processing and more !
Confluent Partner Tech Talk with Synthesis
Apache kafka- Onkar Kadam
Kafka for data scientists
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
Work Queue Systems
Ad

More from Hua Chu (7)

PDF
CYBERSEC2025 - 生成式 AI 合規技術與挑戰 / Gen AI: Risks and Compliance Strategies
PDF
COSCUP2024 - SCaLE:打開北美開源世界的大門 / Insights from SCaLE and Beyond
PDF
TANET 2018 - Insights into the reliability of open-source distributed file sy...
PDF
Taipei.py 2018 - Control device via ioctl from Python
PPTX
Apache spot 系統架構
PPTX
Apache spot 初步瞭解
PDF
TWJUG 2016 - Mogilefs, 簡約可靠的儲存方案
CYBERSEC2025 - 生成式 AI 合規技術與挑戰 / Gen AI: Risks and Compliance Strategies
COSCUP2024 - SCaLE:打開北美開源世界的大門 / Insights from SCaLE and Beyond
TANET 2018 - Insights into the reliability of open-source distributed file sy...
Taipei.py 2018 - Control device via ioctl from Python
Apache spot 系統架構
Apache spot 初步瞭解
TWJUG 2016 - Mogilefs, 簡約可靠的儲存方案

Recently uploaded (20)

PPTX
Introuction about WHO-FIC in ICD-10.pptx
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPTX
artificial intelligence overview of it and more
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PDF
The Internet -By the Numbers, Sri Lanka Edition
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PDF
Sims 4 Historia para lo sims 4 para jugar
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
PPTX
Funds Management Learning Material for Beg
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPTX
Power Point - Lesson 3_2.pptx grad school presentation
PPTX
E -tech empowerment technologies PowerPoint
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PDF
Testing WebRTC applications at scale.pdf
PDF
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
PPTX
Internet___Basics___Styled_ presentation
Introuction about WHO-FIC in ICD-10.pptx
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
artificial intelligence overview of it and more
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
The Internet -By the Numbers, Sri Lanka Edition
Job_Card_System_Styled_lorem_ipsum_.pptx
RPKI Status Update, presented by Makito Lay at IDNOG 10
Sims 4 Historia para lo sims 4 para jugar
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
Funds Management Learning Material for Beg
An introduction to the IFRS (ISSB) Stndards.pdf
Introuction about ICD -10 and ICD-11 PPT.pptx
Power Point - Lesson 3_2.pptx grad school presentation
E -tech empowerment technologies PowerPoint
522797556-Unit-2-Temperature-measurement-1-1.pptx
PptxGenJS_Demo_Chart_20250317130215833.pptx
Testing WebRTC applications at scale.pdf
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
Internet___Basics___Styled_ presentation

PyCon HK 2018 - Heterogeneous job processing with Apache Kafka

  • 1. Heterogeneous Job processing with Apache Kafka Chu, Hua-Rong | PyCon HK 2018
  • 2. TL; DL Journey toward job processing in large scale What you might be interested: ● Why and how to make use of Kafka for job queuing and scheduling ● Sketch of a reliable and scalable job processing system build with Python ● Exploring another use case for Kafka in addition to those in the data science area
  • 3. Speaker Pythonista from Taiwan, plus... ● Research engineer @ Chunghwa Telecom Laboratories ○ Focus on infrastracture and platform architecture desgin ● Enthusiast of open source and software technologies ○ Involving open source projects, meetups, conferences and so on
  • 5. So what is job processing?
  • 6. Consumer-Producer pattern ● Some business logic are too time-consuming to run in front of user ● Creating background jobs, placing those jobs on multiple queues, and processing them later ● Well-known as Celery, RQ in Python world ● Disambiguation: workers in the system are heterogeneous in contrasted to those in Apache Spark https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch
  • 7. Do labor jobs underneath most fancy services Example: background tasks in Shopify ● Sending massive newsletters / image resizing / http downloads / updating smart collections / updating solr / batch imports / spam checks Fancy services also made their own fancy job processing system ● Github: Resque ● Livejournal: Gearman ● Shopify: delayed_job
  • 8. Almost yet another job queue released every month... http://queues.io/
  • 9. Why we reinvent the wheel?
  • 10. Indeed adopted existing artifact...until we break up ● Most existing artifacts are made by cool guys who are building fancy SaaSs ○ Low-latency, moderate reliability, handles relatively short time jobs (seconds to a minute) ○ Example: sending massive newsletters ● We are buidling heavy boring IaaS intra. ○ cloud resource provisioning, tiering storage system in several PB... ○ Require serious durability, handle long-run jobs (several minutes to hour) ○ Moderate latency is acceptable
  • 11. Presented @ Taipei.py Sep. 2018 the Azure Blob Storage Lifecycle
  • 12. Evolution of our job processing infrastructure
  • 13. The Dark Ages - DB + cronjobs ● Todo lists in RDBMS ● Variety of cronjobs check the list periodically and do corresponding jobs ● Verdict ○ For - good choice in MVP (minimum viable product) building stage ○ Against - any aspect other than simplicity of dev
  • 14. http://resque.github.io/ Renaissance - stood on the shoulders of Github Github has been on the journey to seek a robust job processing infra. ● They’ve experienced many different background job systems ○ SQS, Starling, ActiveMessaging, BackgroundJob, DelayedJob, and beanstalkd ● resque is the answer learned from the journey. ○ Github’s own in-house job system ○ Redis-backed
  • 15. Renaissance - stood on the shoulders of Github ● We adopted resque as our job processing system ○ Harder, better, faster, stronger ● Problem ○ We enjoy it happily ever after until the Redis daemon is killed by OOM killer due to outage ○ redis is not as reliable as a storage ○ Take a lot of time to redo long run job after fail ● resque is good for SaaS but... ○ Consider there are “only” 7.7 billion people in the world, clicking buttons in your app ○ There are trillions of files in AWS S3
  • 16. Revolution - our in-house system The solution satisfy: ● Both durability and scalabilities we desire ○ Jobs are stored and bypass in/via Apache Kafka ○ Let Kafka handle the hard queue/schedule problems ● Job execution is recoverable and diagnosable ○ We decouple a long-run job into shorter tasks ○ Inspired by stream processing and event sourcing AS-IS TO-BE Input Result Huge job Input Result Small tasks
  • 17. Revolution - our in-house system A job handling procedure is defined by a series of tasks in a spec. ● Tasks are chained together via adjacent input/output ● input/output are Kafka topics
  • 18. Revolution - our in-house system Initials the job handling procedure according the spec. ● Spawn workers which handle each tasks respectively ● Mange the number of each kinds of workers
  • 19. Details on how we achieve this 1. Master the power of Kafka in Python 2. Let it be durable 3. Let it be scalable 4. Decouple long run job into small pieces
  • 20. Master the power of Kafka in Python Brief of Apache Kafka ● Brought to you by Linkedin ● Focus on performance, scaliability and durability ○ Better throughput, built-in partitioning, replication, and fault-tolerance for large scale message processing applications. ● Widely adopted in the data sciense / big data areas ○ "K" of the SMACK stack ○ Website Activity Tracking, metrics, log aggregation Ch.ko123, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=59871096
  • 21. Master the power of Kafka in Python Quick facts for pythonistas ● We can only use two of four major APIs provided by Kafka in python ○ (O) Producer API, Consumer API ○ (X) Connector API, Stream API ● Client binding: confluent-kafka-python ○ Supported by creators of Kafka ○ Wrapper around librdkafka => better performance, reliability, feature-proof
  • 23. Let it be durable Kafka is one of the most durable store for messages ● message are retained as replicas which spread among brokers ● replicas can be writen in store synchronizedly, just as a real distributed file system Other MQs are not such durable in general due to use cases. For example, RabbitMQ documented: ● “Marking messages as persistent doesn't fully guarantee that a message won't be lost” https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-si mplicity/
  • 24. Let it be durable Durability is not out-of-box with default config... ● Adjust following parameters in Kafka cluster config ○ default.replication.factor=3 ○ unclean.leader.election.enable = false (Default in v0.11.0.0+) ○ min.insync.replicas=2 ● Add following parameters in producer config ○ request.required.acks=all
  • 25. Let it be durable To archive the at-least-once sematic, be care of commit timing in the consumer side
  • 26. Let it be scalable Scalability is achieved with the system design ● Share nothing ○ Each worker is a single process ○ GIL free ● Job decoupling ○ Each task in job can be scaled individually, just like fancy microservices ○ Admin can adjust number of worker in runtime
  • 27. Let it be scalable Scalability is also achieved with Kafka properties ● Naturally-distributed ● Manage offset in consumer instread of broker Configuration ● Producer should write messages to several partitions in one topic ● Consumer/Worker responsible for the same task should belong to the same group https://www.confluent.io/blog/apache-kafka-for-service-architectures/
  • 28. Decouple long run job into small pieces When facing business logic problem... ● Procedure can be recovered from the failed task stage ○ Auto retry ○ “Abort” topic (queue) for human being intervention ● Debug info available ○ Exception info in message header when retry Input Result ➔ Recoverable ➔ Diagnosable
  • 29. Decouple long run job into small pieces Tasks can be ● Stateful ● Grouped into sub systems by different domain ● Flush result to external system
  • 30. Decouple long run job into small pieces Grouped into sub systems by different domain Flush result to external system
  • 31. Task chain can do operations in the stream processing / functional programming like way consume flatmap partition
  • 33. Have done so far ● Why and how to make use of Kafka for job queuing and scheduling ● Sketch of a reliable and scalable job processing system build with Python ● Exploring another use case for Kafka in addition to those in the data science area
  • 34. Final word I highly recommend existing wonderful artifacts such as RQ and Celery to anyone unless: ● Share the existing Kafka cluster to make infrastructure more cost-effective ● For critical applications that require scalability and durability
  • 36. Job processing in large scale An IaaS-graded job processing system powered by Kafka and Python ● Stood on the shoulders of Github ● Focus on durability, scalability, and the long-run job ● Design out of necessity Against ● Availability - a CP system ● Monitoring ● Complexity