PyCon HK 2018 - Heterogeneous job processing with Apache Kafka

Heterogeneous Job processing
with Apache Kafka
Chu, Hua-Rong | PyCon HK 2018

TL; DL
Journey toward job processing in large scale
What you might be interested:
● Why and how to make use of Kafka for job queuing and scheduling
● Sketch of a reliable and scalable job processing system build with Python
● Exploring another use case for Kafka in addition to those in the data science area

Speaker
Pythonista from Taiwan, plus...
● Research engineer @ Chunghwa Telecom Laboratories
○ Focus on infrastracture and platform architecture desgin
● Enthusiast of open source and software technologies
○ Involving open source projects, meetups, conferences and so on

Consumer-Producer pattern
● Some business logic are too time-consuming to run in front of user
● Creating background jobs, placing those jobs on multiple queues, and processing them later
● Well-known as Celery, RQ in Python world
● Disambiguation: workers in the system are heterogeneous in contrasted to those in Apache Spark
https://www.hashicorp.com/blog/replacing-queues-with-nomad-dispatch

Do labor jobs underneath most fancy services
Example: background tasks in Shopify
● Sending massive newsletters / image resizing / http downloads / updating smart collections /
updating solr / batch imports / spam checks
Fancy services also made their own fancy job processing system
● Github: Resque
● Livejournal: Gearman
● Shopify: delayed_job

Almost yet another job queue released every month...
http://queues.io/

Indeed adopted existing artifact...until we break up
● Most existing artifacts are made by cool guys who are building fancy SaaSs
○ Low-latency, moderate reliability, handles relatively short time jobs (seconds to a minute)
○ Example: sending massive newsletters
● We are buidling heavy boring IaaS intra.
○ cloud resource provisioning, tiering storage system in several PB...
○ Require serious durability, handle long-run jobs (several minutes to hour)
○ Moderate latency is acceptable

Presented @ Taipei.py Sep. 2018
the Azure Blob Storage
Lifecycle

Evolution of our job
processing infrastructure

The Dark Ages - DB + cronjobs
● Todo lists in RDBMS
● Variety of cronjobs check the list periodically and do corresponding jobs
● Verdict
○ For - good choice in MVP (minimum viable product) building stage
○ Against - any aspect other than simplicity of dev

http://resque.github.io/
Renaissance - stood on the shoulders of Github
Github has been on the journey to seek a robust job processing infra.
● They’ve experienced many different background job systems
○ SQS, Starling, ActiveMessaging, BackgroundJob, DelayedJob, and beanstalkd
● resque is the answer learned from the journey.
○ Github’s own in-house job system
○ Redis-backed

Renaissance - stood on the shoulders of Github
● We adopted resque as our job processing system
○ Harder, better, faster, stronger
● Problem
○ We enjoy it happily ever after until the Redis daemon is killed by OOM killer due to outage
○ redis is not as reliable as a storage
○ Take a lot of time to redo long run job after fail
● resque is good for SaaS but...
○ Consider there are “only” 7.7 billion people in the world, clicking buttons in your app
○ There are trillions of files in AWS S3

Revolution - our in-house system
The solution satisfy:
● Both durability and scalabilities we desire
○ Jobs are stored and bypass in/via Apache Kafka
○ Let Kafka handle the hard queue/schedule problems
● Job execution is recoverable and diagnosable
○ We decouple a long-run job into shorter tasks
○ Inspired by stream processing and event sourcing
AS-IS TO-BE
Input Result
Huge
job
Input Result
Small tasks

A job handling procedure is defined by a series of tasks in a spec.
● Tasks are chained together via adjacent input/output
● input/output are Kafka topics

Initials the job handling procedure according the spec.
● Spawn workers which handle each tasks respectively
● Mange the number of each kinds of workers

Details on how we
achieve this
1. Master the power of Kafka in Python
2. Let it be durable
3. Let it be scalable
4. Decouple long run job into small pieces

Master the power of Kafka in Python
Brief of Apache Kafka
● Brought to you by Linkedin
● Focus on performance, scaliability and durability
○ Better throughput, built-in partitioning, replication,
and fault-tolerance for large scale message
processing applications.
● Widely adopted in the data sciense / big data areas
○ "K" of the SMACK stack
○ Website Activity Tracking, metrics, log aggregation
Ch.ko123, CC BY 4.0,
https://commons.wikimedia.org/w/index.php?curid=59871096

Master the power of Kafka in Python
Quick facts for pythonistas
● We can only use two of four major APIs provided by Kafka in python
○ (O) Producer API, Consumer API
○ (X) Connector API, Stream API
● Client binding: confluent-kafka-python
○ Supported by creators of Kafka
○ Wrapper around librdkafka => better performance, reliability, feature-proof

Let it be durable
Kafka is one of the most durable store for messages
● message are retained as replicas which spread
among brokers
● replicas can be writen in store synchronizedly,
just as a real distributed file system
Other MQs are not such durable in general due to
use cases. For example, RabbitMQ documented:
● “Marking messages as persistent doesn't fully
guarantee that a message won't be lost”
https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-si
mplicity/

Let it be durable
Durability is not out-of-box with default config...
● Adjust following parameters in Kafka cluster config
○ default.replication.factor=3
○ unclean.leader.election.enable = false (Default in v0.11.0.0+)
○ min.insync.replicas=2
● Add following parameters in producer config
○ request.required.acks=all

Let it be durable
To archive the at-least-once sematic, be care of commit timing in the consumer side

Let it be scalable
Scalability is achieved with the system design
● Share nothing
○ Each worker is a single process
○ GIL free
● Job decoupling
○ Each task in job can be scaled individually,
just like fancy microservices
○ Admin can adjust number of worker in
runtime

Let it be scalable
Scalability is also achieved with Kafka properties
● Naturally-distributed
● Manage offset in consumer instread of
broker
Configuration
● Producer should write messages to several
partitions in one topic
● Consumer/Worker responsible for the same
task should belong to the same group
https://www.confluent.io/blog/apache-kafka-for-service-architectures/

Decouple long run job into small pieces
When facing business logic problem...
● Procedure can be recovered from the
failed task stage
○ Auto retry
○ “Abort” topic (queue) for human being
intervention
● Debug info available
○ Exception info in message header when
retry
Input Result
➔ Recoverable
➔ Diagnosable

Tasks can be
● Stateful
● Grouped into sub systems by different
domain
● Flush result to external system

Grouped into sub systems by
different domain
Flush result to external system

Task chain can do operations in the stream
processing / functional programming like
way
consume
flatmap
partition

Have done so far
● Why and how to make use of Kafka for job queuing and scheduling
● Sketch of a reliable and scalable job processing system build with Python
● Exploring another use case for Kafka in addition to those in the data science area

Final word
I highly recommend existing wonderful artifacts such as RQ and Celery to anyone unless:
● Share the existing Kafka cluster to make infrastructure more cost-effective
● For critical applications that require scalability and durability

多謝!
Example code
and contact
https://hrchu.github.io/

Job processing in large scale
An IaaS-graded job processing system powered by Kafka and Python
● Stood on the shoulders of Github
● Focus on durability, scalability, and the long-run job
● Design out of necessity
Against
● Availability - a CP system
● Monitoring
● Complexity

PyCon HK 2018 - Heterogeneous job processing with Apache Kafka

More Related Content

What's hot (20)

Similar to PyCon HK 2018 - Heterogeneous job processing with Apache Kafka (20)

More from Hua Chu (7)

Recently uploaded (20)

PyCon HK 2018 - Heterogeneous job processing with Apache Kafka