SlideShare a Scribd company logo
Queue with
asyncio and Kafka
Showcase
Ondřej Veselý
What kind of data we have
Problem:
store JSON to database
Just a few records
per second.
But
● Slow database
● Unreliable database
● Increasing traffic (20x)
def save_data(conn, cur, ts, data):
cur.execute(
"""INSERT INTO data (timestamp, data)
VALUES (%s,%s) """, (ts, ujson.dumps(data)))
conn.commit()
@app.route('/store', method=['PUT', 'POST'])
def logstash_route():
data = ujson.load(request.body)
conn = psycopg2.connect(**config.pg_logs)
t = datetime.now()
with conn.cursor(cursor_factory=DictCursor) as cur:
for d in data:
save_data(conn, cur, t, d)
conn.close()
Old code
Architecture
internet
Kafka producer
/store
Kafka consumer
Kafka queue
Postgres
… time to kill consumer ...
Asyncio, example
import asyncio
async def factorial(name, number):
f = 1
for i in range(2, number+1):
print("Task %s: Compute factorial(%s)..." % (name, i))
await asyncio.sleep(1)
f *= i
print("Task %s: factorial(%s) = %s" % (name, number, f))
loop = asyncio.get_event_loop()
tasks = [
asyncio.ensure_future(factorial("A", 2)),
asyncio.ensure_future(factorial("B", 3)),
asyncio.ensure_future(factorial("C", 4))]
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
Task A: Compute factorial(2)...
Task B: Compute factorial(2)...
Task C: Compute factorial(2)...
Task A: factorial(2) = 2
Task B: Compute factorial(3)...
Task C: Compute factorial(3)...
Task B: factorial(3) = 6
Task C: Compute factorial(4)...
Task C: factorial(4) = 24
What we used
Apache Kafka
Not ujson
Concurrency - doing lots of slow things at once.
No processes, no threads.
Producer
from aiohttp import web
import json
Consumer
import asyncio
import json
from aiokafka import AIOKafkaConsumer
import aiopg
Producer #1
async def kafka_send(kafka_producer, data, topic):
message = {
'data': data,
'received': str(arrow.utcnow())
}
message_json_bytes = bytes(json.dumps(message), 'utf-8')
await kafka_producer.send_and_wait(topic, message_json_bytes)
async def handle(request):
post_data = await request.json()
try:
await kafka_send(request.app['kafka_p'], post_data, topic=settings.topic)
except:
slog.exception("Kafka Error")
await destroy_all()
return web.Response(status=200)
app = web.Application()
app.router.add_route('POST', '/store', handle)
app['kafka_p'] = get_kafka_producer()
Destroying the loop
async def destroy_all():
loop = asyncio.get_event_loop()
for task in asyncio.Task.all_tasks():
task.cancel()
await loop.stop()
await loop.close()
slog.debug("Exiting.")
sys.exit()
def get_kafka_producer():
loop = asyncio.get_event_loop()
producer = AIOKafkaProducer(
loop=loop,
bootstrap_servers=settings.queues_urls,
request_timeout_ms=settings.kafka_timeout,
retry_backoff_ms=1000)
loop.run_until_complete(producer.start())
return producer
Getting producer
Producer #2
Consume
… time to resurrect consumer ...
DB
connected
1. Receive data record from Kafka
2. Put it to the queue
start
yesno
Flush
queue full enough
or
data old enough
Store data from queue to DB
yesno
Connect to DB
start
asyncio.Queue()
Consumer #1
def main():
dbs_connected = asyncio.Future()
batch = asyncio.Queue(maxsize=settings.batch_max_size)
asyncio.ensure_future(consume(batch, dbs_connected))
asyncio.ensure_future(start_flushing(batch, dbs_connected))
loop.run_forever()
async def consume(queue, dbs_connected):
await asyncio.wait_for(dbs_connected, timeout=settings.wait_for_databases)
consumer = AIOKafkaConsumer(
settings.topic, loop=loop, bootstrap_servers=settings.queues_urls,
group_id='consumers'
)
await consumer.start()
async for msg in consumer:
message = json.loads(msg.value.decode("utf-8"))
await queue.put((message.get('received'), message.get('data')))
await consumer.stop()
Consumer #2
async def start_flushing(queue, dbs_connected):
db_logg = await aiopg.create_pool(settings.logs_db_url)
while True:
async with db_logg.acquire() as logg_conn, logg_conn.cursor() as logg_cur:
await keep_flushing(dbs_connected, logg_cur, queue)
await asyncio.sleep(2)
async def keep_flushing(dbs_connected, logg_cur, queue):
dbs_connected.set_result(True)
last_stored_time = time.time()
while True:
if not queue.empty() and (queue.qsize() > settings.batch_flush_size or
time.time() - last_stored_time > settings.batch_max_time):
to_store = []
while not queue.empty():
to_store.append(await queue.get())
try:
await store_bulk(logg_cur, to_store)
except:
break # DB down, breaking to reconnect
last_stored_time = time.time()
await asyncio.sleep(settings.batch_sleep)
Consumer #3
Code is public on gitlab
https://gitlab.skypicker.com/ondrej/faqstorer
www.orwen.org
code.kiwi.com
www.kiwi.com/jobs/
Check graphs...

More Related Content

PDF
Sqoop on Spark for Data Ingestion
PPTX
LLAP: long-lived execution in Hive
PDF
Everything You Always Wanted to Know About Kafka's Rebalance Protocol but Wer...
PDF
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PDF
Neo4j: Import and Data Modelling
PDF
Hyper-V を Windows PowerShell から管理する
PDF
자료구조 2014-2018년 기말시험 기출문제
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Sqoop on Spark for Data Ingestion
LLAP: long-lived execution in Hive
Everything You Always Wanted to Know About Kafka's Rebalance Protocol but Wer...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
Neo4j: Import and Data Modelling
Hyper-V を Windows PowerShell から管理する
자료구조 2014-2018년 기말시험 기출문제
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014

What's hot (20)

PDF
Graph Databases - RedisGraph and RedisInsight
PPTX
MongoDBが遅いときの切り分け方法
PDF
Overview of Zookeeper, Helix and Kafka (Oakjug)
PDF
Linked Open Data(LOD)の基本理念と基盤となる技術
PPTX
FIWARE アーキテクチャの保護 - FIWARE WednesdayWebinars
PDF
High-speed Database Throughput Using Apache Arrow Flight SQL
PPTX
Dapper - Rise of the MicroORM
PDF
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
PDF
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
PDF
ナレッジグラフ/LOD利用技術の入門(前編)
KEY
MAP 実装してみた
PDF
SQL for NoSQL and how Apache Calcite can help
PDF
SPARQL入門
PDF
Scaling up task processing with Celery
PDF
PostgreSQLでスケールアウト
PPTX
Optimizing Apache Spark SQL Joins
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PDF
Docker入門-基礎編 いまから始めるDocker管理【2nd Edition】
PPTX
검색엔진이 데이터를 다루는 법 김종민
PDF
2018 builderscon airflowを用いて、 複雑大規模なジョブフロー管理 に立ち向かう
Graph Databases - RedisGraph and RedisInsight
MongoDBが遅いときの切り分け方法
Overview of Zookeeper, Helix and Kafka (Oakjug)
Linked Open Data(LOD)の基本理念と基盤となる技術
FIWARE アーキテクチャの保護 - FIWARE WednesdayWebinars
High-speed Database Throughput Using Apache Arrow Flight SQL
Dapper - Rise of the MicroORM
Observability of InfluxDB IOx: Tracing, Metrics and System Tables
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
ナレッジグラフ/LOD利用技術の入門(前編)
MAP 実装してみた
SQL for NoSQL and how Apache Calcite can help
SPARQL入門
Scaling up task processing with Celery
PostgreSQLでスケールアウト
Optimizing Apache Spark SQL Joins
Apache Spark Core—Deep Dive—Proper Optimization
Docker入門-基礎編 いまから始めるDocker管理【2nd Edition】
검색엔진이 데이터를 다루는 법 김종민
2018 builderscon airflowを用いて、 複雑大規模なジョブフロー管理 に立ち向かう
Ad

Viewers also liked (7)

PDF
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
PDF
美团技术沙龙04 - Kv Tair best practise
PPTX
Communication And Synchronization In Distributed Systems
PDF
Inter-Process Communication in distributed systems
PPT
Synchronization in distributed systems
PDF
大数据时代feed架构 (ArchSummit Beijing 2014)
PPTX
Cdc@ganji.com
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
美团技术沙龙04 - Kv Tair best practise
Communication And Synchronization In Distributed Systems
Inter-Process Communication in distributed systems
Synchronization in distributed systems
大数据时代feed架构 (ArchSummit Beijing 2014)
Cdc@ganji.com
Ad

Similar to Python queue solution with asyncio and kafka (20)

PDF
Introduction to asyncio
PPTX
Tools for Making Machine Learning more Reactive
PDF
Future Decoded - Node.js per sviluppatori .NET
PDF
ZeroMQ: Messaging Made Simple
PDF
Asynchronous web apps with the Play Framework 2.0
PDF
JS Fest 2019 Node.js Antipatterns
PDF
Making Structured Streaming Ready for Production
PDF
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
PDF
Websockets talk at Rubyconf Uruguay 2010
PDF
TDC2018SP | Trilha Go - Processando analise genetica em background com Go
PDF
Refactoring to Macros with Clojure
PDF
Lego: A brick system build by scala
PDF
Think Async: Asynchronous Patterns in NodeJS
PDF
Writing Redis in Python with asyncio
PDF
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
PDF
Rntb20200805
 
PPTX
Avoiding Callback Hell with Async.js
PDF
Stream or not to Stream?

PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Futures e abstração - QCon São Paulo 2015
Introduction to asyncio
Tools for Making Machine Learning more Reactive
Future Decoded - Node.js per sviluppatori .NET
ZeroMQ: Messaging Made Simple
Asynchronous web apps with the Play Framework 2.0
JS Fest 2019 Node.js Antipatterns
Making Structured Streaming Ready for Production
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Websockets talk at Rubyconf Uruguay 2010
TDC2018SP | Trilha Go - Processando analise genetica em background com Go
Refactoring to Macros with Clojure
Lego: A brick system build by scala
Think Async: Asynchronous Patterns in NodeJS
Writing Redis in Python with asyncio
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Rntb20200805
 
Avoiding Callback Hell with Async.js
Stream or not to Stream?

Easy, scalable, fault tolerant stream processing with structured streaming - ...
Futures e abstração - QCon São Paulo 2015

Recently uploaded (20)

PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Computer network topology notes for revision
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Fluorescence-microscope_Botany_detailed content
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Lecture1 pattern recognition............
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Acumen Training GuidePresentation.pptx
Reliability_Chapter_ presentation 1221.5784
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Computer network topology notes for revision
Miokarditis (Inflamasi pada Otot Jantung)
Fluorescence-microscope_Botany_detailed content
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Business Ppt On Nestle.pptx huunnnhhgfvu
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Supervised vs unsupervised machine learning algorithms
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Quality review (1)_presentation of this 21
Lecture1 pattern recognition............
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
.pdf is not working space design for the following data for the following dat...
Business Acumen Training GuidePresentation.pptx

Python queue solution with asyncio and kafka

  • 1. Queue with asyncio and Kafka Showcase Ondřej Veselý
  • 2. What kind of data we have
  • 3. Problem: store JSON to database Just a few records per second. But ● Slow database ● Unreliable database ● Increasing traffic (20x)
  • 4. def save_data(conn, cur, ts, data): cur.execute( """INSERT INTO data (timestamp, data) VALUES (%s,%s) """, (ts, ujson.dumps(data))) conn.commit() @app.route('/store', method=['PUT', 'POST']) def logstash_route(): data = ujson.load(request.body) conn = psycopg2.connect(**config.pg_logs) t = datetime.now() with conn.cursor(cursor_factory=DictCursor) as cur: for d in data: save_data(conn, cur, t, d) conn.close() Old code
  • 5. Architecture internet Kafka producer /store Kafka consumer Kafka queue Postgres … time to kill consumer ...
  • 6. Asyncio, example import asyncio async def factorial(name, number): f = 1 for i in range(2, number+1): print("Task %s: Compute factorial(%s)..." % (name, i)) await asyncio.sleep(1) f *= i print("Task %s: factorial(%s) = %s" % (name, number, f)) loop = asyncio.get_event_loop() tasks = [ asyncio.ensure_future(factorial("A", 2)), asyncio.ensure_future(factorial("B", 3)), asyncio.ensure_future(factorial("C", 4))] loop.run_until_complete(asyncio.gather(*tasks)) loop.close() Task A: Compute factorial(2)... Task B: Compute factorial(2)... Task C: Compute factorial(2)... Task A: factorial(2) = 2 Task B: Compute factorial(3)... Task C: Compute factorial(3)... Task B: factorial(3) = 6 Task C: Compute factorial(4)... Task C: factorial(4) = 24
  • 7. What we used Apache Kafka Not ujson Concurrency - doing lots of slow things at once. No processes, no threads. Producer from aiohttp import web import json Consumer import asyncio import json from aiokafka import AIOKafkaConsumer import aiopg
  • 8. Producer #1 async def kafka_send(kafka_producer, data, topic): message = { 'data': data, 'received': str(arrow.utcnow()) } message_json_bytes = bytes(json.dumps(message), 'utf-8') await kafka_producer.send_and_wait(topic, message_json_bytes) async def handle(request): post_data = await request.json() try: await kafka_send(request.app['kafka_p'], post_data, topic=settings.topic) except: slog.exception("Kafka Error") await destroy_all() return web.Response(status=200) app = web.Application() app.router.add_route('POST', '/store', handle) app['kafka_p'] = get_kafka_producer()
  • 9. Destroying the loop async def destroy_all(): loop = asyncio.get_event_loop() for task in asyncio.Task.all_tasks(): task.cancel() await loop.stop() await loop.close() slog.debug("Exiting.") sys.exit() def get_kafka_producer(): loop = asyncio.get_event_loop() producer = AIOKafkaProducer( loop=loop, bootstrap_servers=settings.queues_urls, request_timeout_ms=settings.kafka_timeout, retry_backoff_ms=1000) loop.run_until_complete(producer.start()) return producer Getting producer Producer #2
  • 10. Consume … time to resurrect consumer ... DB connected 1. Receive data record from Kafka 2. Put it to the queue start yesno Flush queue full enough or data old enough Store data from queue to DB yesno Connect to DB start asyncio.Queue() Consumer #1
  • 11. def main(): dbs_connected = asyncio.Future() batch = asyncio.Queue(maxsize=settings.batch_max_size) asyncio.ensure_future(consume(batch, dbs_connected)) asyncio.ensure_future(start_flushing(batch, dbs_connected)) loop.run_forever() async def consume(queue, dbs_connected): await asyncio.wait_for(dbs_connected, timeout=settings.wait_for_databases) consumer = AIOKafkaConsumer( settings.topic, loop=loop, bootstrap_servers=settings.queues_urls, group_id='consumers' ) await consumer.start() async for msg in consumer: message = json.loads(msg.value.decode("utf-8")) await queue.put((message.get('received'), message.get('data'))) await consumer.stop() Consumer #2
  • 12. async def start_flushing(queue, dbs_connected): db_logg = await aiopg.create_pool(settings.logs_db_url) while True: async with db_logg.acquire() as logg_conn, logg_conn.cursor() as logg_cur: await keep_flushing(dbs_connected, logg_cur, queue) await asyncio.sleep(2) async def keep_flushing(dbs_connected, logg_cur, queue): dbs_connected.set_result(True) last_stored_time = time.time() while True: if not queue.empty() and (queue.qsize() > settings.batch_flush_size or time.time() - last_stored_time > settings.batch_max_time): to_store = [] while not queue.empty(): to_store.append(await queue.get()) try: await store_bulk(logg_cur, to_store) except: break # DB down, breaking to reconnect last_stored_time = time.time() await asyncio.sleep(settings.batch_sleep) Consumer #3
  • 13. Code is public on gitlab https://gitlab.skypicker.com/ondrej/faqstorer www.orwen.org code.kiwi.com www.kiwi.com/jobs/ Check graphs...

Editor's Notes

  • #3: Talk more about Kiwi.com Skyscanner, Momondo
  • #4: 5 TB Postgres Database
  • #7: PEP 492 -- Coroutines with async and await syntax, 09-Apr-2015 Python 3.5