SlideShare a Scribd company logo
Queue with
asyncio and Kafka
Showcase
Ondřej Veselý
What kind of data we have
Problem:
store JSON to database
Just a few records
per second.
But
● Slow database
● Unreliable database
● Increasing traffic (20x)
def save_data(conn, cur, ts, data):
cur.execute(
"""INSERT INTO data (timestamp, data)
VALUES (%s,%s) """, (ts, ujson.dumps(data)))
conn.commit()
@app.route('/store', method=['PUT', 'POST'])
def logstash_route():
data = ujson.load(request.body)
conn = psycopg2.connect(**config.pg_logs)
t = datetime.now()
with conn.cursor(cursor_factory=DictCursor) as cur:
for d in data:
save_data(conn, cur, t, d)
conn.close()
Old code
Architecture
internet
Kafka producer
/store
Kafka consumer
Kafka queue
Postgres
… time to kill consumer ...
Asyncio, example
import asyncio
async def factorial(name, number):
f = 1
for i in range(2, number+1):
print("Task %s: Compute factorial(%s)..." % (name, i))
await asyncio.sleep(1)
f *= i
print("Task %s: factorial(%s) = %s" % (name, number, f))
loop = asyncio.get_event_loop()
tasks = [
asyncio.ensure_future(factorial("A", 2)),
asyncio.ensure_future(factorial("B", 3)),
asyncio.ensure_future(factorial("C", 4))]
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
Task A: Compute factorial(2)...
Task B: Compute factorial(2)...
Task C: Compute factorial(2)...
Task A: factorial(2) = 2
Task B: Compute factorial(3)...
Task C: Compute factorial(3)...
Task B: factorial(3) = 6
Task C: Compute factorial(4)...
Task C: factorial(4) = 24
What we used
Apache Kafka
Not ujson
Concurrency - doing lots of slow things at once.
No processes, no threads.
Producer
from aiohttp import web
import json
Consumer
import asyncio
import json
from aiokafka import AIOKafkaConsumer
import aiopg
Producer #1
async def kafka_send(kafka_producer, data, topic):
message = {
'data': data,
'received': str(arrow.utcnow())
}
message_json_bytes = bytes(json.dumps(message), 'utf-8')
await kafka_producer.send_and_wait(topic, message_json_bytes)
async def handle(request):
post_data = await request.json()
try:
await kafka_send(request.app['kafka_p'], post_data, topic=settings.topic)
except:
slog.exception("Kafka Error")
await destroy_all()
return web.Response(status=200)
app = web.Application()
app.router.add_route('POST', '/store', handle)
app['kafka_p'] = get_kafka_producer()
Destroying the loop
async def destroy_all():
loop = asyncio.get_event_loop()
for task in asyncio.Task.all_tasks():
task.cancel()
await loop.stop()
await loop.close()
slog.debug("Exiting.")
sys.exit()
def get_kafka_producer():
loop = asyncio.get_event_loop()
producer = AIOKafkaProducer(
loop=loop,
bootstrap_servers=settings.queues_urls,
request_timeout_ms=settings.kafka_timeout,
retry_backoff_ms=1000)
loop.run_until_complete(producer.start())
return producer
Getting producer
Producer #2
Consume
… time to resurrect consumer ...
DB
connected
1. Receive data record from Kafka
2. Put it to the queue
start
yesno
Flush
queue full enough
or
data old enough
Store data from queue to DB
yesno
Connect to DB
start
asyncio.Queue()
Consumer #1
def main():
dbs_connected = asyncio.Future()
batch = asyncio.Queue(maxsize=settings.batch_max_size)
asyncio.ensure_future(consume(batch, dbs_connected))
asyncio.ensure_future(start_flushing(batch, dbs_connected))
loop.run_forever()
async def consume(queue, dbs_connected):
await asyncio.wait_for(dbs_connected, timeout=settings.wait_for_databases)
consumer = AIOKafkaConsumer(
settings.topic, loop=loop, bootstrap_servers=settings.queues_urls,
group_id='consumers'
)
await consumer.start()
async for msg in consumer:
message = json.loads(msg.value.decode("utf-8"))
await queue.put((message.get('received'), message.get('data')))
await consumer.stop()
Consumer #2
async def start_flushing(queue, dbs_connected):
db_logg = await aiopg.create_pool(settings.logs_db_url)
while True:
async with db_logg.acquire() as logg_conn, logg_conn.cursor() as logg_cur:
await keep_flushing(dbs_connected, logg_cur, queue)
await asyncio.sleep(2)
async def keep_flushing(dbs_connected, logg_cur, queue):
dbs_connected.set_result(True)
last_stored_time = time.time()
while True:
if not queue.empty() and (queue.qsize() > settings.batch_flush_size or
time.time() - last_stored_time > settings.batch_max_time):
to_store = []
while not queue.empty():
to_store.append(await queue.get())
try:
await store_bulk(logg_cur, to_store)
except:
break # DB down, breaking to reconnect
last_stored_time = time.time()
await asyncio.sleep(settings.batch_sleep)
Consumer #3
Code is public on gitlab
https://gitlab.skypicker.com/ondrej/faqstorer
www.orwen.org
code.kiwi.com
www.kiwi.com/jobs/
Check graphs...

More Related Content

PPTX
push down automata
Christopher Chizoba
 
PPTX
C++ concept of Polymorphism
kiran Patel
 
PPTX
Backtracking-N Queens Problem-Graph Coloring-Hamiltonian cycle
varun arora
 
PPT
Input and output in C++
Nilesh Dalvi
 
PDF
Python tuple
Mohammed Sikander
 
PPTX
Process state in OS
Khushboo Jain
 
PPTX
C Programming Unit-1
Vikram Nandini
 
PPT
Operating Systems - "Chapter 4: Multithreaded Programming"
Ra'Fat Al-Msie'deen
 
push down automata
Christopher Chizoba
 
C++ concept of Polymorphism
kiran Patel
 
Backtracking-N Queens Problem-Graph Coloring-Hamiltonian cycle
varun arora
 
Input and output in C++
Nilesh Dalvi
 
Python tuple
Mohammed Sikander
 
Process state in OS
Khushboo Jain
 
C Programming Unit-1
Vikram Nandini
 
Operating Systems - "Chapter 4: Multithreaded Programming"
Ra'Fat Al-Msie'deen
 

What's hot (20)

DOC
Linux Lab Manual.doc
Dr.M.Karthika parthasarathy
 
PPTX
Spanning trees & applications
Tech_MX
 
PPT
Finite automata
ankitamakin
 
PDF
Neural Networks: Multilayer Perceptron
Mostafa G. M. Mostafa
 
PPT
Data preprocessing
ankur bhalla
 
PPT
Structures
archikabhatia
 
PPT
Deadlocks in operating system
Midhun Sankar
 
PPT
Os Threads
Salman Memon
 
PDF
What is Multithreading In Python | Python Multithreading Tutorial | Edureka
Edureka!
 
PPT
structure and union
student
 
PPTX
Pointer in c
lavanya marichamy
 
PPTX
Birch Algorithm With Solved Example
kailash shaw
 
PPTX
Kruskal Algorithm
Bhavik Vashi
 
PPTX
Python Exception Handling
Megha V
 
PPT
Chapter 10 - File System Interface
Wayne Jones Jnr
 
PPT
Cure, Clustering Algorithm
Lino Possamai
 
PDF
MRI Energy-Efficient Cloud Computing
Roger Rafanell Mas
 
PPT
REVIEW PAPER on Scheduling in Cloud Computing
Jaya Gautam
 
PDF
5 process synchronization
BaliThorat1
 
PPTX
Basic Graphics in Java
Prakash Kumar
 
Linux Lab Manual.doc
Dr.M.Karthika parthasarathy
 
Spanning trees & applications
Tech_MX
 
Finite automata
ankitamakin
 
Neural Networks: Multilayer Perceptron
Mostafa G. M. Mostafa
 
Data preprocessing
ankur bhalla
 
Structures
archikabhatia
 
Deadlocks in operating system
Midhun Sankar
 
Os Threads
Salman Memon
 
What is Multithreading In Python | Python Multithreading Tutorial | Edureka
Edureka!
 
structure and union
student
 
Pointer in c
lavanya marichamy
 
Birch Algorithm With Solved Example
kailash shaw
 
Kruskal Algorithm
Bhavik Vashi
 
Python Exception Handling
Megha V
 
Chapter 10 - File System Interface
Wayne Jones Jnr
 
Cure, Clustering Algorithm
Lino Possamai
 
MRI Energy-Efficient Cloud Computing
Roger Rafanell Mas
 
REVIEW PAPER on Scheduling in Cloud Computing
Jaya Gautam
 
5 process synchronization
BaliThorat1
 
Basic Graphics in Java
Prakash Kumar
 
Ad

Viewers also liked (7)

PDF
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
DataStax Academy
 
PDF
美团技术沙龙04 - Kv Tair best practise
美团点评技术团队
 
PPTX
Communication And Synchronization In Distributed Systems
guest61205606
 
PDF
Inter-Process Communication in distributed systems
Aya Mahmoud
 
PPT
Synchronization in distributed systems
SHATHAN
 
PDF
大数据时代feed架构 (ArchSummit Beijing 2014)
Tim Y
 
codecentric AG: CQRS and Event Sourcing Applications with Cassandra
DataStax Academy
 
美团技术沙龙04 - Kv Tair best practise
美团点评技术团队
 
Communication And Synchronization In Distributed Systems
guest61205606
 
Inter-Process Communication in distributed systems
Aya Mahmoud
 
Synchronization in distributed systems
SHATHAN
 
大数据时代feed架构 (ArchSummit Beijing 2014)
Tim Y
 
Ad

Similar to Python queue solution with asyncio and kafka (20)

PDF
Introduction to asyncio
Saúl Ibarra Corretgé
 
PPTX
Tools for Making Machine Learning more Reactive
Jeff Smith
 
PDF
Future Decoded - Node.js per sviluppatori .NET
Gianluca Carucci
 
PDF
ZeroMQ: Messaging Made Simple
Ian Barber
 
PDF
Asynchronous web apps with the Play Framework 2.0
Oscar Renalias
 
PDF
JS Fest 2019 Node.js Antipatterns
Timur Shemsedinov
 
PDF
Making Structured Streaming Ready for Production
Databricks
 
PDF
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Sages
 
PDF
Websockets talk at Rubyconf Uruguay 2010
Ismael Celis
 
PDF
TDC2018SP | Trilha Go - Processando analise genetica em background com Go
tdc-globalcode
 
PDF
Refactoring to Macros with Clojure
Dmitry Buzdin
 
PDF
Lego: A brick system build by scala
lunfu zhong
 
PDF
Think Async: Asynchronous Patterns in NodeJS
Adam L Barrett
 
PDF
Writing Redis in Python with asyncio
James Saryerwinnie
 
PDF
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
InfluxData
 
PDF
Rntb20200805
t k
 
PPTX
Avoiding Callback Hell with Async.js
cacois
 
PDF
Stream or not to Stream?

Lukasz Byczynski
 
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
PDF
Futures e abstração - QCon São Paulo 2015
Leonardo Borges
 
Introduction to asyncio
Saúl Ibarra Corretgé
 
Tools for Making Machine Learning more Reactive
Jeff Smith
 
Future Decoded - Node.js per sviluppatori .NET
Gianluca Carucci
 
ZeroMQ: Messaging Made Simple
Ian Barber
 
Asynchronous web apps with the Play Framework 2.0
Oscar Renalias
 
JS Fest 2019 Node.js Antipatterns
Timur Shemsedinov
 
Making Structured Streaming Ready for Production
Databricks
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Sages
 
Websockets talk at Rubyconf Uruguay 2010
Ismael Celis
 
TDC2018SP | Trilha Go - Processando analise genetica em background com Go
tdc-globalcode
 
Refactoring to Macros with Clojure
Dmitry Buzdin
 
Lego: A brick system build by scala
lunfu zhong
 
Think Async: Asynchronous Patterns in NodeJS
Adam L Barrett
 
Writing Redis in Python with asyncio
James Saryerwinnie
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
InfluxData
 
Rntb20200805
t k
 
Avoiding Callback Hell with Async.js
cacois
 
Stream or not to Stream?

Lukasz Byczynski
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Futures e abstração - QCon São Paulo 2015
Leonardo Borges
 

Recently uploaded (20)

PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PPTX
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
Employee Salary Presentation.l based on data science collection of data
barridevakumari2004
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 

Python queue solution with asyncio and kafka

  • 1. Queue with asyncio and Kafka Showcase Ondřej Veselý
  • 2. What kind of data we have
  • 3. Problem: store JSON to database Just a few records per second. But ● Slow database ● Unreliable database ● Increasing traffic (20x)
  • 4. def save_data(conn, cur, ts, data): cur.execute( """INSERT INTO data (timestamp, data) VALUES (%s,%s) """, (ts, ujson.dumps(data))) conn.commit() @app.route('/store', method=['PUT', 'POST']) def logstash_route(): data = ujson.load(request.body) conn = psycopg2.connect(**config.pg_logs) t = datetime.now() with conn.cursor(cursor_factory=DictCursor) as cur: for d in data: save_data(conn, cur, t, d) conn.close() Old code
  • 5. Architecture internet Kafka producer /store Kafka consumer Kafka queue Postgres … time to kill consumer ...
  • 6. Asyncio, example import asyncio async def factorial(name, number): f = 1 for i in range(2, number+1): print("Task %s: Compute factorial(%s)..." % (name, i)) await asyncio.sleep(1) f *= i print("Task %s: factorial(%s) = %s" % (name, number, f)) loop = asyncio.get_event_loop() tasks = [ asyncio.ensure_future(factorial("A", 2)), asyncio.ensure_future(factorial("B", 3)), asyncio.ensure_future(factorial("C", 4))] loop.run_until_complete(asyncio.gather(*tasks)) loop.close() Task A: Compute factorial(2)... Task B: Compute factorial(2)... Task C: Compute factorial(2)... Task A: factorial(2) = 2 Task B: Compute factorial(3)... Task C: Compute factorial(3)... Task B: factorial(3) = 6 Task C: Compute factorial(4)... Task C: factorial(4) = 24
  • 7. What we used Apache Kafka Not ujson Concurrency - doing lots of slow things at once. No processes, no threads. Producer from aiohttp import web import json Consumer import asyncio import json from aiokafka import AIOKafkaConsumer import aiopg
  • 8. Producer #1 async def kafka_send(kafka_producer, data, topic): message = { 'data': data, 'received': str(arrow.utcnow()) } message_json_bytes = bytes(json.dumps(message), 'utf-8') await kafka_producer.send_and_wait(topic, message_json_bytes) async def handle(request): post_data = await request.json() try: await kafka_send(request.app['kafka_p'], post_data, topic=settings.topic) except: slog.exception("Kafka Error") await destroy_all() return web.Response(status=200) app = web.Application() app.router.add_route('POST', '/store', handle) app['kafka_p'] = get_kafka_producer()
  • 9. Destroying the loop async def destroy_all(): loop = asyncio.get_event_loop() for task in asyncio.Task.all_tasks(): task.cancel() await loop.stop() await loop.close() slog.debug("Exiting.") sys.exit() def get_kafka_producer(): loop = asyncio.get_event_loop() producer = AIOKafkaProducer( loop=loop, bootstrap_servers=settings.queues_urls, request_timeout_ms=settings.kafka_timeout, retry_backoff_ms=1000) loop.run_until_complete(producer.start()) return producer Getting producer Producer #2
  • 10. Consume … time to resurrect consumer ... DB connected 1. Receive data record from Kafka 2. Put it to the queue start yesno Flush queue full enough or data old enough Store data from queue to DB yesno Connect to DB start asyncio.Queue() Consumer #1
  • 11. def main(): dbs_connected = asyncio.Future() batch = asyncio.Queue(maxsize=settings.batch_max_size) asyncio.ensure_future(consume(batch, dbs_connected)) asyncio.ensure_future(start_flushing(batch, dbs_connected)) loop.run_forever() async def consume(queue, dbs_connected): await asyncio.wait_for(dbs_connected, timeout=settings.wait_for_databases) consumer = AIOKafkaConsumer( settings.topic, loop=loop, bootstrap_servers=settings.queues_urls, group_id='consumers' ) await consumer.start() async for msg in consumer: message = json.loads(msg.value.decode("utf-8")) await queue.put((message.get('received'), message.get('data'))) await consumer.stop() Consumer #2
  • 12. async def start_flushing(queue, dbs_connected): db_logg = await aiopg.create_pool(settings.logs_db_url) while True: async with db_logg.acquire() as logg_conn, logg_conn.cursor() as logg_cur: await keep_flushing(dbs_connected, logg_cur, queue) await asyncio.sleep(2) async def keep_flushing(dbs_connected, logg_cur, queue): dbs_connected.set_result(True) last_stored_time = time.time() while True: if not queue.empty() and (queue.qsize() > settings.batch_flush_size or time.time() - last_stored_time > settings.batch_max_time): to_store = [] while not queue.empty(): to_store.append(await queue.get()) try: await store_bulk(logg_cur, to_store) except: break # DB down, breaking to reconnect last_stored_time = time.time() await asyncio.sleep(settings.batch_sleep) Consumer #3
  • 13. Code is public on gitlab https://gitlab.skypicker.com/ondrej/faqstorer www.orwen.org code.kiwi.com www.kiwi.com/jobs/ Check graphs...

Editor's Notes

  • #3: Talk more about Kiwi.com Skyscanner, Momondo
  • #4: 5 TB Postgres Database
  • #7: PEP 492 -- Coroutines with async and await syntax, 09-Apr-2015 Python 3.5