iFood on Delivering 100 Million Events a Month to Restaurants with Scylla

Multiple Device Polling
Events Systems
Thales Biancalana, Senior Backend Developer
A case study from iFood

Presenter
Thales Biancalana, Senior Backend Developer at iFood
Control and Automation Engineer that decided that
programming is more exciting than building robots. Worked in
multiple applications using .NET, Node, React, Swift and Java,
now working as a backend developer at iFood. Always looking for
new challenges and different ways to solve them

iFood
■ Food-tech delivery business.
■ Main delivery app in Brazil and also present in Colombia and Mexico
> 100k> 20M/month

iFood Infrastructure
■ Migrating to an microservice event driven architecture

Connection Team
Responsible for delivering orders' events to merchants
Connection
Merchant App
Integrations

POST
/acks
Polling Services
Multiple polling systems running in parallel
Service A
Service B
Proxy Service App
Orders Events
Acknowledgements
GET
/events

■ Http requests every 30 seconds for each device
■ Database to be invoked for each call
■ Heavy queries on read nodes: “all non-acked events by the device”
■ Mid term goal to support 500k connected merchants with 1 device
each
Polling Services
■ Why?

Polling Services
Multiple polling systems running in parallel:
■ Proxy Service: Kitchen-Polling
■ Service A: Gateway-Core (PostgresQL) - Dead
■ Service B: Connection-Order-Events (Apache Ignite)
■ Service C: Connection-Polling (DynamoDB) - Dying
■ Service D: Connection-Polling (ScyllaDB)

PostgresQL Legacy Service
■ Events indexed in one table and the acknowledgements in another
■ Readings (JOINS) were starting to become a problem as the number
of events and merchants increased
■ Master node “suffering with increasing load”
■ Single point of failure

PostgresQL Legacy Service - Data

Apache Ignite
Connection-Order-Events

Apache Ignite Service
■ Works really well (reading ~3ms)
Problems:
■ Hard to monitor, as service and database are one
■ We need to save events in another database used when adding
machines or recovering from disasters (more code to maintain)
■ It takes longer to get the service back up as it needs to fill the cache
from the PostgresQL database. That's why we have a fallback system
for when it is down

NoSQL Modeling
■ Our main query?
● All events that were not acked by a device
■ Orders (and events) belong to merchants, not devices
● We need the merchant devices when saving events
■ What to do with new devices?
● Return all merchant events and save them to the not acked by
device table
■ We are only interested in events from the last 8 hours from delivery
time

DynamoDB Service
■ Why DynamoDB?
● Try a NoSQL approach
● Most of infrastructure is in AWS
● Fully managed solution

DynamoDB Service - Issues
Issues with DynamoDB for our use:
■ DynamoDB autoscaling was not fast enough for our use case unless
we left a high minimum throughput our manage it ourselves
● Defeats the purpose as a fully managed solution
■ DynamoDB new on-demand mode is great, but expensive

ScyllaDB
Connection-Polling-v2

ScyllaDB Service A
■ Quite easy to migrate from DynamoDB to Scylla with the same
modeling. Should be even easier with the new Project Alternator

ScyllaDB Service A - Results
■ How did it compare with DynamoDB?
● We started with three c5.2xlarge machine cluster that easily held
the throughput. This was nearly 9x database cost reduction that
could still hold more throughput (around $4.5k to $500/month)

ScyllaDB Service A - Learnings
■ Scylla uses TTL by column vs DynamoDB expiration time by
document
■ Scylla Support: we identified a bug when reading pages from
secondary index with prepared statements. After opening a Github
issue we had a new build with the fix in less than 4 days
(https://github.com/scylladb/scylla/issues/4569)

Modeling Issues
Issues with this modeling:
■ We need to manage restaurant devices
■ Need to manage old events for new devices
● It may be quite heavy to introduce a new device in the middle of the day

ScyllaDB Service B
Second modeling using collections.
Drawbacks:
■ Reads are expected to be slower
(okay as a fallback system)
Advantages:
■ Less complex
■ Events table can be used to
populate ignite cache

ScyllaDB Service B - Results
The good:
■ Nearly 9x database cost reduction when comparing with DynamoDB
on-demand
■ Time reduction from ~80ms to ~3ms to index events which resulted in
nearly 8x infrastructure reduction for writes
■ Solution complexity reduction from 4 tables and 2 indexes to 2 tables
and 1 index and 40% less code
The bad:
■ Increase in read times, worth it for now as a fallback system
■ Collections updates are CPU intensive and generate tombstones ->
use carefully

Final Thoughts
■ Scylla was cheaper when comparing with DynamoDB, but we created
a cluster on AWS machines
● Take in consideration the cost of maintaining a cluster. Learn from other talks how
easy it is to maintain a cluster when choosing between databases.
● But we have had no problems as of now
■ Check what you know about your domain and problem, it can be used
to simplify the solution
● Knowing it was a fallback system and the average number of devices per merchant
and orders per merchant led me believe it was a good trade off to have collections
updates, which should be used carefully

Final Thoughts
■ Get to know all features of your database before using them
● Collection updates are not cheap! Each update incurs in a tombstone which
slowdowns reads and gives more work to the garbage collector. We are still toying
with gc_grace to improve performance
● ScyllaDB secondary indexes are global by default which was a good thing for our
second solution, where the index has a cardinality as high as the number of
merchants (a bit more than 100k merchants online today). It could be achieved in
cassandra with Materialized Views.
● Global is the default, but it may not be always the best one to use, so Scylla also
supports local indexes and you need to know when to use each.

Next Steps
■ No acknowledgment polling solution using Scylla
■ Force Scylla to fail
■ Working on MQTT pub/sub solution

Thank you Stay in touch
Any questions?
Thales Biancalana
thales.biancalana@ifood.com.br
37

iFood on Delivering 100 Million Events a Month to Restaurants with Scylla

More Related Content

What's hot (20)

Similar to iFood on Delivering 100 Million Events a Month to Restaurants with Scylla (20)

More from ScyllaDB (20)

Recently uploaded (20)

iFood on Delivering 100 Million Events a Month to Restaurants with Scylla

Editor's Notes