SlideShare a Scribd company logo
Multiple Device Polling
Events Systems
Thales Biancalana, Senior Backend Developer
A case study from iFood
Presenter
Thales Biancalana, Senior Backend Developer at iFood
Control and Automation Engineer that decided that
programming is more exciting than building robots. Worked in
multiple applications using .NET, Node, React, Swift and Java,
now working as a backend developer at iFood. Always looking for
new challenges and different ways to solve them
Context
iFood
■ Food-tech delivery business.
■ Main delivery app in Brazil and also present in Colombia and Mexico
> 100k> 20M/month
iFood apps
iFood Infrastructure
■ Migrating to an microservice event driven architecture
Connection Team
Connection Team
Responsible for delivering orders' events to merchants
Connection
Merchant App
Integrations
Polling Services
POST
/acks
Polling Services
Multiple polling systems running in parallel
Service A
Service B
Proxy Service App
Orders Events
Acknowledgements
GET
/events
■ Http requests every 30 seconds for each device
■ Database to be invoked for each call
■ Heavy queries on read nodes: “all non-acked events by the device”
■ Mid term goal to support 500k connected merchants with 1 device
each
Polling Services
■ Why?
Polling Services
Multiple polling systems running in parallel:
■ Proxy Service: Kitchen-Polling
■ Service A: Gateway-Core (PostgresQL) - Dead
■ Service B: Connection-Order-Events (Apache Ignite)
■ Service C: Connection-Polling (DynamoDB) - Dying
■ Service D: Connection-Polling (ScyllaDB)
PostgresQL
Gateway-Core
PostgresQL Legacy Service
■ Events indexed in one table and the acknowledgements in another
■ Readings (JOINS) were starting to become a problem as the number
of events and merchants increased
■ Master node “suffering with increasing load”
■ Single point of failure
PostgresQL Legacy Service - Data
Apache Ignite
Connection-Order-Events
Apache Ignite Service
■ Works really well (reading ~3ms)
Problems:
■ Hard to monitor, as service and database are one
■ We need to save events in another database used when adding
machines or recovering from disasters (more code to maintain)
■ It takes longer to get the service back up as it needs to fill the cache
from the PostgresQL database. That's why we have a fallback system
for when it is down
NoSQL
NoSQL Modeling
■ Our main query?
● All events that were not acked by a device
■ Orders (and events) belong to merchants, not devices
● We need the merchant devices when saving events
■ What to do with new devices?
● Return all merchant events and save them to the not acked by
device table
■ We are only interested in events from the last 8 hours from delivery
time
NoSQL Modeling
DynamoDB
Connection-Polling
DynamoDB Service
■ Why DynamoDB?
● Try a NoSQL approach
● Most of infrastructure is in AWS
● Fully managed solution
DynamoDB Service - Issues
Issues with DynamoDB for our use:
■ DynamoDB autoscaling was not fast enough for our use case unless
we left a high minimum throughput our manage it ourselves
● Defeats the purpose as a fully managed solution
■ DynamoDB new on-demand mode is great, but expensive
ScyllaDB
Connection-Polling-v2
ScyllaDB Service A
■ Quite easy to migrate from DynamoDB to Scylla with the same
modeling. Should be even easier with the new Project Alternator
ScyllaDB Service A - Results
■ How did it compare with DynamoDB?
● We started with three c5.2xlarge machine cluster that easily held
the throughput. This was nearly 9x database cost reduction that
could still hold more throughput (around $4.5k to $500/month)
ScyllaDB Service A - Learnings
■ Scylla uses TTL by column vs DynamoDB expiration time by
document
■ Scylla Support: we identified a bug when reading pages from
secondary index with prepared statements. After opening a Github
issue we had a new build with the fix in less than 4 days
(https://github.com/scylladb/scylla/issues/4569)
Modeling Issues
Issues with this modeling:
■ We need to manage restaurant devices
■ Need to manage old events for new devices
● It may be quite heavy to introduce a new device in the middle of the day
ScyllaDB Service B
Second modeling using collections.
Drawbacks:
■ Reads are expected to be slower
(okay as a fallback system)
Advantages:
■ Less complex
■ Events table can be used to
populate ignite cache
ScyllaDB Service B - Catch
ScyllaDB Service B - Results
The good:
■ Nearly 9x database cost reduction when comparing with DynamoDB
on-demand
■ Time reduction from ~80ms to ~3ms to index events which resulted in
nearly 8x infrastructure reduction for writes
■ Solution complexity reduction from 4 tables and 2 indexes to 2 tables
and 1 index and 40% less code
The bad:
■ Increase in read times, worth it for now as a fallback system
■ Collections updates are CPU intensive and generate tombstones ->
use carefully
Final Thoughts
Final Thoughts
■ Scylla was cheaper when comparing with DynamoDB, but we created
a cluster on AWS machines
● Take in consideration the cost of maintaining a cluster. Learn from other talks how
easy it is to maintain a cluster when choosing between databases.
● But we have had no problems as of now
■ Check what you know about your domain and problem, it can be used
to simplify the solution
● Knowing it was a fallback system and the average number of devices per merchant
and orders per merchant led me believe it was a good trade off to have collections
updates, which should be used carefully
Final Thoughts
■ Get to know all features of your database before using them
● Collection updates are not cheap! Each update incurs in a tombstone which
slowdowns reads and gives more work to the garbage collector. We are still toying
with gc_grace to improve performance
● ScyllaDB secondary indexes are global by default which was a good thing for our
second solution, where the index has a cardinality as high as the number of
merchants (a bit more than 100k merchants online today). It could be achieved in
cassandra with Materialized Views.
● Global is the default, but it may not be always the best one to use, so Scylla also
supports local indexes and you need to know when to use each.
Next Steps
Next Steps
■ No acknowledgment polling solution using Scylla
■ Force Scylla to fail
■ Working on MQTT pub/sub solution
Thank you Stay in touch
Any questions?
Thales Biancalana
thales.biancalana@ifood.com.br
37

More Related Content

PPTX
How Workload Prioritization Reduces Your Datacenter Footprint
PPTX
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
PPTX
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
PPTX
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
PPTX
Lightweight Transactions at Lightning Speed
PDF
Lookout on Scaling Security to 100 Million Devices
PPTX
How SkyElectric Uses Scylla to Power Its Smart Energy Platform
PPTX
How to be Successful with Scylla
How Workload Prioritization Reduces Your Datacenter Footprint
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
Lightweight Transactions at Lightning Speed
Lookout on Scaling Security to 100 Million Devices
How SkyElectric Uses Scylla to Power Its Smart Energy Platform
How to be Successful with Scylla

What's hot (20)

PPTX
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
PPTX
Free & Open DynamoDB API for Everyone
PPTX
Using ScyllaDB with JanusGraph for Cyber Security
PDF
How to Monitor and Size Workloads on AWS i3 instances
PPTX
How Scylla Manager Handles Backups
PDF
Scylla: 1 Million CQL operations per second per server
PPTX
Scylla Summit 2018: Meshify - A Case Study, or Petshop Seamonsters
PDF
ScyllaDB @ Apache BigData, may 2016
PDF
Back to the future with C++ and Seastar
PDF
Scylla Summit 2022: Stream Processing with ScyllaDB
PDF
Introducing Scylla Open Source 4.0
PPTX
Sizing Your Scylla Cluster
PPTX
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
PPTX
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
PDF
Building and running cloud native cassandra
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
PDF
Scylla Summit 2022: How ScyllaDB Powers This Next Tech Cycle
PPTX
Implementing a Distributed NoSQL Database in a Persistent Distributed Ledger ...
PPTX
Scylla Summit 2019 Keynote - Avi Kivity
PDF
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
Free & Open DynamoDB API for Everyone
Using ScyllaDB with JanusGraph for Cyber Security
How to Monitor and Size Workloads on AWS i3 instances
How Scylla Manager Handles Backups
Scylla: 1 Million CQL operations per second per server
Scylla Summit 2018: Meshify - A Case Study, or Petshop Seamonsters
ScyllaDB @ Apache BigData, may 2016
Back to the future with C++ and Seastar
Scylla Summit 2022: Stream Processing with ScyllaDB
Introducing Scylla Open Source 4.0
Sizing Your Scylla Cluster
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Building and running cloud native cassandra
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Scylla Summit 2022: How ScyllaDB Powers This Next Tech Cycle
Implementing a Distributed NoSQL Database in a Persistent Distributed Ledger ...
Scylla Summit 2019 Keynote - Avi Kivity
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Ad

Similar to iFood on Delivering 100 Million Events a Month to Restaurants with Scylla (20)

PDF
Cloud arch patterns
PDF
Storing State Forever: Why It Can Be Good For Your Analytics
PPTX
IBM Insight 2013 - Aetna's production experience using IBM DB2 Analytics Acce...
PDF
Kafka used at scale to deliver real-time notifications
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PPTX
Background processing with hangfire
PPTX
Netflix Data Pipeline With Kafka
PPTX
Netflix Data Pipeline With Kafka
PPTX
Boosting the Performance of your Rails Apps
PDF
Processing 19 billion messages in real time and NOT dying in the process
PDF
Netflix SRE perf meetup_slides
PDF
Reliable Data Replication by Cameron Morgan
PDF
Server fleet management using Camunda by Akhil Ahuja
PDF
Feature Store Evolution Under Cost Constraints: When Cost is Part of the Arch...
PPTX
The challenges of live events scalability
PDF
The Journey To Serverless At Home24 - reflections and insights
PDF
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
PDF
Building a real-time, scalable and intelligent programmatic ad buying platform
PPTX
PPCD_And_AmazonRDS
Cloud arch patterns
Storing State Forever: Why It Can Be Good For Your Analytics
IBM Insight 2013 - Aetna's production experience using IBM DB2 Analytics Acce...
Kafka used at scale to deliver real-time notifications
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
Background processing with hangfire
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Boosting the Performance of your Rails Apps
Processing 19 billion messages in real time and NOT dying in the process
Netflix SRE perf meetup_slides
Reliable Data Replication by Cameron Morgan
Server fleet management using Camunda by Akhil Ahuja
Feature Store Evolution Under Cost Constraints: When Cost is Part of the Arch...
The challenges of live events scalability
The Journey To Serverless At Home24 - reflections and insights
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Building a real-time, scalable and intelligent programmatic ad buying platform
PPCD_And_AmazonRDS
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
PDF
A Dist Sys Programmer's Journey into AI by Piotr Sarna
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
New Ways to Reduce Database Costs with ScyllaDB
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...
A Dist Sys Programmer's Journey into AI by Piotr Sarna

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Cloud computing and distributed systems.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Monthly Chronicles - July 2025
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Spectral efficient network and resource selection model in 5G networks
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The AUB Centre for AI in Media Proposal.docx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Cloud computing and distributed systems.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
cuic standard and advanced reporting.pdf
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Monthly Chronicles - July 2025

iFood on Delivering 100 Million Events a Month to Restaurants with Scylla

  • 1. Multiple Device Polling Events Systems Thales Biancalana, Senior Backend Developer A case study from iFood
  • 2. Presenter Thales Biancalana, Senior Backend Developer at iFood Control and Automation Engineer that decided that programming is more exciting than building robots. Worked in multiple applications using .NET, Node, React, Swift and Java, now working as a backend developer at iFood. Always looking for new challenges and different ways to solve them
  • 4. iFood ■ Food-tech delivery business. ■ Main delivery app in Brazil and also present in Colombia and Mexico > 100k> 20M/month
  • 6. iFood Infrastructure ■ Migrating to an microservice event driven architecture
  • 8. Connection Team Responsible for delivering orders' events to merchants Connection Merchant App Integrations
  • 10. POST /acks Polling Services Multiple polling systems running in parallel Service A Service B Proxy Service App Orders Events Acknowledgements GET /events
  • 11. ■ Http requests every 30 seconds for each device ■ Database to be invoked for each call ■ Heavy queries on read nodes: “all non-acked events by the device” ■ Mid term goal to support 500k connected merchants with 1 device each Polling Services ■ Why?
  • 12. Polling Services Multiple polling systems running in parallel: ■ Proxy Service: Kitchen-Polling ■ Service A: Gateway-Core (PostgresQL) - Dead ■ Service B: Connection-Order-Events (Apache Ignite) ■ Service C: Connection-Polling (DynamoDB) - Dying ■ Service D: Connection-Polling (ScyllaDB)
  • 14. PostgresQL Legacy Service ■ Events indexed in one table and the acknowledgements in another ■ Readings (JOINS) were starting to become a problem as the number of events and merchants increased ■ Master node “suffering with increasing load” ■ Single point of failure
  • 17. Apache Ignite Service ■ Works really well (reading ~3ms) Problems: ■ Hard to monitor, as service and database are one ■ We need to save events in another database used when adding machines or recovering from disasters (more code to maintain) ■ It takes longer to get the service back up as it needs to fill the cache from the PostgresQL database. That's why we have a fallback system for when it is down
  • 18. NoSQL
  • 19. NoSQL Modeling ■ Our main query? ● All events that were not acked by a device ■ Orders (and events) belong to merchants, not devices ● We need the merchant devices when saving events ■ What to do with new devices? ● Return all merchant events and save them to the not acked by device table ■ We are only interested in events from the last 8 hours from delivery time
  • 22. DynamoDB Service ■ Why DynamoDB? ● Try a NoSQL approach ● Most of infrastructure is in AWS ● Fully managed solution
  • 23. DynamoDB Service - Issues Issues with DynamoDB for our use: ■ DynamoDB autoscaling was not fast enough for our use case unless we left a high minimum throughput our manage it ourselves ● Defeats the purpose as a fully managed solution ■ DynamoDB new on-demand mode is great, but expensive
  • 25. ScyllaDB Service A ■ Quite easy to migrate from DynamoDB to Scylla with the same modeling. Should be even easier with the new Project Alternator
  • 26. ScyllaDB Service A - Results ■ How did it compare with DynamoDB? ● We started with three c5.2xlarge machine cluster that easily held the throughput. This was nearly 9x database cost reduction that could still hold more throughput (around $4.5k to $500/month)
  • 27. ScyllaDB Service A - Learnings ■ Scylla uses TTL by column vs DynamoDB expiration time by document ■ Scylla Support: we identified a bug when reading pages from secondary index with prepared statements. After opening a Github issue we had a new build with the fix in less than 4 days (https://github.com/scylladb/scylla/issues/4569)
  • 28. Modeling Issues Issues with this modeling: ■ We need to manage restaurant devices ■ Need to manage old events for new devices ● It may be quite heavy to introduce a new device in the middle of the day
  • 29. ScyllaDB Service B Second modeling using collections. Drawbacks: ■ Reads are expected to be slower (okay as a fallback system) Advantages: ■ Less complex ■ Events table can be used to populate ignite cache
  • 31. ScyllaDB Service B - Results The good: ■ Nearly 9x database cost reduction when comparing with DynamoDB on-demand ■ Time reduction from ~80ms to ~3ms to index events which resulted in nearly 8x infrastructure reduction for writes ■ Solution complexity reduction from 4 tables and 2 indexes to 2 tables and 1 index and 40% less code The bad: ■ Increase in read times, worth it for now as a fallback system ■ Collections updates are CPU intensive and generate tombstones -> use carefully
  • 33. Final Thoughts ■ Scylla was cheaper when comparing with DynamoDB, but we created a cluster on AWS machines ● Take in consideration the cost of maintaining a cluster. Learn from other talks how easy it is to maintain a cluster when choosing between databases. ● But we have had no problems as of now ■ Check what you know about your domain and problem, it can be used to simplify the solution ● Knowing it was a fallback system and the average number of devices per merchant and orders per merchant led me believe it was a good trade off to have collections updates, which should be used carefully
  • 34. Final Thoughts ■ Get to know all features of your database before using them ● Collection updates are not cheap! Each update incurs in a tombstone which slowdowns reads and gives more work to the garbage collector. We are still toying with gc_grace to improve performance ● ScyllaDB secondary indexes are global by default which was a good thing for our second solution, where the index has a cardinality as high as the number of merchants (a bit more than 100k merchants online today). It could be achieved in cassandra with Materialized Views. ● Global is the default, but it may not be always the best one to use, so Scylla also supports local indexes and you need to know when to use each.
  • 36. Next Steps ■ No acknowledgment polling solution using Scylla ■ Force Scylla to fail ■ Working on MQTT pub/sub solution
  • 37. Thank you Stay in touch Any questions? Thales Biancalana thales.biancalana@ifood.com.br 37

Editor's Notes

  • #2: Hi everyone, my name is Thales. I’m here today because I’ve seen a lot of Scylla presentations talking about how awesome it is from a tech perspective, like how many ops, how it compares with cassandra as a drop in replacements and things like that, so I'm here to give a different perspective on how was to develop an application using Scylla from a developer perspective. I will not go into how to maintain the infrastructure, just about monitoring and costs.
  • #3: (I'll probably skip this slide, but I'll leave it here) So as I said I’m Thales
  • #4: Let me start by giving a little bit of context of what we do at iFood
  • #5: iFood is a food tech delivery business. It is the main delivery app in Brazil, but we are present in other countries: Colombia and Mexico. We connect over 12 million users, 100 thousand merchants - mostly restaurants today, and deliverymen to deliver a bit over 20 million orders a month as of now, which amounts to a bit over 100 million events going through the platform every month.
  • #6: So here is the user app on the left and the merchant web app on the left
  • #7: Something relevant is how fast iFood grew. It went from 1 million to 20 million orders a month in a bit over two years. Because of that we still have some legacy services being broken into microservices using java, node, docker and kubernetes. This was only possible using a cloud service, and most of iFood's infrastructure runs on AWS, which is why we are still using SNS and SQS to move events around our platform. We use other technologies, but I'll mostly focus on these for our problem. Even though its size, iFood is not an established tech company as of now, and with the growing issues we are facing we are always looking for new ways to scale the infrastructure, which is what I'll try to share with you guys today. Most of what we have today in our infrastructure database is over PostgresQL and DynamoDB which is not scaling well as it is becoming expensive.
  • #8: Now to talk about the project I'll first have to introduce the team I work in: Connection.
  • #9: We are responsible, among other things, for delivering order's events to merchants, either directly to our merchant app I showed before or to integrations for huge food companies. One of the ways this is done today is with a polling API.
  • #10: So now I'll present the polling services we've worked on until we got to the Scylla solution.
  • #11: Events arrive from the platform via SNS-SQS and are indexed in multiple services running in parallel so we can compare them. These events are polled from the app via an GET /events endpoint and are acknowledged via a POST /acks endpoints. The app them sends and acknowledgement for each event it receives as not to receive it again on the next event poll.
  • #12: The polling is done every 30 seconds for each device The database will be invoked on each /events call We have heavy queries on reading nodes of: all non acked events by the device The master Something that I want to adress now: why are we using polling instead of something with a pub/sub approach? We do have a MQTT service, which we are still developing, but unfortunately we also need to support external integrations, and a lot of them are not tech savvy, so having a REST API is a strategic advantage for having more merchants without going after them.
  • #13: This is just to give names to all the services we developed
  • #15: We started with a monolith polling system over a PostgresQL database that was the core of iFood for a long time. Readings were starting to become a problem as we got close to 10 million orders a month. We could solve it for some time by replicating to more databases and scaling the master vetically, but since we were separating the polling system from other functionalities they took this opportunity to work on something better.
  • #16: Just to give you a better understanding, this was the relational data format. We had the events table at the top and an acknowledge table with the event id and the device id for the acknowledge. We would then join both tables for the polling result.
  • #18: Our second approach was to deliver events using Apache Ignite in-memory database by indexing events and acknowledges. We decided to use Apache Ignite because we already used it in another service. It was put in place around october last year. It works really well and is currently the primary polling system at iFood. When we first deployed it, it had the postgres solution as a fallback. It works wonderfully, but after working with it for some time we had some bones to pick with it. First that service and database are one, so we need to be really careful about deployments and scaling (one machine at a time), and, although not a problem directly with Ignite, we had multiple issues with AWS ELB discovery for the machines to talk with each other. We also need to save the events/acks in another database for when adding machines or recover from disasters. With that in mind and thinking about removing the postgres solution as a fallback we started working on our first NoSQL solution.
  • #20: So what do we know about our domain: First that we want all events not acked by a device Second that orders (and events) belong to a merchant, not to the device, so we need to know the merchant devices when saving the device events We need to also index the events by merchant to query them when introducing a new device Also, we are only interested in events from the last 8 hours from the delivery time. When I say delivery time is because we may have scheduled orders
  • #21: So this was our first NoSQL model. We have a table for unacked device events, one table for the restaurant or merchant events and another for the restaurant devices. I’m just going to point out that we introduced restaurants as merchants not so long ago, so we sometimes still use the term restaurant.
  • #23: Now we get to the good NoSQL part, where I'll get into it a bit more than the other solutions on how we implemented the solution. But first, why did we choose DynamoDB as our first NoSQL solution? First we wanted to try a NoSQL solution, second that we were already in AWS ecosystem, and third because it is a full managed solution.
  • #24: As you can see, the solution is quite complex. We need to manage the restaurant devices and events for new devices Other problems with this solution is that DynamoDB autoscaling was not fast enough unless we left a high enough reading and write capacities, which would defeat the purpose of cutting costs. DynamoDB autoscaling only happens every 5 minutes, which is not fast enough for us. Lunch and especially dinner go from 0 to max throughput quite fast. We are currently using on-demand, but it is expensive. We could do the auto scaling ourselves, but it would no longer be a fully managed solution. It was around the time Scylla got in contact with our DBAs and started working on a new Scylla. The main problem we saw was the cost The scaling policy also contains a target utilization—the percentage of consumed provisioned throughput at a point in time. Application Auto Scaling uses a target tracking algorithm to adjust the provisioned throughput of the table (or index) upward or downward in response to actual workloads, so that the actual capacity utilization remains at or near your target utilization. You can set the auto scaling target utilization values between 20 and 90 percent for your read and write capacity. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html
  • #26: The first implementation using Scylla service was a direct comparison between Scylla and dynamodb solutions, so we implemented the same modeling. Because of this we could use the same code base and only change de DAO
  • #27: This CPU load chart was taken from the scylla grafana overview dashboard provided by the scylla team.
  • #28: I'll talk about Scylla collections
  • #29: As you can see, the solution is quite complex. We need to manage the restaurant devices and events for new devices Other problems with this solution is that DynamoDB autoscaling was not fast enough unless we left a high enough reading and write capacities, which would defeat the purpose of cutting costs. DynamoDB autoscaling only happens every 5 minutes, which is not fast enough for us. Lunch and especially dinner go from 0 to max throughput quite fast. We are currently using on-demand, but it is expensive. We could do the auto scaling ourselves, but it would no longer be a fully managed solution. It was around the time Scylla got in contact with our DBAs and started working on a new Scylla. The main problem we saw was the cost The scaling policy also contains a target utilization—the percentage of consumed provisioned throughput at a point in time. Application Auto Scaling uses a target tracking algorithm to adjust the provisioned throughput of the table (or index) upward or downward in response to actual workloads, so that the actual capacity utilization remains at or near your target utilization. You can set the auto scaling target utilization values between 20 and 90 percent for your read and write capacity. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html
  • #30: Reads are slower, which is ok for a fallback system. Could be faster if NOT CONTAINS was supported on SETs (which not supported in Cassandra as it is not usually a good approach
  • #31: Remember what I said about Scylla TTL? It is column based, not document based, so the new acked devices column would not have the TTL, thus it was never be deleted
  • #34: This is probably the most important slide
  • #35: This is probably the most important slide https://www.scylladb.com/2019/07/23/global-or-localsecondary-indexes-in-scylla-the-choice-is-now-yours/ https://thelastpickle.com/blog/2018/03/21/hinted-handoff-gc-grace-demystified.html