SlideShare a Scribd company logo
Processing Terabytes of data every day
… and sleeping at night
@katavic_d - @loige
User Group
Dublin
05/02/2019
loige.link/tera-dub
Domagoj KatavicSenior Software Engineer
🐦 @katavic_d
😸 github.com/dkatavic
Luciano Mammino Cloud Architect
🐦 @loige
😸 github.com/lmammino
loige.co
4.7 out of 5 stars
on Amazon.com
With @mariocasciaro
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
AI to detect and hunt for
cyber attackers
Cognito Platform
● Detect
● Recall
@katavic_d - @loige
Cognito Detect
on premise solution
● Analyzing network traffic and logs
● Uses AI to deliver real-time attack visibility
● Behaviour driven Host centric
● Provides threat context and most relevant
attack details
@katavic_d - @loige
@katavic_d - @loige
Cognito Recall
● Collects network metadata
and stores it in “the cloud”
● Data is processed, enriched and standardised
● Data is made searchable
@katavic_d - @loige
A new Vectra product for Incident Response
Recall requirements
● Data isolation
● Ingestion speed: ~2GB/min x customer
(up ~3TB x day per customer)
● Investigation tool:
Flexible data exploration
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Our first iteration
@katavic_d - @loige
@katavic_d - @loige
Control plane
Centralised
Logging &
Metrics
Security
● Separate VPCs
● Strict Security Groups (whitelisting)
● Red, amber, green subnets
● Encryption at rest through AWS services
● Client Certificates + TLS
● Pentest
@katavic_d - @loige
Let’s start the beta!
@katavic_d - @loige
Warning: different timezones!
A cu m
Our ne *
@katavic_d - @loige *yeah, we actually look that cute when we sleep!
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
Lambda timeouts incident
● AWS Lambda timeout: 5 minutes 15 minutes
● We are receiving files every minute
(containing 1 minute of network traffic)
● During peak hours for the biggest customer, files
can be too big to be processed within timeout
limits
@katavic_d - @loige
Splitter lambda
@katavic_d - @loige
Message-aware splitting
Lessons learned
● Predictable data input for
predictable performance
● Data ingestion parallelization
(exploiting serverless
capabilities)
@katavic_d - @loige
@katavic_d - @loige
Lambdas IP starvation incident
● Spinning up many lambdas consumed
all the available IPs in a subnet
● Failure to get an IP for the new ES
machines
● ElasticSearch cannot scale up
● Solution: separate ElasticSearch and
Lambda subnets
@katavic_d - @loige
GI
IP!
Lessons learned
● Every running lambda inside a VPC uses an ENI
(Elastic Network Interface)
● Every ENI takes a private IP address
● Edge conditions or bugs might generate spikes in the
number of running lambdas and you might run out of
IPs in the subnet!
● Consider putting lambdas in their dedicated subnet
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
Missing data incident
@katavic_d - @loige
@katavic_d - @loige
● New lambda version: triggered insertion failures
● ElasticSearch rejecting inserts and logging errors
● Our log reporting agents got stuck (we DDoS’d ourselves!)
● Monitoring/Alerting failed
Resolution:
● Fix mismatching schema
● Scaled out centralised logging system
Why didn’t we receive the page
@katavic_d - @loige
Alerting on lambda failures
Using logs:
● Best case: no logs
● Worst case: no logs (available)!
A better approach:
● Attach a DLQ to your lambdas
● Alert on queue size with
CloudWatch!
● Visibility on Lambda retries
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
Fast retry at peak times
● Lambda retry logic is not configurable
loige.link/lambda-retry
● Most events will be retried 2 times
● Time between retry attempts is not clearly defined
(observed in the order of few seconds)
● What if all retry attempts happen at peak times?
@katavic_d - @loige
Fast retry at peak times
@katavic_d - @loige
Fast retry at peak times
Processing in these range of time is likely to succeed@katavic_d - @loige
Fast retry at peak times
@katavic_d - @loige
Fast retry at peak times
Processing in this range of time is likely to fail@katavic_d - @loige
Fast retry at peak times
If retries are in the same zone, the message will fail and go to the DLQ
1st retry 2nd retry
Can we extend the retry period
in case of failure?
@katavic_d - @loige
@katavic_d - @loige
Extended retry period
We normally trigger our ingestion Lambda when a new file is stored in S3
@katavic_d - @loige
Extended retry period
If the Lambda fails, the event is automatically retried, up to 2 times
@katavic_d - @loige
Extended retry period
If the Lambda still fails, the event is copied to the Dead Letter Queue (DLQ)
@katavic_d - @loige
Extended retry period
At this point our Lambda, can receive an SQS event from the DLQ (custom retry logic)
@katavic_d - @loige
Extended retry period
If the processing still fails, we can extend the VisibilityTimeout (event delay)
x3
@katavic_d - @loige
Extended retry period
If the processing still fails, we eventually drop the message and alert for manual intervention.
x3
Lessons learned
● Cannot always rely on the default retry logic
● SQS events + DLQ =
custom SERVERLESS retry logic
● Now we only alert on custom metrics when
we are sure the event will fail (logic error)
● https://loige.link/async-lambda-retry
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
AWS nuances
● Serverless is generally cheap, but be careful!
○ You are paying for wait time
○ Bugs may be expensive
○ 100ms charging blocks
● https://loige.link/lambda-pricing
● https://loige.link/serverless-costs-all-wrong
@katavic_d - @loige
AWS nuances
● Not every service/feature is available in every region or AZ
○ SQS FIFO :(
○ Not all AWS regions have 3 AZs
○ Not all instance types are available in every availability zone
● https://loige.link/aws-regional-services
@katavic_d - @loige
AWS nuances
● Limits everywhere!
○ Soft vs hard limits
○ Take them into account in your design
● https://loige.link/aws-service-limits
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Process
How to deal with incidents
● Page
● Engineers on call
● Incident Retrospective
● Actions
@katavic_d - @loige
Pages
● Page is an alarm for people on call (Pagerduty)
● Rotate ops & devs (share the pain)
● Generate pages from different sources (Logs, Cloudwatch, SNS,
grafana, etc)
● When a page is received, it needs to be acknowledged or it is
automatically escalated
● If customer facing (e.g. service not available), customer is notified
@katavic_d - @loige
Engineers on call
1. Use operational handbook
2. Might escalate to other engineers
3. Find mitigation / remediation
4. Update handbook
5. Prepare for retrospective
@katavic_d - @loige
Incidents Retrospective
"Regardless of what we discover, we understand and truly
believe that everyone did the best job they could, given
what they knew at the time, their skills and abilities, the
resources available, and the situation at hand."
– Nor t , Pro t R os t e : A Han k o T m e
TLDR; NOT A BLAMING GAME!
@katavic_d - @loige
Incidents Retrospective
● Summary
● Events timeline
● Contributing Factors
● Remediation / Solution
● Actions for the future
● Transparency
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Development best practices
● Regular Retrospectives (not just for incidents)
○ What’s good
○ What’s bad
○ Actions to improve
● Kanban Board
○ All work visible
○ One card at the time
○ Work In Progress limit
○ “Stop Starting Start Finishing”
@katavic_d - @loige
Development best practices
● Clear acceptance criteria
○ Collectively defined (3 amigos)
○ Make sure you know when a card is done
● Split the work in small cards
○ High throughput
○ More predictability
● Bugs take priority over features!
@katavic_d - @loige
Development best practices
● Pair programming
○ Share the knowledge/responsibility
○ Improve team dynamics
○ Enforced by low WIP limit
● Quality over deadlines
● Don’t estimate without data
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Release process
● Infrastructure as a code
○ Deterministic deployments
○ Infrastructure versioning using git
● No “snowflakes”, one code base for all customers
● Feature flags:
○ Special features
○ Soft releases
● Automated tests before release
@katavic_d - @loige
Conclusion
@katavic_d - @loige
We are still waking up at night sometimes,
but we are definitely sleeping a lot more and better!
Takeaways:
● Have healthy and clear processes
● Always review and strive for improvement
● Monitor/Instrument as much as you can (even monitoring)
● Use managed services to reduce the operational overhead
(but learn their nuances)
We are hiring …
Talk to us!@katavic_d - @loige
Thank you!
loige.link/tera-dub
Credits
Pictures from Unsplash
Huge thanks for support and reviews to:
● All the Vectra team
● Yan Cui (@theburningmonk)
● Paul Dolan
● @gbinside
● @augeva
● @Podgeypoos79
● @PawrickMannion
● @micktwomey
● Vedran Jukic

More Related Content

PDF
Processing TeraBytes of data every day and sleeping at night
PDF
Processing Terabytes of data every day … and sleeping at night (infiniteConf ...
PDF
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
PDF
Netflix OSS Meetup Season 4 Episode 4
PPTX
The Road to Kubernetes
PDF
PDX Serverless Meetup - Self-Healing Serverless Applications
PDF
Self-Healing Serverless Applications (Stackery @ GlueCon 2018)
PDF
SpringOne 2016 in a nutshell
Processing TeraBytes of data every day and sleeping at night
Processing Terabytes of data every day … and sleeping at night (infiniteConf ...
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix OSS Meetup Season 4 Episode 4
The Road to Kubernetes
PDX Serverless Meetup - Self-Healing Serverless Applications
Self-Healing Serverless Applications (Stackery @ GlueCon 2018)
SpringOne 2016 in a nutshell

What's hot (20)

PDF
Top conf serverlezz
PDF
NetflixOSS Meetup season 3 episode 1
PDF
Cloud Native Patterns Meetup 2019-11-20
PPTX
How and why GraalVM is quickly becoming relevant for you (DOAG 2020)
PDF
Building a serverless company on AWS lambda and Serverless framework
PDF
Front-end for Java developers Devoxx France 2018
KEY
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
PDF
Flagger: Istio Progressive Delivery Operator
PDF
Microservices with Micronaut
PDF
Extending your applications to the edge with CDNs
PDF
Understanding time in structured streaming
PDF
Modern Monitoring [ with Prometheus ]
PDF
Microservices and serverless in python projects
PDF
OpenFlow @ Google
PDF
Altitude SF 2017: Building a continuous deployment pipeline
PDF
Apache Beam and Google Cloud Dataflow - IDG - final
PDF
Иван Бурмистров "Строго ориентированная последовательность временных событий"...
PDF
Samza at LinkedIn
PPTX
Reactive Java: Promises and Streams with Reakt (JavaOne talk 2016)
PDF
FluentD vs. Logstash
Top conf serverlezz
NetflixOSS Meetup season 3 episode 1
Cloud Native Patterns Meetup 2019-11-20
How and why GraalVM is quickly becoming relevant for you (DOAG 2020)
Building a serverless company on AWS lambda and Serverless framework
Front-end for Java developers Devoxx France 2018
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
Flagger: Istio Progressive Delivery Operator
Microservices with Micronaut
Extending your applications to the edge with CDNs
Understanding time in structured streaming
Modern Monitoring [ with Prometheus ]
Microservices and serverless in python projects
OpenFlow @ Google
Altitude SF 2017: Building a continuous deployment pipeline
Apache Beam and Google Cloud Dataflow - IDG - final
Иван Бурмистров "Строго ориентированная последовательность временных событий"...
Samza at LinkedIn
Reactive Java: Promises and Streams with Reakt (JavaOne talk 2016)
FluentD vs. Logstash
Ad

Similar to Processing TeraBytes of data every day and sleeping at night (20)

PDF
Serverless for High Performance Computing
PDF
Serverless for High Performance Computing
PPTX
AWS Techniques and lessons writing low cost autoscaling GitLab runners
PDF
AWS Lambda and Serverless framework: lessons learned while building a serverl...
PPTX
Netflix Data Pipeline With Kafka
PPTX
Netflix Data Pipeline With Kafka
PDF
Aws uk ug #8 not everything that happens in vegas stay in vegas
PPTX
Building real time Data Pipeline using Spark Streaming
PDF
Writing and deploying serverless python applications
PDF
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
PDF
Lessons learned from operating small scale clusters.pdf
PDF
Lessons learned from operating small scale clusters.pdf
PPTX
Node.js Web Apps @ ebay scale
PDF
Debugging data pipelines @OLA by Karan Kumar
PDF
PyConIT 2018 Writing and deploying serverless python applications
PDF
There is something about serverless
PDF
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PDF
PyConIE 2017 Writing and deploying serverless python applications
PDF
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
PDF
[AWS Builders] Effective AWS Glue
Serverless for High Performance Computing
Serverless for High Performance Computing
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Lambda and Serverless framework: lessons learned while building a serverl...
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Aws uk ug #8 not everything that happens in vegas stay in vegas
Building real time Data Pipeline using Spark Streaming
Writing and deploying serverless python applications
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Lessons learned from operating small scale clusters.pdf
Lessons learned from operating small scale clusters.pdf
Node.js Web Apps @ ebay scale
Debugging data pipelines @OLA by Karan Kumar
PyConIT 2018 Writing and deploying serverless python applications
There is something about serverless
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PyConIE 2017 Writing and deploying serverless python applications
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
[AWS Builders] Effective AWS Glue
Ad

More from Luciano Mammino (20)

PDF
Serverless Rust: Your Low-Risk Entry Point to Rust in Production (and the ben...
PDF
Did you know JavaScript has iterators? DublinJS
PDF
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
PDF
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
PDF
From Node.js to Design Patterns - BuildPiper
PDF
Let's build a 0-cost invite-only website with Next.js and Airtable!
PDF
Everything I know about S3 pre-signed URLs
PDF
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
PDF
Building an invite-only microsite with Next.js & Airtable
PDF
Let's take the monolith to the cloud 🚀
PDF
A look inside the European Covid Green Certificate - Rust Dublin
PDF
Monoliths to the cloud!
PDF
The senior dev
PDF
Node.js: scalability tips - Azure Dev Community Vijayawada
PDF
A look inside the European Covid Green Certificate (Codemotion 2021)
PDF
AWS Observability Made Simple
PDF
Semplificare l'observability per progetti Serverless
PDF
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
PDF
Finding a lost song with Node.js and async iterators - EnterJS 2021
PDF
How to send gzipped requests with boto3
Serverless Rust: Your Low-Risk Entry Point to Rust in Production (and the ben...
Did you know JavaScript has iterators? DublinJS
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
From Node.js to Design Patterns - BuildPiper
Let's build a 0-cost invite-only website with Next.js and Airtable!
Everything I know about S3 pre-signed URLs
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
Building an invite-only microsite with Next.js & Airtable
Let's take the monolith to the cloud 🚀
A look inside the European Covid Green Certificate - Rust Dublin
Monoliths to the cloud!
The senior dev
Node.js: scalability tips - Azure Dev Community Vijayawada
A look inside the European Covid Green Certificate (Codemotion 2021)
AWS Observability Made Simple
Semplificare l'observability per progetti Serverless
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021
How to send gzipped requests with boto3

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
cuic standard and advanced reporting.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Machine Learning_overview_presentation.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
cuic standard and advanced reporting.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
Encapsulation_ Review paper, used for researhc scholars
Machine Learning_overview_presentation.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
1. Introduction to Computer Programming.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25-Week II
Agricultural_Statistics_at_a_Glance_2022_0.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Assigned Numbers - 2025 - Bluetooth® Document
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding

Processing TeraBytes of data every day and sleeping at night

  • 1. Processing Terabytes of data every day … and sleeping at night @katavic_d - @loige User Group Dublin 05/02/2019 loige.link/tera-dub
  • 2. Domagoj KatavicSenior Software Engineer 🐦 @katavic_d 😸 github.com/dkatavic
  • 3. Luciano Mammino Cloud Architect 🐦 @loige 😸 github.com/lmammino loige.co 4.7 out of 5 stars on Amazon.com With @mariocasciaro
  • 4. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 5. AI to detect and hunt for cyber attackers Cognito Platform ● Detect ● Recall @katavic_d - @loige
  • 6. Cognito Detect on premise solution ● Analyzing network traffic and logs ● Uses AI to deliver real-time attack visibility ● Behaviour driven Host centric ● Provides threat context and most relevant attack details @katavic_d - @loige
  • 8. Cognito Recall ● Collects network metadata and stores it in “the cloud” ● Data is processed, enriched and standardised ● Data is made searchable @katavic_d - @loige A new Vectra product for Incident Response
  • 9. Recall requirements ● Data isolation ● Ingestion speed: ~2GB/min x customer (up ~3TB x day per customer) ● Investigation tool: Flexible data exploration @katavic_d - @loige
  • 10. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 12. @katavic_d - @loige Control plane Centralised Logging & Metrics
  • 13. Security ● Separate VPCs ● Strict Security Groups (whitelisting) ● Red, amber, green subnets ● Encryption at rest through AWS services ● Client Certificates + TLS ● Pentest @katavic_d - @loige
  • 14. Let’s start the beta! @katavic_d - @loige
  • 15. Warning: different timezones! A cu m Our ne * @katavic_d - @loige *yeah, we actually look that cute when we sleep!
  • 16. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 19. Lambda timeouts incident ● AWS Lambda timeout: 5 minutes 15 minutes ● We are receiving files every minute (containing 1 minute of network traffic) ● During peak hours for the biggest customer, files can be too big to be processed within timeout limits @katavic_d - @loige
  • 22. Lessons learned ● Predictable data input for predictable performance ● Data ingestion parallelization (exploiting serverless capabilities) @katavic_d - @loige
  • 24. Lambdas IP starvation incident ● Spinning up many lambdas consumed all the available IPs in a subnet ● Failure to get an IP for the new ES machines ● ElasticSearch cannot scale up ● Solution: separate ElasticSearch and Lambda subnets @katavic_d - @loige GI IP!
  • 25. Lessons learned ● Every running lambda inside a VPC uses an ENI (Elastic Network Interface) ● Every ENI takes a private IP address ● Edge conditions or bugs might generate spikes in the number of running lambdas and you might run out of IPs in the subnet! ● Consider putting lambdas in their dedicated subnet @katavic_d - @loige
  • 30. ● New lambda version: triggered insertion failures ● ElasticSearch rejecting inserts and logging errors ● Our log reporting agents got stuck (we DDoS’d ourselves!) ● Monitoring/Alerting failed Resolution: ● Fix mismatching schema ● Scaled out centralised logging system Why didn’t we receive the page @katavic_d - @loige
  • 31. Alerting on lambda failures Using logs: ● Best case: no logs ● Worst case: no logs (available)! A better approach: ● Attach a DLQ to your lambdas ● Alert on queue size with CloudWatch! ● Visibility on Lambda retries @katavic_d - @loige
  • 35. Fast retry at peak times ● Lambda retry logic is not configurable loige.link/lambda-retry ● Most events will be retried 2 times ● Time between retry attempts is not clearly defined (observed in the order of few seconds) ● What if all retry attempts happen at peak times? @katavic_d - @loige
  • 36. Fast retry at peak times @katavic_d - @loige
  • 37. Fast retry at peak times Processing in these range of time is likely to succeed@katavic_d - @loige
  • 38. Fast retry at peak times @katavic_d - @loige
  • 39. Fast retry at peak times Processing in this range of time is likely to fail@katavic_d - @loige
  • 40. Fast retry at peak times If retries are in the same zone, the message will fail and go to the DLQ 1st retry 2nd retry
  • 41. Can we extend the retry period in case of failure? @katavic_d - @loige
  • 42. @katavic_d - @loige Extended retry period We normally trigger our ingestion Lambda when a new file is stored in S3
  • 43. @katavic_d - @loige Extended retry period If the Lambda fails, the event is automatically retried, up to 2 times
  • 44. @katavic_d - @loige Extended retry period If the Lambda still fails, the event is copied to the Dead Letter Queue (DLQ)
  • 45. @katavic_d - @loige Extended retry period At this point our Lambda, can receive an SQS event from the DLQ (custom retry logic)
  • 46. @katavic_d - @loige Extended retry period If the processing still fails, we can extend the VisibilityTimeout (event delay) x3
  • 47. @katavic_d - @loige Extended retry period If the processing still fails, we eventually drop the message and alert for manual intervention. x3
  • 48. Lessons learned ● Cannot always rely on the default retry logic ● SQS events + DLQ = custom SERVERLESS retry logic ● Now we only alert on custom metrics when we are sure the event will fail (logic error) ● https://loige.link/async-lambda-retry @katavic_d - @loige
  • 49. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 50. AWS nuances ● Serverless is generally cheap, but be careful! ○ You are paying for wait time ○ Bugs may be expensive ○ 100ms charging blocks ● https://loige.link/lambda-pricing ● https://loige.link/serverless-costs-all-wrong @katavic_d - @loige
  • 51. AWS nuances ● Not every service/feature is available in every region or AZ ○ SQS FIFO :( ○ Not all AWS regions have 3 AZs ○ Not all instance types are available in every availability zone ● https://loige.link/aws-regional-services @katavic_d - @loige
  • 52. AWS nuances ● Limits everywhere! ○ Soft vs hard limits ○ Take them into account in your design ● https://loige.link/aws-service-limits @katavic_d - @loige
  • 53. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 54. Process How to deal with incidents ● Page ● Engineers on call ● Incident Retrospective ● Actions @katavic_d - @loige
  • 55. Pages ● Page is an alarm for people on call (Pagerduty) ● Rotate ops & devs (share the pain) ● Generate pages from different sources (Logs, Cloudwatch, SNS, grafana, etc) ● When a page is received, it needs to be acknowledged or it is automatically escalated ● If customer facing (e.g. service not available), customer is notified @katavic_d - @loige
  • 56. Engineers on call 1. Use operational handbook 2. Might escalate to other engineers 3. Find mitigation / remediation 4. Update handbook 5. Prepare for retrospective @katavic_d - @loige
  • 57. Incidents Retrospective "Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand." – Nor t , Pro t R os t e : A Han k o T m e TLDR; NOT A BLAMING GAME! @katavic_d - @loige
  • 58. Incidents Retrospective ● Summary ● Events timeline ● Contributing Factors ● Remediation / Solution ● Actions for the future ● Transparency @katavic_d - @loige
  • 59. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 60. Development best practices ● Regular Retrospectives (not just for incidents) ○ What’s good ○ What’s bad ○ Actions to improve ● Kanban Board ○ All work visible ○ One card at the time ○ Work In Progress limit ○ “Stop Starting Start Finishing” @katavic_d - @loige
  • 61. Development best practices ● Clear acceptance criteria ○ Collectively defined (3 amigos) ○ Make sure you know when a card is done ● Split the work in small cards ○ High throughput ○ More predictability ● Bugs take priority over features! @katavic_d - @loige
  • 62. Development best practices ● Pair programming ○ Share the knowledge/responsibility ○ Improve team dynamics ○ Enforced by low WIP limit ● Quality over deadlines ● Don’t estimate without data @katavic_d - @loige
  • 63. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 64. Release process ● Infrastructure as a code ○ Deterministic deployments ○ Infrastructure versioning using git ● No “snowflakes”, one code base for all customers ● Feature flags: ○ Special features ○ Soft releases ● Automated tests before release @katavic_d - @loige
  • 65. Conclusion @katavic_d - @loige We are still waking up at night sometimes, but we are definitely sleeping a lot more and better! Takeaways: ● Have healthy and clear processes ● Always review and strive for improvement ● Monitor/Instrument as much as you can (even monitoring) ● Use managed services to reduce the operational overhead (but learn their nuances)
  • 66. We are hiring … Talk to us!@katavic_d - @loige Thank you! loige.link/tera-dub
  • 67. Credits Pictures from Unsplash Huge thanks for support and reviews to: ● All the Vectra team ● Yan Cui (@theburningmonk) ● Paul Dolan ● @gbinside ● @augeva ● @Podgeypoos79 ● @PawrickMannion ● @micktwomey ● Vedran Jukic