Processing TeraBytes of data every day and sleeping at night

Processing Terabytes of data every day
… and sleeping at night
@katavic_d - @loige
User Group
Dublin
05/02/2019
loige.link/tera-dub

Domagoj KatavicSenior Software Engineer
🐦 @katavic_d
😸 github.com/dkatavic

Luciano Mammino Cloud Architect
🐦 @loige
😸 github.com/lmammino
loige.co
4.7 out of 5 stars
on Amazon.com
With @mariocasciaro

Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige

AI to detect and hunt for
cyber attackers
Cognito Platform
● Detect
● Recall
@katavic_d - @loige

Cognito Detect
on premise solution
● Analyzing network traffic and logs
● Uses AI to deliver real-time attack visibility
● Behaviour driven Host centric
● Provides threat context and most relevant
attack details
@katavic_d - @loige

Cognito Recall
● Collects network metadata
and stores it in “the cloud”
● Data is processed, enriched and standardised
● Data is made searchable
@katavic_d - @loige
A new Vectra product for Incident Response

Recall requirements
● Data isolation
● Ingestion speed: ~2GB/min x customer
(up ~3TB x day per customer)
● Investigation tool:
Flexible data exploration
@katavic_d - @loige

Our first iteration
@katavic_d - @loige

@katavic_d - @loige
Control plane
Centralised
Logging &
Metrics

Security
● Separate VPCs
● Strict Security Groups (whitelisting)
● Red, amber, green subnets
● Encryption at rest through AWS services
● Client Certificates + TLS
● Pentest
@katavic_d - @loige

Let’s start the beta!
@katavic_d - @loige

Warning: different timezones!
A cu m
Our ne *
@katavic_d - @loige *yeah, we actually look that cute when we sleep!

Lambda timeouts incident
● AWS Lambda timeout: 5 minutes 15 minutes
● We are receiving files every minute
(containing 1 minute of network traffic)
● During peak hours for the biggest customer, files
can be too big to be processed within timeout
limits
@katavic_d - @loige

Splitter lambda
@katavic_d - @loige

Lessons learned
● Predictable data input for
predictable performance
● Data ingestion parallelization
(exploiting serverless
capabilities)
@katavic_d - @loige

Lambdas IP starvation incident
● Spinning up many lambdas consumed
all the available IPs in a subnet
● Failure to get an IP for the new ES
machines
● ElasticSearch cannot scale up
● Solution: separate ElasticSearch and
Lambda subnets
@katavic_d - @loige
GI
IP!

Lessons learned
● Every running lambda inside a VPC uses an ENI
(Elastic Network Interface)
● Every ENI takes a private IP address
● Edge conditions or bugs might generate spikes in the
number of running lambdas and you might run out of
IPs in the subnet!
● Consider putting lambdas in their dedicated subnet
@katavic_d - @loige

@katavic_d - @loige
Missing data incident

● New lambda version: triggered insertion failures
● ElasticSearch rejecting inserts and logging errors
● Our log reporting agents got stuck (we DDoS’d ourselves!)
● Monitoring/Alerting failed
Resolution:
● Fix mismatching schema
● Scaled out centralised logging system
Why didn’t we receive the page
@katavic_d - @loige

Alerting on lambda failures
Using logs:
● Best case: no logs
● Worst case: no logs (available)!
A better approach:
● Attach a DLQ to your lambdas
● Alert on queue size with
CloudWatch!
● Visibility on Lambda retries
@katavic_d - @loige

Fast retry at peak times
● Lambda retry logic is not configurable
loige.link/lambda-retry
● Most events will be retried 2 times
● Time between retry attempts is not clearly defined
(observed in the order of few seconds)
● What if all retry attempts happen at peak times?
@katavic_d - @loige

@katavic_d - @loige

Processing in these range of time is likely to succeed@katavic_d - @loige

Processing in this range of time is likely to fail@katavic_d - @loige

If retries are in the same zone, the message will fail and go to the DLQ
1st retry 2nd retry

Can we extend the retry period
in case of failure?
@katavic_d - @loige

@katavic_d - @loige
Extended retry period
We normally trigger our ingestion Lambda when a new file is stored in S3

@katavic_d - @loige
If the Lambda fails, the event is automatically retried, up to 2 times

@katavic_d - @loige
If the Lambda still fails, the event is copied to the Dead Letter Queue (DLQ)

@katavic_d - @loige
At this point our Lambda, can receive an SQS event from the DLQ (custom retry logic)

@katavic_d - @loige
If the processing still fails, we can extend the VisibilityTimeout (event delay)
x3

@katavic_d - @loige
If the processing still fails, we eventually drop the message and alert for manual intervention.
x3

Lessons learned
● Cannot always rely on the default retry logic
● SQS events + DLQ =
custom SERVERLESS retry logic
● Now we only alert on custom metrics when
we are sure the event will fail (logic error)
● https://loige.link/async-lambda-retry
@katavic_d - @loige

AWS nuances
● Serverless is generally cheap, but be careful!
○ You are paying for wait time
○ Bugs may be expensive
○ 100ms charging blocks
● https://loige.link/lambda-pricing
● https://loige.link/serverless-costs-all-wrong
@katavic_d - @loige

AWS nuances
● Not every service/feature is available in every region or AZ
○ SQS FIFO :(
○ Not all AWS regions have 3 AZs
○ Not all instance types are available in every availability zone
● https://loige.link/aws-regional-services
@katavic_d - @loige

AWS nuances
● Limits everywhere!
○ Soft vs hard limits
○ Take them into account in your design
● https://loige.link/aws-service-limits
@katavic_d - @loige

Process
How to deal with incidents
● Page
● Engineers on call
● Incident Retrospective
● Actions
@katavic_d - @loige

Pages
● Page is an alarm for people on call (Pagerduty)
● Rotate ops & devs (share the pain)
● Generate pages from different sources (Logs, Cloudwatch, SNS,
grafana, etc)
● When a page is received, it needs to be acknowledged or it is
automatically escalated
● If customer facing (e.g. service not available), customer is notified
@katavic_d - @loige

Engineers on call
1. Use operational handbook
2. Might escalate to other engineers
3. Find mitigation / remediation
4. Update handbook
5. Prepare for retrospective
@katavic_d - @loige

Incidents Retrospective
"Regardless of what we discover, we understand and truly
believe that everyone did the best job they could, given
what they knew at the time, their skills and abilities, the
resources available, and the situation at hand."
– Nor t , Pro t R os t e : A Han k o T m e
TLDR; NOT A BLAMING GAME!
@katavic_d - @loige

Incidents Retrospective
● Summary
● Events timeline
● Contributing Factors
● Remediation / Solution
● Actions for the future
● Transparency
@katavic_d - @loige

Development best practices
● Regular Retrospectives (not just for incidents)
○ What’s good
○ What’s bad
○ Actions to improve
● Kanban Board
○ All work visible
○ One card at the time
○ Work In Progress limit
○ “Stop Starting Start Finishing”
@katavic_d - @loige

● Clear acceptance criteria
○ Collectively defined (3 amigos)
○ Make sure you know when a card is done
● Split the work in small cards
○ High throughput
○ More predictability
● Bugs take priority over features!
@katavic_d - @loige

● Pair programming
○ Share the knowledge/responsibility
○ Improve team dynamics
○ Enforced by low WIP limit
● Quality over deadlines
● Don’t estimate without data
@katavic_d - @loige

Release process
● Infrastructure as a code
○ Deterministic deployments
○ Infrastructure versioning using git
● No “snowflakes”, one code base for all customers
● Feature flags:
○ Special features
○ Soft releases
● Automated tests before release
@katavic_d - @loige

Conclusion
@katavic_d - @loige
We are still waking up at night sometimes,
but we are definitely sleeping a lot more and better!
Takeaways:
● Have healthy and clear processes
● Always review and strive for improvement
● Monitor/Instrument as much as you can (even monitoring)
● Use managed services to reduce the operational overhead
(but learn their nuances)

We are hiring …
Talk to us!@katavic_d - @loige
Thank you!
loige.link/tera-dub

Credits
Pictures from Unsplash
Huge thanks for support and reviews to:
● All the Vectra team
● Yan Cui (@theburningmonk)
● Paul Dolan
● @gbinside
● @augeva
● @Podgeypoos79
● @PawrickMannion
● @micktwomey
● Vedran Jukic

Processing TeraBytes of data every day and sleeping at night

More Related Content

What's hot (20)

Similar to Processing TeraBytes of data every day and sleeping at night (20)

More from Luciano Mammino (20)

Recently uploaded (20)

Processing TeraBytes of data every day and sleeping at night