SlideShare a Scribd company logo
Cloud Native Predictive
Data Pipelines
A microtalk
Sid Anand (@r39132)
PayPal Risk Infra All-Hands (Jan 2018)
About Me
Worked @
Committer & PPMC
on
Father of 2
Co-Chair for
Apache Airflow
Work @ (started July 2017)
About Me
Worked @
Committer & PPMC
on
Father of 2
Co-Chair for
Apache Airflow
Work @ (started July 2017)
NRT Fraud Prevention
in the cloud
What does Agari do?
What does Agari do?
What does Agari do?
What does Agari do?
• You may have been phished in
your personal email inbox
• This is a bigger problem for
enterprises
• ~80% of enterprise-targeted
attacks are via email
What does Agari do?
How does Agari work?
How does Agari work?
Enterprise
Customer
SEG
• A (targeted) spear-phish email is sent to an individual
at your company
• The email content is unique and personalized to a
specific employee
How does Agari work?
• A (targeted) spear-phish email is sent to an individual
in your company
• The email content is unique and personalized to a
specific employee
• Your company’s SEG (a.k.a. spam detector) lets it
through since it doesn’t match known signatures of
spam campaigns
Enterprise
Customer
SEG
AWS Cloud
How does Agari work?
Enterprise
Customer
SEG
Email
Metadata
Build Trust
Models &
Score
• Agari’s on-premise interceptor holds the email & sends
email headers to its Cloud-based SAAS prediction
service
• Agari assigns a trust score to the email, building a
model on the fly if more information is needed, and
records the result in a DB
AWS Cloud
Quarantine,
Label,
PassThrough
How does Agari work?
Enterprise
Customer
SEG
Email
Metadata
Build Trust
Models &
Score
• The prediction service sends a control signal back to
the on-premise interceptor to quarantine, label, or
release the held email message
AWS Cloud
Quarantine,
Label,
PassThrough
How does Agari work?
Enterprise
Customer
SEG
Email
Metadata
Build Trust
Models &
Score
• 95%ile SLA <3 seconds (end-to-end)
• Agari also provides near-realtime analytics on the mail
flow & actions
Predictive Analytics @
Agari
Use Cases
Use Cases
Apply trust models
(message scoring)
near real time
Build trust models
batch + near real
time
Use Cases
Apply trust models
(message scoring)
near real time
Build trust models
batch + near real
time
Focus of this talk
Use-Case : Message
Scoring (near-real time)
NRT Pipeline Architecture
Message Scoring Architecture
K
enterprise C
enterprise B
enterprise A
Counter
K K
Importer
ASG
K
ES Upd8r
Alerter
ASG
SQS
SR SR SR
SR
SR
Scorer
ASG
S3 Upd8r
S3
… With Nightly Model Building
K
enterprise C
enterprise B
enterprise A
Counter
K K
Importer
ASG
K
ES Upd8r
Alerter
ASG
SQS
SR SR SR
SR
SR
Scorer
ASG
S3 Upd8r
S3
Architectural Concepts
• Microservice Architecture - each service does a simple job but does it well
• Decoupled Services - Via Message Buses (Kinesis) & Avro (great support for schema
evolution)
• Immutable Services - Nightly models are packaged with code (co-versioned) in a new
image that is rolled out via the autoscaler
• Polyglot Persistence - Use the right datastore for the right job
• Postgres for message details
• ES for aggregates & search
• Redis for low-latency, high-frequency windowed counter-style R/W workloads
• Single source(s) of truth - S3 and Postgres are the sources of truth for semi-structured and
structured data, respectively. Everything else can be built from them
• Use Lambda/FaaS when possible for light-weight event processing - Note : it’s stateful!
• Leverage CDC - Create a stream of committed data! Avoid Write-then-Read patterns
Architectural Components
Component Role Details Pros Operability Model
Data Lake
• All data stored in S3 via
Kinesis Firehose
Scalable, Available,
Performant, Serverless
Serverless
Kinesis Messaging
• Streaming transport
modeled on Kafka
Scalable, Available,
Serverless
Serverless
General
Processing
• ASG Replacement
except for Rails Apps
Scalable, Available,
Serverless
Serverless
ASG
General
Processing
• Used for importing, data
cleansing, business logic
Scalable, Available,
Managed
Managed
Data Science
Processing
• Model Building Agari Operates
Workflow
Engine
• Nightly model builds +
some classic Ops cron
workloads
Lightweight, DAGs as
Code
Agari Operates
DB
Persistence for
WebApp
• Holds smaller subset of
data needed for Web
App
Rails + Postgres
‘nuff said
Agari Operates
Persistence for
WebApp
• Aggregation + Search
moved from DB to ES
• Model Building queries
moved to Elasticache
Redis
Faster. more accurate for
aggregates, frees up
headroom for DB
(polyglot persistence)
Managed
S3
What’s Next?
For more info, reach out to DL-PP-CDP-PM
Questions?

More Related Content

PDF
Cloud Native Data Pipelines (GoTo Chicago 2017)
PDF
Cloud Native Data Pipelines (DataEngConf SF 2017)
PDF
Low Latency Fraud Detection & Prevention
PDF
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
PDF
Resilient Predictive Data Pipelines (QCon London 2016)
PDF
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
PDF
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
PPTX
Running Presto and Spark on the Netflix Big Data Platform
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
Low Latency Fraud Detection & Prevention
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Resilient Predictive Data Pipelines (QCon London 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Running Presto and Spark on the Netflix Big Data Platform

What's hot (13)

PDF
Introduction to Apache Airflow - Data Day Seattle 2016
PDF
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
PPTX
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
PPTX
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
PPTX
Next Generation Big Data Platform at Netflix 2014
PDF
Journeys from Kafka to Parquet
PDF
How Disney+ uses fast data ubiquity to improve the customer experience
PDF
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
PDF
Data Pipeline with Kafka
PDF
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Introduction to Apache Airflow - Data Day Seattle 2016
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...
Next Generation Big Data Platform at Netflix 2014
Journeys from Kafka to Parquet
How Disney+ uses fast data ubiquity to improve the customer experience
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Data Pipeline with Kafka
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Summit EU talk by Kaarthik Sivashanmugam
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Ad

Similar to Cloud Native Predictive Data Pipelines (micro talk) (20)

PDF
Cloud Native Data Pipelines
PDF
Airflow @ Agari
PPTX
Tools and practices to use in a Continuous Delivery pipeline
PDF
Adopting the Cloud
PDF
Tech Talk - Cloud Transformation in 2017
PDF
Operating Microservices at Hyperscale — Tech in Asia PDC 2019
PDF
Unlocking the Cloud Operating Model
PDF
DevOps Pragmatic Overview
PPTX
Microservics, serverless and real time; Building blocks of the modern data pi...
PDF
Dev ops and safety critical systems
PDF
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
PDF
Architecting applications in the AWS cloud
PPTX
Wicsa2011 cloud tutorial
PPTX
SANS_PentestHackfest_2022-PurpleTeam_Cloud_Identity.pptx
PDF
AWS101 Cloud is the New Normal
PPTX
RAG Chatbot using AWS Bedrock and Streamlit Framework
PDF
Cloud computing workshop at IIT bombay
PPTX
Cloud computing security & forensics (manu)
PDF
Docker microservices and the service mesh
PPTX
AWS Summit 2018 Summary
Cloud Native Data Pipelines
Airflow @ Agari
Tools and practices to use in a Continuous Delivery pipeline
Adopting the Cloud
Tech Talk - Cloud Transformation in 2017
Operating Microservices at Hyperscale — Tech in Asia PDC 2019
Unlocking the Cloud Operating Model
DevOps Pragmatic Overview
Microservics, serverless and real time; Building blocks of the modern data pi...
Dev ops and safety critical systems
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
Architecting applications in the AWS cloud
Wicsa2011 cloud tutorial
SANS_PentestHackfest_2022-PurpleTeam_Cloud_Identity.pptx
AWS101 Cloud is the New Normal
RAG Chatbot using AWS Bedrock and Streamlit Framework
Cloud computing workshop at IIT bombay
Cloud computing security & forensics (manu)
Docker microservices and the service mesh
AWS Summit 2018 Summary
Ad

More from Sid Anand (18)

PDF
Building High Fidelity Data Streams (QCon London 2023)
PDF
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
PDF
YOW! Data Keynote (2021)
PDF
Big Data, Fast Data @ PayPal (YOW 2018)
PDF
Building Better Data Pipelines using Apache Airflow
PPTX
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
PPTX
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
PPTX
Building a Modern Website for Scale (QCon NY 2013)
PDF
Hands On with Maven
PDF
Learning git
PDF
LinkedIn Data Infrastructure Slides (Version 2)
PDF
LinkedIn Data Infrastructure (QCon London 2012)
PPTX
Linked in nosql_atnetflix_2012_v1
PDF
Keeping Movies Running Amid Thunderstorms!
PDF
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
PDF
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
PPTX
Svccg nosql 2011_v4
PPTX
Netflix's Transition to High-Availability Storage (QCon SF 2010)
Building High Fidelity Data Streams (QCon London 2023)
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
YOW! Data Keynote (2021)
Big Data, Fast Data @ PayPal (YOW 2018)
Building Better Data Pipelines using Apache Airflow
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
Building a Modern Website for Scale (QCon NY 2013)
Hands On with Maven
Learning git
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure (QCon London 2012)
Linked in nosql_atnetflix_2012_v1
Keeping Movies Running Amid Thunderstorms!
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Svccg nosql 2011_v4
Netflix's Transition to High-Availability Storage (QCon SF 2010)

Recently uploaded (20)

PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Nekopoi APK 2025 free lastest update
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
L1 - Introduction to python Backend.pptx
PDF
Digital Strategies for Manufacturing Companies
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
AI in Product Development-omnex systems
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
history of c programming in notes for students .pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Transform Your Business with a Software ERP System
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Nekopoi APK 2025 free lastest update
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
L1 - Introduction to python Backend.pptx
Digital Strategies for Manufacturing Companies
CHAPTER 2 - PM Management and IT Context
AI in Product Development-omnex systems
Operating system designcfffgfgggggggvggggggggg
history of c programming in notes for students .pptx
Design an Analysis of Algorithms I-SECS-1021-03
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
How to Choose the Right IT Partner for Your Business in Malaysia
wealthsignaloriginal-com-DS-text-... (1).pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Odoo Companies in India – Driving Business Transformation.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Transform Your Business with a Software ERP System
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Upgrade and Innovation Strategies for SAP ERP Customers

Cloud Native Predictive Data Pipelines (micro talk)

  • 1. Cloud Native Predictive Data Pipelines A microtalk Sid Anand (@r39132) PayPal Risk Infra All-Hands (Jan 2018)
  • 2. About Me Worked @ Committer & PPMC on Father of 2 Co-Chair for Apache Airflow Work @ (started July 2017)
  • 3. About Me Worked @ Committer & PPMC on Father of 2 Co-Chair for Apache Airflow Work @ (started July 2017) NRT Fraud Prevention in the cloud
  • 8. • You may have been phished in your personal email inbox • This is a bigger problem for enterprises • ~80% of enterprise-targeted attacks are via email What does Agari do?
  • 10. How does Agari work? Enterprise Customer SEG • A (targeted) spear-phish email is sent to an individual at your company • The email content is unique and personalized to a specific employee
  • 11. How does Agari work? • A (targeted) spear-phish email is sent to an individual in your company • The email content is unique and personalized to a specific employee • Your company’s SEG (a.k.a. spam detector) lets it through since it doesn’t match known signatures of spam campaigns Enterprise Customer SEG
  • 12. AWS Cloud How does Agari work? Enterprise Customer SEG Email Metadata Build Trust Models & Score • Agari’s on-premise interceptor holds the email & sends email headers to its Cloud-based SAAS prediction service • Agari assigns a trust score to the email, building a model on the fly if more information is needed, and records the result in a DB
  • 13. AWS Cloud Quarantine, Label, PassThrough How does Agari work? Enterprise Customer SEG Email Metadata Build Trust Models & Score • The prediction service sends a control signal back to the on-premise interceptor to quarantine, label, or release the held email message
  • 14. AWS Cloud Quarantine, Label, PassThrough How does Agari work? Enterprise Customer SEG Email Metadata Build Trust Models & Score • 95%ile SLA <3 seconds (end-to-end) • Agari also provides near-realtime analytics on the mail flow & actions
  • 16. Use Cases Apply trust models (message scoring) near real time Build trust models batch + near real time
  • 17. Use Cases Apply trust models (message scoring) near real time Build trust models batch + near real time Focus of this talk
  • 18. Use-Case : Message Scoring (near-real time) NRT Pipeline Architecture
  • 19. Message Scoring Architecture K enterprise C enterprise B enterprise A Counter K K Importer ASG K ES Upd8r Alerter ASG SQS SR SR SR SR SR Scorer ASG S3 Upd8r S3
  • 20. … With Nightly Model Building K enterprise C enterprise B enterprise A Counter K K Importer ASG K ES Upd8r Alerter ASG SQS SR SR SR SR SR Scorer ASG S3 Upd8r S3
  • 21. Architectural Concepts • Microservice Architecture - each service does a simple job but does it well • Decoupled Services - Via Message Buses (Kinesis) & Avro (great support for schema evolution) • Immutable Services - Nightly models are packaged with code (co-versioned) in a new image that is rolled out via the autoscaler • Polyglot Persistence - Use the right datastore for the right job • Postgres for message details • ES for aggregates & search • Redis for low-latency, high-frequency windowed counter-style R/W workloads • Single source(s) of truth - S3 and Postgres are the sources of truth for semi-structured and structured data, respectively. Everything else can be built from them • Use Lambda/FaaS when possible for light-weight event processing - Note : it’s stateful! • Leverage CDC - Create a stream of committed data! Avoid Write-then-Read patterns
  • 22. Architectural Components Component Role Details Pros Operability Model Data Lake • All data stored in S3 via Kinesis Firehose Scalable, Available, Performant, Serverless Serverless Kinesis Messaging • Streaming transport modeled on Kafka Scalable, Available, Serverless Serverless General Processing • ASG Replacement except for Rails Apps Scalable, Available, Serverless Serverless ASG General Processing • Used for importing, data cleansing, business logic Scalable, Available, Managed Managed Data Science Processing • Model Building Agari Operates Workflow Engine • Nightly model builds + some classic Ops cron workloads Lightweight, DAGs as Code Agari Operates DB Persistence for WebApp • Holds smaller subset of data needed for Web App Rails + Postgres ‘nuff said Agari Operates Persistence for WebApp • Aggregation + Search moved from DB to ES • Model Building queries moved to Elasticache Redis Faster. more accurate for aggregates, frees up headroom for DB (polyglot persistence) Managed S3
  • 23. What’s Next? For more info, reach out to DL-PP-CDP-PM