SlideShare a Scribd company logo
engineering.deltax.com
Building a Real-time Stream
Processing Pipeline
Akshay Surve, CTO DeltaX
akshay@deltax.com / @ak47surve
Hastag: #awsblr #meetup
engineering.deltax.com
● 12 years
○ Shipping Ideas, Making Mistakes, GTD
○ Marathons / Hackathons / *-athon :)
● Co-founded DeltaX in 2013
○ Ad-tech / Product Startup
○ 300+ advertisers across India, APAC and US.
About Me
2
engineering.deltax.com
Agenda
● Use-case
● Processing Models
● Old Batch Processing Architecture
○ Challenges
● Goals
● Moving Blocks for a Stream Processing Model
○ Kinesis Data Firehose
○ Amazon ElasticSearch
○ Amazon Athena
● Review New Stream Processing Architecture 3
engineering.deltax.com
Use-case
● Ad Tracking & Ad Serving
● Cloud Architecture
4
engineering.deltax.com
Use-case
- Ad Tracking & Ad-serving
5
engineering.deltax.com
Use-case
- Ad Tracking & Ad-serving
6
engineering.deltax.com
Use-case
- Ad Tracking & Ad-serving
Advertiser
7
engineering.deltax.com
Use-case
- Ad Tracking & Ad-serving
Event
8
engineering.deltax.com
Use-case
- Ad Tracking & Ad-serving
Timestamp
9
engineering.deltax.com
Use-case
- Cloud Architecture
10
engineering.deltax.com
● Batch Processing
● Stream Processing
Processing Models
11
engineering.deltax.com
● Batch Processing
Processing Models
Input OutputBatch Job(s)
12
engineering.deltax.com
● Stream Processing
Processing Models
Queue
Stream
Processor
Output
13
engineering.deltax.com
● Batch vs Stream
Processing Models
Batch Stream
High Latency Low Latency
Static Files Event Streams
Snapshot Continuous Window
14
engineering.deltax.com
Batch Processing
15
engineering.deltax.com
Batch Processing (Close-up)
16
engineering.deltax.com
Batch Processing (Challenges)
● Modelled around batch processing and not stream processing
● Ingesting JSON files in bulk isn’t natural for SQL - JSON parsing > SQL
tables
● Varied levels of aggregations - campaign, ad, device, geo + unique metrics
● Future roadmap - userid cookie pool across advertisers; exchange based
cookie matching, etc. become challenges in itself
17
engineering.deltax.com
● Stream processing as a paradigm suits our use case the best
● Easy to maintain or managed service in the cloud would be ideal
● Developer friendly and peace of mind was of utmost importance
● Being able to ingest streaming data and query summaries was important
● Good to have a way to run batch processing framework for machine learning,
data crunching, and analysis
Goals
18
engineering.deltax.com
● Amazon Athena
● Amazon Elasticsearch
● Kinesis Data Firehose
Moving Blocks
19
engineering.deltax.com
20
engineering.deltax.com
Amazon Athena
21
engineering.deltax.com
Amazon Athena
● Persistent Store
● DDL
● Query
22
engineering.deltax.com
Amazon Athena
● Persistent Store (AWS S3)
○ Text files, e.g., CSV, raw logs
○ Apache Web Logs, TSV files
○ JSON (simple, nested)
○ Compressed files
○ Columnar formats such as Apache Parquet & Apache ORC
23
engineering.deltax.com
Amazon Athena
● Persistent Store (AWS S3)
○ JSON events
24
engineering.deltax.com
● DDL (Apache Hive)
Amazon Athena
25
engineering.deltax.com
● DDL (Apache Hive)
Amazon Athena
26
engineering.deltax.com
Amazon Athena
● Query Engine (Presto query engine)
○ In Memory
○ ANSI SQL Compliant
27
engineering.deltax.com
● Query Engine (Presto query engine)
○ In Memory
○ ANSI SQL Compliant
Amazon Athena
28
engineering.deltax.com
● Query Engine (Presto query engine)
○ In Memory
○ ANSI SQL Compliant
Amazon Athena
29
engineering.deltax.com
● Serverless
● No spin-up time
● Query data directly from S3
● ANSI SQL
Amazon Athena (Advantages)
30
engineering.deltax.com
● Queries run fast
Amazon Athena (Advantages)
31
engineering.deltax.com
Amazon Elasticsearch
32
engineering.deltax.com
Amazon Elasticsearch
● ELK Stack (Searching, Log monitoring)
● Seamless Ingestion (Document-based model)
● Real-time queries (even during ingestion; 30s refresh window; immutability)
● Meant for search; Efficient for time-series (will discuss why?)
33
engineering.deltax.com
Amazon Elasticsearch
- Document that gets ingested
34
engineering.deltax.com
Elasticsearch (Internals)
● Elasticsearch Index
○ Inverted Index
○ Doc Values
35
engineering.deltax.com
Elasticsearch (Internals)
Deeper into an Elasticsearch Index
36
engineering.deltax.com
Elasticsearch (Internals)
● Deeper into an Elasticsearch Index - Inverted Index
○ The quick brown fox jumped over the lazy dog
○ Quick brown foxes leap over lazy dogs in summer
37
engineering.deltax.com
Elasticsearch (Internals)
Deeper into an Elasticsearch Index - Doc Values
● column-oriented fashion that is way more efficient for sorting and
aggregations
● Filesystem optimized
38
engineering.deltax.com
● Integration with AWS ecosystem
Amazon Elasticsearch (Advantages)
39
engineering.deltax.com
Amazon Elasticsearch (Advantages)
● Cluster Management (scale out/up)
40
engineering.deltax.com
Amazon Elasticsearch (Advantages)
● Monitoring & Alerts
41
engineering.deltax.com
Amazon Elasticsearch (Advantages)
● Snapshot Recovery / Backup to S3
● Elasticsearch Upgrades (could be made smoother)
42
engineering.deltax.com
Amazon Elasticsearch (Advantages)
● Integration with AWS ecosystem
● Cluster Management (scale out/up)
● Monitoring & Alerts
● Snapshot Recovery / Backup to S3
● Elasticsearch Upgrades
43
engineering.deltax.com
Kinesis Data Firehose
44
engineering.deltax.com
Kinesis
45
engineering.deltax.com
Kinesis Data Firehose
46
engineering.deltax.com
Kinesis Data Firehose
● Streaming Data Processing
● Multiple destinations - S3, Redshift, ES
● Intermediate Record transformations (using AWS Lambda) before delivery to
the destination
○ Ip2location
○ Enrich flow
○ Ua-parser
● Combine with Kinesis Analytics
47
engineering.deltax.com
Kinesis Data Firehose (source)
48
engineering.deltax.com
Kinesis Data Firehose (transformation)
49
engineering.deltax.com
Kinesis Data Firehose (destination)
50
engineering.deltax.com
Kinesis Data Firehose (ES config options)
51
engineering.deltax.com
Kinesis Data Firehose (ES destination)
Node.js (tracker) >
52
engineering.deltax.com
Kinesis Data Firehose (Advantages)
● Cloud Offering
53
Source: https://blog.ippon.tech/spark-storm-s
xd-comparison/
engineering.deltax.com
Kinesis Data Firehose (Advantages)
● Pluggability
54
Source: https://www.slideshare.net/AmazonWebServices/aws-reinvent-
2016-analyzing-streaming-data-in-realtime-with-amazon-kinesis-analytics-
bdm304
engineering.deltax.com
Kinesis Data Firehose
(Architecture)
55
engineering.deltax.com
Architecture
(Old vs New)
56
engineering.deltax.com
Stats
● Data: ~12 GB / day (peaks of 32 GB/day)
57
engineering.deltax.com
“The cloud is not a silver bullet”
silver bullet ~ noun
‘a simple and seemingly magical solution to a complicated problem’
Twitter - @ak47suve #awsblr #meetup
Email - akshay@deltax.com
Blog - engineering.deltax.com
58

More Related Content

PDF
Migrating a multi tenant app to Azure (war biopic)
PDF
Scylla Summit 2022: ScyllaDB Cloud: Simplifying Deployment to the Public Cloud
PPTX
Bleeding Edge Databases
PDF
Scylla Summit 2022: Multi-cloud State for k8s: Anthos and ScyllaDB
PPT
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
PPTX
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
PDF
AWS Athena vs. Google BigQuery for interactive SQL Queries
PPTX
AWS for the Data Professional
Migrating a multi tenant app to Azure (war biopic)
Scylla Summit 2022: ScyllaDB Cloud: Simplifying Deployment to the Public Cloud
Bleeding Edge Databases
Scylla Summit 2022: Multi-cloud State for k8s: Anthos and ScyllaDB
Building a data warehouse with AWS Redshift, Matillion and Yellowfin
Benchmarking Aerospike on the Google Cloud - NoSQL Speed with Ease
AWS Athena vs. Google BigQuery for interactive SQL Queries
AWS for the Data Professional

What's hot (20)

PPTX
AWS for Big Data Experts
PDF
Beyond Relational
PDF
Introducing the Hub for Data Orchestration
PPTX
Webinar: Building Blocks for the Future of Television
PPTX
SQL Server on Google Cloud Platform
PDF
Introduction to AWS Outposts
PPTX
New AWS Services for Bioinformatics
PDF
Streaming 4 billion Messages per day. Lessons Learned.
PDF
Apache Cassandra in the Cloud
PDF
Redshift VS BigQuery
PDF
Análisis del roadmap del Elastic Stack
PPTX
Not only SQL - Database Choices
PPTX
Scylla Summit 2018: Grab and Scylla: Driving Southeast Asia Forward
PPTX
Serverless Reality
PDF
Aws Kinesis
PDF
Polyglot persistence @ netflix (CDE Meetup)
PPTX
Curriculum Associates Strata NYC 2017
PDF
Deep Learning in the Cloud at Scale: A Data Orchestration Story
PPTX
Scaling Traffic from 0 to 139 Million Unique Visitors
PPTX
Microsoft Machine Learning Smackdown
AWS for Big Data Experts
Beyond Relational
Introducing the Hub for Data Orchestration
Webinar: Building Blocks for the Future of Television
SQL Server on Google Cloud Platform
Introduction to AWS Outposts
New AWS Services for Bioinformatics
Streaming 4 billion Messages per day. Lessons Learned.
Apache Cassandra in the Cloud
Redshift VS BigQuery
Análisis del roadmap del Elastic Stack
Not only SQL - Database Choices
Scylla Summit 2018: Grab and Scylla: Driving Southeast Asia Forward
Serverless Reality
Aws Kinesis
Polyglot persistence @ netflix (CDE Meetup)
Curriculum Associates Strata NYC 2017
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Scaling Traffic from 0 to 139 Million Unique Visitors
Microsoft Machine Learning Smackdown
Ad

Similar to Building a Real-time Stream Processing Pipeline - Kinesis Data Firehose, Amazon Elasticsearch, Amazon Athena (20)

PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
From raw data to business insights. A modern data lake
PPTX
Introduction to AWS Big Data
PDF
Building a modern data platform on AWS. Utrecht AWS Dev Day
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PDF
Lyft data Platform - 2019 slides
PDF
The Lyft data platform: Now and in the future
PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
PDF
JDD2014: Real Big Data - Scott MacGregor
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PPTX
Reshape Data Lake (as of 2020.07)
PPTX
Building Data Lakes & Analytics on AWS
PDF
Stream Computing & Analytics at Uber
PDF
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
PPTX
PDF
Building a Sustainable Data Platform on AWS
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Implementing the Lambda Architecture efficiently with Apache Spark
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
From raw data to business insights. A modern data lake
Introduction to AWS Big Data
Building a modern data platform on AWS. Utrecht AWS Dev Day
AWS Big Data Demystified #1: Big data architecture lessons learned
Big Data, Ingeniería de datos, y Data Lakes en AWS
Lyft data Platform - 2019 slides
The Lyft data platform: Now and in the future
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
JDD2014: Real Big Data - Scott MacGregor
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Reshape Data Lake (as of 2020.07)
Building Data Lakes & Analytics on AWS
Stream Computing & Analytics at Uber
AWS reinvent 2019 recap - Riyadh - Database and Analytics - Assif Abbasi
Building a Sustainable Data Platform on AWS
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Ad

More from ★ Akshay Surve (6)

PPTX
How I stopped watching p0rn and other *kinkiness*
PPTX
Blogging4Good @ BlogCamp Mumbai 2010 - Ads4Good.org
ZIP
Web Applicaitons - a roller coaster ride
PPT
Khelvigyan Project - Children Toy Foundation
PDF
SocialSync - Why it exists?
PPT
SocialSync
How I stopped watching p0rn and other *kinkiness*
Blogging4Good @ BlogCamp Mumbai 2010 - Ads4Good.org
Web Applicaitons - a roller coaster ride
Khelvigyan Project - Children Toy Foundation
SocialSync - Why it exists?
SocialSync

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Knowledge Engineering Part 1
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Business Analytics and business intelligence.pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Supervised vs unsupervised machine learning algorithms
Data_Analytics_and_PowerBI_Presentation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Reliability_Chapter_ presentation 1221.5784
Introduction to Knowledge Engineering Part 1
ISS -ESG Data flows What is ESG and HowHow
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Quality review (1)_presentation of this 21
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Clinical guidelines as a resource for EBP(1).pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Analytics and business intelligence.pdf
Qualitative Qantitative and Mixed Methods.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Mega Projects Data Mega Projects Data
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Supervised vs unsupervised machine learning algorithms

Building a Real-time Stream Processing Pipeline - Kinesis Data Firehose, Amazon Elasticsearch, Amazon Athena