SlideShare a Scribd company logo
Transforming Mobile 
Push Notifications with 
Big Data 
Dennis Waldron, Data Engineering 
Pablo Varela, Systems Engineering
Who is Plumbee? 
● 12.8M Installs 
● 209K Daily Active Users 
● 818K Monthly Active Users 
● Social Games Studio 
● Mirrorball Slots & Bingo 
● Facebook Canvas, iOS
Data Providers 
Inhouse data = 99.9% of all data 
In Total: 
● 98TB (907 days of data) 
● All stored in Amazon S3 
Daily: 
● 78GB compressed 
● ~450M events/day 
● 4,800 events/second (peak)
Architecture - Overview 
Events (JSON) 
Daily Batch Processing 
Aggregates 
Application/Game Servers 
End Users (Desktop & Mobile) 
Log Aggregators 
Amazon S3 
Amazon EMR 
(Elastic MapReduce) 
DataPipeline (Simple Storage Service) 
Amazon Redshift 
Plumbee Employees 
Analytics (SQL Queries) 
SQS Analytics Queue 
Events (JSON)
Amazon Web Service 
Application/Game Servers 
End Users (Desktop & Mobile) 
● Collect everything! 
● RPC events intercepted by 
annotated endpoints. (Requests) 
● All mutating state changes 
recorded: 
○ DynamoDB, MySQL, Memcache 
(Blobs Updates) 
● Custom Telemetry (Other): 
○ Client: click tracking, loading time 
statistics, GPU data... 
○ Server: promotions, transactions, 
Facebook user data... 
Game Data 
MySQL 
MemCache 
RPC 
77% 
9% 
OTHER 15% 
GENERATES 
DynamoDB
Game Data - Example RPC Endpoint Annotation 
/** 
* Example annotation 
*/ 
@SQSRequestLog(requestMessage = SpinRequest.class) 
@RequestMapping(“/spin”) 
public SpinResponse spin(SpinRequest spinRequest) { 
… 
}
Example Event - userStats 
● All events are recorded in JSON. 
● Structure: 
○ Headers 
○ Categorization Data (metadata) 
○ Payload (message) 
● Important Headers: 
○ timestamp 
○ testVariant 
○ plumbeeUid
Architecture - Collection 
Analytics (SQL Queries) 
Daily Batch Processing 
Aggregates 
Application/Game Servers 
End Users (Desktop & Mobile) 
Amazon S3 
Amazon EMR 
(Elastic MapReduce) 
DataPipeline (Simple Storage Service) 
Amazon Redshift 
Plumbee Employees 
Log Aggregators 
Events (JSON) 
SQS Analytics Queue 
Events (JSON)
Data Collection (I) - PUT 
Application/Game Servers 
Events (JSON) 
SQS Queue 
Log Aggregators 
Producers Consumers 
What is SQS (Simple Queue Service)? 
A cloud-based message queue for transmitting 
messages between producers and consumers 
SQS Provides: 
● ACK/FAIL semantics 
● Unlimited number of messages 
● Scales transparently 
● Buffer zone
Data Collection (II) - GET 
SQS Queue 
What is Apache Flume? 
A distributed, reliable, and available service 
for efficiently collecting, aggregating, and 
moving large amounts of log data 
Apache Flume 
Consumers 
Amazon S3 
(Simple Storage Service) 
S3 Data: 
● Partitioned by: date / type / sub_type 
● Compressed with: Snappy 
● Aggregated in 512MB chunks
Data Collection (III) - Flume 
Flume Agent 
Source 
(Custom) 
Sink 
(HDFS) 
SQS Queue 
Channel 
(File Based) 
● Pluggable component architecture 
● Durability via transactions 
● File channel use Elastic Book Store (EBS) volumes (network attached storage) 
○ Protects against Hardware failure 
● SQS Flume Plugin: https://github.com/plumbee/flume-sqs-source 
S3 Bucket 
Transactions 
A + B + C = Flow 
A B C
Architecture - Processing 
Events (JSON) 
Daily Batch Processing 
Aggregates 
Application/Game Servers 
End Users (Desktop & Mobile) 
Amazon S3 
Amazon EMR 
(Elastic MapReduce) 
DataPipeline (Simple Storage Service) 
Amazon Redshift 
Plumbee Employees 
Analytics (SQL Queries) 
SQS Analytics Queue 
Events (JSON)
Extract, Transform, Load 
● Daily activity 
● Orchestrated by Amazon DataPipeline 
● Includes generation of reports 
● Configured with JSON 
What is DataPipeline? 
A cloud-based data workflow service that 
helps you process and move data between 
different AWS services 
RESOURCE COMMAND SCHEDULE
Extract & Transform (I) 
What is Elastic Map Reduce? 
Cloud-based MapReduce implementation to 
process vast amounts of data built on top of 
the open-sourced Hadoop framework. 
Two phases: 
● Map() Procedure -> Filtering & Sorting 
● Reduce() -> Summary operation 
Penguin 
Horse 
Cake 
Cake 
Penguin 
Penguin 
Penguin 
Horse 
Horse 
Cake 
Cake 
Horse 
Horse 
Horse 
MAP() 
Penguin 
Penguin 
Penguin 
Penguin 
REDUCE() 
Cake: 2 Horse: 3 
RESULT SORTED QUEUES RAW DATA 
Penguin: 
4
Extract & Transform (II) 
What is Hive? 
An open-sourced Apache project with provides a 
SQL-Like interface to summarize, query and 
analysis large datasets by leveraging Hadoop’s 
MapReduce infrastructure. 
● Not really SQL, HQL -> HiveQL 
● No transactions, materialized views, 
limited subquery support, ... 
SELECT plumbeeuid, 
COUNT(*) AS spins 
FROM eventlog 
-- Partitioned data access 
WHERE event_date = '2014-11-18' 
AND event_type = 'rpc' 
AND event_sub_type = 'rpc-spin' 
-- Aggregation 
GROUP BY plumbeeuid; 
Table: Eventlog 
● Mounted on top of raw data 
● SerDe provides JSON parsing 
● Target data via partition filters
Extract & Transform (III) 
● Hive has limitations! 
○ Speed, JSON 
● Most of our transformations use: 
Streaming MapReduce Jobs 
What is Streaming? 
“A Hadoop utility that allows you to create 
and run MapReduce jobs using any 
executable script as a mapper or reducer” 
for line in sys.stdin: 
data = json.loads(line) 
print data['plumbeeUid'] + 't' + 1 
Emits, Key value Pairs 
466264 => 1, 376166 => 1 
983131 => 1, 466264 => 1 
Hadoop sorts and shuffles the data making sure 
matching keys are processed by a single reducer! 
results = defaultdict(int) 
for line in sys.stdin: 
plumbee_uid, count = line.split('t') 
results[plumbee_uid] += int(count) 
print results 
JSON rpc-spin 
Data 
Result: 
{ 466264: 2, 376166: 1, 983131: 1 } 
map() 
reduce()
Results 
Load (I) - Problem 
Raw S3 JSON Data Aggregated Data 
EMR Transformed data: 
● Referred to as aggregates 
● Stored in S3 
● Accessible via EMR cluster 
EMR Transformation 
(Hive & Streaming Jobs) 
5.4TB 
Problem 
● We don’t run long-lived EMR clusters. 
EMR requires: 
● Specialists knowledge 
● Is slow, processing and booting “offline”. 
Use Amazon Redshift for fast “online” data access
What is Redshift? 
A column-oriented database which uses 
Massive Parallel Processing (MPP) techniques 
to support analytics style SQL based 
workloads across large datasets. 
Power comes from: 
● Query parallelization 
● Column-oriented design 
Redshift Provides: 
● Low latency JDBC and ODBC access 
● Fault Tolerance 
● Automated Backups 
Load (II) - Redshift 
Redshift (x3 nodes): 0.33s 
EMR (x20 nodes): 135.46s
Load (II) - Column-Oriented Databases 
Row-oriented Database - MySQL 
ID First Name Last Name Country 
1 Penguin Situation GB 
2 Cheese Labs US 
3 Horse Barracks GB 
Column-oriented Database - Redshift 
ID First Name Last Name Country 
1 Penguin Situation GB 
2 Cheese Labs US 
3 Horse Barracks GB 
● East to add/modify records 
● Could read irrelevant data. 
● Great for fast lookups (OLTP) 
● Only read in relevant data 
● Adding rows requires multiple 
updates to column data. 
● Great for aggregation queries 
(OLAP)
Architecture - Revisit 
Daily Batch Processing 
Aggregates 
Application/Game Servers 
End Users (Desktop & Mobile) 
Amazon S3 
Amazon EMR 
(Elastic MapReduce) 
DataPipeline (Simple Storage Service) 
Amazon Redshift 
Plumbee Employees 
Analytics (SQL Queries) 
Log Aggregators 
Events (JSON) 
SQS Analytics Queue 
Events (JSON)
Q&A
Targeted Push 
Notifications
Mirrorball Slots: Kingdom of Riches
Mirrorball Slots: Challenges 
● recurring timed event 
● collect symbols from non-winning 
spins 
● get free coins if enough symbols are 
collected
Some players ask for notifications
Use Cases
Building blocks
Data Collection
Data Collection 
Players 
Amazon Redshift
Architecture - Overview 
Amazon Redshift 
Amazon S3 
Trigger Publisher Segmentation Workers 
Batch Processors Amazon SNS 
Players 
Targeting 
Mobile Push
User Targeting
User targeting 
Run SQL queries directly against Redshift 
SQL Query 
Amazon Redshift User Segment
User targeting: Query example 
-- Target all mobile users 
SELECT plumbee_uid, arn 
FROM mobile_user
User targeting: Query example (II) 
-- Target lapsed users (1 week lapse) 
SELECT plumbee_uid, arn 
FROM mobile_user 
WHERE last_play_time < (now - 7 days)
Demo (I) 
Mobile MBS Notifications
Architecture - Mobile Push 
Amazon Redshift 
Amazon S3 
Trigger Publisher Segmentation Workers 
Batch Processors Amazon SNS 
Players 
Targeting 
Mobile Push
Amazon Simple 
Notification Service
What is SNS? 
“Amazon Simple Notification Service (Amazon 
SNS) is a fast, flexible, fully managed push 
messaging service”
Amazon SNS
Amazon SNS
Amazon SNS: Device Registration 
Players Game Servers SQS Analytics Queue Amazon Redshift 
Amazon SNS 
register device 
event 
register
Amazon SNS: ARN Retrieval 
private String getArnForDeviceEndpoint(String platformApplicationArn, String deviceToken) { 
CreatePlatformEndpointRequest request = 
new CreatePlatformEndpointRequest() 
.withPlatformApplicationArn(platformApplicationArn) 
.withToken(deviceToken); 
CreatePlatformEndpointResult result = snsClient.createPlatformEndpoint(request); 
return result.getEndpointArn(); 
}
Amazon SNS: Analytics Event 
private String registerEndpointForApplicationAndPlatform( final long plumbeeUid, 
String platformARN, String platformToken) { 
final String deviceEndpointARN = getArnForDeviceEndpoint( platformARN , platformToken ); 
sqsLogger.queueMessage( new HashMap<String, Object>() {{ 
put( "notification", "register"); 
put( "plumbeeUid", plumbeeUid ); 
put( "provider", platformName ); 
put( "endpoint", deviceEndpointARN ); 
}}, null); 
return deviceEndpointARN; 
}
Amazon SNS: Mobile Push 
private void publishMessage(UserData userData, String jsonPayload) { 
amazonSNS.publish(new PublishRequest() 
.withTargetArn( userData.getEndpoint()) 
.withMessageStructure( "json") 
.withMessage( jsonPayload )); 
} 
Payload example 
{"default": "The 5 day Halloween Challenge has started today! Touch to play NOW!"}
Architecture - Orchestration 
Amazon Redshift 
Amazon S3 
Trigger Publisher Segmentation Workers 
Batch Processors Amazon SNS 
Players 
Targeting 
Mobile Push
Amazon Simple Workflow
What is Amazon SWF? 
“Amazon Simple Workflow (Amazon SWF) is a 
task coordination and state management 
service for cloud applications.”
What Amazon SWF provides 
● consistent execution state management 
● workflow executions and tasks tracking 
● non-duplicated dispatch of tasks 
● task routing and queuing 
● the AWS Flow Framework
Architecture - Orchestration 
Amazon Redshift 
Amazon S3 
Trigger Publisher Segmentation Workers 
Batch Processors Amazon SNS 
Players 
Targeting 
Mobile Push
Mobile Push: Scheduling 
Trigger Publish Service Amazon 
Simple Workflow
Mobile Push: Targeting 
query query 
target 
users 
Amazon SWF 
Amazon EC2 
Worker 
(Segmentation) 
Amazon 
Redshift 
Amazon 
S3
Mobile Push: Processing 
batch 1-N publish push 
Workers 
(Processing) 
Amazon SWF Read data + push End User
Mobile Push: Reporting 
send send 
Amazon SWF 
Amazon EC2 
Worker 
(Reporting) 
Amazon 
SES
Demo (II)
Q&A

More Related Content

PDF
From stream to recommendation using apache beam with cloud pubsub and cloud d...
PDF
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
PPTX
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
PDF
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
PDF
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
PDF
PDF
Sorry - How Bieber broke Google Cloud at Spotify
From stream to recommendation using apache beam with cloud pubsub and cloud d...
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Sorry - How Bieber broke Google Cloud at Spotify

What's hot (19)

PDF
Amazon Redshift
PDF
Event Driven Microservices
PDF
PDF
Spark streaming: Best Practices
PDF
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
PDF
So you think you can stream.pptx
PDF
Productionizing your Streaming Jobs
PDF
Scio - Moving to Google Cloud, A Spotify Story
PDF
Hadoop summit 2010, HONU
PDF
Pulsar - Real-time Analytics at Scale
PDF
Distributed Stream Processing - Spark Summit East 2017
PDF
Streaming Auto-scaling in Google Cloud Dataflow
PDF
Apache Spark for Library Developers with William Benton and Erik Erlandson
PPTX
MongoDB for Time Series Data
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
PDF
Building real time data-driven products
PDF
Log everything! @DC13
PPTX
MongoDB for Time Series Data: Setting the Stage for Sensor Management
PPTX
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
Amazon Redshift
Event Driven Microservices
Spark streaming: Best Practices
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
So you think you can stream.pptx
Productionizing your Streaming Jobs
Scio - Moving to Google Cloud, A Spotify Story
Hadoop summit 2010, HONU
Pulsar - Real-time Analytics at Scale
Distributed Stream Processing - Spark Summit East 2017
Streaming Auto-scaling in Google Cloud Dataflow
Apache Spark for Library Developers with William Benton and Erik Erlandson
MongoDB for Time Series Data
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Building real time data-driven products
Log everything! @DC13
MongoDB for Time Series Data: Setting the Stage for Sensor Management
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
Ad

Viewers also liked (11)

PPTX
Your Guide to Push Notifications - Comparing GCM & APNS
PPTX
Brug - Web push notification
PPTX
Push notification to the open web
PPTX
Push notifications
PPTX
How to Choose Between Push Notifications and SMS | CM Telecom
PPTX
Push notifications
PDF
Push notifications
PDF
Push Notifications for Websites
PPTX
Push Notification
PPTX
web push notifications for your webapp
PPTX
MapReduce Design Patterns
Your Guide to Push Notifications - Comparing GCM & APNS
Brug - Web push notification
Push notification to the open web
Push notifications
How to Choose Between Push Notifications and SMS | CM Telecom
Push notifications
Push notifications
Push Notifications for Websites
Push Notification
web push notifications for your webapp
MapReduce Design Patterns
Ad

Similar to Transforming Mobile Push Notifications with Big Data (20)

PDF
Big data on_aws in korea by abhishek sinha (lunch and learn)
PPTX
Data Analysis on AWS
PDF
Get Value From Your Data
PDF
Module 2 - Datalake
PDF
Big data and Analytics on AWS
PDF
Big Data on AWS
PDF
Big Data and Analytics Innovation Summit
PDF
Big data on aws
PDF
Building a Sustainable Data Platform on AWS
PDF
AWS Floor 28 - Building Data lake on AWS
PDF
Technologies for Data Analytics Platform
PDF
Processing 19 billion messages in real time and NOT dying in the process
PDF
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
PPTX
Make your data fly - Building data platform in AWS
PDF
Builders' Day - Building Data Lakes for Analytics On AWS LC
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PDF
Get Value from Your Data
PPTX
Introduction to AWS Big Data
PDF
Architecting Data Lakes on AWS
PDF
JDD2014: Real Big Data - Scott MacGregor
Big data on_aws in korea by abhishek sinha (lunch and learn)
Data Analysis on AWS
Get Value From Your Data
Module 2 - Datalake
Big data and Analytics on AWS
Big Data on AWS
Big Data and Analytics Innovation Summit
Big data on aws
Building a Sustainable Data Platform on AWS
AWS Floor 28 - Building Data lake on AWS
Technologies for Data Analytics Platform
Processing 19 billion messages in real time and NOT dying in the process
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
Make your data fly - Building data platform in AWS
Builders' Day - Building Data Lakes for Analytics On AWS LC
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Get Value from Your Data
Introduction to AWS Big Data
Architecting Data Lakes on AWS
JDD2014: Real Big Data - Scott MacGregor

Recently uploaded (20)

PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Electronic commerce courselecture one. Pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Modernizing your data center with Dell and AMD
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PPT
Teaching material agriculture food technology
PDF
Advanced IT Governance
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Empathic Computing: Creating Shared Understanding
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Electronic commerce courselecture one. Pdf
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
cuic standard and advanced reporting.pdf
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Spectral efficient network and resource selection model in 5G networks
Modernizing your data center with Dell and AMD
“AI and Expert System Decision Support & Business Intelligence Systems”
The Rise and Fall of 3GPP – Time for a Sabbatical?
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Teaching material agriculture food technology
Advanced IT Governance
NewMind AI Monthly Chronicles - July 2025
GamePlan Trading System Review: Professional Trader's Honest Take
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Empathic Computing: Creating Shared Understanding

Transforming Mobile Push Notifications with Big Data

  • 1. Transforming Mobile Push Notifications with Big Data Dennis Waldron, Data Engineering Pablo Varela, Systems Engineering
  • 2. Who is Plumbee? ● 12.8M Installs ● 209K Daily Active Users ● 818K Monthly Active Users ● Social Games Studio ● Mirrorball Slots & Bingo ● Facebook Canvas, iOS
  • 3. Data Providers Inhouse data = 99.9% of all data In Total: ● 98TB (907 days of data) ● All stored in Amazon S3 Daily: ● 78GB compressed ● ~450M events/day ● 4,800 events/second (peak)
  • 4. Architecture - Overview Events (JSON) Daily Batch Processing Aggregates Application/Game Servers End Users (Desktop & Mobile) Log Aggregators Amazon S3 Amazon EMR (Elastic MapReduce) DataPipeline (Simple Storage Service) Amazon Redshift Plumbee Employees Analytics (SQL Queries) SQS Analytics Queue Events (JSON)
  • 5. Amazon Web Service Application/Game Servers End Users (Desktop & Mobile) ● Collect everything! ● RPC events intercepted by annotated endpoints. (Requests) ● All mutating state changes recorded: ○ DynamoDB, MySQL, Memcache (Blobs Updates) ● Custom Telemetry (Other): ○ Client: click tracking, loading time statistics, GPU data... ○ Server: promotions, transactions, Facebook user data... Game Data MySQL MemCache RPC 77% 9% OTHER 15% GENERATES DynamoDB
  • 6. Game Data - Example RPC Endpoint Annotation /** * Example annotation */ @SQSRequestLog(requestMessage = SpinRequest.class) @RequestMapping(“/spin”) public SpinResponse spin(SpinRequest spinRequest) { … }
  • 7. Example Event - userStats ● All events are recorded in JSON. ● Structure: ○ Headers ○ Categorization Data (metadata) ○ Payload (message) ● Important Headers: ○ timestamp ○ testVariant ○ plumbeeUid
  • 8. Architecture - Collection Analytics (SQL Queries) Daily Batch Processing Aggregates Application/Game Servers End Users (Desktop & Mobile) Amazon S3 Amazon EMR (Elastic MapReduce) DataPipeline (Simple Storage Service) Amazon Redshift Plumbee Employees Log Aggregators Events (JSON) SQS Analytics Queue Events (JSON)
  • 9. Data Collection (I) - PUT Application/Game Servers Events (JSON) SQS Queue Log Aggregators Producers Consumers What is SQS (Simple Queue Service)? A cloud-based message queue for transmitting messages between producers and consumers SQS Provides: ● ACK/FAIL semantics ● Unlimited number of messages ● Scales transparently ● Buffer zone
  • 10. Data Collection (II) - GET SQS Queue What is Apache Flume? A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data Apache Flume Consumers Amazon S3 (Simple Storage Service) S3 Data: ● Partitioned by: date / type / sub_type ● Compressed with: Snappy ● Aggregated in 512MB chunks
  • 11. Data Collection (III) - Flume Flume Agent Source (Custom) Sink (HDFS) SQS Queue Channel (File Based) ● Pluggable component architecture ● Durability via transactions ● File channel use Elastic Book Store (EBS) volumes (network attached storage) ○ Protects against Hardware failure ● SQS Flume Plugin: https://github.com/plumbee/flume-sqs-source S3 Bucket Transactions A + B + C = Flow A B C
  • 12. Architecture - Processing Events (JSON) Daily Batch Processing Aggregates Application/Game Servers End Users (Desktop & Mobile) Amazon S3 Amazon EMR (Elastic MapReduce) DataPipeline (Simple Storage Service) Amazon Redshift Plumbee Employees Analytics (SQL Queries) SQS Analytics Queue Events (JSON)
  • 13. Extract, Transform, Load ● Daily activity ● Orchestrated by Amazon DataPipeline ● Includes generation of reports ● Configured with JSON What is DataPipeline? A cloud-based data workflow service that helps you process and move data between different AWS services RESOURCE COMMAND SCHEDULE
  • 14. Extract & Transform (I) What is Elastic Map Reduce? Cloud-based MapReduce implementation to process vast amounts of data built on top of the open-sourced Hadoop framework. Two phases: ● Map() Procedure -> Filtering & Sorting ● Reduce() -> Summary operation Penguin Horse Cake Cake Penguin Penguin Penguin Horse Horse Cake Cake Horse Horse Horse MAP() Penguin Penguin Penguin Penguin REDUCE() Cake: 2 Horse: 3 RESULT SORTED QUEUES RAW DATA Penguin: 4
  • 15. Extract & Transform (II) What is Hive? An open-sourced Apache project with provides a SQL-Like interface to summarize, query and analysis large datasets by leveraging Hadoop’s MapReduce infrastructure. ● Not really SQL, HQL -> HiveQL ● No transactions, materialized views, limited subquery support, ... SELECT plumbeeuid, COUNT(*) AS spins FROM eventlog -- Partitioned data access WHERE event_date = '2014-11-18' AND event_type = 'rpc' AND event_sub_type = 'rpc-spin' -- Aggregation GROUP BY plumbeeuid; Table: Eventlog ● Mounted on top of raw data ● SerDe provides JSON parsing ● Target data via partition filters
  • 16. Extract & Transform (III) ● Hive has limitations! ○ Speed, JSON ● Most of our transformations use: Streaming MapReduce Jobs What is Streaming? “A Hadoop utility that allows you to create and run MapReduce jobs using any executable script as a mapper or reducer” for line in sys.stdin: data = json.loads(line) print data['plumbeeUid'] + 't' + 1 Emits, Key value Pairs 466264 => 1, 376166 => 1 983131 => 1, 466264 => 1 Hadoop sorts and shuffles the data making sure matching keys are processed by a single reducer! results = defaultdict(int) for line in sys.stdin: plumbee_uid, count = line.split('t') results[plumbee_uid] += int(count) print results JSON rpc-spin Data Result: { 466264: 2, 376166: 1, 983131: 1 } map() reduce()
  • 17. Results Load (I) - Problem Raw S3 JSON Data Aggregated Data EMR Transformed data: ● Referred to as aggregates ● Stored in S3 ● Accessible via EMR cluster EMR Transformation (Hive & Streaming Jobs) 5.4TB Problem ● We don’t run long-lived EMR clusters. EMR requires: ● Specialists knowledge ● Is slow, processing and booting “offline”. Use Amazon Redshift for fast “online” data access
  • 18. What is Redshift? A column-oriented database which uses Massive Parallel Processing (MPP) techniques to support analytics style SQL based workloads across large datasets. Power comes from: ● Query parallelization ● Column-oriented design Redshift Provides: ● Low latency JDBC and ODBC access ● Fault Tolerance ● Automated Backups Load (II) - Redshift Redshift (x3 nodes): 0.33s EMR (x20 nodes): 135.46s
  • 19. Load (II) - Column-Oriented Databases Row-oriented Database - MySQL ID First Name Last Name Country 1 Penguin Situation GB 2 Cheese Labs US 3 Horse Barracks GB Column-oriented Database - Redshift ID First Name Last Name Country 1 Penguin Situation GB 2 Cheese Labs US 3 Horse Barracks GB ● East to add/modify records ● Could read irrelevant data. ● Great for fast lookups (OLTP) ● Only read in relevant data ● Adding rows requires multiple updates to column data. ● Great for aggregation queries (OLAP)
  • 20. Architecture - Revisit Daily Batch Processing Aggregates Application/Game Servers End Users (Desktop & Mobile) Amazon S3 Amazon EMR (Elastic MapReduce) DataPipeline (Simple Storage Service) Amazon Redshift Plumbee Employees Analytics (SQL Queries) Log Aggregators Events (JSON) SQS Analytics Queue Events (JSON)
  • 21. Q&A
  • 24. Mirrorball Slots: Challenges ● recurring timed event ● collect symbols from non-winning spins ● get free coins if enough symbols are collected
  • 25. Some players ask for notifications
  • 29. Data Collection Players Amazon Redshift
  • 30. Architecture - Overview Amazon Redshift Amazon S3 Trigger Publisher Segmentation Workers Batch Processors Amazon SNS Players Targeting Mobile Push
  • 32. User targeting Run SQL queries directly against Redshift SQL Query Amazon Redshift User Segment
  • 33. User targeting: Query example -- Target all mobile users SELECT plumbee_uid, arn FROM mobile_user
  • 34. User targeting: Query example (II) -- Target lapsed users (1 week lapse) SELECT plumbee_uid, arn FROM mobile_user WHERE last_play_time < (now - 7 days)
  • 35. Demo (I) Mobile MBS Notifications
  • 36. Architecture - Mobile Push Amazon Redshift Amazon S3 Trigger Publisher Segmentation Workers Batch Processors Amazon SNS Players Targeting Mobile Push
  • 38. What is SNS? “Amazon Simple Notification Service (Amazon SNS) is a fast, flexible, fully managed push messaging service”
  • 41. Amazon SNS: Device Registration Players Game Servers SQS Analytics Queue Amazon Redshift Amazon SNS register device event register
  • 42. Amazon SNS: ARN Retrieval private String getArnForDeviceEndpoint(String platformApplicationArn, String deviceToken) { CreatePlatformEndpointRequest request = new CreatePlatformEndpointRequest() .withPlatformApplicationArn(platformApplicationArn) .withToken(deviceToken); CreatePlatformEndpointResult result = snsClient.createPlatformEndpoint(request); return result.getEndpointArn(); }
  • 43. Amazon SNS: Analytics Event private String registerEndpointForApplicationAndPlatform( final long plumbeeUid, String platformARN, String platformToken) { final String deviceEndpointARN = getArnForDeviceEndpoint( platformARN , platformToken ); sqsLogger.queueMessage( new HashMap<String, Object>() {{ put( "notification", "register"); put( "plumbeeUid", plumbeeUid ); put( "provider", platformName ); put( "endpoint", deviceEndpointARN ); }}, null); return deviceEndpointARN; }
  • 44. Amazon SNS: Mobile Push private void publishMessage(UserData userData, String jsonPayload) { amazonSNS.publish(new PublishRequest() .withTargetArn( userData.getEndpoint()) .withMessageStructure( "json") .withMessage( jsonPayload )); } Payload example {"default": "The 5 day Halloween Challenge has started today! Touch to play NOW!"}
  • 45. Architecture - Orchestration Amazon Redshift Amazon S3 Trigger Publisher Segmentation Workers Batch Processors Amazon SNS Players Targeting Mobile Push
  • 47. What is Amazon SWF? “Amazon Simple Workflow (Amazon SWF) is a task coordination and state management service for cloud applications.”
  • 48. What Amazon SWF provides ● consistent execution state management ● workflow executions and tasks tracking ● non-duplicated dispatch of tasks ● task routing and queuing ● the AWS Flow Framework
  • 49. Architecture - Orchestration Amazon Redshift Amazon S3 Trigger Publisher Segmentation Workers Batch Processors Amazon SNS Players Targeting Mobile Push
  • 50. Mobile Push: Scheduling Trigger Publish Service Amazon Simple Workflow
  • 51. Mobile Push: Targeting query query target users Amazon SWF Amazon EC2 Worker (Segmentation) Amazon Redshift Amazon S3
  • 52. Mobile Push: Processing batch 1-N publish push Workers (Processing) Amazon SWF Read data + push End User
  • 53. Mobile Push: Reporting send send Amazon SWF Amazon EC2 Worker (Reporting) Amazon SES
  • 55. Q&A