SlideShare a Scribd company logo
Building a Sustainable
Data Platform on AWS
Takumi Sakamoto
2016.01.27
Takumi Sakamoto
@takus
😍 = ⚽ ✈ 📷
http://bit.ly/1MCOyBX
JAWSDAYS 2015
Mentioned by @jeffbarr
https://twitter.com/jeffbarr/status/649575575787454464
http://www.slideshare.net/smartnews/smart-newss-journey-into-microservices
AWS Case Study
http://aws.amazon.com/solutions/case-studies/smartnews/
Data Platform at
SmartNews
What is SmartNews?
• News Discovery App
• Launched in 2012
• 15M+ Downloads in World Wide
https://www.smartnews.com/en/
Our Mission
the world's quality information?
the people who need it?
How?
Machine Learning
URLs Found
Structure Analysis
Semantics Analysis
Importance Estimation
Diversification
Internet
100,000+ /day
1000+ /day
Feedback
Deliver
Trending Stories
Data Platform Use Cases
• Product development
• track KPI such as DAU and MAU
• A/B test for new feature, on-boarding, etc...
• ad-hoc analysis
• Provide data to applications
• realtime re-ranking news articles
• CTR prediction of Ads system
• dashboard service for media partners
Data & Its Numbers
• User activities
• ~100 GBs per day (compressed)
• 60+ record types
• User demographics or configurations etc...
• 15M+ records
• Articles metadata
• 100K+ records per day
Sustainable
Data Platform?
Sustainable Data Platform
• Provide a reliable and scalable "Lambda Architecture"
• Minimize both operation & running cost
• Be open to uncertain future
Lambda Architecture
http://lambda-architecture.net/
Why Sustainable?
• Do a lot with a few engineers
• no one is a full-time maintainer
• avoid to waste too much time
• Empower brilliant engineers in SmartNews
• everything should be as self-serve as possible
• don't ask for permission, beg for forgiveness
System Design
λ Architecture at SmartNews
Input Batch Serving
Speed
Output
Design Principles
• Decoupled "Computation" and "Storage" layers
• multiple consumers can use the same data
• run consumers on Spot Instances
• prevent serious data lost with minimum effort
• Use the right tool for the job
• leverage AWS managed service as possible
• fill in the missing pieces by Presto & PipelineDB
An Example
Amazon EMR
AMI 3.x
Amazon S3
Amazon EMR
Hive
General
Users
Application
Engineer
I wanna
upgrade hive
Ad
Engineer
I wanna combine
news data with
ad data
Amazon EMR
AMI 4.x
Amazon EMR
Spark
We’re satisfied
with current
version
Data
Scientist
I wanna test my
algorithm with the
latest spark
Batch Layer
Run multiple EMR clusters for each usages
Kinesis
Stream
Spark
on EMR
AWS
Lambda
Data
Scientist
I wanna consume
streaming data by
Spark
Application
Engineer
I wanna add a
streaming monitor
by Lambda
Speed Layer
Consume the same data for each usages
• AWS managed services
• Replicated data into Multiple AZs
• High availability
Input Data
Collect Events by Fluentd
• Forwarder (running on each instances)
• store JSON events to S3
• forward events to aggregators
• collect metrics and post them to Datadog
• Aggregator
• input events into Kinesis & PipelineDB
• other reporting tasks (not mentioned today)
Forwards to S3
<source>
@type tail
format json
path /data/log/user_activity.log
pos_file /data/log/pos/user_activity.pos
tag smartnews.user_activity
time_key timestamp
</source>
<match smartnews.user_activity>
@type copy
<store>
@type relabel
@label @s3
</store>
<store>
@type forward
@label @forward
</store>
</match>
@include conf.d/s3.conf
@include conf.d/forward.conf
<label @s3>
<% node[:td_agent][:s3].each do |c| -%>
<match <%= c[:tag] %>>
@id s3.<%= c[:tag] %>
@type s3
...
path fluentd/<%= node[:env] %>/<%= node[:role] %>/<%= c[:tag] %>
time_slice_format dt=%Y-%m-%d/hh=%H
time_key timestamp
include_time_key
time_as_epoch
reduced_redundancy true
format json
utc
buffer_chunk_limit 2048m
</match>
<% end -%>
</label>
td-agent.conf conf.d/s3.conf
Capture DynamoDB Streams
<source>
type dynamodb_streams
stream_arn YOUR_DDB_STREAMS_ARN
pos_file /path/to/table.pos
fetch_interval 1
fetch_size 100
</source>
https://github.com/takus/fluent-plugin-dynamodb-streams
DynamoDB DynamoDB
Streams
Amazon S3
AWS
Lambda
Fluentd
Recommended Practices
• Make configuration simple as possible
• fluentd can cover everything, but shouldn't
• keep stateless
• Use v0.12 or later
• "Filter" : better performance
• "Label": eliminate 'output_tag' configuration
Monitor Fluentd Status
• Monitor traffic volume & retry count by Datadog
• Datadog's fluentd integration
• fluent-plugin-flowcounter
• fluent-plugin-dogstatsd
Archive to Amazon S3
• I have 2 recommended settings
• versioning
• enable to recover from human error
• lifecycle policy
• minify storage cost
Archives to IA or Gracier
xx days after the creation date
Keep previous versions xx days
Save you in the future!!
Batch Layer
Various ETL Tasks
• Extract
• dump MySQL records by Embulk
• make files on S3 readable to Hive
• Transform
• transform text files into columnar files (RCFile, ORC)
• generate features for machine learning
• aggregate records (by country, by channel)
• Load
• load aggregated metrics into Amazon Aurora
Hive
• Most popular project on Hadoop ecosystem
• famous for its lovely logo :)
• HiveQL and MapReduce
• convert SQL-like query into MR jobs
• Not adopt Tez engine yet
• Amazon EMR doesn't support now
• limited improvement to our queries
How to process JSON?
A. Transform into columnar table periodically
• required converting job
• better performance
B. Use JSON-SerDe for temporary analysis
• easy way for querying raw json text files
• required to "drop table" for change schema
• performance is not good
Transform Tables
-- Make S3 files readable by Hive
ALTER TABLE raw_activities ADD IF NOT EXISTS PARTITION
(dt='${DATE}', hh='${HOUR}');
-- Transform text files into columnar files (Flatten JSON)
INSERT OVERWRITE TABLE activities
PARTITION (dt='${DATE}', action)
SELECT
user_id, timestamp, os, country,
data,
action
FROM raw_activities
LATERAL VIEW json_tuple(
raw_activities.json,
'userId','timestamp','platform','country','action','data'
) a as user_id, timestamp, os, country, action, data
WHERE dt = '${DATE}'
CLUSTER BY os, country, action, user_id
;
JSON-SerDe
-- Define table with SERDE
CREATE TABLE json_table (
country string,
languages array<string>,
religions map<string,array<int>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
-- Result: 10
SELECT religions['catholic'][0] FROM json_table;
cf. hive-ruby-scripting
-- Define your ruby (JRuby) script
SET rb.script=
require 'json'
def parse (json)
j = JSON.load(json)
j['profile']['attribute1']
end
;
-- Use the script in HQL
SELECT rb_exec('&parse', json) FROM user;
https://github.com/gree/hive-ruby-scripting
Spark
http://www.slideshare.net/smartnews/aws-meetupapache-spark-on-emr
Self-Serve via AWS CLI
# Create EMR clusters that runs Hive & Spark & Ganglia
aws emr create-cluster 
--name "My Cluster" 
--release-label emr-4.2.0 
--applications Name=Hive Name=Spark Name=GANGLIA 
--ec2-attributes KeyName=myKey 
--instance-type c3.4xlarge 
--instance-count 4 
--use-default-roles
Minimize expenses
• Use Spot Instances as possible
• typically discount 50-90%
• select instance type with stable price
• C3 families spike often :(
• Dynamic cluster resizing
• x2 capacity during daily batch job
• 1/2 capacity during midnight
Handle Data Dependencies
Typical Anti-Pattern
5 * * * * app hive -f query_1.hql
15 * * * * app hive -f query_2.hql
30 * * * * app hive -f query_3.hql
0 * * * * app hive -f query_4.hql
1 * * * * app hive -f query_5.hql
Workflow Management
• Define dependencies
• task E is executed after finishing task C and task D
• Scheduling
• task A is kicked after 09:00 AM
• throttle concurrent running of the same task
• Monitoring
• notification in failure
• task C must finish before 01:00 PM (SLA)
cf. http://www.slideshare.net/taroleo/workflow-hacks-1-dots-tokyo
Airflow
• A workflow management systems
• define workflow by Python
• built in shiny UI & CLI
• pluggable architecture
http://nerds.airbnb.com/airflow/
Define Tasks
dag = DAG('tutorial', default_args=default_args)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t3 = BashOperator(
task_id='templated',
bash_command="""
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
""",
params={'my_param': 'Parameter I passed in'},
dag=dag)
t2.set_upstream(t1)
t3.set_upstream(t1)
Task
Dependencies
Python code
DAG
Workflow as Code
Deploy codes automatically after merging into master
Visualize Dependencies
What is done or not?
Alerting to Slack
• SLA Violation
• task A should be done till 00:00 PM
• other team's task K has dependency into task A
• Output validation failure
• stop the following tasks if the output is doubtful
Retry from Web UI
Once clear histories, airflow scheduler back fill the histories
Retry from CLI
// Clear some histories from 2016-01-01
airflow clear etl_smartnews 
--task_regex user_ 
--downstream 
--start_date 2016-01-01
// Backfill uncompleted tasks
airflow backfill etl_smartnews 
--start_date 2016-01-01
Check Rendered Query
How Long Each Tasks?
Pluggable Architecture
• Built-in plugins
• operator: bash, hive, preto, mysql
• transfer: hive_to_mysql
• sensor: wait_hive_partition, wait_s3_file
• Written our own plugin
• mysql_partition
Examples
user_sensor = S3KeySensor(
task_id='wait_user',
bucket_name='smartnews',
bucket_key='user/dt={{ ds }}/dump.csv',
)
etl = HiveOperator(
task_id="task1",
hql="INSERT OVERWRITE INTO...."
)
etl.set_upstream(user_sensor)
import = HiveToMySqlTransfer(
task_id=name,
mysql_preoperator="DELETE FROM %s WHERE date = '{{ ds }}'" % table,
sql="SELECT country, count(*) FROM %s" % table,
mysql_table=table
)
import.set_upstream(etl)
Wait a S3 file creation
After the file is created,
Run ETL Query
After that,
Import into MySQL
Serving Layer
Provides batch views
in low-latency and ad-hoc way
Presto
• A distributed SQL query engine
• join multiple data sources (Hive + MySQL)
• support standard ANSI SQL
• designed to handle TBs or PBs scale data
cf. http://www.slideshare.net/frsyuki/presto-hadoop-conference-japan-2014
Presto Architecture
Amazon S3 Kinesis
Stream
Amazon
RDS
Amazon
Aurora
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Coordinator
Client
1. Query with Standard SQL
4. Scan data concurrently
5. Aggregate data without disk I/O
6. Return result to client
2. Generate execution plan
3. Dispatch tasks into multiple workers
Amazon EMR
(Hive Metastore)
Provides Hive table metadata
(S3 access only)
※ https://github.com/qubole/presto-kinesis
※
Why Presto?
• Join multiple data sources
• skip large parts of ETL process
• enable to merge Hive/MySQL/Kinesis/PipelineDB
• Low latency
• ~30s to scan billions records in S3
• Low maintenance cost
• stateless, and easy to integrate with Auto Scaling
Use case: A/B Test
-- Suppose that this table exists
DESC hive.default.user_activities;
user_id bigint
action varchar
abtest array<map<varchar, bigint>>
url varchar
-- Summarize page view per A/B Test identifier
-- for comparing two algorithms v1 & v2
SELECT
  dt,
  t['behaviorId'],
  count(*) as pv
FROM hive.default.user_activities CROSS JOIN UNNEST(abtest) AS t (t)
WHERE dt like '2016-01-%' AND action = 'viewArticle'
AND t['definitionId'] = 163
GROUP BY dt, t['behaviorId'] ORDER BY dt
;
2015-12-01 | algorithm_v1 | 40000
2015-12-01 | algorithm_v2 | 62000
Use case: Troubleshoot
-- Store access logs to S3, and query to them
-- Summarize access & 95pct response time by SQL
SELECT
from_unixtime(timestamp),
count(*) as access,
approx_percentile(reqtime, 0.95) as pct95_reqtime
FROM hive.default.access_log
WHERE dt = '2015-11-04' AND hh = '13' AND role = 'xxx'
GROUP BY timestamp ORDER BY timestamp
;
2015-11-04 22:00:00.000 | 6377 | 0.522
2015-11-04 22:00:01.000 | 3580 | 0.422
Scheduled Auto Scaling
$ aws autoscaling describe-scheduled-actions
{
"ScheduledUpdateGroupActions": [
{
"DesiredCapacity": 2,
"AutoScalingGroupName": "presto-worker-prd",
"Recurrence": "59 14 * * *",
"ScheduledActionName": "scalein-2359-jst"
},
{
"DesiredCapacity": 20,
"AutoScalingGroupName": "presto-worker-prd",
"Recurrence": "45 0 * * 1-5",
"ScheduledActionName": "scaleout-0945-jst"
}
]
}
Presto Covers Everything? No!
• Fixed system on Amazon Aurora (or other RDB)
• provides KPI for products & business
• require high availability & low latency
• has no flexibility
• Ad-hoc system on Presto
• provides access to all dataset on data platform
• require high scalability
• has flexibility (join various data sources)
Why Fixed vs Ad-hoc?
• Difficulties on the Ad-hoc only solution
• difficult to prevent heavy queries
• large distinct count exhausts computing resources
• decrease presto maintainability
Output Data
Chartio
• Dashboard as A Service
• helps businesses analyze and track their critical data
• one of AWS partners (※)
• Combine multiple data sources at one dashboard
• Presto, MySQL, Redshift, BigQuery, Elasticsearch ...
• enable to join BigQuery + MySQL internally
• Easy to use for every one
• everyone can make their own dashboard
• write SQL directly / generate query by drag & drop
※ http://www.aws-partner-directory.com/PartnerDirectory/PartnerDetail?id=8959
Creating dashboard
1. Building query
(Drag&Drop / SQL)
2. Add step
(filter、sort、modify)
3. Select visualize way
(table、graph)
Examples
Why Chartio?
• Chartio saves a lot of engineering resources
• before
• maintain in-house dashboard written by rails
• everyone got tired to maintain it
• after
• everyone can build their own dashboard easily
• Chartio's UI is cool
• very important factor for dashboard tool
Missing Pieces of Chartio
• No programable API provides
• need to edit dashboard / chart manually
• No rollback feature
• all changes are recorded, but not rollback to the
previous state
• work around : clone => edit => rename
Speed Layer
Why Speed is Matter?
Today’s News is Wrapping
Tomorrow’s Fish and Chips
↑
Yesterday's News
http://www.personalchefapproach.com/tomorrows-fish-n-chips-wrapper/
How News Behaves?
https://gdsdata.blog.gov.uk/2013/10/22/the-half-life-of-news/
Use cases
• Re-rank news articles by user feedback
• track user's positive/negative signal
• consider gender, age, location, interests
• Realtime article monitoring
• detect high bounce rate (may be broken?)
• make realtime reporting dashboard for A/B test
Realtime Re-Ranking
ref. Stream 処理 (Spark Streaming + Kinesis) と Offline 処理 (Hive) の統合
www.slideshare.net/smartnews/stremspark-streaming-kinesisofflinehive
Amazon
CloudSearch
Search
API
API
Gateway
Kinesis
Stream
Amazon S3
Amazon EMR
Amazon S3 Amazon EMR
DynamoDB
Realtime
Feedback
Re-rank
Articles
Article
Metadata
User
Interests
User
Behaviors
Offline Procees
by Hive / Spark
Realtime Monitoring
API
Gateway
Stream
Continuous
View
Continuous
View
Continuous
View
Discard raw record soon after
consumed by Continuous View
Incrementally
updated in realtime
PipelineDB Chartio
AWS
Lambda
Slack
Access Continuous View
by PostgreSQL Client
Record
※1
※1
Use cron on 26 Feb. 2016
Migrate it soon after supporting VPC
PipelineDB
• OSS & enterprise streaming SQL database
• PostgreSQL compatible
• connect to Chartio 😍
• join stream to normal PostgreSQL table
• Support probabilistic data structures
• e.g. HyperLogLog
https://www.pipelinedb.com/
http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
Continuous View
-- Calculate unique users seen per media each day
-- Using only a constant amount of space (HyperLogLog)
CREATE CONTINUOUS VIEW uniques AS
SELECT
day(arrival_timestamp),
substring(url from '.*://([^/]*)') as hostname,
COUNT(DISTINCT user_id::integer)
FROM activity_stream GROUP BY day,hostname;
-- How many impressions have we served in the last five minutes?
CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS
SELECT COUNT(*) FROM imps_stream;
-- What are the 90th, 95th, 99th percentiles of request latency?
CREATE CONTINUOUS VIEW latency AS
SELECT
percentile_cont(array[90, 95, 99])
WITHIN GROUP (ORDER BY latency::integer)
FROM latency_stream;
Summary
Sustainable Data Platform
• build a reliable and scalable lambda architecture
• minimize operation & running cost
• be open to uncertain future
My Wishlist to AWS
• Support Reduced Redundancy Storage (RRS) on EMR
• Faster EMR Launch
• Set TTL to DynamoDB records
• Auto-scale Kinesis Stream
• Launch Kinesis Analytics in Tokyo region
Thank you!!

More Related Content

PDF
Fluentd Overview, Now and Then
PPTX
FIWARE Orion Context Broker コンテキスト情報管理 (Orion 3.9.0対応)
PPTX
FIWARE Orion Context Broker コンテキスト情報管理 (Orion 3.4.0対応)
PDF
MapReduce入門
PDF
Amazon S3による静的Webサイトホスティング
PDF
Apache Nifi Crash Course
PPTX
Gcp dataflow
PDF
20170919 AWS Black Belt Online Seminar AWS Database Migration Service
Fluentd Overview, Now and Then
FIWARE Orion Context Broker コンテキスト情報管理 (Orion 3.9.0対応)
FIWARE Orion Context Broker コンテキスト情報管理 (Orion 3.4.0対応)
MapReduce入門
Amazon S3による静的Webサイトホスティング
Apache Nifi Crash Course
Gcp dataflow
20170919 AWS Black Belt Online Seminar AWS Database Migration Service

What's hot (20)

PDF
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
PDF
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
PDF
AWS 初心者向けWebinar 基本から理解する、AWS運用監視
PPTX
Apache airflow
PDF
FIWARE Orion Context Broker コンテキスト情報管理 (Orion 1.13.0対応)
PDF
PGroonga 2 - PostgreSQLでの全文検索の決定版
PPTX
FIWARE Big Data Ecosystem : Cygnus
PDF
Introduction to Apache Beam
PDF
Fluentdで本番環境を再現
PDF
Pinterest - Big Data Machine Learning Platform at Pinterest
PDF
Rdf入門handout
PDF
PostgreSQLアーキテクチャ入門(INSIGHT OUT 2011)
PDF
Luigi presentation NYC Data Science
PPTX
NGSI によるデータ・モデリング - FIWARE WednesdayWebinars
PPTX
Ozone- Object store for Apache Hadoop
PPTX
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
PDF
Data ingestion and distribution with apache NiFi
PDF
AWSによるWindowsクラサバ環境構築ハンズオン資料
PPTX
Kinesis Firehoseを使ってみた
PPTX
Real Time UI with Apache Kafka Streaming Analytics of Fast Data and Server Push
Apache Arrow Flight – ビッグデータ用高速データ転送フレームワーク #dbts2021
Yahoo! JAPAN Tech Conference 2022 Day2 Keynote #yjtc
AWS 初心者向けWebinar 基本から理解する、AWS運用監視
Apache airflow
FIWARE Orion Context Broker コンテキスト情報管理 (Orion 1.13.0対応)
PGroonga 2 - PostgreSQLでの全文検索の決定版
FIWARE Big Data Ecosystem : Cygnus
Introduction to Apache Beam
Fluentdで本番環境を再現
Pinterest - Big Data Machine Learning Platform at Pinterest
Rdf入門handout
PostgreSQLアーキテクチャ入門(INSIGHT OUT 2011)
Luigi presentation NYC Data Science
NGSI によるデータ・モデリング - FIWARE WednesdayWebinars
Ozone- Object store for Apache Hadoop
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Data ingestion and distribution with apache NiFi
AWSによるWindowsクラサバ環境構築ハンズオン資料
Kinesis Firehoseを使ってみた
Real Time UI with Apache Kafka Streaming Analytics of Fast Data and Server Push
Ad

Viewers also liked (19)

PDF
20170725 black belt_monitoring_on_aws
PDF
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
PDF
20170726 black belt_stepfunctions
PDF
Amazon Redshiftによるリアルタイム分析サービスの構築
PDF
AWS Black Belt Online Seminar 2017 Amazon Connect
PDF
AWS Black Belt Online Seminar 2017 AWS Shield
PDF
20170621 aws-black belt-ads-sms
PDF
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
PDF
AWS Black Belt Online Seminar 2017 Deployment on AWS
PDF
AWS Black Belt online seminar 2017 Snowball
PDF
AWS Black Belt Online Seminar 2017 AWS X-Ray
PDF
AWS Black Belt Online Seminar 2017 Amazon Aurora
PDF
AWS Black Belt Online Seminar 2017 Amazon GameLift
PDF
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
PDF
AWS BlackBelt AWS上でのDDoS対策
PDF
AWS Black Belt Online Seminar 2017 Amazon EMR
PDF
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
PDF
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
PDF
Amazon Athena 初心者向けハンズオン
20170725 black belt_monitoring_on_aws
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
20170726 black belt_stepfunctions
Amazon Redshiftによるリアルタイム分析サービスの構築
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 AWS Shield
20170621 aws-black belt-ads-sms
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt online seminar 2017 Snowball
AWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS BlackBelt AWS上でのDDoS対策
AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
Amazon Athena 初心者向けハンズオン
Ad

Similar to Building a Sustainable Data Platform on AWS (20)

PPTX
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
PDF
Automating Workflows for Analytics Pipelines
PDF
XStream: stream processing platform at facebook
PPTX
Migrating on premises workload to azure sql database
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PDF
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
PPTX
Evolution of a cloud start up: From C# to Node.js
PDF
Apache Samza 1.0 - What's New, What's Next
PPTX
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
PDF
React state management with Redux and MobX
PDF
Orchestrating complex workflows with aws step functions
PDF
DW on AWS
PDF
In-memory ColumnStore Index
PDF
Real time analytics on deep learning @ strata data 2019
PPTX
Day 1 - Technical Bootcamp azure synapse analytics
PDF
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
PPTX
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
PPTX
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
PDF
Serverless ML Workshop with Hopsworks at PyData Seattle
PDF
Serverless SQL
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Automating Workflows for Analytics Pipelines
XStream: stream processing platform at facebook
Migrating on premises workload to azure sql database
Running Airflow Workflows as ETL Processes on Hadoop
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
Evolution of a cloud start up: From C# to Node.js
Apache Samza 1.0 - What's New, What's Next
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
React state management with Redux and MobX
Orchestrating complex workflows with aws step functions
DW on AWS
In-memory ColumnStore Index
Real time analytics on deep learning @ strata data 2019
Day 1 - Technical Bootcamp azure synapse analytics
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless SQL

More from SmartNews, Inc. (19)

PDF
SmartNewsを支えるデータパイプラインとその運用
PDF
Spring で実現する SmartNews のニュース配信基盤
PDF
エンジニアからプロダクトマネージャーへ
PDF
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
PDF
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
PDF
Stream Processing in SmartNews #jawsdays
PDF
AWSの進化とSmartNewsの裏側
PDF
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
PDF
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
PDF
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
PDF
SmartNews TechNight vol5 SmartNews Ads大図解
PDF
NLP in SmartNews
PDF
SmartNews's journey into microservices
PDF
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
PDF
SmartNews の Webmining を支えるプラットフォーム
PDF
AWS meetup「Apache Spark on EMR」
PDF
Smartnews Product Manager Night
PDF
SmartNews Ads System - AWS Summit Tokyo 2015
PDF
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
SmartNewsを支えるデータパイプラインとその運用
Spring で実現する SmartNews のニュース配信基盤
エンジニアからプロダクトマネージャーへ
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
Stream Processing in SmartNews #jawsdays
AWSの進化とSmartNewsの裏側
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight vol5 SmartNews Ads大図解
NLP in SmartNews
SmartNews's journey into microservices
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
SmartNews の Webmining を支えるプラットフォーム
AWS meetup「Apache Spark on EMR」
Smartnews Product Manager Night
SmartNews Ads System - AWS Summit Tokyo 2015
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Modernizing your data center with Dell and AMD
PDF
Electronic commerce courselecture one. Pdf
PPTX
Cloud computing and distributed systems.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Artificial Intelligence
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Empathic Computing: Creating Shared Understanding
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
MYSQL Presentation for SQL database connectivity
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Modernizing your data center with Dell and AMD
Electronic commerce courselecture one. Pdf
Cloud computing and distributed systems.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
“AI and Expert System Decision Support & Business Intelligence Systems”
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Understanding_Digital_Forensics_Presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Artificial Intelligence
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology
Empathic Computing: Creating Shared Understanding
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
MYSQL Presentation for SQL database connectivity

Building a Sustainable Data Platform on AWS

  • 1. Building a Sustainable Data Platform on AWS Takumi Sakamoto 2016.01.27
  • 7. What is SmartNews? • News Discovery App • Launched in 2012 • 15M+ Downloads in World Wide https://www.smartnews.com/en/
  • 8. Our Mission the world's quality information? the people who need it? How?
  • 9. Machine Learning URLs Found Structure Analysis Semantics Analysis Importance Estimation Diversification Internet 100,000+ /day 1000+ /day Feedback Deliver Trending Stories
  • 10. Data Platform Use Cases • Product development • track KPI such as DAU and MAU • A/B test for new feature, on-boarding, etc... • ad-hoc analysis • Provide data to applications • realtime re-ranking news articles • CTR prediction of Ads system • dashboard service for media partners
  • 11. Data & Its Numbers • User activities • ~100 GBs per day (compressed) • 60+ record types • User demographics or configurations etc... • 15M+ records • Articles metadata • 100K+ records per day
  • 13. Sustainable Data Platform • Provide a reliable and scalable "Lambda Architecture" • Minimize both operation & running cost • Be open to uncertain future
  • 15. Why Sustainable? • Do a lot with a few engineers • no one is a full-time maintainer • avoid to waste too much time • Empower brilliant engineers in SmartNews • everything should be as self-serve as possible • don't ask for permission, beg for forgiveness
  • 17. λ Architecture at SmartNews Input Batch Serving Speed Output
  • 18. Design Principles • Decoupled "Computation" and "Storage" layers • multiple consumers can use the same data • run consumers on Spot Instances • prevent serious data lost with minimum effort • Use the right tool for the job • leverage AWS managed service as possible • fill in the missing pieces by Presto & PipelineDB
  • 19. An Example Amazon EMR AMI 3.x Amazon S3 Amazon EMR Hive General Users Application Engineer I wanna upgrade hive Ad Engineer I wanna combine news data with ad data Amazon EMR AMI 4.x Amazon EMR Spark We’re satisfied with current version Data Scientist I wanna test my algorithm with the latest spark Batch Layer Run multiple EMR clusters for each usages Kinesis Stream Spark on EMR AWS Lambda Data Scientist I wanna consume streaming data by Spark Application Engineer I wanna add a streaming monitor by Lambda Speed Layer Consume the same data for each usages • AWS managed services • Replicated data into Multiple AZs • High availability
  • 21. Collect Events by Fluentd • Forwarder (running on each instances) • store JSON events to S3 • forward events to aggregators • collect metrics and post them to Datadog • Aggregator • input events into Kinesis & PipelineDB • other reporting tasks (not mentioned today)
  • 22. Forwards to S3 <source> @type tail format json path /data/log/user_activity.log pos_file /data/log/pos/user_activity.pos tag smartnews.user_activity time_key timestamp </source> <match smartnews.user_activity> @type copy <store> @type relabel @label @s3 </store> <store> @type forward @label @forward </store> </match> @include conf.d/s3.conf @include conf.d/forward.conf <label @s3> <% node[:td_agent][:s3].each do |c| -%> <match <%= c[:tag] %>> @id s3.<%= c[:tag] %> @type s3 ... path fluentd/<%= node[:env] %>/<%= node[:role] %>/<%= c[:tag] %> time_slice_format dt=%Y-%m-%d/hh=%H time_key timestamp include_time_key time_as_epoch reduced_redundancy true format json utc buffer_chunk_limit 2048m </match> <% end -%> </label> td-agent.conf conf.d/s3.conf
  • 23. Capture DynamoDB Streams <source> type dynamodb_streams stream_arn YOUR_DDB_STREAMS_ARN pos_file /path/to/table.pos fetch_interval 1 fetch_size 100 </source> https://github.com/takus/fluent-plugin-dynamodb-streams DynamoDB DynamoDB Streams Amazon S3 AWS Lambda Fluentd
  • 24. Recommended Practices • Make configuration simple as possible • fluentd can cover everything, but shouldn't • keep stateless • Use v0.12 or later • "Filter" : better performance • "Label": eliminate 'output_tag' configuration
  • 25. Monitor Fluentd Status • Monitor traffic volume & retry count by Datadog • Datadog's fluentd integration • fluent-plugin-flowcounter • fluent-plugin-dogstatsd
  • 26. Archive to Amazon S3 • I have 2 recommended settings • versioning • enable to recover from human error • lifecycle policy • minify storage cost Archives to IA or Gracier xx days after the creation date Keep previous versions xx days Save you in the future!!
  • 28. Various ETL Tasks • Extract • dump MySQL records by Embulk • make files on S3 readable to Hive • Transform • transform text files into columnar files (RCFile, ORC) • generate features for machine learning • aggregate records (by country, by channel) • Load • load aggregated metrics into Amazon Aurora
  • 29. Hive • Most popular project on Hadoop ecosystem • famous for its lovely logo :) • HiveQL and MapReduce • convert SQL-like query into MR jobs • Not adopt Tez engine yet • Amazon EMR doesn't support now • limited improvement to our queries
  • 30. How to process JSON? A. Transform into columnar table periodically • required converting job • better performance B. Use JSON-SerDe for temporary analysis • easy way for querying raw json text files • required to "drop table" for change schema • performance is not good
  • 31. Transform Tables -- Make S3 files readable by Hive ALTER TABLE raw_activities ADD IF NOT EXISTS PARTITION (dt='${DATE}', hh='${HOUR}'); -- Transform text files into columnar files (Flatten JSON) INSERT OVERWRITE TABLE activities PARTITION (dt='${DATE}', action) SELECT user_id, timestamp, os, country, data, action FROM raw_activities LATERAL VIEW json_tuple( raw_activities.json, 'userId','timestamp','platform','country','action','data' ) a as user_id, timestamp, os, country, action, data WHERE dt = '${DATE}' CLUSTER BY os, country, action, user_id ;
  • 32. JSON-SerDe -- Define table with SERDE CREATE TABLE json_table ( country string, languages array<string>, religions map<string,array<int>> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; -- Result: 10 SELECT religions['catholic'][0] FROM json_table;
  • 33. cf. hive-ruby-scripting -- Define your ruby (JRuby) script SET rb.script= require 'json' def parse (json) j = JSON.load(json) j['profile']['attribute1'] end ; -- Use the script in HQL SELECT rb_exec('&parse', json) FROM user; https://github.com/gree/hive-ruby-scripting
  • 35. Self-Serve via AWS CLI # Create EMR clusters that runs Hive & Spark & Ganglia aws emr create-cluster --name "My Cluster" --release-label emr-4.2.0 --applications Name=Hive Name=Spark Name=GANGLIA --ec2-attributes KeyName=myKey --instance-type c3.4xlarge --instance-count 4 --use-default-roles
  • 36. Minimize expenses • Use Spot Instances as possible • typically discount 50-90% • select instance type with stable price • C3 families spike often :( • Dynamic cluster resizing • x2 capacity during daily batch job • 1/2 capacity during midnight
  • 38. Typical Anti-Pattern 5 * * * * app hive -f query_1.hql 15 * * * * app hive -f query_2.hql 30 * * * * app hive -f query_3.hql 0 * * * * app hive -f query_4.hql 1 * * * * app hive -f query_5.hql
  • 39. Workflow Management • Define dependencies • task E is executed after finishing task C and task D • Scheduling • task A is kicked after 09:00 AM • throttle concurrent running of the same task • Monitoring • notification in failure • task C must finish before 01:00 PM (SLA) cf. http://www.slideshare.net/taroleo/workflow-hacks-1-dots-tokyo
  • 40. Airflow • A workflow management systems • define workflow by Python • built in shiny UI & CLI • pluggable architecture http://nerds.airbnb.com/airflow/
  • 41. Define Tasks dag = DAG('tutorial', default_args=default_args) t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) t2 = BashOperator( task_id='sleep', bash_command='sleep 5', retries=3, dag=dag) t3 = BashOperator( task_id='templated', bash_command=""" {% for i in range(5) %} echo "{{ ds }}" echo "{{ macros.ds_add(ds, 7)}}" echo "{{ params.my_param }}" {% endfor %} """, params={'my_param': 'Parameter I passed in'}, dag=dag) t2.set_upstream(t1) t3.set_upstream(t1) Task Dependencies Python code DAG
  • 42. Workflow as Code Deploy codes automatically after merging into master
  • 44. What is done or not?
  • 45. Alerting to Slack • SLA Violation • task A should be done till 00:00 PM • other team's task K has dependency into task A • Output validation failure • stop the following tasks if the output is doubtful
  • 46. Retry from Web UI Once clear histories, airflow scheduler back fill the histories
  • 47. Retry from CLI // Clear some histories from 2016-01-01 airflow clear etl_smartnews --task_regex user_ --downstream --start_date 2016-01-01 // Backfill uncompleted tasks airflow backfill etl_smartnews --start_date 2016-01-01
  • 49. How Long Each Tasks?
  • 50. Pluggable Architecture • Built-in plugins • operator: bash, hive, preto, mysql • transfer: hive_to_mysql • sensor: wait_hive_partition, wait_s3_file • Written our own plugin • mysql_partition
  • 51. Examples user_sensor = S3KeySensor( task_id='wait_user', bucket_name='smartnews', bucket_key='user/dt={{ ds }}/dump.csv', ) etl = HiveOperator( task_id="task1", hql="INSERT OVERWRITE INTO...." ) etl.set_upstream(user_sensor) import = HiveToMySqlTransfer( task_id=name, mysql_preoperator="DELETE FROM %s WHERE date = '{{ ds }}'" % table, sql="SELECT country, count(*) FROM %s" % table, mysql_table=table ) import.set_upstream(etl) Wait a S3 file creation After the file is created, Run ETL Query After that, Import into MySQL
  • 53. Provides batch views in low-latency and ad-hoc way
  • 54. Presto • A distributed SQL query engine • join multiple data sources (Hive + MySQL) • support standard ANSI SQL • designed to handle TBs or PBs scale data cf. http://www.slideshare.net/frsyuki/presto-hadoop-conference-japan-2014
  • 55. Presto Architecture Amazon S3 Kinesis Stream Amazon RDS Amazon Aurora Presto Worker Presto Worker Presto Worker Presto Worker Presto Worker Presto Worker Presto Coordinator Client 1. Query with Standard SQL 4. Scan data concurrently 5. Aggregate data without disk I/O 6. Return result to client 2. Generate execution plan 3. Dispatch tasks into multiple workers Amazon EMR (Hive Metastore) Provides Hive table metadata (S3 access only) ※ https://github.com/qubole/presto-kinesis ※
  • 56. Why Presto? • Join multiple data sources • skip large parts of ETL process • enable to merge Hive/MySQL/Kinesis/PipelineDB • Low latency • ~30s to scan billions records in S3 • Low maintenance cost • stateless, and easy to integrate with Auto Scaling
  • 57. Use case: A/B Test -- Suppose that this table exists DESC hive.default.user_activities; user_id bigint action varchar abtest array<map<varchar, bigint>> url varchar -- Summarize page view per A/B Test identifier -- for comparing two algorithms v1 & v2 SELECT   dt,   t['behaviorId'],   count(*) as pv FROM hive.default.user_activities CROSS JOIN UNNEST(abtest) AS t (t) WHERE dt like '2016-01-%' AND action = 'viewArticle' AND t['definitionId'] = 163 GROUP BY dt, t['behaviorId'] ORDER BY dt ; 2015-12-01 | algorithm_v1 | 40000 2015-12-01 | algorithm_v2 | 62000
  • 58. Use case: Troubleshoot -- Store access logs to S3, and query to them -- Summarize access & 95pct response time by SQL SELECT from_unixtime(timestamp), count(*) as access, approx_percentile(reqtime, 0.95) as pct95_reqtime FROM hive.default.access_log WHERE dt = '2015-11-04' AND hh = '13' AND role = 'xxx' GROUP BY timestamp ORDER BY timestamp ; 2015-11-04 22:00:00.000 | 6377 | 0.522 2015-11-04 22:00:01.000 | 3580 | 0.422
  • 59. Scheduled Auto Scaling $ aws autoscaling describe-scheduled-actions { "ScheduledUpdateGroupActions": [ { "DesiredCapacity": 2, "AutoScalingGroupName": "presto-worker-prd", "Recurrence": "59 14 * * *", "ScheduledActionName": "scalein-2359-jst" }, { "DesiredCapacity": 20, "AutoScalingGroupName": "presto-worker-prd", "Recurrence": "45 0 * * 1-5", "ScheduledActionName": "scaleout-0945-jst" } ] }
  • 60. Presto Covers Everything? No! • Fixed system on Amazon Aurora (or other RDB) • provides KPI for products & business • require high availability & low latency • has no flexibility • Ad-hoc system on Presto • provides access to all dataset on data platform • require high scalability • has flexibility (join various data sources)
  • 61. Why Fixed vs Ad-hoc? • Difficulties on the Ad-hoc only solution • difficult to prevent heavy queries • large distinct count exhausts computing resources • decrease presto maintainability
  • 63. Chartio • Dashboard as A Service • helps businesses analyze and track their critical data • one of AWS partners (※) • Combine multiple data sources at one dashboard • Presto, MySQL, Redshift, BigQuery, Elasticsearch ... • enable to join BigQuery + MySQL internally • Easy to use for every one • everyone can make their own dashboard • write SQL directly / generate query by drag & drop ※ http://www.aws-partner-directory.com/PartnerDirectory/PartnerDetail?id=8959
  • 64. Creating dashboard 1. Building query (Drag&Drop / SQL) 2. Add step (filter、sort、modify) 3. Select visualize way (table、graph)
  • 66. Why Chartio? • Chartio saves a lot of engineering resources • before • maintain in-house dashboard written by rails • everyone got tired to maintain it • after • everyone can build their own dashboard easily • Chartio's UI is cool • very important factor for dashboard tool
  • 67. Missing Pieces of Chartio • No programable API provides • need to edit dashboard / chart manually • No rollback feature • all changes are recorded, but not rollback to the previous state • work around : clone => edit => rename
  • 69. Why Speed is Matter?
  • 70. Today’s News is Wrapping Tomorrow’s Fish and Chips
  • 73. Use cases • Re-rank news articles by user feedback • track user's positive/negative signal • consider gender, age, location, interests • Realtime article monitoring • detect high bounce rate (may be broken?) • make realtime reporting dashboard for A/B test
  • 74. Realtime Re-Ranking ref. Stream 処理 (Spark Streaming + Kinesis) と Offline 処理 (Hive) の統合 www.slideshare.net/smartnews/stremspark-streaming-kinesisofflinehive Amazon CloudSearch Search API API Gateway Kinesis Stream Amazon S3 Amazon EMR Amazon S3 Amazon EMR DynamoDB Realtime Feedback Re-rank Articles Article Metadata User Interests User Behaviors Offline Procees by Hive / Spark
  • 75. Realtime Monitoring API Gateway Stream Continuous View Continuous View Continuous View Discard raw record soon after consumed by Continuous View Incrementally updated in realtime PipelineDB Chartio AWS Lambda Slack Access Continuous View by PostgreSQL Client Record ※1 ※1 Use cron on 26 Feb. 2016 Migrate it soon after supporting VPC
  • 76. PipelineDB • OSS & enterprise streaming SQL database • PostgreSQL compatible • connect to Chartio 😍 • join stream to normal PostgreSQL table • Support probabilistic data structures • e.g. HyperLogLog https://www.pipelinedb.com/ http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
  • 77. Continuous View -- Calculate unique users seen per media each day -- Using only a constant amount of space (HyperLogLog) CREATE CONTINUOUS VIEW uniques AS SELECT day(arrival_timestamp), substring(url from '.*://([^/]*)') as hostname, COUNT(DISTINCT user_id::integer) FROM activity_stream GROUP BY day,hostname; -- How many impressions have we served in the last five minutes? CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS SELECT COUNT(*) FROM imps_stream; -- What are the 90th, 95th, 99th percentiles of request latency? CREATE CONTINUOUS VIEW latency AS SELECT percentile_cont(array[90, 95, 99]) WITHIN GROUP (ORDER BY latency::integer) FROM latency_stream;
  • 79. Sustainable Data Platform • build a reliable and scalable lambda architecture • minimize operation & running cost • be open to uncertain future
  • 80. My Wishlist to AWS • Support Reduced Redundancy Storage (RRS) on EMR • Faster EMR Launch • Set TTL to DynamoDB records • Auto-scale Kinesis Stream • Launch Kinesis Analytics in Tokyo region