Big data on_aws in korea by abhishek sinha (lunch and learn)

Big Data Analytics
Abhishek Sinha
Business Development Manager,
AWS
@abysinha
sinhaar@amazon.com

An engineer’s definition
When your data sets become so large that you have to start
innovating how to collect, store, organize, analyze and
share it

What does big data look like ?

Where is this data coming from ?

Human generated
Machine generated
Tweet
Surf the internet
Buy and sell products
Upload images and videos
Play games
Check in at restaurants
Search for cafes
Find deals
Watch content online
Look for directions
Use social media

Human generated
Machine generated
Networks and security devices
Mobile phones
Cell phone towers
Smart grids
Smart meters
Telematics from cars
Sensors on machines
Videos from traffic and security
cameras

What are people using this for ?

Big Data Verticals and Use cases
Media/Advertising
Targeted
Advertising
Image and
Video
Processing
Oil & Gas
Seismic
Analysis
Retail
Recommendati
ons
Transactions
Analysis
Life Sciences
Genome
Analysis
Financial
Services
Monte Carlo
Simulations
Risk
Analysis
Security
Anti-virus
Fraud
Detection
Image
Recognition
Social
Network/Gaming
User
Demographi
cs
Usage
analysis
In-game
metrics

Generation
Collection & storage
Analytics & computation
Collaboration & sharing

Generation
Lower cost,
higher throughput

Generation
Highly
constrained
Lower cost,
higher throughput

Big Gap in turning data into actionable
information

Amazon Web Services helps remove constraints

Big Data + Cloud = Awesome Combination
Big data:
• Potentially massive datasets
• Iterative, experimental style
of data manipulation and
analysis
• Frequently not a steady-state
workload; peaks and valleys
• Data is a combination of
structured and unstructured
data in many formats
AWS Cloud:
• Massive, virtually unlimited
capacity
• Iterative, experimental style of
infrastructure deployment/usage
• At its most efficient with highly
variable workloads
• Tools for managing structured
and unstructured data

Data size
• Global reach
• Native app for almost every smartphone, SMS, web, mobile-web
• 10M+ users, 15M+ venues, ~1B check-ins
• Terabytes of log data

Stack
ApplicationStack
Scala/Liftweb API Machines WWW Machines Batch Jobs
Scala Application code
Mongo/Postgres/Flat
Files
Databases LogsDataStack
Amazon S3 Database Dumps Log Files
Hadoop Elastic Map Reduce
Hive/Ruby/Mahout Analytics Dashboard Map Reduce Jobs
mongoexport
postgres dump
Flume

Stack – Front end Application
ApplicationStack
Mongo/Postgres/Flat
Files
mongoexport
postgres dump
Flume

Stack – Collection and Storage
ApplicationStack
Mongo/Postgres/Flat
Files
mongoexport
postgres dump
Flume

Stack – analysis and sharing
ApplicationStack
Mongo/Postgres/Flat
Files
mongoexport
postgres dump
Flume

“Who is using our
service?”

Identified early mobile usage
Invested heavily in mobile development
Finding signal in the noise of logs

9,432,061 unique mobile devices
used the Yelp mobile app.
4 million+ calls. 5 million+ directions.
In January 2013

Autocomplete Search
Recommendations
Automatic spelling
corrections

“What kind of movies do people
like ?”

More than 25 Million Streaming Members
50 Billion Events Per Day
30 Million plays every day
2 billion hours of video in 3 months
4 million ratings per day
3 million searches
Device location , time , day,
week etc.
Social data

10 TB of streaming data per day

Data consumed in multiple ways
S3
EMR
Prod Cluster
(EMR)
Recommendati
on Engine
Ad-hoc
Analysis
Personalization

AWS
Import/Export
Corporate
data center
Amazon
Elastic
MapReduce
Amazon
Simple
Storage
Service (S3)
BI Users
Clickstream data from
500+ websites and
VoD platform

Who is Razorfish
• Full service Digital Agency
• Developed an Ad-Serving Platform compatible with most browsers
• Clickstream analysis of data , current historical trends and segmentation of
users
• Segmentation is used to serve ads and cross sell
• 45TB of Log data
• Problems at scale
– Giant Datasets
– Building Infrastructure requires large continuous investment
– Build for peak holiday season
– Traditional Data stores are not scaling

3.5 billion records
13 TB of click stream logs
71 million unique cookies
Per day:

This happens in 8 hours everyday

Why AWS + EMR
• Prefect Clarity of Cost
• No upfront infrastructure investment
• No client processing contention
• Without EMR/Hadoop it takes 3 days , with EMR 8 hours
– Scalability 1 node x 100 hours = 100 nodes x 1 hour
• Meet SLA

Playfish improves in-game experience for its users
through data mining
Challenge:
Must understand player usage trends across
50M month users, multiple platforms, 10s of
games, and in the face of rapid growth. This
drives both in-game improvements and
defines what games to target next.
Solution:
EMR provides Playfish the flexibility to
experiment and rapidly ask new questions.
All usage data is stored in S3 and analysts
run ad-hoc hive queries that can slice the
data by time, game, and user.

Data Driven Game Design
Data is being used to understand what gamers are doing
inside the game (behavioral analysis)
- What features people like (rely on data instead of
forum posts)
- What features are abandoned
- A/B testing
- Monetization – In Game Analytics

Building a big data architecture
Design Patterns

Getting your Data into AWS
Amazon S3
Corporate Data
Center
• Console Upload
• FTP
• AWS Import Export
• S3 API
• Direct Connect
• Storage Gateway
• 3rd Party Commercial Apps
• Tsunami UDP
1

Write directly to a data source
Your application Amazon S3
DynamoDB
Any other data
store
Amazon S3
Amazon EC2
2

Queue , pre-process and then write to data source
Amazon Simple
Queue Service
(SQS)
Amazon S3
DynamoDB
Any other data
store
3

Agency Customer: Video Analytics on AWS
Elastic Load
Balancer
Edge Servers
on EC2
Workers on
EC2
Logs Reports
HDFS Cluster
Amazon Simple Queue
Service (SQS)
Amazon Simple Storage Service
(S3)
Amazon Elastic MapReduce

Aggregate and write to data source
Flume running
on EC2
Amazon S3
Any other data
store
HDFS
4

What is Flume
• Collection, Aggregation of streaming Event Data
– Typically used for log data, sensor data , GPS data etc
• Significant advantages over ad-hoc solutions
– Reliable, Scalable, Manageable, Customizable and High Performance
– Declarative, Dynamic Configuration
– Contextual Routing
– Feature rich
– Fully extensible

Typical Aggregation Flow
[Client]+  Agent [ Agent]*  Destination
Flume uses a multi-tier approach where multiple agents can send data to
another agent which acts as a aggregator. For each agent , data can from
either an agent or a client or can be sent to another agent or a sink

Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html
S3 as a “single source of truth”
S3

Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Choose depending upon design

Choice of storage systems (Structure and Volume)
Structure
LowHigh
Large
Small
Size
S3
RDS
Dynamo DB
NoSQL
EBS
1

Hadoop based Analysis
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR

EMR is Hadoop in the Cloud
What is Amazon Elastic MapReduce (EMR)?

A framework
Splits data into pieces
Lets processing occur
Gathers the results

Difficulty
Number of Machines
1
1

Difficulty
Number of Machines
1
1
106
2

distributed computing
requires god-like engineers

Hadoop is…
The MapReduce computational paradigm

Hadoop is…
The MapReduce computational paradigm
… implemented as an Open-source, Scalable,
Fault-tolerant, Distributed System

Person Start End
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11

Person Start End Duration
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11

Bob 00:44:48 00:45:11 23
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11

Bob 00:44:48 00:45:11 23
Charlie 02:16:02 02:16:18 16
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11

Bob 00:44:48 00:45:11 23
Charlie 02:16:02 02:16:18 16
Charlie 11:16:59 11:17:17 18
Charlie 11:17:24 11:17:38 14
Bob 11:23:10 11:23:25 15
Alice 16:26:46 16:26:54 8
David 17:20:28 17:20:45 17
Alice 18:16:53 18:17:00 7
Charlie 19:33:44 19:33:59 15
Bob 21:13:32 21:13:43 11
David 22:36:22 22:36:34 12
Alice 23:42:01 23:42:11 10

Person Duration
Bob 23
Charlie 16
Charlie 18
Charlie 14
Bob 15
Alice 8
David 17
Alice 7
Charlie 15
Bob 11
David 12
Alice 10

Person Duration
Bob 23
Charlie 16
Charlie 18
Charlie 14
Bob 15
Alice 8
David 17
Alice 7
Charlie 15
Bob 11
David 12
Alice 10
Person Start End
Bob 00:44:48 00:45:11
Charlie 02:16:02 02:16:18
Charlie 11:16:59 11:17:17
Charlie 11:17:24 11:17:38
Bob 11:23:10 11:23:25
Alice 16:26:46 16:26:54
David 17:20:28 17:20:45
Alice 18:16:53 18:17:00
Charlie 19:33:44 19:33:59
Bob 21:13:32 21:13:43
David 22:36:22 22:36:34
Alice 23:42:01 23:42:11
map

Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17

Person Total
Alice 25
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17

Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
Person Total
Bob 49
Alice 25

Person Total
Charlie 63
Bob 49
Alice 25
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17

Person Total
David 29
Charlie 63
Bob 49
Alice 25
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17

Person Total
David 29
Charlie 63
Bob 49
Alice 25

Person Total
Alice 25
Bob 49
Charlie 63
David 29
Person Duration
Alice 8
Alice 7
Alice 10
Bob 23
Bob 15
Bob 11
Charlie 16
Charlie 18
Charlie 14
Charlie 15
David 12
David 17
reduce

map
reduce
Works on one record. In this case it
does “end time minus start time”
In parallel over all the records
Group together common records
(e.g “Alice, Bob”) and add all the
results

distributed computing (with Hadoop)
requires god-like talented engineers

Launch a Hadoop cluster from the CLI (
elastic-mapreduce --create --alive
--instance-type m1.xlarge
--num-instances 5

EMR makes it easy to use Hive and Pig
Pig:
• High-level programming
language (Pig Latin)
• Supports UDFs
• Ideal for data flow/ETL
Hive:
• Data Warehouse for Hadoop
• SQL-like query language
(HiveQL)

R:
• Language and software
environment for statistical
computing and graphics
• Open source
EMR makes it easy to use other tools and applications
Mahout:
• Machine learning library
• Supports recommendation
mining, clustering,
classification, and frequent
itemset mining

Launch a Hive cluster from the CLI (step 1/1)
./elastic-mapreduce --create --alive
--name "Test Hive"
--hadoop-version 0.20
--num-instances 5
--instance-type m1.large
--hive-interactive
--hive-versions 0.7.1

SQL Interface for working with data
Simple way to use Hadoop
Create Table statement references data location on
S3
Language called HiveQL, similar to SQL
An example of a query could be:
SELECT COUNT(1) FROM sometable;
Requires to setup a mapping to the input data
Uses SerDe:s to make different input formats
queryable
Powerful data types (Array & Map..)

SQL HiveQL
Updates UPDATE, INSERT,
DELETE
INSERT, OVERWRITE
TABLE
Transactions Supported Not supported
Indexes Supported Not supported
Latency Sub-second Minutes
Functions Hundreds Dozens
Multi-table inserts Not supported Supported
Create table as select Not valid SQL-92 Supported

./elastic-mapreduce –create
--name "Hive job flow”
--hive-script
--args s3://myawsbucket/myquery.q
--args -d,INPUT=s3://myawsbucket/input,-
d,OUTPUT=s3://myawsbucket/output
HiveQL to execute

./elastic-mapreduce
--create
--alive
--name "Hive job flow”
--num-instances 5 --instance-type m1.large
--hive-interactive
Interactive hive session

114
{
requestBeginTime: "19191901901",
requestEndTime: "19089012890",
browserCookie: "xFHJK21AS6HLASLHAS",
userCookie: "ajhlasH6JASLHbas8",
searchPhrase: "digital cameras" adId:
"jalhdahu789asashja",
impresssionId: "hjakhlasuhiouasd897asdh",
referrer: "http://cooking.com/recipe?id=10231",
hostname: "ec2-12-12-12-12.ec2.amazonaws.com",
modelId: "asdjhklasd7812hjkasdhl",
processId: "12901", threadId: "112121",
timers:
{ requestTime: "1910121", modelLookup: "1129101" }
counters:
{ heapSpace: "1010120912012" }
}

115
{
requestBeginTime: "19191901901",
requestEndTime: "19089012890",
browserCookie: "xFHJK21AS6HLASLHAS",
userCookie: "ajhlasH6JASLHbas8",
adId: "jalhdahu789asashja",
impresssionId:
hjakhlasuhiouasd897asdh",
clickId: "ashda8ah8asdp1uahipsd",
referrer: "http://recipes.com/",
directedTo: "http://cooking.com/" }

CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent,
userCookie, ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;

adId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
ROW FORMAT
userCookie, ip' )
Table structure
to create
(happens fast as
just mapping to
source)

adId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
ROW FORMAT
userCookie, ip' )
Source data in S3

Hadoop lowers the cost of developing
a distributed system.

hive> select * from impressions limit 5;
Selecting from source
data directly via Hadoop

What about the cost of operating
a distributed system?

November traffic at amazon.com

November traffic at amazon.com
76%
24%

1 instance x 100 hours = 100 instances x 1 hour

EMR Cluster
S3
Put the data
into S3
Choose: Hadoop distribution, # of
nodes, types of nodes, custom
configs, Hive/Pig/etc.
Get the output from
S3
Launch the cluster using the
EMR console, CLI, SDK, or
APIs
You can also store
everything in HDFS
How does EMR work ?

S3
What can you run on EMR…
EMR Cluster

Resize Nodes
EMR Cluster
You can easily add and
remove nodes

On and Off Fast Growth
Predictable peaksVariable peaks
WASTE

Fast GrowthOn and Off
Predictable peaksVariable peaks

Your choice of tools on Hadoop/EMR
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR

SQL based processing
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework
Petabyte scale
Columnar Data -
warehouse

Massively Parallel Columnar Datawarehouses
• Columnar Data stores
• MPP
– Parallel Ingest
– Parallel Query
– Scale Out
– Parallel Backup

Columnar data stores
• Data alignment
and block size in
row stores vs.
column stores
• Compression
based on each
column

MPP Data warehouse parallelizes and distributes
everything
• Query
• Load
• Backup
• Restore
• Resize
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC

But Data-warehouses are
• Hard to manage
• Very expensive
• Difficult to scale
• Difficult to get performance

Amazon Redshift is a fast and powerful, fully managed,
petabyte-scale data warehouse service in the AWS cloud

Parallelize and Distribute Everything
Dramatically Reduce I/O
MPP
Load
Query
Resize
Backup
Restore

Parallelize and Distribute Everything
Dramatically Reduce I/O
MPP
Load
Query
Resize
Backup
Restore
Direct-attached storage
Large data block sizes
Column data store
Data compression
Zone maps

Protect Operations
Simplify Provisioning
Redshift data is encrypted
Continuously backed up to S3
Automatic node recovery
Transparent disk failure

Protect Operations
Simplify Provisioning
Redshift data is encrypted
Continuously backed up to S3
Automatic node recovery
Transparent disk failure
Create a cluster in minutes
Automatic OS and software patching
Scale up to 1.6PB with a few clicks and no downtime

Start Small and Grow Big
Extra Large Node (XL)
3 spindles, 2TB, 15GiB RAM
2 virtual cores, 10GigE
1 node (2TB)  2-32 node cluster (64TB)
8 Extra Large Node (8XL)
24 spindles, 16TB, 120GiB RAM
16 virtual cores, 10GigE
2-100 node cluster (1.6PB)

Easy to provision and scale
No upfront costs, pay as you go
High performance at a low price
Open and flexible with support for popular BI tools

Amazon Redshift is priced to let you analyze all your data
Price Per Hour for HS1.XL
Single Node
Effective Hourly Price
Per TB
Effective Annual Price
per TB
On-Demand $ 0.850 $ 0.425 $ 3,723
1 Year Reservation $ 0.500 $ 0.250 $ 2,190
3 Year Reservation $ 0.228 $ 0.114 $ 999
Simple Pricing
Number of Nodes x Cost per Hour
No charge for Leader Node
No upfront costs
Pay as you go

Your choice of BI Tools on the cloud
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Pre-processing
framework

Collaboration and Sharing insights
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift

Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools

Sharing results and visualizations and scale
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Web App Server
Visualization tools

Sharing results and visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools

Geospatial Visualizations
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Visualization tools

Rinse Repeat every day or hour

Rinse and Repeat
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline

The complete architecture
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL
Store
Log Aggregation
tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools on
hadoop
GIS tools
Amazon data pipeline

Where do you start ?
• Where is your data ? (S3, SQL, NoSQL ?)
– Are you collecting all your data ?
– What is the format (structured or unstructured)
– How much is this data going to grow ?
• How do you want to process it ?
– SQL (HIVE), Scripts (Python/Ruby/Node.JS) On Hadoop ?
• How do you want to use this data
– Visualization tools
• Do you yourself or engage an AWS partner
• Write to me sinhaar@amazon.com

Big data on_aws in korea by abhishek sinha (lunch and learn)

More Related Content

Similar to Big data on_aws in korea by abhishek sinha (lunch and learn) (20)

More from Amazon Web Services Korea (20)

Recently uploaded (20)

Big data on_aws in korea by abhishek sinha (lunch and learn)