개발자가 알아야 할 Amazon DynamoDB 활용법 :: 김일호 :: AWS Summit Seoul 2016

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
김일호, Solutions Architect
05-17-2016
개발자가 알아야 할 Amazon DynamoDB 활용법

Agenda
Tip 1. DynamoDB Index(LSI, GSI)
Tip 2. DynamoDB Scaling
Tip 3. DynamoDB Data Modeling
Scenario based Best Practice
DynamoDB Streams

Tip 1. DynamoDB Index(LSI, GSI)
Tip 2. DynamoDB Scaling
Tip 3. DynamoDB Data Modeling
Scenario based Best Practice
DynamoDB Streams

Local secondary index (LSI)
Alternate sort(=range) key attribute
Index is local to a partition(=hash) key
A1
(partition)
A3
(sort)
A2
(table key)
A1
(partition)
A2
(sort)
A3 A4 A5
LSIs A1
(partition)
A4
(sort)
A2
(table key)
A3
(projected)
Table
KEYS_ONLY
INCLUDE A3
A1
(partition)
A5
(sort)
A2
(table key)
A3
(projected)
A4
(projected) ALL
10 GB max per hash
key, i.e. LSIs limit the
# of range keys!

Global secondary index (GSI)
Alternate partition key
Index is across all table partition key
A1
(partition)
A2 A3 A4 A5
GSIs
A5
(partition)
A4
(sort)
A1
(table key)
A3
(projected)
Table
INCLUDE A3
A4
(partition)
A5
(sort)
A1
(table key)
A2
(projected)
A3
(projected)
ALL
A2
(partition)
A1
(table key)
KEYS_ONLY
RCUs/WCUs provisioned
separately for GSIs
Online indexing

How do GSI updates work?
Table
Primary
table
Primary
table
Primary
table
Primary
table
Global
Secondary
Index
Client
2. Asynchronous
update (in progress)
If GSIs don’t have enough write capacity, table writes will be throttled!

LSI or GSI?
LSI can be modeled as a GSI
If data size in an item collection > 10 GB, use GSI
If eventual consistency is okay for your scenario, use
GSI!

Scaling
Throughput
• Provision any amount of throughput to a table
Size
• Add any number of items to a table
• Max item size is 400 KB
• LSIs limit the number of range keys due to 10 GB limit
Scaling is achieved through partitioning

Throughput
Provisioned at the table level
• Write capacity units (WCUs) are measured in 1 KB per second
• Read capacity units (RCUs) are measured in 4 KB per second
• RCUs measure strictly consistent reads
• Eventually consistent reads cost 1/2 of consistent reads
Read and write throughput limits are independent
200 RCU

Partitioning math
Number of Partitions
By Capacity (Total RCU / 3000) + (Total WCU / 1000)
By Size Total Size / 10 GB
Total Partitions CEILING(MAX (Capacity, Size))

Partitioning example
Table size = 8 GB, RCUs = 5000, WCUs = 500
RCUs per partition = 5000/3 = 1666.67
WCUs per partition = 500/3 = 166.67
Data/partition = 10/3 = 3.33 GB
RCUs and WCUs are uniformly
spread across partitions
Number of Partitions
By Capacity (5000 / 3000) + (500 / 1000) = 2.17
By Size 8 / 10 = 0.8
Total Partitions CEILING(MAX (2.17, 0.8)) = 3

Allocation of partitions
A partition split occurs when
• Increased provisioned throughput settings
• Increased storage requirements
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Example: hot keys
Partition
Time
Heat

Getting the most out of DynamoDB throughput
“To get the most out of DynamoDB
throughput, create tables where the
partition key element has a large
number of distinct values, and
values are requested fairly
uniformly, as randomly as possible.”
—DynamoDB Developer Guide
Space: access is evenly spread over
the key-space
Time: requests arrive evenly spaced
in time

What causes throttling?
If sustained throughput goes beyond provisioned throughput per partition
Non-uniform workloads
• Hot keys/hot partitions
• Very large bursts
Mixing hot data with cold data
• Use a table per time period
From the example before:
• Table created with 5000 RCUs, 500 WCUs
• RCUs per partition = 1666.67
• WCUs per partition = 166.67
• If sustained throughput > (1666 RCUs or 166 WCUs) per key or partition, DynamoDB may throttle
requests
• Solution: Increase provisioned throughput

1:1 relationships or key-values
Use a table or GSI with a partition key
Use GetItem or BatchGetItem API
Example: Given an SSN or license number, get attributes
Users Table
Partition key Attributes
SSN = 123-45-6789 Email = johndoe@nowhere.com, License = TDL25478134
SSN = 987-65-4321 Email = maryfowler@somewhere.com, License = TDL78309234
Users-Email-GSI
Partition key Attributes
License = TDL78309234 Email = maryfowler@somewhere.com, SSN = 987-65-4321
License = TDL25478134 Email = johndoe@nowhere.com, SSN = 123-45-6789

1:N relationships or parent-children
Use a table or GSI with partition and sort key
Use Query API
Example:
• Given a device, find all readings between epoch X, Y
Device-measurements
Partition Key Sort key Attributes
DeviceId = 1 epoch = 5513A97C Temperature = 30, pressure = 90
DeviceId = 1 epoch = 5513A9DB Temperature = 30, pressure = 90

N:M relationships
Use a table and GSI with partition and sort key elements
switched
Use Query API
Example: Given a user, find all games. Or given a game,
find all users.
User-Games-Table
Hash Key Range key
UserId = bob GameId = Game1
UserId = fred GameId = Game2
UserId = bob GameId = Game3
Game-Users-GSI
Hash Key Range key
GameId = Game1 UserId = bob
GameId = Game2 UserId = fred
GameId = Game3 UserId = bob

Documents (JSON)
New data types (M, L, BOOL, NULL)
introduced to support JSON
Document SDKs
• Simple programming model
• Conversion to/from JSON
• Java, JavaScript, Ruby, .NET
Cannot index (S,N) elements of a
JSON object stored in M
• Only top-level table
attributes can be used in
LSIs and GSIs without
Streams/Lambda
JavaScript DynamoDB
string S
number N
boolean BOOL
null NULL
array L
object M

Rich expressions
Projection expression to get just some of the attributes
• Query/Get/Scan: ProductReviews.FiveStar[0]

Rich expressions
Projection expression to get just some of the attributes
• Query/Get/Scan: ProductReviews.FiveStar[0]
ProductReviews: {
FiveStar: [
"Excellent! Can't recommend it highly enough! Buy it!",
"Do yourself a favor and buy this." ],
OneStar: [
"Terrible product! Do not buy this." ] }
]
}

Rich expressions
Filter expression
• Query/Scan: #VIEWS > :num
Update expression
• UpdateItem: set Replies = Replies + :num

Rich expressions
Conditional expression
• Put/Update/DeleteItem
• attribute_not_exists (#pr.FiveStar)
• attribute_exists(Pictures.RearView)
1. First looks for an item whose primary key matches that of the
item to be written.
2. Only if the search returns nothing is there no partition key
present in the result.
3. Otherwise, the attribute_not_exists function above fails and
the write will be prevented.

Game logging
Storing time series data

Time series tables
Events_table_2015_April
Event_id
(partition key)
Timestamp
(sort key)
Attribute1 …. Attribute N
Events_table_2015_March
Event_id
(partition key)
Timestamp
(sort key)
Events_table_2015_Feburary
Event_id
(partition key)
Timestamp
(sort key)
Events_table_2015_January
Event_id
(partition key)
Timestamp
(sort key)
RCUs = 1000
WCUs = 100
RCUs = 10000
WCUs = 10000
RCUs = 100
WCUs = 1
RCUs = 10
WCUs = 1
Current table
Older tables
HotdataColddata
Don’t mix hot and cold data; archive cold data to Amazon S3

Use a table per time period
• Pre-create daily, weekly, monthly tables
• Provision required throughput for current table
• Writes go to the current table
• Turn off (or reduce) throughput for older tables
• Pre-create heavy users, light users tables

Item shop catalog
Popular items (read)

Partition 1
2000 RCUs
Partition K
2000 RCUs
Partition M
2000 RCUs
Partition 50
2000 RCU
Scaling bottlenecks
Product A Product B
Gamers
ItemShopCatalog Table
SELECT Id, Description, ...
FROM ItemShopCatalog

RequestsPerSecond
Item Primary Key
Request Distribution Per Partition Key
DynamoDB Requests

Partition 1 Partition 2
ItemShopCatalog Table
User
DynamoDB
User
Cache
popular items
SELECT Id, Description, ...
FROM ProductCatalog

RequestsPerSecond
Item Primary Key
Request Distribution Per Partition Key
DynamoDB Requests Cache Hits

Multiplayer online gaming
Query filters vs.
composite key indexes

GameId Date Host Opponent Status
d9bl3 2014-10-02 David Alice DONE
72f49 2014-09-30 Alice Bob PENDING
o2pnb 2014-10-08 Bob Carol IN_PROGRESS
b932s 2014-10-03 Carol Bob PENDING
ef9ca 2014-10-03 David Bob IN_PROGRESS
Games Table
Multiplayer online game data
Partition key

Query for incoming game requests
DynamoDB indexes provide partition and sort
What about queries for two equalities and a range?
SELECT * FROM Game
WHERE Opponent='Bob‘
AND Status=‘PENDING'
ORDER BY Date DESC
(partition)
(sort)
(?)

Secondary Index
Opponent Date GameId Status Host
Alice 2014-10-02 d9bl3 DONE David
Carol 2014-10-08 o2pnb IN_PROGRESS Bob
Bob 2014-09-30 72f49 PENDING Alice
Bob 2014-10-03 b932s PENDING Carol
Bob 2014-10-03 ef9ca IN_PROGRESS David
Approach 1: Query filter
BobPartition key Sort key

Secondary Index
Approach 1: Query filter
Bob
Opponent Date GameId Status Host
Alice 2014-10-02 d9bl3 DONE David
Carol 2014-10-08 o2pnb IN_PROGRESS Bob
Bob 2014-09-30 72f49 PENDING Alice
Bob 2014-10-03 b932s PENDING Carol
Bob 2014-10-03 ef9ca IN_PROGRESS David
SELECT * FROM Game
WHERE Opponent='Bob'
ORDER BY Date DESC
FILTER ON Status='PENDING'
(filtered out)

Use query filter
• Send back less data “on the wire”
• Simplify application code
• Simple SQL-like expressions
• AND, OR, NOT, ()
Use when your index isn’t entirely selective

Approach 2: composite key
StatusDate
DONE_2014-10-02
IN_PROGRESS_2014-10-08
IN_PROGRESS_2014-10-03
PENDING_2014-09-30
PENDING_2014-10-03
Status
DONE
IN_PROGRESS
IN_PROGRESS
PENDING
PENDING
Date
2014-10-02
2014-10-08
2014-10-03
2014-10-03
2014-09-30
+ =

Secondary Index
Opponent StatusDate GameId Host
Alice DONE_2014-10-02 d9bl3 David
Carol IN_PROGRESS_2014-10-08 o2pnb Bob
Bob IN_PROGRESS_2014-10-03 ef9ca David
Bob PENDING_2014-09-30 72f49 Alice
Bob PENDING_2014-10-03 b932s Carol
Partition key Sort key

Opponent StatusDate GameId Host
Alice DONE_2014-10-02 d9bl3 David
Carol IN_PROGRESS_2014-10-08 o2pnb Bob
Bob IN_PROGRESS_2014-10-03 ef9ca David
Bob PENDING_2014-09-30 72f49 Alice
Bob PENDING_2014-10-03 b932s Carol
Secondary Index
Bob
SELECT * FROM Game
WHERE Opponent='Bob'
AND StatusDate BEGINS_WITH 'PENDING'

Needle in a sorted haystack
Bob

Sparse indexes
Id
(Hash)
User Game Score Date Award
1 Bob G1 1300 2012-12-23
2 Bob G1 1450 2012-12-23
3 Jay G1 1600 2012-12-24
4 Mary G1 2000 2012-10-24 Champ
5 Ryan G2 123 2012-03-10
6 Jones G2 345 2012-03-20
Game-scores-table
Award
(Hash)
Id User Score
Champ 4 Mary 2000
Award-GSI
Scan sparse hash GSIs

Replace filter with indexes
Concatenate attributes to form useful
secondary index keys
Take advantage of sparse indexes
Use when You want to optimize a query as much as possible
Status + Date

Big data analytics
with DynamoDB

Transactional Data Processing
DynamoDB is well-suited for transactional processing:
• High concurrency
• Strong consistency
• Atomic updates of single items
• Conditional updates for de-dupe and optimistic concurrency
• Supports both key/value and JSON document schema
• Capable of handling large table sizes with low latency data access

Case 1: Store and Index Metadata for Objects
Stored in Amazon S3

Case 1: Use Case
We have a large number of digital audio files stored in Amazon S3 and
we want to make them searchable
à Use DynamoDB as the primary data store for the metadata.
à Index and query the metadata using Elasticsearch.

Case 1: Steps to Implement
1. Create a Lambda function that reads the metadata from the
ID3 tag and inserts it into a DynamoDB table.
2. Enable S3 notifications on the S3 bucket storing the audio
files.
3. Enable streams on the DynamoDB table.
4. Create a second Lambda function that takes the metadata in
DynamoDB and indexes it using Elasticsearch.
5. Enable the stream as the event source for the Lambda
function.

Case 1: Key Takeaways
DynamoDB + Elasticsearch = Durable, scalable, highly-
available database with rich query capabilities.
Use Lambda functions to respond to events in both
DynamoDB streams and Amazon S3 without having to
manage any underlying compute infrastructure.

Case 2 – Execute Queries Against Multiple Data Sources Using
DynamoDB and Hive

Case 2: Use Case
We want to enrich our audio file metadata stored in DynamoDB with
additional data from the Million Song dataset:
à Million song data set is stored in text files.
à ID3 tag metadata is stored in DynamoDB.
à Use Amazon EMR with Hive to join the two datasets together in a
query.

1. Spin up an Amazon EMR cluster with
Hive.
2. Create an external Hive table using the
DynamoDBStorageHandler.
3. Create an external Hive table using the
Amazon S3 location of the text files
containing the Million Song project
metadata.
4. Create and run a Hive query that joins
the two external tables together and
writes the joined results out to Amazon
S3.
5. Load the results from Amazon S3 into
DynamoDB.

Use Amazon EMR to quickly provision a Hadoop cluster
with Hive and to tear it down when done.
Use of Hive with DynamoDB allows items in DynamoDB
tables to be queried/joined with data from a variety of
sources.

Case 3 – Store and Analyze Sensor Data with
DynamoDB and Amazon Redshift
Dashboard

Case 3: Use Case
A large number of sensors are taking readings at regular intervals. You
need to aggregate the data from each reading into a data warehouse
for analysis:
• Use Amazon Kinesis to ingest the raw sensor data.
• Store the sensor readings in DynamoDB for fast access and real-
time dashboards.
• Store raw sensor readings in Amazon S3 for durability and backup.
• Load the data from Amazon S3 into Amazon Redshift using AWS
Lambda.

1. Create two Lambda functions to
read data from the Amazon
Kinesis stream.
2. Enable the Amazon Kinesis
stream as an event source for
each Lambda function.
3. Write data into DynamoDB in
one of the Lambda functions.
4. Write data into Amazon S3 in the
other Lambda function.
5. Use the aws-lambda-redshift-
loader to load the data in
Amazon S3 into Amazon
Redshift in batches.

Amazon Kinesis + Lambda + DynamoDB = Scalable, durable, highly
available solution for sensor data ingestion with very low operational
overhead.
DynamoDB is well-suited for near-realtime queries of recent sensor
data readings.
Amazon Redshift is well-suited for deeper analysis of sensor data
readings spanning longer time horizons and very large numbers of
records.
Using Lambda to load data into Amazon Redshift provides a way to
perform ETL in frequent intervals.

Stream of updates to a table
Asynchronous
Exactly once
Strictly ordered
• Per item
Highly durable
• Scale with table
24-hour lifetime
Sub-second latency
DynamoDB Streams

View type Destination
Old image—before update Name = John, Destination = Mars
New image—after update
Name = John, Destination = Pluto
Old and new images Name = John, Destination = Mars
Name = John, Destination = Pluto
Keys only Name = John
View types
UpdateItem (Name = John, Destination = Pluto)

Stream
Table
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Table
Shard 1
Shard 2
Shard 3
Shard 4
KCL
Worker
KCL
Worker
KCL
Worker
KCL
Worker
Amazon Kinesis Client
Library application
DynamoDB
client application
Updates
DynamoDB Streams and
Amazon Kinesis Client Library

DynamoDB Streams
Open Source Cross-Region
Replication Library
Asia Pacific (Sydney) EU (Ireland) Replica
US East (N. Virginia)
Cross-region replication

DynamoDB Streams and AWS Lambda

Triggers
Lambda function
Notify change
Derivative tables
Amazon CloudSearch
Amazon ElasticSearch
Amazon ElastiCache
DynamoDB Streams

Analytics with DynamoDB Streams
Collect and de-dupe data in DynamoDB
Aggregate data in-memory and flush periodically
Performing real-time aggregation and analytics
EMR
Redshift
DynamoDB

개발자가 알아야 할 Amazon DynamoDB 활용법 :: 김일호 :: AWS Summit Seoul 2016

여러분의 피드백을 기다립니다!
https://www.awssummit.co.kr
모바일 페이지에 접속하셔서, 지금 세션 평가에
참여하시면, 행사후 기념품을 드립니다.
#AWSSummit 해시태그로 소셜 미디어에 여러분의
행사 소감을 올려주세요.
발표 자료 및 녹화 동영상은 AWS Korea 공식 소셜
채널로 곧 공유될 예정입니다.

개발자가 알아야 할 Amazon DynamoDB 활용법 :: 김일호 :: AWS Summit Seoul 2016

More Related Content

What's hot (20)

Viewers also liked (20)

More from Amazon Web Services Korea (20)

Recently uploaded (20)

개발자가 알아야 할 Amazon DynamoDB 활용법 :: 김일호 :: AWS Summit Seoul 2016