Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy

Polyglot Persistence in the
Real World

Anton Yazovskiy
Thumbtack Technology

›  Software
›  an

Engineer at Thumbtack Technology

active user of various NoSQL solutions
›  consulting with focus on scalability
›  a significant part of my work is advising people on
which solutions to use and why
›  big fan of BigData and clouds

›  NoSQL

– not a silver bullet
›  Choices that we make
›  Cassandra: operational workload
›  Cassandra: analytical workload
›  The best of both worlds
›  Some benchmarks
›  Conclusions

• 

well known ways to scale
• 

• 
• 
• 

scale in/out, scale by
function, data
denormalization

really works
each has disadvantages
mostly manual process
(newSQL)

http://qsec.deviantart.com

›  solve

exactly these kind of problem
›  rapid application development
aggregate
›  schema flexibility
›  auto-scale-out
›  auto-failover
› 

›  amount

of data able to handle
›  shared nothing architecture, no SPOF
›  performance

›  splendors

and miseries of aggregate
›  CAP theorem dilemma

Consistency

Availability

Partition
Tolerance

Analytical

Operational

Consistency

Availability

Performance

Reliability

Analytical

Operational

Consistency

Availability

Performance

Reliability

I want it all

(released by Facebook in 2008)
›  elastic

scalability & linear performance *
›  dynamic schema
›  very high write throughput
›  tunable per request consistency
›  fault-tolerant design
›  multiple datacenter and cloud readiness
›  CaS transaction support *
* http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra

›  Large

data set on commodity hardware
›  Tradeoff between speed and reliability
›  Heavy-write workload
›  Time-series data

http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra

Cassandra
Analytical

Performance

Operational

Reliability

Small demo after this slide

TIMESTAMP

12344567

SERVER
1
12326346

13124124

13237457

SERVER
2
13627236

›  expensive

FIELD
1

DATA

DATA

DATA

DATA

DATA

…

select * from table
where timestamp > 12344567
and timestamp < 13237457

range queries across cluster
›  unless shard by timestamp
›  become a bottleneck for heavy-write workload

› 
› 

all columns are sorted by name
row – aggregate item (never sharded)

get slice
row
key
1

Column

Family

row
key
2

column
1

value
1.1

column
2

value
1.2

column
3

value
1.3

..

..

column
1

column
2

...

column
M

value
2.1

value
2.2

…

value
2.M

column
N

value
1.N

get key
get range

+ combinations of these queries
+ composite columns

Super columns are discouraged and omitted here

› 
› 

get_slice(row_key, from, to, count)
SERVER
1

SERVER
2

row
key
1

row
key
2

row
key
3

row
key
4

row
key
5

Emestamp

Emestamp

Emestamp

Emestamp

Emestamp

Emestamp
Emestamp
Emestamp

Emestamp
Emestamp

Emestamp
Emestamp
Emestamp

Emestamp

get_slice(“row key 1”, from:“timestamp 1”, null, 11)

› 
› 

get_slice(row_key, from, to, count)
SERVER
1

SERVER
2

row
key
1

row
key
2

row
key
3

row
key
4

row
key
5

Emestamp

Emestamp

Emestamp

Emestamp

Emestamp

Emestamp
Emestamp
Emestamp

Emestamp
Emestamp

Emestamp
Emestamp
Emestamp

Emestamp


Next page

get_slice(“row key 1”, null, to:“timestamp 11”, 11)

Prev.page

›  Time-range
›  “get

with filter:

all events for User J from N to M”
›  “get all success events for User J from N to M”
›  “get all events for all user from N to M”

›  Time-range

with filter:

›  “get

all events for User J from N to M”
›  “get all success events for User J from N to M”
Emestamp
1

›  “get all events for all user from N to M”

events::success::User_123

events::success

events::User_123

value
1

Emestamp
1

value
1

Emestamp
1

value
1

›  Counters:
›  “get

# of events for User J grouped by hour”
›  “get # of events for User J grouped by day”

events::success::User_123

events::User_123

1380400000

14

1380400000

842

1380403600

42

1380403600

1024

(group by day – same but in different column family for TTL support)

›  row

key should consist of combination of fields with
high cardinality of values:
› 

name, id, etc..

›  boolean
› 

values are bad option

composite columns – good option for it

›  timestamp
›  otherwise,

may help to spread historical data
scalability will not be linear

In theory – possible in real-time
›  average, 3 dimensional filters, group by, etc..
But:
›  hard to tune data model
›  lack of aggregation options
›  aggregation by historical data

“I want interactive reports”

Auto update
somehow

Cassandra

“Reports could be a little bit out of date, but I
want to control this delay value”

›  Impact

on
production system
or

›  Higher

total cost
of ownership
›  Difficulties with
scalability
›  hard to support
with multiple
clusters
http://www.datastax.com/docs/0.7/map_reduce/hadoop_mr

Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy

http://aws.amazon.com

›  Hadoop

tech.stack
›  Automatic deployment
›  Management API
›  Temporal cluster
›  Amazon S3 as data storage *

* copy from S3 to EMR HDFS and back

JobFlowInstancesConfig instances = ..
instances.setHadoopVersion(..)
instances.setInstanceCount(dataNodeCount + 1)
instances.setMasterInstanceType(..)
instances.setSlaveInstanceType(..)
RunJobFlowRequest req = ..(name, instances)
req.addSteps(new StepConfig(name, jar))
AmazonElasticMapReduce emr = ..
emr.runJobFlow(req)

Execute job on running cluster:
StepConfig stepConfig = new StepConfig(name, jar)
AddJobFlowStepsRequest addReq = …
addReq.setJobFlowId(jobFlowId)
addReq.setSteps(Arrays.asList(stepConfig))
AmazonElasticMapReduce emr =
emr.addJobFlowSteps(addReq)

cluster lifecycle: Long-Running or Transient
›  cold start = ~20 min
›  tradeoff: cluster cost VS availability
›  Compressing and Combiner tuning may speed-up jobs
very much
›  common problems for all big data processing tools monitoring, testability and debug (MRUnit, local hadoop,
smaller data set)
›

try {
long txId = cassandra.persist(entity)
sql.insert(some)
sql.update(someElse)
cassandra.commit(txId)
sql.commit()
} catch (Exception e) {
sql.rollback()
cassandra.rollback(txId)
}

insert into CHANGES (key, commited, data)
values ('tx_id-58e0a7d7-eebc', ’false’, ..)
update CHANGES set commited = ’true’
where key = 'tx_id-58e0a7d7-eebc’
delete from CHANGES
where key = 'tx_id-58e0a7d7-eebc’

I

numbers
non-production setup:
•  3 nodes (cassandra)
•  m1.medium EC2 instance
•  1 data center
•  1 app instance

real-time metrics update (sync):
›  average latency - 60 msec
›  process > 2,000 events per second
›  generate > 1000 reports per second
real-time metrics update (async):
›  process > 15,000 events per second
uploading to AWS S3: slow, but multi-threading helps *
it is more then enough, but what if …

›  distributed

systems force you to make decisions
›  systems like Cassandra trade speed for
Consistency
›  CAP theorem is oversimplified
›  you

have much more options

›  polyglot

persistence can make this world a
better place
›  do

not try to hammer every nail with the same
hammer

›  Cassandra

– great for time series data and
heavy-write workload…
›  ... but use cases should be clearly defined

›  Amazon
›  simple,

›  Amazon

S3 – is great
slow, but predictable storage

EMR

›  integration

with S3 – great
›  very good API, but …
›  … isn’t a magic trick and require
knowledge about Hadoop and skills for
effective usage

/**

*/

/**
*/

ayazovskiy@thumbtack.net
@yazovsky
www.linkedin.com/in/yazovsky

http://www.thumbtack.net
http://thumbtack.net/whitepapers

Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy

More Related Content

What's hot (18)

Similar to Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy (20)

Recently uploaded (20)

Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy