SlideShare a Scribd company logo
Polyglot Persistence in the
Real World

Anton Yazovskiy
Thumbtack Technology
›  Software
›  an

Engineer at Thumbtack Technology

active user of various NoSQL solutions
›  consulting with focus on scalability
›  a significant part of my work is advising people on
which solutions to use and why
›  big fan of BigData and clouds
›  NoSQL

– not a silver bullet
›  Choices that we make
›  Cassandra: operational workload
›  Cassandra: analytical workload
›  The best of both worlds
›  Some benchmarks
›  Conclusions
• 

well known ways to scale
• 

• 
• 
• 

scale in/out, scale by
function, data
denormalization

really works
each has disadvantages
mostly manual process
(newSQL)

http://qsec.deviantart.com
›  solve

exactly these kind of problem
›  rapid application development
aggregate
›  schema flexibility
›  auto-scale-out
›  auto-failover
› 

›  amount

of data able to handle
›  shared nothing architecture, no SPOF
›  performance
›  splendors

and miseries of aggregate
›  CAP theorem dilemma

Consistency

Availability

Partition
Tolerance
Analytical

Operational

Consistency

Availability

Performance

Reliability
Analytical

Operational

Consistency

Availability

Performance

Reliability

I want it all
(released by Facebook in 2008)
›  elastic

scalability & linear performance *
›  dynamic schema
›  very high write throughput
›  tunable per request consistency
›  fault-tolerant design
›  multiple datacenter and cloud readiness
›  CaS transaction support *
* http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra
›  Large

data set on commodity hardware
›  Tradeoff between speed and reliability
›  Heavy-write workload
›  Time-series data

http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra
Cassandra
Analytical

Performance

Operational

Reliability

Small demo after this slide
TIMESTAMP	
  
12344567	
  
SERVER	
  1	
   12326346	
  
13124124	
  
13237457	
  
SERVER	
  2	
   13627236	
  

›  expensive

FIELD	
  1	
  
DATA	
  
DATA	
  
DATA	
  
DATA	
  
DATA	
  

…	
  
	
  
	
  
	
  
	
  
	
  

select * from table
where timestamp > 12344567
and timestamp < 13237457

range queries across cluster
›  unless shard by timestamp
›  become a bottleneck for heavy-write workload
› 
› 

all columns are sorted by name
row – aggregate item (never sharded)

get slice
row	
  key	
  1	
  

Column	
  
Family	
  

row	
  key	
  2	
  

column	
  1	
  
value	
  1.1	
  

column	
  2	
  
value	
  1.2	
  

column	
  3	
  
value	
  1.3	
  

..	
  
..	
  

column	
  1	
  

column	
  2	
  

...	
  

column	
  M	
  

value	
  2.1	
  

value	
  2.2	
  

…	
  

value	
  2.M	
  

column	
  N	
  
value	
  1.N	
  

get key
get range

+ combinations of these queries
+ composite columns

Super columns are discouraged and omitted here
› 
› 

all columns are sorted by name
row – aggregate item (never sharded)
get_slice(row_key, from, to, count)
SERVER	
  1	
  

SERVER	
  2	
  

row	
  key	
  1	
  
row	
  key	
  2	
  
row	
  key	
  3	
  
row	
  key	
  4	
  
row	
  key	
  5	
  

Emestamp	
  
Emestamp	
  
Emestamp	
  
Emestamp	
  
Emestamp	
  

Emestamp	
   Emestamp	
   Emestamp	
  
Emestamp	
   Emestamp	
  
Emestamp	
   Emestamp	
   Emestamp	
  
Emestamp	
  

get_slice(“row key 1”, from:“timestamp 1”, null, 11)
› 
› 

all columns are sorted by name
row – aggregate item (never sharded)
get_slice(row_key, from, to, count)
SERVER	
  1	
  

SERVER	
  2	
  

row	
  key	
  1	
  
row	
  key	
  2	
  
row	
  key	
  3	
  
row	
  key	
  4	
  
row	
  key	
  5	
  

Emestamp	
  
Emestamp	
  
Emestamp	
  
Emestamp	
  
Emestamp	
  

Emestamp	
   Emestamp	
   Emestamp	
  
Emestamp	
   Emestamp	
  
Emestamp	
   Emestamp	
   Emestamp	
  
Emestamp	
  

get_slice(“row key 1”, from:“timestamp 1”, null, 11)

Next page

get_slice(“row key 1”, from:“timestamp 11”, null, 11)
get_slice(“row key 1”, null, to:“timestamp 11”, 11)

Prev.page
›  Time-range
›  “get

with filter:

all events for User J from N to M”
›  “get all success events for User J from N to M”
›  “get all events for all user from N to M”
›  Time-range

with filter:

›  “get

all events for User J from N to M”
›  “get all success events for User J from N to M”
Emestamp	
  1	
  
›  “get all events for all user from N to M”

events::success::User_123	
  
events::success	
  
events::User_123	
  

value	
  1	
  
Emestamp	
  1	
  
value	
  1	
  
Emestamp	
  1	
  
value	
  1	
  
›  Counters:
›  “get

# of events for User J grouped by hour”
›  “get # of events for User J grouped by day”

events::success::User_123	
  
events::User_123	
  

1380400000	
  
14	
  
1380400000	
  
842	
  

1380403600	
  
42	
  
1380403600	
  
1024	
  

(group by day – same but in different column family for TTL support)
›  row

key should consist of combination of fields with
high cardinality of values:
› 

name, id, etc..

›  boolean
› 

values are bad option

composite columns – good option for it

›  timestamp
›  otherwise,

may help to spread historical data
scalability will not be linear
In theory – possible in real-time
›  average, 3 dimensional filters, group by, etc..
But:
›  hard to tune data model
›  lack of aggregation options
›  aggregation by historical data
“I want interactive reports”

Auto update
somehow

Cassandra

“Reports could be a little bit out of date, but I
want to control this delay value”
›  Impact

on
production system
or

›  Higher

total cost
of ownership
›  Difficulties with
scalability
›  hard to support
with multiple
clusters
http://www.datastax.com/docs/0.7/map_reduce/hadoop_mr
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
http://aws.amazon.com
›  Hadoop

tech.stack
›  Automatic deployment
›  Management API
›  Temporal cluster
›  Amazon S3 as data storage *

* copy from S3 to EMR HDFS and back
JobFlowInstancesConfig instances = ..
instances.setHadoopVersion(..)
instances.setInstanceCount(dataNodeCount + 1)
instances.setMasterInstanceType(..)
instances.setSlaveInstanceType(..)
RunJobFlowRequest req = ..(name, instances)
req.addSteps(new StepConfig(name, jar))
AmazonElasticMapReduce emr = ..
emr.runJobFlow(req)
Execute job on running cluster:
StepConfig stepConfig = new StepConfig(name, jar)
AddJobFlowStepsRequest addReq = …
addReq.setJobFlowId(jobFlowId)
addReq.setSteps(Arrays.asList(stepConfig))
AmazonElasticMapReduce emr =
emr.addJobFlowSteps(addReq)
cluster lifecycle: Long-Running or Transient
›  cold start = ~20 min
›  tradeoff: cluster cost VS availability
›  Compressing and Combiner tuning may speed-up jobs
very much
›  common problems for all big data processing tools monitoring, testability and debug (MRUnit, local hadoop,
smaller data set)
› 
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
try {
long txId = cassandra.persist(entity)
sql.insert(some)
sql.update(someElse)
cassandra.commit(txId)
sql.commit()
} catch (Exception e) {
sql.rollback()
cassandra.rollback(txId)
}
insert into CHANGES (key, commited, data)
values ('tx_id-58e0a7d7-eebc', ’false’, ..)
update CHANGES set commited = ’true’
where key = 'tx_id-58e0a7d7-eebc’
delete from CHANGES
where key = 'tx_id-58e0a7d7-eebc’
I

numbers
non-production setup:
•  3 nodes (cassandra)
•  m1.medium EC2 instance
•  1 data center
•  1 app instance
real-time metrics update (sync):
›  average latency - 60 msec
›  process > 2,000 events per second
›  generate > 1000 reports per second
real-time metrics update (async):
›  process > 15,000 events per second
uploading to AWS S3: slow, but multi-threading helps *
it is more then enough, but what if …
›  distributed

systems force you to make decisions
›  systems like Cassandra trade speed for
Consistency
›  CAP theorem is oversimplified
›  you

have much more options

›  polyglot

persistence can make this world a
better place
›  do

not try to hammer every nail with the same
hammer
›  Cassandra

– great for time series data and
heavy-write workload…
›  ... but use cases should be clearly defined
›  Amazon
›  simple,

›  Amazon

S3 – is great
slow, but predictable storage

EMR

›  integration

with S3 – great
›  very good API, but …
›  … isn’t a magic trick and require
knowledge about Hadoop and skills for
effective usage
/**

*/

/**
*/

ayazovskiy@thumbtack.net
@yazovsky
www.linkedin.com/in/yazovsky

http://www.thumbtack.net
http://thumbtack.net/whitepapers

More Related Content

PDF
Monitoring with exometer at AdRoll
PDF
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Data Love Conference - Window Functions for Database Analytics
PPTX
Michael Häusler – Everyday flink
PDF
Easy Scaling with Open Source Data Structures, by Talip Ozturk
PDF
Demystifying Distributed Graph Processing
PDF
Data in Motion: Streaming Static Data Efficiently
Monitoring with exometer at AdRoll
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Goal Based Data Production with Sim Simeonov
Data Love Conference - Window Functions for Database Analytics
Michael Häusler – Everyday flink
Easy Scaling with Open Source Data Structures, by Talip Ozturk
Demystifying Distributed Graph Processing
Data in Motion: Streaming Static Data Efficiently

What's hot (18)

DOCX
Stevens-Benchmarking Sorting Algorithms
PDF
Clustering your Application with Hazelcast
PDF
Processing large-scale graphs with Google(TM) Pregel
PDF
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
PDF
Storing time series data with Apache Cassandra
PPT
Jdbc oracle
PDF
Time series with Apache Cassandra - Long version
PDF
Data Driven Code
PPTX
Building responsive applications with Rx - CodeMash2017 - Tamir Dresher
PPTX
Hadoop Puzzlers
PDF
Introduction to data modeling with apache cassandra
PDF
Cassandra Basics, Counters and Time Series Modeling
PDF
JUnit PowerUp
PDF
Cassandra 2.1
PDF
Data in Motion: Streaming Static Data Efficiently 2
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
PPTX
Adaptive Data Cleansing with StreamSets and Cassandra
PDF
Real data models of silicon valley
Stevens-Benchmarking Sorting Algorithms
Clustering your Application with Hazelcast
Processing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
Storing time series data with Apache Cassandra
Jdbc oracle
Time series with Apache Cassandra - Long version
Data Driven Code
Building responsive applications with Rx - CodeMash2017 - Tamir Dresher
Hadoop Puzzlers
Introduction to data modeling with apache cassandra
Cassandra Basics, Counters and Time Series Modeling
JUnit PowerUp
Cassandra 2.1
Data in Motion: Streaming Static Data Efficiently 2
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Adaptive Data Cleansing with StreamSets and Cassandra
Real data models of silicon valley
Ad

Similar to Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce (20)

PDF
Data Science Lab Meetup: Cassandra and Spark
PDF
Cassandra Talk: Austin JUG
PPTX
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
PPTX
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
PDF
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
PDF
Cassandra as event sourced journal for big data analytics
PDF
Cake Solutions: Cassandra as event sourced journal for big data analytics
DOCX
Cassandra data modelling best practices
PPT
Scaling Web Applications with Cassandra Presentation.ppt
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
PPT
Scaling Web Applications with Cassandra Presentation (1).ppt
PPT
No sql
PDF
Spark and cassandra (Hulu Talk)
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PDF
Getting started with Spark & Cassandra by Jon Haddad of Datastax
PDF
Scala like distributed collections - dumping time-series data with apache spark
PPT
Scaling web applications with cassandra presentation
PPTX
Software architecture for data applications
PPT
5266732.ppt
PDF
About "Apache Cassandra"
Data Science Lab Meetup: Cassandra and Spark
Cassandra Talk: Austin JUG
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Big Data Warehousing Meetup: Real-time Trade Data Monitoring with Storm & Cas...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as event sourced journal for big data analytics
Cake Solutions: Cassandra as event sourced journal for big data analytics
Cassandra data modelling best practices
Scaling Web Applications with Cassandra Presentation.ppt
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Scaling Web Applications with Cassandra Presentation (1).ppt
No sql
Spark and cassandra (Hulu Talk)
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Scala like distributed collections - dumping time-series data with apache spark
Scaling web applications with cassandra presentation
Software architecture for data applications
5266732.ppt
About "Apache Cassandra"
Ad

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Cloud computing and distributed systems.
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Empathic Computing: Creating Shared Understanding
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
KodekX | Application Modernization Development
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The AUB Centre for AI in Media Proposal.docx
Building Integrated photovoltaic BIPV_UPV.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
Big Data Technologies - Introduction.pptx
Encapsulation_ Review paper, used for researhc scholars
Empathic Computing: Creating Shared Understanding
20250228 LYD VKU AI Blended-Learning.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KodekX | Application Modernization Development
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25 Week I
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
Spectral efficient network and resource selection model in 5G networks
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce

  • 1. Polyglot Persistence in the Real World Anton Yazovskiy Thumbtack Technology
  • 2. ›  Software ›  an Engineer at Thumbtack Technology active user of various NoSQL solutions ›  consulting with focus on scalability ›  a significant part of my work is advising people on which solutions to use and why ›  big fan of BigData and clouds
  • 3. ›  NoSQL – not a silver bullet ›  Choices that we make ›  Cassandra: operational workload ›  Cassandra: analytical workload ›  The best of both worlds ›  Some benchmarks ›  Conclusions
  • 4. •  well known ways to scale •  •  •  •  scale in/out, scale by function, data denormalization really works each has disadvantages mostly manual process (newSQL) http://qsec.deviantart.com
  • 5. ›  solve exactly these kind of problem ›  rapid application development aggregate ›  schema flexibility ›  auto-scale-out ›  auto-failover ›  ›  amount of data able to handle ›  shared nothing architecture, no SPOF ›  performance
  • 6. ›  splendors and miseries of aggregate ›  CAP theorem dilemma Consistency Availability Partition Tolerance
  • 9. (released by Facebook in 2008) ›  elastic scalability & linear performance * ›  dynamic schema ›  very high write throughput ›  tunable per request consistency ›  fault-tolerant design ›  multiple datacenter and cloud readiness ›  CaS transaction support * * http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra
  • 10. ›  Large data set on commodity hardware ›  Tradeoff between speed and reliability ›  Heavy-write workload ›  Time-series data http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra
  • 12. TIMESTAMP   12344567   SERVER  1   12326346   13124124   13237457   SERVER  2   13627236   ›  expensive FIELD  1   DATA   DATA   DATA   DATA   DATA   …             select * from table where timestamp > 12344567 and timestamp < 13237457 range queries across cluster ›  unless shard by timestamp ›  become a bottleneck for heavy-write workload
  • 13. ›  ›  all columns are sorted by name row – aggregate item (never sharded) get slice row  key  1   Column   Family   row  key  2   column  1   value  1.1   column  2   value  1.2   column  3   value  1.3   ..   ..   column  1   column  2   ...   column  M   value  2.1   value  2.2   …   value  2.M   column  N   value  1.N   get key get range + combinations of these queries + composite columns Super columns are discouraged and omitted here
  • 14. ›  ›  all columns are sorted by name row – aggregate item (never sharded) get_slice(row_key, from, to, count) SERVER  1   SERVER  2   row  key  1   row  key  2   row  key  3   row  key  4   row  key  5   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   get_slice(“row key 1”, from:“timestamp 1”, null, 11)
  • 15. ›  ›  all columns are sorted by name row – aggregate item (never sharded) get_slice(row_key, from, to, count) SERVER  1   SERVER  2   row  key  1   row  key  2   row  key  3   row  key  4   row  key  5   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   get_slice(“row key 1”, from:“timestamp 1”, null, 11) Next page get_slice(“row key 1”, from:“timestamp 11”, null, 11) get_slice(“row key 1”, null, to:“timestamp 11”, 11) Prev.page
  • 16. ›  Time-range ›  “get with filter: all events for User J from N to M” ›  “get all success events for User J from N to M” ›  “get all events for all user from N to M”
  • 17. ›  Time-range with filter: ›  “get all events for User J from N to M” ›  “get all success events for User J from N to M” Emestamp  1   ›  “get all events for all user from N to M” events::success::User_123   events::success   events::User_123   value  1   Emestamp  1   value  1   Emestamp  1   value  1  
  • 18. ›  Counters: ›  “get # of events for User J grouped by hour” ›  “get # of events for User J grouped by day” events::success::User_123   events::User_123   1380400000   14   1380400000   842   1380403600   42   1380403600   1024   (group by day – same but in different column family for TTL support)
  • 19. ›  row key should consist of combination of fields with high cardinality of values: ›  name, id, etc.. ›  boolean ›  values are bad option composite columns – good option for it ›  timestamp ›  otherwise, may help to spread historical data scalability will not be linear
  • 20. In theory – possible in real-time ›  average, 3 dimensional filters, group by, etc.. But: ›  hard to tune data model ›  lack of aggregation options ›  aggregation by historical data
  • 21. “I want interactive reports” Auto update somehow Cassandra “Reports could be a little bit out of date, but I want to control this delay value”
  • 22. ›  Impact on production system or ›  Higher total cost of ownership ›  Difficulties with scalability ›  hard to support with multiple clusters http://www.datastax.com/docs/0.7/map_reduce/hadoop_mr
  • 25. ›  Hadoop tech.stack ›  Automatic deployment ›  Management API ›  Temporal cluster ›  Amazon S3 as data storage * * copy from S3 to EMR HDFS and back
  • 26. JobFlowInstancesConfig instances = .. instances.setHadoopVersion(..) instances.setInstanceCount(dataNodeCount + 1) instances.setMasterInstanceType(..) instances.setSlaveInstanceType(..) RunJobFlowRequest req = ..(name, instances) req.addSteps(new StepConfig(name, jar)) AmazonElasticMapReduce emr = .. emr.runJobFlow(req)
  • 27. Execute job on running cluster: StepConfig stepConfig = new StepConfig(name, jar) AddJobFlowStepsRequest addReq = … addReq.setJobFlowId(jobFlowId) addReq.setSteps(Arrays.asList(stepConfig)) AmazonElasticMapReduce emr = emr.addJobFlowSteps(addReq)
  • 28. cluster lifecycle: Long-Running or Transient ›  cold start = ~20 min ›  tradeoff: cluster cost VS availability ›  Compressing and Combiner tuning may speed-up jobs very much ›  common problems for all big data processing tools monitoring, testability and debug (MRUnit, local hadoop, smaller data set) › 
  • 31. try { long txId = cassandra.persist(entity) sql.insert(some) sql.update(someElse) cassandra.commit(txId) sql.commit() } catch (Exception e) { sql.rollback() cassandra.rollback(txId) }
  • 32. insert into CHANGES (key, commited, data) values ('tx_id-58e0a7d7-eebc', ’false’, ..) update CHANGES set commited = ’true’ where key = 'tx_id-58e0a7d7-eebc’ delete from CHANGES where key = 'tx_id-58e0a7d7-eebc’
  • 33. I numbers non-production setup: •  3 nodes (cassandra) •  m1.medium EC2 instance •  1 data center •  1 app instance
  • 34. real-time metrics update (sync): ›  average latency - 60 msec ›  process > 2,000 events per second ›  generate > 1000 reports per second real-time metrics update (async): ›  process > 15,000 events per second uploading to AWS S3: slow, but multi-threading helps * it is more then enough, but what if …
  • 35. ›  distributed systems force you to make decisions ›  systems like Cassandra trade speed for Consistency ›  CAP theorem is oversimplified ›  you have much more options ›  polyglot persistence can make this world a better place ›  do not try to hammer every nail with the same hammer
  • 36. ›  Cassandra – great for time series data and heavy-write workload… ›  ... but use cases should be clearly defined
  • 37. ›  Amazon ›  simple, ›  Amazon S3 – is great slow, but predictable storage EMR ›  integration with S3 – great ›  very good API, but … ›  … isn’t a magic trick and require knowledge about Hadoop and skills for effective usage