SlideShare a Scribd company logo
Polyglot Persistence in the
Real World

Anton Yazovskiy
Thumbtack Technology
›  Software
›  an

Engineer at Thumbtack Technology

active user of various NoSQL solutions
›  consulting with focus on scalability
›  a significant part of my work is advising people on
which solutions to use and why
›  big fan of BigData and clouds
›  NoSQL

– not a silver bullet
›  Choices that we make
›  Cassandra: operational workload
›  Cassandra: analytical workload
›  The best of both worlds
›  Some benchmarks
›  Conclusions
• 

well known ways to scale
• 

• 
• 
• 

scale in/out, scale by
function, data
denormalization

really works
each has disadvantages
mostly manual process
(newSQL)

http://qsec.deviantart.com
›  solve

exactly these kind of problem
›  rapid application development
aggregate
›  schema flexibility
›  auto-scale-out
›  auto-failover
› 

›  amount

of data able to handle
›  shared nothing architecture, no SPOF
›  performance
›  splendors

and miseries of aggregate
›  CAP theorem dilemma

Consistency

Availability

Partition
Tolerance
Analytical

Operational

Consistency

Availability

Performance

Reliability
Analytical

Operational

Consistency

Availability

Performance

Reliability

I want it all
(released by Facebook in 2008)
›  elastic

scalability & linear performance *
›  dynamic schema
›  very high write throughput
›  tunable per request consistency
›  fault-tolerant design
›  multiple datacenter and cloud readiness
›  CaS transaction support *
* http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra
›  Large

data set on commodity hardware
›  Tradeoff between speed and reliability
›  Heavy-write workload
›  Time-series data

http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra
Cassandra
Analytical

Performance

Operational

Reliability

Small demo after this slide
TIMESTAMP	
  
12344567	
  
SERVER	
  1	
   12326346	
  
13124124	
  
13237457	
  
SERVER	
  2	
   13627236	
  

›  expensive

FIELD	
  1	
  
DATA	
  
DATA	
  
DATA	
  
DATA	
  
DATA	
  

…	
  
	
  
	
  
	
  
	
  
	
  

select * from table
where timestamp > 12344567
and timestamp < 13237457

range queries across cluster
›  unless shard by timestamp
›  become a bottleneck for heavy-write workload
› 
› 

all columns are sorted by name
row – aggregate item (never sharded)

get slice
row	
  key	
  1	
  

Column	
  
Family	
  

row	
  key	
  2	
  

column	
  1	
  
value	
  1.1	
  

column	
  2	
  
value	
  1.2	
  

column	
  3	
  
value	
  1.3	
  

..	
  
..	
  

column	
  1	
  

column	
  2	
  

...	
  

column	
  M	
  

value	
  2.1	
  

value	
  2.2	
  

…	
  

value	
  2.M	
  

column	
  N	
  
value	
  1.N	
  

get key
get range

+ combinations of these queries
+ composite columns

Super columns are discouraged and omitted here
› 
› 

all columns are sorted by name
row – aggregate item (never sharded)
get_slice(row_key, from, to, count)
SERVER	
  1	
  

SERVER	
  2	
  

row	
  key	
  1	
  
row	
  key	
  2	
  
row	
  key	
  3	
  
row	
  key	
  4	
  
row	
  key	
  5	
  

Emestamp	
  
Emestamp	
  
Emestamp	
  
Emestamp	
  
Emestamp	
  

Emestamp	
   Emestamp	
   Emestamp	
  
Emestamp	
   Emestamp	
  
Emestamp	
   Emestamp	
   Emestamp	
  
Emestamp	
  

get_slice(“row key 1”, from:“timestamp 1”, null, 11)
› 
› 

all columns are sorted by name
row – aggregate item (never sharded)
get_slice(row_key, from, to, count)
SERVER	
  1	
  

SERVER	
  2	
  

row	
  key	
  1	
  
row	
  key	
  2	
  
row	
  key	
  3	
  
row	
  key	
  4	
  
row	
  key	
  5	
  

Emestamp	
  
Emestamp	
  
Emestamp	
  
Emestamp	
  
Emestamp	
  

Emestamp	
   Emestamp	
   Emestamp	
  
Emestamp	
   Emestamp	
  
Emestamp	
   Emestamp	
   Emestamp	
  
Emestamp	
  

get_slice(“row key 1”, from:“timestamp 1”, null, 11)

Next page

get_slice(“row key 1”, from:“timestamp 11”, null, 11)
get_slice(“row key 1”, null, to:“timestamp 11”, 11)

Prev.page
›  Time-range
›  “get

with filter:

all events for User J from N to M”
›  “get all success events for User J from N to M”
›  “get all events for all user from N to M”
›  Time-range

with filter:

›  “get

all events for User J from N to M”
›  “get all success events for User J from N to M”
Emestamp	
  1	
  
›  “get all events for all user from N to M”

events::success::User_123	
  
events::success	
  
events::User_123	
  

value	
  1	
  
Emestamp	
  1	
  
value	
  1	
  
Emestamp	
  1	
  
value	
  1	
  
›  Counters:
›  “get

# of events for User J grouped by hour”
›  “get # of events for User J grouped by day”

events::success::User_123	
  
events::User_123	
  

1380400000	
  
14	
  
1380400000	
  
842	
  

1380403600	
  
42	
  
1380403600	
  
1024	
  

(group by day – same but in different column family for TTL support)
›  row

key should consist of combination of fields with
high cardinality of values:
› 

name, id, etc..

›  boolean
› 

values are bad option

composite columns – good option for it

›  timestamp
›  otherwise,

may help to spread historical data
scalability will not be linear
In theory – possible in real-time
›  average, 3 dimensional filters, group by, etc..
But:
›  hard to tune data model
›  lack of aggregation options
›  aggregation by historical data
“I want interactive reports”

Auto update
somehow

Cassandra

“Reports could be a little bit out of date, but I
want to control this delay value”
›  Impact

on
production system
or

›  Higher

total cost
of ownership
›  Difficulties with
scalability
›  hard to support
with multiple
clusters
http://www.datastax.com/docs/0.7/map_reduce/hadoop_mr
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
http://aws.amazon.com
›  Hadoop

tech.stack
›  Automatic deployment
›  Management API
›  Temporal cluster
›  Amazon S3 as data storage *

* copy from S3 to EMR HDFS and back
JobFlowInstancesConfig instances = ..
instances.setHadoopVersion(..)
instances.setInstanceCount(dataNodeCount + 1)
instances.setMasterInstanceType(..)
instances.setSlaveInstanceType(..)
RunJobFlowRequest req = ..(name, instances)
req.addSteps(new StepConfig(name, jar))
AmazonElasticMapReduce emr = ..
emr.runJobFlow(req)
Execute job on running cluster:
StepConfig stepConfig = new StepConfig(name, jar)
AddJobFlowStepsRequest addReq = …
addReq.setJobFlowId(jobFlowId)
addReq.setSteps(Arrays.asList(stepConfig))
AmazonElasticMapReduce emr =
emr.addJobFlowSteps(addReq)
cluster lifecycle: Long-Running or Transient
›  cold start = ~20 min
›  tradeoff: cluster cost VS availability
›  Compressing and Combiner tuning may speed-up jobs
very much
›  common problems for all big data processing tools monitoring, testability and debug (MRUnit, local hadoop,
smaller data set)
› 
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
try {
long txId = cassandra.persist(entity)
sql.insert(some)
sql.update(someElse)
cassandra.commit(txId)
sql.commit()
} catch (Exception e) {
sql.rollback()
cassandra.rollback(txId)
}
insert into CHANGES (key, commited, data)
values ('tx_id-58e0a7d7-eebc', ’false’, ..)
update CHANGES set commited = ’true’
where key = 'tx_id-58e0a7d7-eebc’
delete from CHANGES
where key = 'tx_id-58e0a7d7-eebc’
I

numbers
non-production setup:
•  3 nodes (cassandra)
•  m1.medium EC2 instance
•  1 data center
•  1 app instance
real-time metrics update (sync):
›  average latency - 60 msec
›  process > 2,000 events per second
›  generate > 1000 reports per second
real-time metrics update (async):
›  process > 15,000 events per second
uploading to AWS S3: slow, but multi-threading helps *
it is more then enough, but what if …
›  distributed

systems force you to make decisions
›  systems like Cassandra trade speed for
Consistency
›  CAP theorem is oversimplified
›  you

have much more options

›  polyglot

persistence can make this world a
better place
›  do

not try to hammer every nail with the same
hammer
›  Cassandra

– great for time series data and
heavy-write workload…
›  ... but use cases should be clearly defined
›  Amazon
›  simple,

›  Amazon

S3 – is great
slow, but predictable storage

EMR

›  integration

with S3 – great
›  very good API, but …
›  … isn’t a magic trick and require
knowledge about Hadoop and skills for
effective usage
/**

*/

/**
*/

ayazovskiy@thumbtack.net
@yazovsky
www.linkedin.com/in/yazovsky

http://www.thumbtack.net
http://thumbtack.net/whitepapers

More Related Content

PDF
Monitoring with exometer at AdRoll
PDF
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Data Love Conference - Window Functions for Database Analytics
PPTX
Michael Häusler – Everyday flink
PDF
Easy Scaling with Open Source Data Structures, by Talip Ozturk
PDF
Demystifying Distributed Graph Processing
PDF
Data in Motion: Streaming Static Data Efficiently
Monitoring with exometer at AdRoll
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Goal Based Data Production with Sim Simeonov
Data Love Conference - Window Functions for Database Analytics
Michael Häusler – Everyday flink
Easy Scaling with Open Source Data Structures, by Talip Ozturk
Demystifying Distributed Graph Processing
Data in Motion: Streaming Static Data Efficiently

What's hot (18)

DOCX
Stevens-Benchmarking Sorting Algorithms
PDF
Clustering your Application with Hazelcast
PDF
Processing large-scale graphs with Google(TM) Pregel
PDF
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
PDF
Storing time series data with Apache Cassandra
PPT
Jdbc oracle
PDF
Time series with Apache Cassandra - Long version
PDF
Data Driven Code
PPTX
Building responsive applications with Rx - CodeMash2017 - Tamir Dresher
PPTX
Hadoop Puzzlers
PDF
Introduction to data modeling with apache cassandra
PDF
Cassandra Basics, Counters and Time Series Modeling
PDF
JUnit PowerUp
PDF
Cassandra 2.1
PDF
Data in Motion: Streaming Static Data Efficiently 2
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
PPTX
Adaptive Data Cleansing with StreamSets and Cassandra
PDF
Real data models of silicon valley
Stevens-Benchmarking Sorting Algorithms
Clustering your Application with Hazelcast
Processing large-scale graphs with Google(TM) Pregel
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...
Storing time series data with Apache Cassandra
Jdbc oracle
Time series with Apache Cassandra - Long version
Data Driven Code
Building responsive applications with Rx - CodeMash2017 - Tamir Dresher
Hadoop Puzzlers
Introduction to data modeling with apache cassandra
Cassandra Basics, Counters and Time Series Modeling
JUnit PowerUp
Cassandra 2.1
Data in Motion: Streaming Static Data Efficiently 2
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Adaptive Data Cleansing with StreamSets and Cassandra
Real data models of silicon valley
Ad

Similar to Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy (20)

PPTX
Accelerating analytics on the Sensor and IoT Data.
PDF
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
PPTX
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
PDF
My Postdoctoral Research
PPTX
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
PDF
Practical data science_public
PDF
Complex models in ecology: challenges and solutions
PPT
Matlab basics
PDF
ScalaMeter 2014
PDF
Build a Complex, Realtime Data Management App with Postgres 14!
PPT
NOSQL and Cassandra
PPT
Learn Matlab
PDF
Two methods for optimising cognitive model parameters
PDF
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
PDF
Parallel R in snow (english after 2nd slide)
PDF
R programming & Machine Learning
PDF
Lesson_8_DeepLearning.pdf
PDF
Idea for ineractive programming language
PPTX
Mat lab workshop
PDF
Testing in those hard to reach places
 
Accelerating analytics on the Sensor and IoT Data.
Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
My Postdoctoral Research
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
Practical data science_public
Complex models in ecology: challenges and solutions
Matlab basics
ScalaMeter 2014
Build a Complex, Realtime Data Management App with Postgres 14!
NOSQL and Cassandra
Learn Matlab
Two methods for optimising cognitive model parameters
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Parallel R in snow (english after 2nd slide)
R programming & Machine Learning
Lesson_8_DeepLearning.pdf
Idea for ineractive programming language
Mat lab workshop
Testing in those hard to reach places
 
Ad

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
A Presentation on Artificial Intelligence
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
KodekX | Application Modernization Development
PDF
Electronic commerce courselecture one. Pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Encapsulation theory and applications.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Modernizing your data center with Dell and AMD
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
A Presentation on Artificial Intelligence
The AUB Centre for AI in Media Proposal.docx
Empathic Computing: Creating Shared Understanding
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Diabetes mellitus diagnosis method based random forest with bat algorithm
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KodekX | Application Modernization Development
Electronic commerce courselecture one. Pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
Encapsulation theory and applications.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Modernizing your data center with Dell and AMD
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy

  • 1. Polyglot Persistence in the Real World Anton Yazovskiy Thumbtack Technology
  • 2. ›  Software ›  an Engineer at Thumbtack Technology active user of various NoSQL solutions ›  consulting with focus on scalability ›  a significant part of my work is advising people on which solutions to use and why ›  big fan of BigData and clouds
  • 3. ›  NoSQL – not a silver bullet ›  Choices that we make ›  Cassandra: operational workload ›  Cassandra: analytical workload ›  The best of both worlds ›  Some benchmarks ›  Conclusions
  • 4. •  well known ways to scale •  •  •  •  scale in/out, scale by function, data denormalization really works each has disadvantages mostly manual process (newSQL) http://qsec.deviantart.com
  • 5. ›  solve exactly these kind of problem ›  rapid application development aggregate ›  schema flexibility ›  auto-scale-out ›  auto-failover ›  ›  amount of data able to handle ›  shared nothing architecture, no SPOF ›  performance
  • 6. ›  splendors and miseries of aggregate ›  CAP theorem dilemma Consistency Availability Partition Tolerance
  • 9. (released by Facebook in 2008) ›  elastic scalability & linear performance * ›  dynamic schema ›  very high write throughput ›  tunable per request consistency ›  fault-tolerant design ›  multiple datacenter and cloud readiness ›  CaS transaction support * * http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra
  • 10. ›  Large data set on commodity hardware ›  Tradeoff between speed and reliability ›  Heavy-write workload ›  Time-series data http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra
  • 12. TIMESTAMP   12344567   SERVER  1   12326346   13124124   13237457   SERVER  2   13627236   ›  expensive FIELD  1   DATA   DATA   DATA   DATA   DATA   …             select * from table where timestamp > 12344567 and timestamp < 13237457 range queries across cluster ›  unless shard by timestamp ›  become a bottleneck for heavy-write workload
  • 13. ›  ›  all columns are sorted by name row – aggregate item (never sharded) get slice row  key  1   Column   Family   row  key  2   column  1   value  1.1   column  2   value  1.2   column  3   value  1.3   ..   ..   column  1   column  2   ...   column  M   value  2.1   value  2.2   …   value  2.M   column  N   value  1.N   get key get range + combinations of these queries + composite columns Super columns are discouraged and omitted here
  • 14. ›  ›  all columns are sorted by name row – aggregate item (never sharded) get_slice(row_key, from, to, count) SERVER  1   SERVER  2   row  key  1   row  key  2   row  key  3   row  key  4   row  key  5   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   get_slice(“row key 1”, from:“timestamp 1”, null, 11)
  • 15. ›  ›  all columns are sorted by name row – aggregate item (never sharded) get_slice(row_key, from, to, count) SERVER  1   SERVER  2   row  key  1   row  key  2   row  key  3   row  key  4   row  key  5   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   Emestamp   get_slice(“row key 1”, from:“timestamp 1”, null, 11) Next page get_slice(“row key 1”, from:“timestamp 11”, null, 11) get_slice(“row key 1”, null, to:“timestamp 11”, 11) Prev.page
  • 16. ›  Time-range ›  “get with filter: all events for User J from N to M” ›  “get all success events for User J from N to M” ›  “get all events for all user from N to M”
  • 17. ›  Time-range with filter: ›  “get all events for User J from N to M” ›  “get all success events for User J from N to M” Emestamp  1   ›  “get all events for all user from N to M” events::success::User_123   events::success   events::User_123   value  1   Emestamp  1   value  1   Emestamp  1   value  1  
  • 18. ›  Counters: ›  “get # of events for User J grouped by hour” ›  “get # of events for User J grouped by day” events::success::User_123   events::User_123   1380400000   14   1380400000   842   1380403600   42   1380403600   1024   (group by day – same but in different column family for TTL support)
  • 19. ›  row key should consist of combination of fields with high cardinality of values: ›  name, id, etc.. ›  boolean ›  values are bad option composite columns – good option for it ›  timestamp ›  otherwise, may help to spread historical data scalability will not be linear
  • 20. In theory – possible in real-time ›  average, 3 dimensional filters, group by, etc.. But: ›  hard to tune data model ›  lack of aggregation options ›  aggregation by historical data
  • 21. “I want interactive reports” Auto update somehow Cassandra “Reports could be a little bit out of date, but I want to control this delay value”
  • 22. ›  Impact on production system or ›  Higher total cost of ownership ›  Difficulties with scalability ›  hard to support with multiple clusters http://www.datastax.com/docs/0.7/map_reduce/hadoop_mr
  • 25. ›  Hadoop tech.stack ›  Automatic deployment ›  Management API ›  Temporal cluster ›  Amazon S3 as data storage * * copy from S3 to EMR HDFS and back
  • 26. JobFlowInstancesConfig instances = .. instances.setHadoopVersion(..) instances.setInstanceCount(dataNodeCount + 1) instances.setMasterInstanceType(..) instances.setSlaveInstanceType(..) RunJobFlowRequest req = ..(name, instances) req.addSteps(new StepConfig(name, jar)) AmazonElasticMapReduce emr = .. emr.runJobFlow(req)
  • 27. Execute job on running cluster: StepConfig stepConfig = new StepConfig(name, jar) AddJobFlowStepsRequest addReq = … addReq.setJobFlowId(jobFlowId) addReq.setSteps(Arrays.asList(stepConfig)) AmazonElasticMapReduce emr = emr.addJobFlowSteps(addReq)
  • 28. cluster lifecycle: Long-Running or Transient ›  cold start = ~20 min ›  tradeoff: cluster cost VS availability ›  Compressing and Combiner tuning may speed-up jobs very much ›  common problems for all big data processing tools monitoring, testability and debug (MRUnit, local hadoop, smaller data set) › 
  • 31. try { long txId = cassandra.persist(entity) sql.insert(some) sql.update(someElse) cassandra.commit(txId) sql.commit() } catch (Exception e) { sql.rollback() cassandra.rollback(txId) }
  • 32. insert into CHANGES (key, commited, data) values ('tx_id-58e0a7d7-eebc', ’false’, ..) update CHANGES set commited = ’true’ where key = 'tx_id-58e0a7d7-eebc’ delete from CHANGES where key = 'tx_id-58e0a7d7-eebc’
  • 33. I numbers non-production setup: •  3 nodes (cassandra) •  m1.medium EC2 instance •  1 data center •  1 app instance
  • 34. real-time metrics update (sync): ›  average latency - 60 msec ›  process > 2,000 events per second ›  generate > 1000 reports per second real-time metrics update (async): ›  process > 15,000 events per second uploading to AWS S3: slow, but multi-threading helps * it is more then enough, but what if …
  • 35. ›  distributed systems force you to make decisions ›  systems like Cassandra trade speed for Consistency ›  CAP theorem is oversimplified ›  you have much more options ›  polyglot persistence can make this world a better place ›  do not try to hammer every nail with the same hammer
  • 36. ›  Cassandra – great for time series data and heavy-write workload… ›  ... but use cases should be clearly defined
  • 37. ›  Amazon ›  simple, ›  Amazon S3 – is great slow, but predictable storage EMR ›  integration with S3 – great ›  very good API, but … ›  … isn’t a magic trick and require knowledge about Hadoop and skills for effective usage