SlideShare a Scribd company logo
YCSB
Yahoo! Cloud Serving Benchmark
Scalable Distributed Systems
Antonio L. Severien
antonio.severien@gmail.com
João Rosa
Joao.rui.rosa@gmail.com
Overview
• Distributed Databases
• Cassandra
• HBase
• YCSB General View
• YCSB Details
• Amazon EC2
• YCSB Results
• YCSB Future
• Conclusions
• References
Distributed Databases
Traditional RDBMS
• ACID transactions
• Query language (SQL)
• Data tied to the modeling (hard to analyze)
• Scalable to a limit
Distributed Databases
• Not ACID
• Not Relational
• Column oriented (key-value)
• CAP (Consistency, Availability, Partitioning)
• Big Data (Massively scalable)
Distributed Databases
• Sherpa/PNUTS
• BigTable
• HBase, Hypertable, HTable
• Megastore
• Azure
• Cassandra
• Amazon Web Services
• S3, SimpleDB, EBS • CouchDB
• Voldemort
• Dynomite
• Tokyo
• Redis
• MongoDB
Distributed Databases
• NoSQL Databases have different designs and architecture
Cassandra
Thrift
Gossip
Token ring
…
Hbase
HDFS
Zookeeper
Hadoop (MapReduce)
BigTable
GFS
Chubby (Lock Service)
MapReduce
Cassandra
• Highlights
• High availability
• Incremental scalability
• Eventually consistent
• Tradeoffs between consistency and latency
• Minimal administration
• No SPF (Single Point of Failure)
Cassandra
• CAP-aware
• Cassandra values Availability and Partitioning tolerance (AP)
 eventually consistent
• Providing strong Consistency in Cassandra increases latency
• Partitioning
• Token oriented
• Explicit Replication
• Replication factor ≤ Total nodes
• High level clients
• Python, Java, C#, .NET, Scala, Ruby, PHP, Erlang, Haskell…etc
• Thrift  driver-level interface
Cassandra
• Data Model
• Cluster:
• Machines (nodes) in a logical
Cassandra instance
• can contain multiple keyspaces
• Keyspace:
• name for ColumnFamilies
• ColumnFamilies:
• contain multiple columns each with name, value and timestamp
referenced by row keys.
• Analogous to table on RDBMS
• SuperColumns:
• columns with subcolumns
• Rows
• Columns
keyA Column1 Column2 Column3
keyB Column5 Column6 column10
Column
Byte[] Name
Byte[] Value
I64 Timestamp
Cassandra
Partitioning Replication
HBase
“HBase is more a datastore than a database”
• It lacks many of the features of RDBMS
• Distributed and scalable big data store.
• Regions model
• Strong consistency
HBase
Built on top of Hadoop Distributed Filesystem (HDFS)
HBase
• The NameNode is
responsible for maintaining
the filesystem metadata.
• The DataNodes are
responsible for storing HDFS
blocks.
HBase
• The NameNode is
responsible for maintaining
the filesystem metadata.
• The DataNodes are
responsible for storing HDFS
blocks.
Note: In our study case, we only
had interest on HDFS layer.
HBase
HBase
DatanodesNamenode
HBase
• Data is stored into HBase tables.
• Tables are made of rows and columns.
• All columns belong to a particular column family.
Important note: All column family members are stored together.
• A query on a
column family
model has a better
performance
YCSB General View
• Which is the best NoSQL DB?
• How to compare?
• Yahoo! Cloud Serving Benchmark (YCSB)
• Benchmarking tool
• Evaluate key-value and cloud DBs performance on a common set
of workloads
• Client – an extensible workload generator
• Yahoo! Research
• Brian F. Cooper - cooperb@yahoo-inc.com
• Joint work with Adam Silberstein, Erwin Tam, Raghu Ramakrishnan
and Russell Sear
YCSB Details
• How it works?
YCSB Client
DBInterface
Layer
Client
Threads
Statistics
Workload
Executor
Cloud
Serving
Store
Workload file
• Read/write mix
• Record size
• Popularity distribution
• …
Command line
• DB to use
• Workload to use
• Target throughput
• Number of threads
• …
YCSB Details
Benchmark Tiers
• Performance
• Measure latency/throughput curve
• Increase throughput until saturation
• Scalability
• Scale up: increase hardware, data size and throughput
proportionally
• Elastic speedup: add servers while running a workload
YCSB Details
Load phase
- Load the database
$ ycsb load cassandra-10
–p hosts=127.0.0.1 –P workloadX
Transactions phase
- Executes the workload
$ ycsb run cassandra-10
–p hosts=127.0.0.1 –P workloadX
Random Load Distribution
YCSB Details
• # Yahoo! Cloud System Benchmark
• # Workload A: Update heavy workload
• # Application example: Session store recording recent actions
• #
• # Read/update ratio: 50/50
• # Default data size: 1 KB records (10 fields, 100 bytes each, plus key)
• # Request distribution: zipfian
• recordcount=1000
• operationcount=1000
• workload=com.yahoo.ycsb.workloads.CoreWorkload
• readallfields=true
• readproportion=0.5
• updateproportion=0.5
• scanproportion=0
• insertproportion=0
• requestdistribution=zipfian
YCSB Details
• Execution parameters
• $ ./bin/ycsb run cassandra-10 –P workloads/workloada –s –threads 10 –target 100
> transactions.dat
[OVERALL],RunTime(ms), 10110
[OVERALL],Throughput(ops/sec), 98.91196834817013
[UPDATE], Operations, 491
[UPDATE], AverageLatency(ms), 0.054989816700611
[UPDATE], MinLatency(ms), 0
[UPDATE], MaxLatency(ms), 1
[UPDATE], 95thPercentileLatency(ms), 1
[UPDATE], 99thPercentileLatency(ms), 1
[UPDATE], Return=0, 491
[UPDATE], 0, 464
[UPDATE], 1, 27
[UPDATE], 2, 0
[UPDATE], 3, 0
[UPDATE], 4, 0
...
YCSB Details
• $ ./bin/ycsb run basic -P workloads/workloada -P large.dat -s -threads 10 -
target 100 –p measurementtype=timeseries -p timeseries.granularity=2000 >
transactions.dat
[OVERALL],RunTime(ms), 10077
[OVERALL],Throughput(ops/sec), 9923.58836955443
[UPDATE], Operations, 50396
[UPDATE], AverageLatency(ms), 0.04339630129375347
[UPDATE], MinLatency(ms), 0
[UPDATE], MaxLatency(ms), 338
[UPDATE], Return=0, 50396
[UPDATE], 0, 0.10264765784114054
[UPDATE], 2000, 0.026989343690867442
[UPDATE], 4000, 0.0352882703777336
[UPDATE], 6000, 0.004238958990536277
[UPDATE], 8000, 0.052813085033008175
[UPDATE], 10000, 0.0
[READ], Operations, 49604
[READ], AverageLatency(ms), 0.038242883638416256
[READ], MinLatency(ms), 0
[READ], MaxLatency(ms), 230
[READ], Return=0, 49604
[READ], 0, 0.08997245741099663
[READ], 2000, 0.02207505518763797
[READ], 4000, 0.03188493260913297
[READ], 6000, 0.004869141813755326
[READ], 8000, 0.04355329949238579
[READ], 10000, 0.005405405405405406
YCSB Details
Status Output
Amazon EC2 Configuration
Large Instance
7.5 GB memory
4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
850 GB instance storage
64-bit platform
I/O Performance: High
API name: m1.large
Experiment Set-up
Cassandra Cluster
3 nodes + 1 node (Elasticity)
Hbase Cluster
3 nodes
Amazon EC2 Usage
Cassandra
Load phase: 60,000,000 records of 1Kb
Amazon EC2 Usage
HBase
Load phase: 60,000,000 records of 1Kb
Amazon EC2 Usage
Load phase: 60,000,000 records of 1Kb
Cassandra
HBase
Amazon EC2 Usage
Load phase: 60,000,000 records of 1KbCassandra HBase
Amazon EC2 Usage
Transaction phase:
- 10,000 records
- 1,000,000 operations
- 250 threads
Cassandra
YCSB Cassandra Results
Update Heavy Workload
(50/50)
0
10
20
30
40
50
60
0 1,000 2,000 3,000 4,000 5,000 6,000
AverageLatency(ms)
Throughput (ops/sec)
Update
0
10
20
30
40
50
60
0 1,000 2,000 3,000 4,000 5,000 6,000
AverageLatency(ms)
Throughput (ops/sec)
Read
YCSB HBase Results
0.00
0.05
0.10
0.15
0.20
0.25
0.30
471.15 485 492.38 507.17 562.33 620.04 634.82 734.32 845.15
AverageLatency(ms)
Throughput (ops/sec)
Update Hbase 0.90.5
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
471.15 485 492.38 507.17 562.33 620.04 634.82 734.32 845.15
AverageLatency(ms)
Throughput (ops/sec)
Read HBase 0.90.5
YCSB Cassandra Results
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
0 50000 100000 150000 200000 250000 300000 350000 400000
Latency(ms)
Time miliseconds
Elasticity Cassandra 1.0
YCSB Cassandra Results
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
0 50000 100000 150000 200000 250000 300000 350000 400000
Latency(ms)
Time miliseconds
Elasticity Cassandra 1.0
YCSB Future
Provide statistics for:
- Availability
- Replication
Additional Distributed Databases
Currently supported:
Cassandra Mapkeeper
MongoDB Redis
Voldemort Vmware vFabric Gemfire
Hbase
Conclusions
• YCSB provides a common ground for benchmarking cloud DB
services
• Good for leaning and experimenting with different distributed
databases
• Open source, extensible for new databases
• Laboratory with Amazon EC2 provided good insight into setting
up cloud services
• Challenges
• Installation problems
• Hard to follow documentation
• Working on distributed environment require lots of configuration
References
• YCSB (Yahoo! Cloud Serving Benchmark)
• https://github.com/brianfrankcooper/YCSB/wiki
• Yahoo! Research
• http://research.yahoo.com/Web_Information_Management/YCSB
• BigTable
• http://en.wikipedia.org/wiki/BigTable
• Cassandra
• http://wiki.apache.org/cassandra/
• HBase
• http://hbase.apache.org/
Questions

More Related Content

PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
PPTX
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
PDF
Intro to HBase - Lars George
PPTX
Rigorous and Multi-tenant HBase Performance Measurement
PDF
HBase Advanced - Lars George
PDF
Hbase: an introduction
PDF
HBase: Extreme Makeover
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
Intro to HBase - Lars George
Rigorous and Multi-tenant HBase Performance Measurement
HBase Advanced - Lars George
Hbase: an introduction
HBase: Extreme Makeover

What's hot (20)

PDF
HBaseCon 2015- HBase @ Flipboard
PPTX
Apache HBase Performance Tuning
PPTX
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
PPTX
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
PPTX
Digital Library Collection Management using HBase
PDF
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
PPTX
Keynote: The Future of Apache HBase
PDF
HBaseCon 2013: Apache HBase Operations at Pinterest
PPTX
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
PPTX
Off-heaping the Apache HBase Read Path
PPTX
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
PDF
HBase 0.20.0 Performance Evaluation
PPTX
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
PPTX
HBase Data Modeling and Access Patterns with Kite SDK
PPTX
HBaseCon 2013: Compaction Improvements in Apache HBase
PPTX
HBaseCon 2015: HBase Performance Tuning @ Salesforce
PDF
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
PPTX
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2015- HBase @ Flipboard
Apache HBase Performance Tuning
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Digital Library Collection Management using HBase
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
Keynote: The Future of Apache HBase
HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBase and HDFS: Understanding FileSystem Usage in HBase
Off-heaping the Apache HBase Read Path
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
HBase 0.20.0 Performance Evaluation
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2015: HBase Performance Tuning @ Salesforce
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Ad

Viewers also liked (20)

PPTX
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
PPTX
Introduction To HBase
KEY
Strengths and Weaknesses of MongoDB
PDF
Ycsb benchmarking
PPT
5 Pitfalls to Avoid with MongoDB
PDF
Lily for the Bay Area HBase UG - NYC edition
ODP
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
PDF
Rigorous and Multi-tenant HBase Performance
PPTX
When to Use MongoDB...and When You Should Not...
PDF
STAC Summit 2014 - Building a multitenant Big Data infrastructure
PDF
Tokyo Cassandra Summit 2014: Tunable Consistency by Al Tobey
PDF
Couchbase, что за зверь и на что способен.
PPT
MongoDB Pros and Cons
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
PDF
Yahoo Cloud Serving Benchmark
PPTX
An Introduction to Cassandra - Oracle User Group
PPTX
Hadoop Interview Questions and Answers
PPTX
Преимущества NoSQL баз данных на примере MongoDB
PDF
Real-Time Analytics with Apache Cassandra and Apache Spark
DOCX
VENU_Hadoop_Resume
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
Introduction To HBase
Strengths and Weaknesses of MongoDB
Ycsb benchmarking
5 Pitfalls to Avoid with MongoDB
Lily for the Bay Area HBase UG - NYC edition
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
Rigorous and Multi-tenant HBase Performance
When to Use MongoDB...and When You Should Not...
STAC Summit 2014 - Building a multitenant Big Data infrastructure
Tokyo Cassandra Summit 2014: Tunable Consistency by Al Tobey
Couchbase, что за зверь и на что способен.
MongoDB Pros and Cons
Spark + Cassandra = Real Time Analytics on Operational Data
Yahoo Cloud Serving Benchmark
An Introduction to Cassandra - Oracle User Group
Hadoop Interview Questions and Answers
Преимущества NoSQL баз данных на примере MongoDB
Real-Time Analytics with Apache Cassandra and Apache Spark
VENU_Hadoop_Resume
Ad

Similar to NoSQL: Cassadra vs. HBase (20)

PPTX
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
PPTX
SQL Server 2014 In-Memory OLTP
PDF
Databases in the hosted cloud
PDF
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
PDF
Scaling HDFS to Manage Billions of Files
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
PPT
NoSQL_Night
PPTX
Drop acid
PPTX
How does Apache Pegasus (incubating) community develop at SensorsData
PDF
Hbase Nosql
PDF
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
PDF
Scaling on AWS for the First 10 Million Users at Websummit Dublin
PDF
Spring one2gx2010 spring-nonrelational_data
PDF
Module 2 - Datalake
PDF
Intro to database_services_fg_aws_summit_2014
PDF
High-Performance Storage Services with HailDB and Java
PDF
Apache Drill talk ApacheCon 2018
PPTX
Horizon for Big Data
PPTX
A Presentation on MongoDB Introduction - Habilelabs
PDF
Amazon Web Services - Relational Database Service Meetup
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
SQL Server 2014 In-Memory OLTP
Databases in the hosted cloud
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
Scaling HDFS to Manage Billions of Files
Scaling HDFS to Manage Billions of Files with Key-Value Stores
NoSQL_Night
Drop acid
How does Apache Pegasus (incubating) community develop at SensorsData
Hbase Nosql
AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)
Scaling on AWS for the First 10 Million Users at Websummit Dublin
Spring one2gx2010 spring-nonrelational_data
Module 2 - Datalake
Intro to database_services_fg_aws_summit_2014
High-Performance Storage Services with HailDB and Java
Apache Drill talk ApacheCon 2018
Horizon for Big Data
A Presentation on MongoDB Introduction - Habilelabs
Amazon Web Services - Relational Database Service Meetup

More from Antonio Severien (6)

PDF
Scalable Distributed Real-Time Clustering for Big Data Streams
PDF
Scalable Distributed Real-Time Clustering for Big Data Streams
PDF
On Pragmatism and Scientific Freedom
PDF
Community cloud antonioseverien
PDF
Relational Cloud
PPTX
Soap vs rest
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
On Pragmatism and Scientific Freedom
Community cloud antonioseverien
Relational Cloud
Soap vs rest

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Modernizing your data center with Dell and AMD
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Cloud computing and distributed systems.
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Encapsulation theory and applications.pdf
Understanding_Digital_Forensics_Presentation.pptx
Unlocking AI with Model Context Protocol (MCP)
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Modernizing your data center with Dell and AMD
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
MYSQL Presentation for SQL database connectivity
Advanced methodologies resolving dimensionality complications for autism neur...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Cloud computing and distributed systems.
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

NoSQL: Cassadra vs. HBase

  • 1. YCSB Yahoo! Cloud Serving Benchmark Scalable Distributed Systems Antonio L. Severien antonio.severien@gmail.com João Rosa Joao.rui.rosa@gmail.com
  • 2. Overview • Distributed Databases • Cassandra • HBase • YCSB General View • YCSB Details • Amazon EC2 • YCSB Results • YCSB Future • Conclusions • References
  • 3. Distributed Databases Traditional RDBMS • ACID transactions • Query language (SQL) • Data tied to the modeling (hard to analyze) • Scalable to a limit Distributed Databases • Not ACID • Not Relational • Column oriented (key-value) • CAP (Consistency, Availability, Partitioning) • Big Data (Massively scalable)
  • 4. Distributed Databases • Sherpa/PNUTS • BigTable • HBase, Hypertable, HTable • Megastore • Azure • Cassandra • Amazon Web Services • S3, SimpleDB, EBS • CouchDB • Voldemort • Dynomite • Tokyo • Redis • MongoDB
  • 5. Distributed Databases • NoSQL Databases have different designs and architecture Cassandra Thrift Gossip Token ring … Hbase HDFS Zookeeper Hadoop (MapReduce) BigTable GFS Chubby (Lock Service) MapReduce
  • 6. Cassandra • Highlights • High availability • Incremental scalability • Eventually consistent • Tradeoffs between consistency and latency • Minimal administration • No SPF (Single Point of Failure)
  • 7. Cassandra • CAP-aware • Cassandra values Availability and Partitioning tolerance (AP)  eventually consistent • Providing strong Consistency in Cassandra increases latency • Partitioning • Token oriented • Explicit Replication • Replication factor ≤ Total nodes • High level clients • Python, Java, C#, .NET, Scala, Ruby, PHP, Erlang, Haskell…etc • Thrift  driver-level interface
  • 8. Cassandra • Data Model • Cluster: • Machines (nodes) in a logical Cassandra instance • can contain multiple keyspaces • Keyspace: • name for ColumnFamilies • ColumnFamilies: • contain multiple columns each with name, value and timestamp referenced by row keys. • Analogous to table on RDBMS • SuperColumns: • columns with subcolumns • Rows • Columns keyA Column1 Column2 Column3 keyB Column5 Column6 column10 Column Byte[] Name Byte[] Value I64 Timestamp
  • 10. HBase “HBase is more a datastore than a database” • It lacks many of the features of RDBMS • Distributed and scalable big data store. • Regions model • Strong consistency
  • 11. HBase Built on top of Hadoop Distributed Filesystem (HDFS)
  • 12. HBase • The NameNode is responsible for maintaining the filesystem metadata. • The DataNodes are responsible for storing HDFS blocks.
  • 13. HBase • The NameNode is responsible for maintaining the filesystem metadata. • The DataNodes are responsible for storing HDFS blocks. Note: In our study case, we only had interest on HDFS layer.
  • 14. HBase
  • 16. HBase • Data is stored into HBase tables. • Tables are made of rows and columns. • All columns belong to a particular column family. Important note: All column family members are stored together. • A query on a column family model has a better performance
  • 17. YCSB General View • Which is the best NoSQL DB? • How to compare? • Yahoo! Cloud Serving Benchmark (YCSB) • Benchmarking tool • Evaluate key-value and cloud DBs performance on a common set of workloads • Client – an extensible workload generator • Yahoo! Research • Brian F. Cooper - cooperb@yahoo-inc.com • Joint work with Adam Silberstein, Erwin Tam, Raghu Ramakrishnan and Russell Sear
  • 18. YCSB Details • How it works? YCSB Client DBInterface Layer Client Threads Statistics Workload Executor Cloud Serving Store Workload file • Read/write mix • Record size • Popularity distribution • … Command line • DB to use • Workload to use • Target throughput • Number of threads • …
  • 19. YCSB Details Benchmark Tiers • Performance • Measure latency/throughput curve • Increase throughput until saturation • Scalability • Scale up: increase hardware, data size and throughput proportionally • Elastic speedup: add servers while running a workload
  • 20. YCSB Details Load phase - Load the database $ ycsb load cassandra-10 –p hosts=127.0.0.1 –P workloadX Transactions phase - Executes the workload $ ycsb run cassandra-10 –p hosts=127.0.0.1 –P workloadX Random Load Distribution
  • 21. YCSB Details • # Yahoo! Cloud System Benchmark • # Workload A: Update heavy workload • # Application example: Session store recording recent actions • # • # Read/update ratio: 50/50 • # Default data size: 1 KB records (10 fields, 100 bytes each, plus key) • # Request distribution: zipfian • recordcount=1000 • operationcount=1000 • workload=com.yahoo.ycsb.workloads.CoreWorkload • readallfields=true • readproportion=0.5 • updateproportion=0.5 • scanproportion=0 • insertproportion=0 • requestdistribution=zipfian
  • 22. YCSB Details • Execution parameters • $ ./bin/ycsb run cassandra-10 –P workloads/workloada –s –threads 10 –target 100 > transactions.dat [OVERALL],RunTime(ms), 10110 [OVERALL],Throughput(ops/sec), 98.91196834817013 [UPDATE], Operations, 491 [UPDATE], AverageLatency(ms), 0.054989816700611 [UPDATE], MinLatency(ms), 0 [UPDATE], MaxLatency(ms), 1 [UPDATE], 95thPercentileLatency(ms), 1 [UPDATE], 99thPercentileLatency(ms), 1 [UPDATE], Return=0, 491 [UPDATE], 0, 464 [UPDATE], 1, 27 [UPDATE], 2, 0 [UPDATE], 3, 0 [UPDATE], 4, 0 ...
  • 23. YCSB Details • $ ./bin/ycsb run basic -P workloads/workloada -P large.dat -s -threads 10 - target 100 –p measurementtype=timeseries -p timeseries.granularity=2000 > transactions.dat [OVERALL],RunTime(ms), 10077 [OVERALL],Throughput(ops/sec), 9923.58836955443 [UPDATE], Operations, 50396 [UPDATE], AverageLatency(ms), 0.04339630129375347 [UPDATE], MinLatency(ms), 0 [UPDATE], MaxLatency(ms), 338 [UPDATE], Return=0, 50396 [UPDATE], 0, 0.10264765784114054 [UPDATE], 2000, 0.026989343690867442 [UPDATE], 4000, 0.0352882703777336 [UPDATE], 6000, 0.004238958990536277 [UPDATE], 8000, 0.052813085033008175 [UPDATE], 10000, 0.0 [READ], Operations, 49604 [READ], AverageLatency(ms), 0.038242883638416256 [READ], MinLatency(ms), 0 [READ], MaxLatency(ms), 230 [READ], Return=0, 49604 [READ], 0, 0.08997245741099663 [READ], 2000, 0.02207505518763797 [READ], 4000, 0.03188493260913297 [READ], 6000, 0.004869141813755326 [READ], 8000, 0.04355329949238579 [READ], 10000, 0.005405405405405406
  • 25. Amazon EC2 Configuration Large Instance 7.5 GB memory 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each) 850 GB instance storage 64-bit platform I/O Performance: High API name: m1.large Experiment Set-up Cassandra Cluster 3 nodes + 1 node (Elasticity) Hbase Cluster 3 nodes
  • 26. Amazon EC2 Usage Cassandra Load phase: 60,000,000 records of 1Kb
  • 27. Amazon EC2 Usage HBase Load phase: 60,000,000 records of 1Kb
  • 28. Amazon EC2 Usage Load phase: 60,000,000 records of 1Kb Cassandra HBase
  • 29. Amazon EC2 Usage Load phase: 60,000,000 records of 1KbCassandra HBase
  • 30. Amazon EC2 Usage Transaction phase: - 10,000 records - 1,000,000 operations - 250 threads Cassandra
  • 31. YCSB Cassandra Results Update Heavy Workload (50/50) 0 10 20 30 40 50 60 0 1,000 2,000 3,000 4,000 5,000 6,000 AverageLatency(ms) Throughput (ops/sec) Update 0 10 20 30 40 50 60 0 1,000 2,000 3,000 4,000 5,000 6,000 AverageLatency(ms) Throughput (ops/sec) Read
  • 32. YCSB HBase Results 0.00 0.05 0.10 0.15 0.20 0.25 0.30 471.15 485 492.38 507.17 562.33 620.04 634.82 734.32 845.15 AverageLatency(ms) Throughput (ops/sec) Update Hbase 0.90.5 0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 471.15 485 492.38 507.17 562.33 620.04 634.82 734.32 845.15 AverageLatency(ms) Throughput (ops/sec) Read HBase 0.90.5
  • 33. YCSB Cassandra Results 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 0 50000 100000 150000 200000 250000 300000 350000 400000 Latency(ms) Time miliseconds Elasticity Cassandra 1.0
  • 34. YCSB Cassandra Results 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 0 50000 100000 150000 200000 250000 300000 350000 400000 Latency(ms) Time miliseconds Elasticity Cassandra 1.0
  • 35. YCSB Future Provide statistics for: - Availability - Replication Additional Distributed Databases Currently supported: Cassandra Mapkeeper MongoDB Redis Voldemort Vmware vFabric Gemfire Hbase
  • 36. Conclusions • YCSB provides a common ground for benchmarking cloud DB services • Good for leaning and experimenting with different distributed databases • Open source, extensible for new databases • Laboratory with Amazon EC2 provided good insight into setting up cloud services • Challenges • Installation problems • Hard to follow documentation • Working on distributed environment require lots of configuration
  • 37. References • YCSB (Yahoo! Cloud Serving Benchmark) • https://github.com/brianfrankcooper/YCSB/wiki • Yahoo! Research • http://research.yahoo.com/Web_Information_Management/YCSB • BigTable • http://en.wikipedia.org/wiki/BigTable • Cassandra • http://wiki.apache.org/cassandra/ • HBase • http://hbase.apache.org/