NoSQL: Cassadra vs. HBase

YCSB
Yahoo! Cloud Serving Benchmark
Scalable Distributed Systems
Antonio L. Severien
antonio.severien@gmail.com
João Rosa
Joao.rui.rosa@gmail.com

Overview
• Distributed Databases
• Cassandra
• HBase
• YCSB General View
• YCSB Details
• Amazon EC2
• YCSB Results
• YCSB Future
• Conclusions
• References

Distributed Databases
Traditional RDBMS
• ACID transactions
• Query language (SQL)
• Data tied to the modeling (hard to analyze)
• Scalable to a limit
• Not ACID
• Not Relational
• Column oriented (key-value)
• CAP (Consistency, Availability, Partitioning)
• Big Data (Massively scalable)

• Sherpa/PNUTS
• BigTable
• HBase, Hypertable, HTable
• Megastore
• Azure
• Cassandra
• Amazon Web Services
• S3, SimpleDB, EBS • CouchDB
• Voldemort
• Dynomite
• Tokyo
• Redis
• MongoDB

• NoSQL Databases have different designs and architecture
Cassandra
Thrift
Gossip
Token ring
…
Hbase
HDFS
Zookeeper
Hadoop (MapReduce)
BigTable
GFS
Chubby (Lock Service)
MapReduce

Cassandra
• Highlights
• High availability
• Incremental scalability
• Eventually consistent
• Tradeoffs between consistency and latency
• Minimal administration
• No SPF (Single Point of Failure)

Cassandra
• CAP-aware
• Cassandra values Availability and Partitioning tolerance (AP)
 eventually consistent
• Providing strong Consistency in Cassandra increases latency
• Partitioning
• Token oriented
• Explicit Replication
• Replication factor ≤ Total nodes
• High level clients
• Python, Java, C#, .NET, Scala, Ruby, PHP, Erlang, Haskell…etc
• Thrift  driver-level interface

Cassandra
• Data Model
• Cluster:
• Machines (nodes) in a logical
Cassandra instance
• can contain multiple keyspaces
• Keyspace:
• name for ColumnFamilies
• ColumnFamilies:
• contain multiple columns each with name, value and timestamp
referenced by row keys.
• Analogous to table on RDBMS
• SuperColumns:
• columns with subcolumns
• Rows
• Columns
keyA Column1 Column2 Column3
keyB Column5 Column6 column10
Column
Byte[] Name
Byte[] Value
I64 Timestamp

Cassandra
Partitioning Replication

HBase
“HBase is more a datastore than a database”
• It lacks many of the features of RDBMS
• Distributed and scalable big data store.
• Regions model
• Strong consistency

HBase
Built on top of Hadoop Distributed Filesystem (HDFS)

HBase
• The NameNode is
responsible for maintaining
the filesystem metadata.
• The DataNodes are
responsible for storing HDFS
blocks.

HBase
• The NameNode is
responsible for maintaining
the filesystem metadata.
• The DataNodes are
responsible for storing HDFS
blocks.
Note: In our study case, we only
had interest on HDFS layer.

HBase
• Data is stored into HBase tables.
• Tables are made of rows and columns.
• All columns belong to a particular column family.
Important note: All column family members are stored together.
• A query on a
column family
model has a better
performance

YCSB General View
• Which is the best NoSQL DB?
• How to compare?
• Yahoo! Cloud Serving Benchmark (YCSB)
• Benchmarking tool
• Evaluate key-value and cloud DBs performance on a common set
of workloads
• Client – an extensible workload generator
• Yahoo! Research
• Brian F. Cooper - cooperb@yahoo-inc.com
• Joint work with Adam Silberstein, Erwin Tam, Raghu Ramakrishnan
and Russell Sear

YCSB Details
• How it works?
YCSB Client
DBInterface
Layer
Client
Threads
Statistics
Workload
Executor
Cloud
Serving
Store
Workload file
• Read/write mix
• Record size
• Popularity distribution
• …
Command line
• DB to use
• Workload to use
• Target throughput
• Number of threads
• …

YCSB Details
Benchmark Tiers
• Performance
• Measure latency/throughput curve
• Increase throughput until saturation
• Scalability
• Scale up: increase hardware, data size and throughput
proportionally
• Elastic speedup: add servers while running a workload

YCSB Details
Load phase
- Load the database
$ ycsb load cassandra-10
–p hosts=127.0.0.1 –P workloadX
Transactions phase
- Executes the workload
$ ycsb run cassandra-10
–p hosts=127.0.0.1 –P workloadX
Random Load Distribution

YCSB Details
• # Yahoo! Cloud System Benchmark
• # Workload A: Update heavy workload
• # Application example: Session store recording recent actions
• #
• # Read/update ratio: 50/50
• # Default data size: 1 KB records (10 fields, 100 bytes each, plus key)
• # Request distribution: zipfian
• recordcount=1000
• operationcount=1000
• workload=com.yahoo.ycsb.workloads.CoreWorkload
• readallfields=true
• readproportion=0.5
• updateproportion=0.5
• scanproportion=0
• insertproportion=0
• requestdistribution=zipfian

YCSB Details
• Execution parameters
• $ ./bin/ycsb run cassandra-10 –P workloads/workloada –s –threads 10 –target 100
> transactions.dat
[OVERALL],RunTime(ms), 10110
[OVERALL],Throughput(ops/sec), 98.91196834817013
[UPDATE], Operations, 491
[UPDATE], AverageLatency(ms), 0.054989816700611
[UPDATE], MinLatency(ms), 0
[UPDATE], MaxLatency(ms), 1
[UPDATE], 95thPercentileLatency(ms), 1
[UPDATE], 99thPercentileLatency(ms), 1
[UPDATE], Return=0, 491
[UPDATE], 0, 464
[UPDATE], 1, 27
[UPDATE], 2, 0
[UPDATE], 3, 0
[UPDATE], 4, 0
...

YCSB Details
• $ ./bin/ycsb run basic -P workloads/workloada -P large.dat -s -threads 10 -
target 100 –p measurementtype=timeseries -p timeseries.granularity=2000 >
transactions.dat
[OVERALL],RunTime(ms), 10077
[OVERALL],Throughput(ops/sec), 9923.58836955443
[UPDATE], Operations, 50396
[UPDATE], AverageLatency(ms), 0.04339630129375347
[UPDATE], MinLatency(ms), 0
[UPDATE], MaxLatency(ms), 338
[UPDATE], Return=0, 50396
[UPDATE], 0, 0.10264765784114054
[UPDATE], 2000, 0.026989343690867442
[UPDATE], 4000, 0.0352882703777336
[UPDATE], 6000, 0.004238958990536277
[UPDATE], 8000, 0.052813085033008175
[UPDATE], 10000, 0.0
[READ], Operations, 49604
[READ], AverageLatency(ms), 0.038242883638416256
[READ], MinLatency(ms), 0
[READ], MaxLatency(ms), 230
[READ], Return=0, 49604
[READ], 0, 0.08997245741099663
[READ], 2000, 0.02207505518763797
[READ], 4000, 0.03188493260913297
[READ], 6000, 0.004869141813755326
[READ], 8000, 0.04355329949238579
[READ], 10000, 0.005405405405405406

Amazon EC2 Configuration
Large Instance
7.5 GB memory
4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
850 GB instance storage
64-bit platform
I/O Performance: High
API name: m1.large
Experiment Set-up
Cassandra Cluster
3 nodes + 1 node (Elasticity)
Hbase Cluster
3 nodes

Amazon EC2 Usage
Cassandra
Load phase: 60,000,000 records of 1Kb

Amazon EC2 Usage
HBase

Amazon EC2 Usage
Cassandra
HBase

Amazon EC2 Usage
Load phase: 60,000,000 records of 1KbCassandra HBase

Amazon EC2 Usage
Transaction phase:
- 10,000 records
- 1,000,000 operations
- 250 threads
Cassandra

YCSB Cassandra Results
Update Heavy Workload
(50/50)
0
10
20
30
40
50
60
0 1,000 2,000 3,000 4,000 5,000 6,000
AverageLatency(ms)
Throughput (ops/sec)
Update
0
10
20
30
40
50
60
0 1,000 2,000 3,000 4,000 5,000 6,000
AverageLatency(ms)
Read

YCSB HBase Results
0.00
0.05
0.10
0.15
0.20
0.25
0.30
471.15 485 492.38 507.17 562.33 620.04 634.82 734.32 845.15
AverageLatency(ms)
Update Hbase 0.90.5
0.00
200.00
400.00
600.00
800.00
1000.00
1200.00
471.15 485 492.38 507.17 562.33 620.04 634.82 734.32 845.15
AverageLatency(ms)
Read HBase 0.90.5

YCSB Cassandra Results
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
0 50000 100000 150000 200000 250000 300000 350000 400000
Latency(ms)
Time miliseconds
Elasticity Cassandra 1.0

YCSB Future
Provide statistics for:
- Availability
- Replication
Additional Distributed Databases
Currently supported:
Cassandra Mapkeeper
MongoDB Redis
Voldemort Vmware vFabric Gemfire
Hbase

Conclusions
• YCSB provides a common ground for benchmarking cloud DB
services
• Good for leaning and experimenting with different distributed
databases
• Open source, extensible for new databases
• Laboratory with Amazon EC2 provided good insight into setting
up cloud services
• Challenges
• Installation problems
• Hard to follow documentation
• Working on distributed environment require lots of configuration

References
• YCSB (Yahoo! Cloud Serving Benchmark)
• https://github.com/brianfrankcooper/YCSB/wiki
• Yahoo! Research
• http://research.yahoo.com/Web_Information_Management/YCSB
• BigTable
• http://en.wikipedia.org/wiki/BigTable
• Cassandra
• http://wiki.apache.org/cassandra/
• HBase
• http://hbase.apache.org/

NoSQL: Cassadra vs. HBase

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to NoSQL: Cassadra vs. HBase (20)

More from Antonio Severien (6)

Recently uploaded (20)

NoSQL: Cassadra vs. HBase