SlideShare a Scribd company logo
Beginning Apache Cassandra Development 1st
Edition Vivek Mishra Auth download
https://ebookbell.com/product/beginning-apache-cassandra-
development-1st-edition-vivek-mishra-auth-4971738
Explore and download more ebooks at ebookbell.com
Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Beginning Apache Cassandra Development Mishra Vivekoneill Brian
https://ebookbell.com/product/beginning-apache-cassandra-development-
mishra-vivekoneill-brian-11862984
Beginning Apache Spark Using Azure Databricks Unleashing Large Cluster
Analytics In The Cloud 1st Ed Robert Ilijason
https://ebookbell.com/product/beginning-apache-spark-using-azure-
databricks-unleashing-large-cluster-analytics-in-the-cloud-1st-ed-
robert-ilijason-22417420
Beginning Apache Struts 1st Edition Arnold Doray
https://ebookbell.com/product/beginning-apache-struts-1st-edition-
arnold-doray-34504736
Beginning Apache Spark 3 With Dataframe Spark Sql Structured Streaming
And Spark Machine Learning Library 2nd Ed Hien Luu
https://ebookbell.com/product/beginning-apache-spark-3-with-dataframe-
spark-sql-structured-streaming-and-spark-machine-learning-library-2nd-
ed-hien-luu-35191130
Beginning Apache Pig Big Data Processing Made Easy Balaswamy Vaddeman
https://ebookbell.com/product/beginning-apache-pig-big-data-
processing-made-easy-balaswamy-vaddeman-5684564
Beginning Apache Spark 2 With Resilient Distributed Datasets Spark Sql
Structured Streaming And Spark Machine Learning Library Hien Luu
https://ebookbell.com/product/beginning-apache-spark-2-with-resilient-
distributed-datasets-spark-sql-structured-streaming-and-spark-machine-
learning-library-hien-luu-7213898
Beginning Apache Spark 2 With Resilient Distributed Datasets Spark Sql
Structured Streaming And Spark Machine Learning Library Hien Luu
https://ebookbell.com/product/beginning-apache-spark-2-with-resilient-
distributed-datasets-spark-sql-structured-streaming-and-spark-machine-
learning-library-hien-luu-11068224
Beginning Apache Struts Arnold Doray
https://ebookbell.com/product/beginning-apache-struts-arnold-
doray-34504734
Beginning Php 6 Apache Mysql 6 Web Development Timothy Boronczyk
https://ebookbell.com/product/beginning-php-6-apache-mysql-6-web-
development-timothy-boronczyk-2318188
Beginning Apache Cassandra Development 1st Edition Vivek Mishra Auth
Beginning Apache Cassandra Development 1st Edition Vivek Mishra Auth
For your convenience Apress has placed some of the front
matter material after the index. Please use the Bookmarks
and Contents at a Glance links to access them.
v
Contents at a Glance
About the Author����������������������������������������������������������������������������������������������������������������xv
About the Technical Reviewer������������������������������������������������������������������������������������������xvii
Acknowledgments�������������������������������������������������������������������������������������������������������������xix
Introduction�����������������������������������������������������������������������������������������������������������������������xxi
Chapter 1: NoSQL: Cassandra Basics
■
■ ��������������������������������������������������������������������������������1
Chapter 2: Cassandra Data Modeling
■
■ ������������������������������������������������������������������������������27
Chapter 3: Indexes and Composite Columns
■
■ �������������������������������������������������������������������43
Chapter 4: Cassandra Data Security
■
■ 
��������������������������������������������������������������������������������61
Chapter 5: MapReduce with Cassandra
■
■ ��������������������������������������������������������������������������79
Chapter 6: Data Migration and Analytics
■
■ 
�������������������������������������������������������������������������97
Chapter 7: Titan Graph Databases with Cassandra
■
■ �������������������������������������������������������123
Chapter 8: Cassandra Performance Tuning
■
■ �������������������������������������������������������������������153
Chapter 9: Cassandra: Administration and Monitoring
■
■ �������������������������������������������������171
Chapter 10: Cassandra Utilities
■
■ �������������������������������������������������������������������������������������191
Chapter 11: Upgrading Cassandra and Troubleshooting
■
■ �����������������������������������������������209
Index���������������������������������������������������������������������������������������������������������������������������������217
xxi
Introduction
Big or large data has been the talk of the town in recent years. With possibilities for solving unstructured and
semi-structured data issues, more and more organizations are gradually moving toward big data powered solutions.
This essentially gives organization a way to think “beyond RDBMS.” This book will walk you through many such use
cases during the journey.
Many NoSQL databases have been developed over the last 4-5 years. Recent research shows there are now more
than 150 different NoSQL databases. This raises questions about why to adopt a specific database. For example,
is it scalable, under active development, and most importantly accepted by the community and organizations? It is
in light of these questions that Apache Cassandra comes out as a winner and indicates why it is one of the most
popular NoSQL databases currently in use.
Apache Cassandra is a columnar distributed database that takes database application development forward from
the point at which we encounter the limitations of traditional RDBMSs in terms of performance and scalability. A few
things that restrict traditional RDBMSs are that they require predefined schemas, the ability to scale up to hundreds
of data nodes, and the amount of work involved with data administration and monitoring. We will discuss these
restrictions and how to address these with Apache Cassandra.
Beginning Apache Cassandra Development introduces you to Apache Cassandra, including the answers to the
questions mentioned above, and provides a detailed overview and explanation of its feature set.
Beginning with Cassandra basics, this book will walk you through the following topics and more:
Data modeling
•
Cluster deployment, logging, and monitoring
•
Performance tuning
•
Batch processing via MapReduce
•
Hive and Pig integration
•
Working on graph-based solutions
•
Open source tools for Cassandra and related utilities
•
The book is intended for database administrators, big data developers, students, big data solution architects,
and technology decision makers who are planning to use or are already using Apache Cassandra.
Many of the features and concepts covered in this book are approached through hands on recipes that show how
things are done. In addition to those step-by-step guides, the source code for the examples is available as a download
from the book’s Apress product page (www.apress.com/9781484201435).
1
Chapter 1
NoSQL: Cassandra Basics
The purpose of this chapter is to discuss NoSQL, let users dive into NoSQL elements, and then introduce big data
problems, distributed database concepts, and finally Cassandra concepts. Topics covered in this chapter are:
NoSQL introduction
•
CAP theorem
•
Data distribution concepts
•
Big data problems
•
Cassandra configurations
•
Cassandra storage architecture
•
Setup and installation
•
Logging with Cassandra
•
The intent of the detailed introductory chapter is to dive deep into the NoSQL ecosystem by discussing problems
and solutions, such as distributed programming concepts, which can help in solving scalability, availability, and other
data-related problems.
This chapter will introduce the reader to Cassandra and discuss Cassandra’s storage architecture, various other
configurations, and the Cassandra cluster setup over local and AWS boxes.
Introducing NoSQL
Big data’s existence can be traced back to the mid 1990s. However, the actual shift began in the early 2000s.
The evolution of the Internet and mobile technology opened many doors for more people to participate and
share data globally. This resulted in massive data production, in various formats, flowing across the globe. A wider
distributed network resulted in incremental data growth. Due to this massive data generation, there is a major shift in
application development and many new domain business possibilities have emerged, like:
Social trending
•
OLAP and Data mining
•
Sentiment analysis
•
Behavior targeting
•
Real-time data analysis
•
Chapter 1 ■ NoSQL: Cassandra Basics
2
With high data growth into peta/zeta bytes, challenges like scalability and managing data structure would be very
difficult with traditional relational databases. Here big data and NoSQL technologies are considered an alternative to
building solutions. In today’s scenario, existing business domains are also exploring the possibilities of new functional
aspects and handling massive data growth simultaneously.
NoSQL Ecosystem
NoSQL, often called “Not Only SQL,” implies thinking beyond traditional SQL in a distributed way. There are more
than 150 NoSQL databases available today. The following are a few popular databases:
Columnar databases, such as Cassandra  HBase
•
Document based storage like MongoDB  Couchbase
•
Graph based access like Neo4J  Titan Graph DB
•
Simple key-value store like Redis  Couch DB
•
With so many options and categories, the most important question is, what, how, and why to choose! Each
NoSQL database category is meant to deal with a specific set of problems. Specific technology for specific
requirement paradigm is leading the current era of technology. It is certain that a single database for all business
needs is clearly not a solution, and that’s where the need for NoSQL databases arises. The best way to adopt databases
is to understand the requirements first. If the application is polyglot in nature, then you may need to choose more
than one database from the available options. In the next section, we will discuss a few points that describe why
Cassandra could be an answer to your big data problem.
CAP Theorem
CAP theorem, which was introduced in early 2000 by Eric Brewer, states that no database can offer Consistency,
Availability, and Partition tolerance together (see Figure 1-1), but depending on use case may allow for any two
of them.
Figure 1-1. CAP theorem excludes the possibility of a database with all three characteristics (the “NA” area)
Chapter 1 ■ NoSQL: Cassandra Basics
3
Traditional relational database management systems (RDBMS) provide atomicity, consistency, isolation, and
durability (ACID) semantics and advocate for strong consistency. That’s where most of NoSQL databases differ and
strongly advocate for partition tolerance and high availability with eventual consistency.
High availability of data means data must be available with minimal latency. For
distributed databases where data is distributed across multiple nodes, one way to achieve
high availability is to replicate it across multiple nodes. Like most of NoSQL databases,
Cassandra also provides high availability.
Partition tolerance implies if a node or couple of nodes is down, the system would still be
able to serve read/write requests. In scalable systems, built to deal with a massive volume
of data (in peta bytes) it is highly likely that situations may occur often. Hence, such systems
have to be partition tolerant. Cassandra’s storage architecture enables this as well.
Consistency means consistent across distributed nodes. Strong consistency refers to most
updated or consistent data on each node in a cluster. On each read/write request most
stable rows can be read or written to by introducing latency (downside of NoSQL) on each
read and write request, ensuring synchronized data on all the replicas. Cassandra offers
eventual consistency, and levels of configuration consistency for each read/write request.
We will discuss various consistency level options in detail in the coming chapters.
Budding Schema
Structured or fixed schema defines the number of columns and data types before implementation. Any alteration to
schema like adding column(s) would require a migration plan across the schema. For semistructured or unstructured
data formats where number of columns and data types may vary across multiple rows, static schema doesn’t fit very
well. That’s where budding or dynamic schema is best fit for semistructured or unstructured data.
Figure 1-2 presents four records containing twitter-like data for a particular user id. Here, the user id imvivek
consists of three columns “tweet body”
, ”followers”
, and “retweeted by”
. But on the row for user “apress_team” there
is only the column followers. For unstructured schema such as server logs, the number of fields may vary from row
to row. This requires the addition of columns “on the fly” a strong requirement for NoSQL databases. Traditional
RDBMS can handle such data set in a static way, but unlike Cassandra RDBMS cannot scale to have up to a million
columns per row in each partition. With predefined models in the RDBMS world, handling frequent schema changes
is certainly not a workable option. Imagine if we attempt to support dynamic columns we may end up having many
null columns! Having default null values for multiple columns per row is certainly not desirable. With Cassandra we
can have as many columns as we want (up to 2 billion)! Also another possible option is to define datatype for column
names (comparator) which is not possible with RDBMS (to have a column name of type integer).
Chapter 1 ■ NoSQL: Cassandra Basics
4
Scalability
Traditional RDBMSs offer vertical scalability, that is, scaling by adding more processors or RAM to a single unit.
Whereas, NoSQL databases offer horizontal scalability, and add more nodes. Mostly NoSQL databases are schemaless
and can perform well over commodity servers. Adding nodes to an existing RDBMS cluster is a cumbersome process
and relatively expensive whereas it is relatively easy to add data nodes with a NoSQL database, such as Cassandra.
We will discuss adding nodes to Cassandra in coming chapters.
No Single Point of Failure
With centralized databases or master/slave architectures, where database resources or a master are available on a
single machine, database services come to a complete halt if the master node goes down. Such database architectures
are discouraged where high availability of data is a priority. NoSQL distributed databases generally prefer multiple
master/slave configuration or peer-to-peer architecture to avoid a single point of failure. Cassandra delivers peer-to-
peer architecture where each Cassandra node would have an identical configuration. We will discuss this at length in
the coming chapters.
Figure 1-3a depicts a system single master acting as single point of contact to retrieve data from slave nodes. If
the master goes down, it would bring the whole system to a halt until the master node is reinstated. But with multiple
master configurations, like the one in Figure 1-3b, a single point of failure does not interrupt service.
Figure 1-2. A dynamic column, a.k.a. budding schema, is one way to relax static schema constraint of RDBMS world
Chapter 1 ■ NoSQL: Cassandra Basics
5
High Availability
High availability clusters suggest the database is available with 24x7 support with minimal (or no) downtime. In such
clusters, data is replicated across multiple nodes, in case one node is down still another node is available to serve the
read/write requests until that node is up and running. Cassandra’s peer-to-peer architecture ensures high availability
of data with co-location.
Identifying the Big Data Problem
Recently, it has been observed that developers are opting for NoSQL databases as an alternative to RDBMS. However,
I recommend that you perform an in-depth analysis before deciding on NoSQL technologies. Traditional RDBMS
does offer lots of features which are absent in most of NoSQL databases. A couple of questions that must be analyzed
and answered before jumping to a NoSQL based approach include
Is it really a big data problem?
•
Why/where RDBMS fails to deliver?
•
Identifying a “big data problem” is an interesting errand. Scalability, nature of data (structured, unstructured,
or semistructured) and cost of maintaining data volume are a few important factors. In most cases, managing
secured and structured data within an RDBMS may still be the preferred approach; however, if the nature of the
data is semistructured, less vulnerable, and scalability is preferred over traditional RDBMS features (e.g., joins,
materialized view, and so forth), it qualifies as a big data use case. Here data security means the authentication
and authorization mechanism. Although Cassandra offers decent support for authentication and authorization but
RDBMS fairs well in comparison with most of NoSQL databases.
Figure 1-4 shows a scenario in which a cable/satellite operator system is collecting audio/video transmission logs
(on daily basis) of around 3 GB/day per connection. A “viewer transmission analytic system” can be developed using
a big data tech stack to perform “near real time” and “large data” analytics over the streaming logs. Also the nature
of data logs is uncertain and may vary from user to user. Generating monthly/yearly analytic reports would require
dealing with petabytes of data, and NoSQL’s scalability is definitely a preference over that of RDBMS.
Figure 1-3. Centralized vs. distributed architecural setup
Chapter 1 ■ NoSQL: Cassandra Basics
6
Consider an example in which a viewer transmission analytic system is capturing random logs for each
transmitted program and watched or watching users. The first question we need to ask is, is it really a big data
problem? Yes, here we are talking about logs; imagine in a country like India the user base is huge as are the logs
captured 24x7! Also, the nature of transmitted logs may be random, meaning the structure is not fixed! It can be
semi-structured or totally unstructured. That’s where RDBMS will fail to deliver because of budding schema and
scalability problems (see previous section).
To summarize, build a NoSQL based solution if:
Data format is semi/unstructured
•
RDBMS reaches the storage limit and cannot scale further
•
RDBMS specific features like relations, indexes can be sacrificed against denormalized but
•
distributed data
Data redundancy is not an issue and a read-before-write approach can be applied
•
In the next section, we will discuss how Cassandra can be a best fit to address such technical and functional
challenges.
Introducing Cassandra
Cassandra is an open-source column, family-oriented database. Originally developed at Facebook, it has been an
Apache TLP since 2009. Cassandra comes with many important features; some are listed below:
Distributed database
•
Peer to Peer architecture
•
Configurable consistency
•
CQL (Cassandra Query Language)
•
Figure 1-4. Family watching satellite transmitted programs
Chapter 1 ■ NoSQL: Cassandra Basics
7
Distributed Databases
Cassandra is a global distributed database. Cassandra supports features like replication and partitioning. Replication
is a process where system maintains n* number of replicas on various data sites. Such data sites are called nodes in
Cassandra. Data Partitioning is a scheme, where data may be distributed across multiple nodes. Partitioning is usually
for managing high availability/performance on data.
Note
■
■  A node is a physical location where data resides.
Peer-to-Peer Design
Cassandra storage architecture is peer-to-peer. Each node in a cluster is assigned the same role, making it a
decentralized database. Each node is independent of the other but interconnected. Nodes in a network are capable
of serving read/write database requests, so at a given point even if a node goes down, subsequent read/write requests
will be served from other nodes in the network, hence there is no SPOF (Single Point Of Failure).
Figure 1-5 is a graphical representation of peer-to-peer (P2P) architecture.
Figure 1-5. Peer to Peer decentralized Cassandra nodes. Every node is identical and can communicate with other nodes
Chapter 1 ■ NoSQL: Cassandra Basics
8
Configurable Data Consistency
Data consistency is synchronization of data across multiple replica nodes. Eventually the consistency-based data
model returns the last updated record. Such a data model is widely supported by many distributed databases.
Cassandra also offers configurable eventual consistency.
Write Consistency
If the data is successfully written and synchronized on replica nodes before acknowledging the write request, data is
considered write consistent. However, various consistency level values are possible while submitting a write request.
Available consistency levels are
• ANY: A write must be written to at least ANY one node. In this case, all replica nodes are down
and “hinted_handoff_enabled: true” (default is true), then still corresponding write data and
hint will be stored by coordinator node, and later once all replica nodes are up, they will be
coordinated to at least one node. That written data will not be available for reads until all replica
nodes are down. Though ANY is the lowest consistency level but with highest availability as it
requires data to be replicated on any one node before sending write acknowledgment.
• ONE: With consistency level ONE; write request must be successfully written on at least one
replica node before acknowledgment.
• QUORUM*: With the consistency level QUORUM* write requests must be successfully written on
a selected group of replica nodes.
• LOCAL_QUORUM: With the consistency level LOCAL_QUORUM write requests must be
successfully written on a selected group of replica nodes, known as quorum, which are locally
available on the same data center as the coordinator node.
• EACH_QUORUM: With the consistency level EACH_QUORUM write requests must be successfully
written on select groups of replica nodes (quorum).
• ALL: With the consistency level ALL write requests must be written to the commit log and
memory table on all replica nodes in the cluster for that row key to ensure the highest
consistency level.
• SERIAL: Linearizable consistency is being introduced in Cassandra 2.0 as a lightweight
transaction support. With the consistency level SERIAL write requests must be written to the
commit log and memory table on quorum replica nodes conditionally. Here conditionally
means either guaranteed write on all nodes or none.
• TWO: Similar to ONE except with the consistency level TWO write requests must be written to
the commit log and memory table on minimum two replica nodes.
• THREE: Similar to TWO except with the consistency level TWO write requests must be written
to the commit log and memory table on a minimum of three replica nodes.
Read Consistency
No data is of much use if it is not consistent. Large or small data applications would prefer not to have dirty reads
or inconsistent data. A dirty read is a scenario where a transaction may end up in reading uncommitted data from
another thread. Although dirty reads are more RDBMS specific, with Cassandra there is a possibility for inconsistent
data if the responsible node is down and the latest data is not replicated on each replica node. In such cases, the
application may prefer to have strong consistency at the read level. With Cassandra’s tunable consistency, it is possible
to have configurable consistency per read request. Possible options are
Chapter 1 ■ NoSQL: Cassandra Basics
9
• ONE: With the read consistency level ONE, data is returned from the nearest replica node to
coordinator node. Cassandra relies on snitch configuration to determine the nearest possible
replica node. Since a response is required to be returned from the closest replica node, ONE is
the lowest consistency level.
• QUORUM: With the read consistency level QUORUM, the last updated data (based on
timestamp) is returned among data responses received by a quorum of replica nodes.
• LOCAL_QUORUM: With the read consistency level LOCAL_QUORUM, the last updated data
(based on timestamp) is returned among the data response received by a local quorum of
replica nodes.
• EACH_QUORUM: With the read consistency level EACH_QUORUM, the last updated data (based
on timestamp) is returned among the data response received by each quorum of replica
nodes.
• ALL: With the read consistency level ALL, the last updated data (based on timestamp)
returned among the data response received from all replica nodes. Since responses with the
latest timestamp are returned among all replica nodes, ALL is the highest consistency level.
• SERIAL: With the read consistency level SERIAL, it would return the latest set of columns
committed or in progress. Uncommitted transactions discovered during read would result in
implicit commit of running transactions and return to the latest column values.
• TWO: With the read consistency level TWO, the latest column values will be returned from the
two closest replica nodes.
• THREE: With the read consistency level THREE, the latest column values will be returned from
three of the closest replica nodes.
Based on the above-mentioned consistency level configurations, the user can always configure each read/write
request with a desired consistency level. For example, to ensure the lowest write consistency but the highest read
consistency, we can opt for ANY as write consistency and ALL for read consistency level.
Cassandra Query Language (CQL)
One of the key features of Cassandra from an end user perspective is ease-of-use rather than familiarity. Cassandra
query language (CQL) was introduced with Cassandra 0.8 release with the intention of having a RDBMS style
structured query language (SQL). Since its inception CQL has gone through many changes. Many new features have
been introduced in later releases along with lots of performance-related enhancement work. CQL adds a flavor of
known data definition language (ddl) and data manipulation language (dml) statements.
During the course of this book, we will be covering most of the CQL features.
Installing Cassandra
Installing Cassandra is fairly easy. In this section we will cover how to set up a Cassandra tarball (.tar file) installation
over Windows and Linux box.
1. Create a folder to download Cassandra tarball, for example:
Run
• mkdir /home/apress/Cassandra {Here apress is user.name environment variable}
Run
• cd/home/apress/cassandra
Chapter 1 ■ NoSQL: Cassandra Basics
10
2. Download the Cassandra tarball:
Linux:
• wget http://archive.apache.org/dist/cassandra/2.0.6/apache-cassandra-
2.0.6-bin.tar.gz
Windows:
• http://archive.apache.org/dist/cassandra/2.0.6/apache-cassandra-
2.0.6-bin.tar.gz
3. Extract the downloaded tar file using the appropriate method for your platform:
For Linux, use the following command:
• tar- xvf apache-cassandra-2.0.6-bin.tar.gz
For Windows, you may use tools like WinZip or 7zip to extract the tarball.
•
Note
■
■  If you get an “Out of memory” or segmentation fault, check for the JAVA_HOME and JVM_OPTS parameters in
cassandra-env.sh file.
Logging in Cassandra
While running an application in development or production mode, we might need to look into server logs in certain
circumstances, such as:
Performance issues
•
Operation support
•
Debug application vulnerability
•
Default server logging settings are defined within the log4j-server.properties file, as shown in the following.
# output messages into a rolling log file as well as stdout
log4j.rootLogger=INFO,stdout,R
# stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%5p %d{HH:mm:ss,SSS} %m%n
# rolling log file
log4j.appender.R=org.apache.log4j.RollingFileAppender
log4j.appender.R.maxFileSize=20MB
log4j.appender.R.maxBackupIndex=50
log4j.appender.R.layout=org.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern=%5p [%t] %d{ISO8601} %F (line %L) %m%n
# Edit the next line to point to your logs directory
log4j.appender.R.File=/var/log/cassandra/system.log
# Application logging options
#log4j.logger.org.apache.cassandra=DEBUG
#log4j.logger.org.apache.cassandra.db=DEBUG
#log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG
# Adding this to avoid thrift logging disconnect errors.
log4j.logger.org.apache.thrift.server.TNonblockingServer=ERROR
Chapter 1 ■ NoSQL: Cassandra Basics
11
Let’s discuss these properties in sequence
Properties with prefix
• log4j.appender.stdout are for console logging.
Server logs are generated and appended on a location defined as property
•
log4j.appender.R.File value. The default value is /var/log/cassandra/system. User can
overwrite the property file for default location
• og4j.appender.R.maxFileSize defines the maximum log file size.
The
• log4j.appender.R.maxBackupIndex property defines the maximum rolling log file
(default 50).
The
• Log4j.appender.R.layout.ConversionPattern property defines logging pattern for log
files.
Last line in the
• log4j-server.properties file is for application logging in case of thrift
connection with Cassandra. By default it’s commented out to avoid unnecessary logging on
frequent socket disconnection.
Application Logging Options
By default, Cassandra API level logging is disabled. But we can enable and change log level to log more application
level information. Many times applications may need to enable Cassandra-specific server-side logging to troubleshoot
the problems. The following code depicts the section that can be used for application-specific logging.
# Application logging options
#log4j.logger.org.apache.cassandra=DEBUG
#log4j.logger.org.apache.cassandra.db=DEBUG
#log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG
Changing Log Properties
There are two possible ways for configuring log properties. First, we can modify log4j-server.properties and
second, via JMX (Java Management Extension), using jconsole. The difference between both of them is, using the
latter can change the logging level dynamically at run time, while the first one is static.
Managing Logs via JConsole
JConsole is a GUI monitoring tool for resource usage and performance monitoring of running Java applications
using JMX.
The jconsole executable can be found in JDK_HOME/bin, where JDK_HOME is the directory in which the Java
Development Kit (JDK) is installed. If this directory is in your system path, you can start JConsole by simply typing
jconsole at command (shell) prompt. Otherwise, you have to use the full path of the executable file.
Chapter 1 ■ NoSQL: Cassandra Basics
12
Figure 1-6. JConsole connection layout
On running jconsole, you need to connect the Cassandra Daemon thread as shown in Figure 1-6.
Chapter 1 ■ NoSQL: Cassandra Basics
13
After successfully connecting to CassandraDaemon process, click on the MBeans tab to look into registered
message beans. Figure 1-7 depicts changing the log level for classes within the org.apache.cassandra.db package to
INFO level.
Figure 1-7. Changing the log level via jconsole Mbeans setting
Note
■
■  Please refer to http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/PatternLayout.html
for more information on logging patterns.
Understanding Cassandra Configuration
The primary Cassandra configuration file is Cassandra.yaml, which is available within the $CASSSANDRA_HOME/conf
folder. Roughly there are approximately 100 properties. Table 1-1 consists of a subset of such properties, which are
helpful for Cassandra beginners and worth mentioning.
Chapter 1 ■ NoSQL: Cassandra Basics
14
Table 1-1. Important Cassandra server properties
Property Default Description
cluster_name “Test cluster” This is to restrict node to join in one logical cluster only.
num_tokens Disabled, not specified If not specified, default value is 1. For example, if you
want to enable virtual node support while bootstrapping
a node, you need to set num_tokens parameter.
Recommended value is 256.
initial_token N/A Assigns a data range for node. While bootstrapping
a node, it is recommended to assign a value. If left
unspecified, a random token will be assigned by
Cassandra. For Random partitioning schema, the way to
calculate a initial_token is:
i * (2**127 / N)
for i = 0 .. N-1. N is number of nodes.
hinted_handoff_
enabled
True With a consistency level ANY, if replica node is down,
then the corresponding write request will be stored
down on coordinator node as a hint in system.hints
column family. This is used to replay the mutation
object, once replica node starts accepting write requests.
max_hint_window_in_
ms
3 hours Maximum wait time for a dead node until new hints
meant to be written on coordinator node. After the hint
window expires, no more new hints will be stored.
If left unspecified, the default value is 1 hour. This property
is used when writing hints on the coordinator node.
If the gossip protocol end point down time for a specific
replica node is greater than the specified Maximum
wait time value, then no new hints can be written by the
StorageProxy service on the coordinator node.
hinted_handoff_
throttle_in_kb
1024 kb/sec hint data flow/per thread.
max_hints_delivery_
threads
2 Maximum number of allowed threads to send data hints.
Useful when writing hints across multiple data centers.
populate_io_cache_
on_flush
False Set it to true, if complete data on a node can fit into
memory. Since Cassandra 1.2.2, we can also set this
parameter per column family as well. https://issues.
apache.org/jira/browse/CASSANDRA-4694.
authenticator AllowAllAuthenticator Implementation of IAuthenticator interface. By
default Cassandra offers AllowAllAuthenticator and
PasswordAuthenticator as internal authentication
implementations. PasswordAuthenticator validates
username and password against data stored in credentials
and users column family in system_auth keyspace.
(Security in Cassandra will be discussed at length in
Chapter 10.)
(continued)
Chapter 1 ■ NoSQL: Cassandra Basics
15
Table 1-1. (continued)
Property Default Description
Authorizer AllowAllAuthorizer Implementation of IAuthorizer interface.
Implementation manages user’s permission over
keyspace, column family, index, etc. Enabling
CassandraAuthorizer on server startup will create a
permissions table in system_auth keyspace and to store
user permissions.
(Security in Cassandra will be discussed at length in
Chapter 10.)
permissions_
validity_in_ms
Default is 2000. Disabled
if authorizer property is
AllowAllAuthorizer
Default permissions cache validity.
Partitioner Murmur3Partitioner Rows distribution across nodes in cluster is decided
based on selection partitioner. Available values are
RandomPartitioner, ByteOrderedPartitioner,
Murmur3Partitioner and OrderPreservingPartitioner
(deprecated).
data_file_
directories
/var/lib/cassandra/data Physical data location of node.
commitlog_directory /var/lib/cassandra/
commitlog
Physical location of commit log files of node.
disk_failure_policy Stop Available values are stop, best_effort, and ignore. Stop
will shut down all communication with node (except
JMX). best_effort will still acknowledge read request
from available sstables.
key_cache_size_in_mb Empty, means 100MB or 5% of
available heap size, whichever
is smaller
To disable set it to Zero.
saved_caches_
directory
/var/lib/cassandra/
saved_caches
Physical location for saved cache on node.
key_cache_save_
period
14400 Key cache save duration (in seconds) save under
saved_caches_directory.
key_cache_keys_to_
save
Disabled. By default disabled. All row keys will be cached.
row_cache_size_in_mb 0(Disabled) In-memory row cache size.
row_cache_save_
period
0(Disabled) row cache save duration (in seconds) save under
saved_caches_directory.
row_cache_keys_to_
save
Disabled. By default disabled. All row keys will be cached.
(continued)
Chapter 1 ■ NoSQL: Cassandra Basics
16
Table 1-1. (continued)
Property Default Description
row_cache_provider SerializingCacheProvider Available values are SerializingCacheProvider
and ConcurrentLinkedHashCacheProvider.
SerializingCacheProvider is recommended in case
workload is not intensive update as it uses native
memory (not JVM) for caching.
commitlog_sync Periodic Available values are periodic and batch. In case of batch
sync, writes will not be acknowledged until writes are
synced with disk.
See the commitlog_sync_batch_window_in_ms property.
commitlog_sync_
batch_window_in_ms
50 If commitlog_sync is in batch mode, Cassandra will
acknowledge writes only after commit log sync windows
expires and data will be fsynced to disk.
commitlog_sync_
period_in_ms
10000 If commitlog_sync is periodic. Commit log will be
fsynced to disk after this value.
commitlog_segment_
size_in_mb
32 Commit log segment size. Upon reaching this limit,
Cassandra flushes memtables to disk in form of sstables.
Keep it to minimum in case of 32 bit JVM to avoid running
out of address space and reduced commit log flushing.
seed_provider SimpleSeedProvider Implementation of SeedProvider interface.
SimpleSeedProvider is default implementation and
takes comma separated list of addresses. Default value
for “-seeds” parameter is 127.0.0.1. Please change it
for multiple node addresses, in case of multi-node
deployment.
concurrent_reads 32 If workload data cannot fit in memory, it would require
to fetch data from disk. Set this parameter to perform
number of concurrent reads.
concurrent_writes 32 Generally writes are faster than reads. So we can set this
parameter on the higher side in comparison to
concurrent_reads.
memtable_total_
space_in_mb
One third of JVM
heap(disabled)
Total space allocated for memtables. Once exceeding
specified size Cassandra will flush the largest memtable
first onto disk.
commitlog_total_
space_in_mb
32(32 bit JVM), 1024
(64bit JVM)
Total space allocated commit log segments. Upon
reaching the specified limit, Cassandra flushes
memtables to claim space by removing the oldest
commit log first.
storage_port 7000 TCP port for internal communication between nodes.
ssl_storage_port 7001 Used if client_encryption_options is enabled.
listen_address Localhost Address to bind and connect with other Cassandra
nodes.
(continued)
Chapter 1 ■ NoSQL: Cassandra Basics
17
Table 1-1. (continued)
Property Default Description
broadcast_address Disabled(same as listen_
address)
Broadcast address for other Cassandra nodes.
internode_
authenticator
AllowAllInternode
Authenticator
IinternodeAuthenticator interface implementation for
internode communication.
start_native_
transport
False CQL native transport for clients.
native_transport_
port
9042 CQL native transport port to connect with clients.
rpc_address Localhost Thrift rpc address, client to connect with.
rpc_port 9160 Thrift rpc port for clients to communicate.
rpc_min_threads 16 Minimum number of thread for thrift rpc.
rpc_max_threads 2147483647(Maximum 32-bit
signed integer)
Maximum number of threads for thrift rpc.
rpc_recv_buff_size_
in_bytes
Disabled Enable if you want to set a limit of receiving socket buffer
size for thrift rpc.
rpc_send_buff_size_
in_bytes
Disabled Enable if you want to set a limit of sending socket buffer
size for thrift rpc.
incremental_backups False If enabled, Cassandra will hard links flushed sstables to
backup directory under data_file_directories/keyspace/
backup directory.
snapshot_before_
compaction
False If enabled, will create snapshots before each compaction
under the data_file_directories/keyspace/
snapshots directory.
auto_snapshot True If disabled, snapshot will not be taken in case of dml
operation (truncate, drop) over keyspace.
concurrent_
compactors
Equals number of processors Equal to cassandra.available_processors (if defined)
else number of available processors.
multithreaded_
compaction
False If enabled, single thread per processor will be used for
compaction.
compaction_
throughput_mb_per_
sec
16 Data compaction flow in megabytes per seconds. More
compaction throughput will ensure less sstables and
more space on disk.
endpoint_snitch SimpleSnitch A very important configuration. Snitch can also be
termed as informer. Useful to route requests for replica
nodes in cluster. Available values are SimpleSnitch,
PropertyFileSnitch, RackInferringSnitch, Ec2Snitch,
and Ec2MultiRegionSnitch.
(I will cover snitch configuration in later chapters.)
(continued)
Chapter 1 ■ NoSQL: Cassandra Basics
18
Commit Log Archival
To enable Cassandra for auto commit log archiving and restore for recovery (supported since 1.1.1.),
the commitlog_archiving.properties file is used. It configures archive_command and restore_command properties.
Commit log archival is also referred to as write ahead log (WAL) archive and used for point-in-time recovery.
Cassandra’s implementation is similar to Postgresql. Postgresql is an object-oriented relational database
management system (OORDBMS) that offers wal_level settings with minimum as the lowest, followed by archive
and hot_standby levels to allow executing queries during recovery. For more details on Postgresql refer to
http://www.postgresql.org/.
archive_command
Enable archive_command for implicit commit log archival using a command such as:
archive_command= /bin/ln %path /home/backup/%name
Here %path is a fully qualified path of the last active commit log segment and %name is the name of commit log.
The above-mentioned shell command will create a hard link for the commit log segment (%path). If row mutation size
exceeds commitlog_segment_size_in_mb, Cassandra archives this segment using the archive command under
/home/backup/. Here %path is the name of latest old segment and %name is commit log file name.
restore_command
Leaving restore_command and restore_directories blank in commitlog_archiving.properties during bootstrap
Cassandra will replay these log files using the restore_command:
restore_command=cp -f %from %to
Table 1-1. (continued)
Property Default Description
request_scheduler NoScheduler Client request scheduler. By default no scheduling is
done, but we can configure this to RoundRobinScheduler
or a custom implementation. It will queue up client dml
request and finally release it after successfully processing
the request.
server_encryption_
options
None To enable encryption for internode communication.
Available values are all, none, dc, and rack.
client_encryption_
options
false(not enabled) To enable client/server communication. If enabled must
specify ssl_storage_port. As it will be used for client/
server communication.
internode_
compression
All To compress traffic in internode communication.
Available values are: all, dc, and none.
inter_dc_tcp_nodelay True Setting it to false will cause less congestion over TCP
protocol but increased latency.
Chapter 1 ■ NoSQL: Cassandra Basics
19
Here %from is a value specified as restore_directories and %to is next commit log segment file under
commitlog_directory.
One advantage of this continuous commit log is high availability of data also termed warm standby.
Configuring Replication and Data Center
Recently, the need for big data heterogeneous systems has evolved. Components in such systems are diverse in
nature and can be made up of different data sets. Considering nature, locality, and quantity of data volume, it is highly
possible that such systems may need to interconnect with data centers available on different physical locations.
A data center is a hardware system (say commodity server), which consists of multiple racks. A rack may contain
one or more nodes (see Figure 1-8).
Figure 1-8. Image depicting a Cassandra data center
Reasons for maintaining multiple data centers can be high availability, stand-by-node, and data recovery.
With high availability, any incoming request must be served with minimum latency. Data replication is a
mechanism to keep redundant copy of the same data on multiple nodes.
As explained above, a data center consists of multiple racks with each rack containing multiple nodes. A data
replication strategy is vital in order to enable high availability and node failure. Situations like
Local reads (high availability)
•
Fail-over (node failure)
•
Considering these factors, we should replicate data on multiple nodes in the same data center but with different
racks. This would avoid read/write failure (in case of network connection issues, power failure, etc.) of nodes in the
same rack.
Chapter 1 ■ NoSQL: Cassandra Basics
20
Replication means keeping redundant copies of data over multiple data nodes for high availability and
consistency. With Cassandra we can configure the replication factor and replication strategy class while creating
keyspace.
While creating schema (rather than keyspace) we can configure replication as:
CREATE KEYSPACE apress WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
// cql3 script
create keyspace apress with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and
strategy_options ={replication_factor:1};
// using cassandra-cli thrift
Note
■
■ Schema creation and management via CQL3 and Cassandra-cli will be discussed in Chapter 2.
Here, SimpleStrategy is the replication strategy, where the replication_factor is 3. Using SimpleStrategy like
this, each data row will be replicated on 3 replica nodes synchronously or asynchronously (depending on the write
consistency level) in clockwise direction.
Different strategy class options supported by Cassandra are
• SimpleStrategy
• LocalStrategy
• NetworkTopologyStrategy
LocalStrategy
LocalStrategy is available for internal purposes and is used for system and system_auth keyspaces. System and
system_auth are internal keyspaces, implicitly handled by Cassandra’s storage architecture for managing
authorization and authentication. These keyspaces also keep metadata about user-defined keyspaces and
column families. In the next chapter we will discuss them in detail. Trying to create keyspace with strategy class as
LocalStrategy is not permissible in Cassandra and would give an error like “LocalStrategy is for Cassandra’s internal
purpose only”
.
NetworkTopologyStrategy
NetworkTopologyStrategy is preferred if multiple replica nodes need to be placed on different data centers. We can
create a keyspace with this strategy as
CREATE KEYSPACE apress WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1' : 2, 'dc2' : 3};
Here dc1 and dc2 are data center names with replication factor of 2 and 3 respectively. Data center names are
derived from a configured snitch property.
Chapter 1 ■ NoSQL: Cassandra Basics
21
SimpleStrategy
SimpleStrategy is recommended for multiple nodes over multiple racks in a single data center.
CREATE KEYSPACE apress WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
Here, replication factor 3 would mean to replicate data on 3 nodes and strategy class SimpleStrategy would mean
to have those Cassandra nodes within the same data center.
Cassandra Multiple Node Configuration
In this section, we will discuss multiple Cassandra node configurations over a single machine and over Amazon EC2
instances. Reasons to choose AWS EC2 instances include the setup of the Cassandra cluster over the cloud and the set
up on the local box to configure the Cassandra cluster over physical boxes. AWS based configuration would educate
users about AWS and Cassandra.
Configuring Multiple Nodes over a Single Machine
Configuring multiple nodes over a single machine is more of an experiment, as with a production application you
would like to configure a Cassandra cluster over multiple Cassandra nodes. Setting up multinode clusters over a single
machine or multiple machines is similar. That’s what we will be covering in this sample exercise. In this example, we
will configure 3 nodes (127.0.0.2-4) on a single machine.
1. We need to map hostnames to IP addresses.
In Windows and Linux OS, these configurations are available in
a. etc/hosts (Windows)
or /etc/hosts (Linux) files. Modify the configuration file to add the above-mentioned
3 node configuration as:
127.0.0.1 127.0.0.2
127.0.0.1 127.0.0.3
127.0.0.1 127.0.0.4
For Mac OS, we need to create those aliases as:
b.
sudo ifconfig lo0 alias 127.0.0.2 up
sudo ifconfig lo0 alias 127.0.0.3 up
sudo ifconfig lo0 alias 127.0.0.4 up
2. Unzip the downloaded Cassandra tarball installation in 3 different folders (one for each
node). Assign each node an identical cluster_name as:
# The name of the cluster. This is mainly used to prevent machines in
# one logical cluster from joining another.
cluster_name: 'Test Cluster'
Chapter 1 ■ NoSQL: Cassandra Basics
22
3. We should hold identical seeds on each node in the cluster. These are used just to initiate
gossip protocol among nodes in the cluster. Configure seeds in cassandra.yaml as :
seed_provider:
# Addresses of hosts that are deemed contact points.
# Cassandra nodes use this list of hosts to find each other and learn
# the topology of the ring. You must change this if you are running
# multiple nodes!
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
# seeds is actually a comma-delimited list of addresses.
# Ex: ip1,ip2,ip3
- seeds: 127.0.0.2
4. Change the listen_address and rpc_address configurations for 127.0.0.2, 127.0.0.3, and
127.0.0.4 IP addresses in each cassandra.yaml file. Since all 3 nodes are running on the
same machine, change the rpc_address to 9160, 9161, and 9162 for each respectively.
5. Here we have an option to choose between 1 token per node or multiple tokens per node.
Cassandra 1.2 introduced the “Virtual nodes” feature which allows assigning a range of
tokens on a node. We will discuss Virtual nodes in coming chapter. Change the
initial_token to empty and keep num_tokens as 2 (recommend is 256).
6. Next is to assign different JMX_PORT (say 8081, 8082, and 8083) for each node.
a. With Linux, modify $CASSANDRA_HOME/conf/cassandra.env.sh as:
# Specifies the default port over which Cassandra will be available for
# JMX connections.
JMX_PORT=7199
b. 
With Windows, modify
$CASSANDRA_HOME/bin/cassandra.bat as:
REM ***** JAVA options *****
set JAVA_OPTS=-ea^
-javaagent:%CASSANDRA_HOME%libjamm-0.2.5.jar^
-Xms1G^
-Xmx1G^
-XX:+HeapDumpOnOutOfMemoryError^
-XX:+UseParNewGC^
-XX:+UseConcMarkSweepGC^
-XX:+CMSParallelRemarkEnabled^
-XX:SurvivorRatio=8^
-XX:MaxTenuringThreshold=1^
-XX:CMSInitiatingOccupancyFraction=75^
-XX:+UseCMSInitiatingOccupancyOnly^
-Dcom.sun.management.jmxremote.port=7199^
-Dcom.sun.management.jmxremote.ssl=false^
-Dcom.sun.management.jmxremote.authenticate=false^
-Dlog4j.configuration=log4j-server.properties^
-Dlog4j.defaultInitOverride=true
Chapter 1 ■ NoSQL: Cassandra Basics
23
7. Let’s start each node one by one and check ring status as: $CASSANDRA_HOME/apache-
cassandra-1.2.4/bin/nodetool -h 127.0.0.02 -p 8081 ring.
Figure 1-9 shows ring status while connecting to one Cassandra node using jmx. Since Cassandra’s architecture is
peer-to-peer, checking ring status on any node will yield the same result.
Figure 1-9. The ring status
Figure 1-10. Ec2 console display with 2 instances in running state
Configuring Multiple Nodes over Amazon EC2
Amazon Elastic Computing Cloud (Amazon EC2), one of the important parts of Amazon Web Service (AWS) cloud
computing platform. AWS offers you to choose OS platform and provides required hardware support over the cloud,
which allows you to quickly set up and deploy application over the cloud computing platform. To learn more about
Amazon ec2 setup please refer to http://aws.amazon.com/ec2/.
In this section, we will learn about how to configure multiple Cassandra nodes over Amazon EC2. To do so,
follow these steps.
1. First let’s launch 2 instances of AMI (ami-00730969), as shown in Figure 1-10.
Chapter 1 ■ NoSQL: Cassandra Basics
24
2. Modify security group to enable 9160, 7000, and 7199 ports, as in Figure 1-11.
Figure 1-11. Configuring security group settings
3. Connect to each instance and download Cassandra tarball as:
wget http://archive.apache.org/dist/cassandra/1.2.4/apache-cassandra-1.2.4-bin.tar.gz
4. Download and setup Java on each EC2 instance using the rpm installer as:
sudo rpm -i jdk-7-linux-x64.rpm
sudo rm -rf /usr/bin/java
sudo ln -s /usr/java/jdk1.7.0/bin/java /usr/bin/java
sudo rm -rf /usr/bin/javac
sudo ln -s /usr/java/jdk1.7.0/bin/javac /usr/bin/javac
5. Multiple Cassandra node configurations are the same as we discussed in the previous
section. In this section we will demonstrate using single token per node (initial_token).
Let’s assign initial token values 0 and 1. We can assign initial_token values by modifying
Cassandra.yaml files on each node.
Figure 1-12. initial_token configuration for both nodes
Chapter 1 ■ NoSQL: Cassandra Basics
25
6. Create any one of these two as a seed node and keep storage port, jmx_port, and rpc_port
to 7000, 7199, and 9160.
7. Let’s keep listen_address and rpc_address with empty values (default is the node’s inet
address (underlined), shown in Figure 1-13).
Figure 1-13. How to get inet address for node
8. Let’s start each node one by one and check ring status. Verify both EC2 instances should
be up, running, and connected using ring topology. Figure 1-14 shows the ring status of
both running ec2 instances.
Figure 1-14. The two EC2 instances and their ring statuses
Chapter 1 ■ NoSQL: Cassandra Basics
26
9. Figure 1-15 shows instance 10.145.213.3 is up and joining the cluster ring.
Summary
This chapter is an introductory one to cover all generic concepts and Cassandra-specific configurations. For
application developers it is really important to understand the essence of replication, data distribution, and
most importantly setting this up with Cassandra. Now we are ready for the next challenge: handling big data with
Cassandra! In the next chapter we will discuss Cassandra’s storage mechanism and data modeling. With data
modeling and understanding Cassandra’s storage architecture it would help us to model the data set, and analyze and
look into the best possible approaches available with Cassandra.
Figure 1-15. Node 10.145.213.3 is up and joining the cluster
27
Chapter 2
Cassandra Data Modeling
In the previous chapter we discussed Cassandra configuration, installation, and cluster setup. This chapter will walk
you through
Data modeling concepts
•
Cassandra collection support
•
CQL vs thrift based schema
•
Managing data types
•
Counter columns
•
Get ready to learn with an equal balance of theoretical and practical approach.
Introducing Data Modeling
Data modeling is a mechanism to define read/write requirements and build a logical structure and object model.
Cassandra is an NOSQL database and promotes read-before-write instead of relational model. Read-before-write
or ready-for-read design is used to analyze your data read requirement first and store it in the same way. Consider
managing data volume in peta bytes or zeta bytes, where we cannot afford to have in-memory computations
(e.g., joins) because of data volume. Hence it is preferable to have the data set ready for retrieval or large data
analytics. Users need not know about columns up front, but should avoid storing flat columns and favor doing
computations (e.g., aggregation, joins etc.) during read time.
Cassandra is a column-family–oriented database. Column family, as the name suggests is “family of columns.”
Each row in Cassandra may contain one or more columns. A column is the smallest unit of data containing a name,
value, and time stamp (see Figure 2-1).
Figure 2-1. Cassandra column definition
Chapter 2 ■ Cassandra Data Modeling
28
By default Cassandra distribution comes with cqlsh and Cassandra-cli command line clients to manipulate.
Cassandra-cli and cqlsh (.sh and .bat) are available under bin folder. Running these command line clients over
Linux, Windows, or Mac is fairly easy. Running shell files over Linux and Mac box requires simply running cql.sh.
However running cqlsh over Windows would require Python to be installed.
To install cqlsh on Windows, follow these steps:
1. First, download Python from https://www.python.org/ftp/python/2.7.6/
python-2.7.6.msi.
2. Add python.exe to PATH under environment variable
3. Run setup.py, available under $CASSANDRA_HOME/pylib directory:
python setup.py install
4. Run cqlsh, available under bin directory (see Figure 2-2):
python cqlsh
Figure 2-2. successfully connected to cql shell
Figure 2-3. Cassandra’s supported data types
Data Types
Before CQL’s evolution, data types in Cassandra are defined in the form of a comparator and validator. Column or
row key value is referred to as a validator, whereas a column name is called a comparator. Available data types are
shown in Figure 2-3.
Chapter 2 ■ Cassandra Data Modeling
29
Dynamic Columns
Since its inception, Cassandra is projected as a schema-less, column-family–oriented distributed database. The number
of columns may vary for each row in a column family. A column definition can be added dynamically at run time.
Cassandra-cli (Thrift) and cqlsh (CQL3) are two command clients we will be using for various exercises in
this chapter.
Dynamic Columns via Thrift
Let’s discuss a simple Twitter use case. In this example we would explore ways to model and store dynamic columns
via Thrift.
1. First, let’s create a keyspace twitter and column family users:
create keyspace twitter with strategy_options={replication_factor:1} and
placement_strategy='org.apache.cassandra.locator.SimpleStrategy';
use twitter;
create column family users with key_validation_class='UTF8Type' and
comparator='UTF8Type' and default_validation_class='UTF8Type';
Here, while defining a column family, we did not define any columns with the column
family. Columns will be added on the fly against each row key value.
2. Store a few columns in the users column family for row key value 'imvivek':
set users['imvivek']['apress']='apress author';
set users['imvivek']['team_marketing']='apress marketing';
set users['imvivek']['guest']='guest user';
set users['imvivek']['ritaf']='rita fernando';
Here we are adding followers as dynamic columns for user imvivek.
3. Let’s add 'imvivek' and 'team_marketing' as followers for 'ritaf':
set users['ritaf']['imvivek']='vivek mishra';
set users['team_marketing']['imvivek']='vivek mishra';
4. To view a list of rows in users column family (see Figure 2-4), use the following command:
list users;
Chapter 2 ■ Cassandra Data Modeling
30
In Figure 2-4, we can see column name and their values against each row key stored in step 3.
5. We can delete columns for an individual key as well. For example, to delete a column
'apress' for row key 'imvivek':
del users['imvivek']['apress'];
Figure 2-5 shows the number of columns for imvivek after step 5.
Figure 2-4. Output of selecting users
Figure 2-5. The number of columns for imvivek after deletion
Here column name is the follower’s twitter_id and their full name is column value. That’s how we can manage
schema and play with dynamic columns in Thrift way. We will discuss dynamic column support with CQL3 in Chapter 3.
Chapter 2 ■ Cassandra Data Modeling
31
Dynamic Columns via cqlsh Using Map Support
In this section, we will discuss how to implement the same Twitter use case using map support. Collection support in
Cassandra would work only with CQL3 binary protocol.
1. First, let’s create a keyspace twitter and column family users:
create keyspace twitter with replication = {'class':'SimpleStrategy',
'replication_factor':3};
use twitter;
create table users(twitter_id text primary key,followers maptext,text);
2. Store a few columns in users column family for row key value 'imvivek':
insert into users(twitter_id,followers) values('imvivek',{'guestuser':'guest',
'ritaf':'rita fernando','team_marketing':'apress marketing'});
Here we are adding followers as dynamic columns as map attributes for user imvivek.
3. Let’s add 'imvivek' and 'team_marketing' as followers for 'ritaf':
insert into users(twitter_id,followers) values('ritaf',{'imvivek':'vivek mishra'});
insert into users(twitter_id,followers) values('team_marketing',
{'imvivek':'vivek mishra'});
4. To view list of rows in the users column family (see Figure 2-6), use the following command:
select * from users;
Figure 2-6. Map containing followers for user
5. To add 'team_marketing' as a follower for 'ritaf' and vice versa (see Figure 2-7), we can
simply add it as an element in users column family:
update users set followers = followers + {'team_marketing':'apress marketing'} where
twitter_id='ritaf';
update users set followers = followers + {'ritaf':'rita fernando'} where
twitter_id='apress_marketing';
Chapter 2 ■ Cassandra Data Modeling
32
6. Using update would work as an insert if row key doesn’t exist in the database. For example,
update users set followers = followers + {'ritaf':'rita fernando'} where
twitter_id='jhassell'; // update as insert
Figure 2-8 shows that ritaf has been added as a follower of jhassell.
Figure 2-7. After update map of followers for each user
Figure 2-8. Update works as an insert for map of followers for nonexisting row key (e.g., twitter_id)
Figure 2-9. After deleting guestuser as a follower for imvivek
7. To delete an element from the map we need to execute, use this command:
delete followers['guestuser'] from users where twitter_id='imvivek';
You can see that the list of followers for imvivek is reduced to four followers after deletion (see Figure 2-9).
Chapter 2 ■ Cassandra Data Modeling
33
With that said, we can add a dynamic column as a key-value pair using collection support.
Dynamic Columns via cqlsh Using Set Support
Consider a scenario where the user wants to store only a collection of follower’s id (not full name). Cassandra offers
collection support for keeping a list or set of such elements. Let’s discuss how to implement it using set support.
1. First, let’s create a keyspace twitter and column family users.
create keyspace twitter with replication = {'class':'SimpleStrategy',
'replication_factor':3};
use twitter;
create table users(twitter_id text primary key,followers settext);
2. Store few columns in users column family for row key value 'imvivek'.
insert into users(twitter_id,followers) values('imvivek',
{'guestuser','ritaf','team_marketing'});
Here we are adding followers as dynamic columns as set attributes for user imvivek.
3. Let’s add the following:
'imvivek' and 'team_marketing' as followers for 'ritaf'
'ritaf' as a follower for 'jhassell'
insert into users(twitter_id,followers) values('ritaf', {'imvivek','jhassell',
'team_marketing'});
insert into users(twitter_id,followers) values('jhassell', {'ritaf'});
4. To view the list of rows in users column family (see Figure 2-10), use the following command:
select * from users;
Figure 2-10. Followers for ritaf, jhassell, and imvivek have been added
5. We can update the collection to delete an element as follows. Figure 2-10 shows the result:
update users set followers = followers - {'guestuser'} where twitter_id = 'imvivek';
Chapter 2 ■ Cassandra Data Modeling
34
Collection support can be a good alternative for achieving Adding dynamic columns over Cassandra.
Composite key is a combination of multiple table fields where the first part is referred to as partition key and the
remaining part of the composite key is known as cluster key. Chapter 3 will discuss achieving dynamic columns
using composite columns.
Secondary Indexes
In a distributed cluster, data for a column family is distributed across multiple nodes, based on replication factor and
partitioning schema. However data for a given row key value will always be on the same node. Using the primary
index (e.g., Row key) we can always retrieve a row. But what about retrieving it using non-row key values?
Cassandra provides support to add indexes over column values, called Secondary indexes. Chapter 3 will cover
more about indexes, so for now let’s just take a look at a simple secondary index example.
Let’s discuss the same Twitter example and see how we can utilize and enable secondary index support.
1. First, let’s create twitter keyspace and column family users.
create keyspace twitter with replication = {'class' : 'SimpleStrategy' ,
'replication_factor' : 3};
use twitter;
create table users(user_id text PRIMARY KEY,fullname text,email text,password text,
followers maptext, text);
2. Insert a user with e-mail and password:
insert into users(user_id,email,password,fullname,followers) values ('imvivek',
'imvivek@xxx.com','password','vivekm',{'mkundera':'milan kundera','guest': 'guestuser'});
Before we move ahead with this exercise, it’s worth discussing which columns should be indexed?
Any read request using the secondary index will actually be broadcast to all nodes in a cluster. Cassandra
maintains a hidden column family for the secondary index locally on node, which is scanned for retrieving rows using
secondary indexes.
While performing data modeling, we should create secondary indexes over column values which should return
a big chunk of data over a very large data set. Indexes over unique values of small data sets would simply become an
overhead, which is not a good data modeling practice. Index over fullname is a possible candidate for indexing.
3. Let’s create secondary index over fullname
create index fullname_idx on users(fullname);
Figure 2-11. Updated set of followers after removing guestuser for imvivek
Chapter 2 ■ Cassandra Data Modeling
35
After successful index creation, we can fetch records using fullname. Figure 2-12 shows the result.
Figure 2-12. Search user for records having fullname value vivekm
Figure 2-13. Selecting all users of age 51
4. Let’s add a column of age and create the index:
alter table users add age text;
create index age_idx on users(age);
update users set age='32' where user_id='imvivek';
insert into users(user_id,email,password,fullname,followers,age) values
('mkundera','mkundera@outlook.com','password','milan kundera',{'imvivek':'vivekm','gues
t': 'guestuser'},'51');
Figure 2-13 shows the outcome.
5. Let’s alter data type of age to int:
alter table users alter age type int;
It will result in the following error:
TSocket read 0 bytes (via cqlsh)
6. To alter data type of indexed columns we need to rebuild them:
drop index age_idx;
alter table users alter age type int;
But please note that in such cases, it may result the data set being in an incompatible state (see Figure 2-14).
Chapter 2 ■ Cassandra Data Modeling
36
Here is the error:
Failed to decode value '51' (for column 'age') as int: unpack requires a string argument of length 4
Failed to decode value '32' (for column 'age') as int: unpack requires a string argument of length 4
Hence it is recommended to change data types on indexed columns, when there is no data available for that column.
Indexes over collections are not supported in Cassandra 2.0. Figure 2-15 shows what happens if we try to create
an index follower. However, before this book went to press, version 2.1 was released and added this capability. See
“Indexing on Collection Attributes” in Chapter 11.
Figure 2-14. Error while changing data type to int from string
Figure 2-15. Indexes over collections are not supported in Cassandra 2.0
Note
■
■ Updates to the data type of clustering keys and indexes are not allowed.
CQL3 and Thrift Interoperability
Prior to CQL existence, Thrift was the only way to develop an application over Cassandra. CQL3 and Thrift
interoperability issues are often discussed within the Cassandra community.
Let’s discuss some issues with a simple example:
1. First, let’s create a keyspace and column family using CQL3.
create keyspace cql3usage with replication = {'class' : 'SimpleStrategy' ,
'replication_factor' : 3};
use cql3usage;
create table user(user_id text PRIMARY KEY, first_name text, last_name text,
emailid text);
2. Let’s insert one record:
insert into user(user_id,first_name,last_name,emailid)
values('@mevivs','vivek','mishra','vivek.mishra@xxx.com');
Chapter 2 ■ Cassandra Data Modeling
37
Figure 2-16. Describes table user
3. Now, connect with Cassandra-cli (the Thrift way) and update the user column family to
create indexes over last_name and first_name:
update column family user with key_validation_class='UTF8Type' and column_
metadata=[{column_name:last_name, validation_class:'UTF8Type', index_type:KEYS},
{column_name:first_name, validation_class:'UTF8Type', index_type:KEYS}];
Note
■
■ Chapter 3 will cover indexing in detail.
4. Now explore the user column family with CQL3, and see the result in Figure 2-16.
describe table user;
Metadata has been changed, and columns (first_name and last_name) modified via Thrift are no longer
available with CQL3! Don’t worry! Data is not lost as CQL3 and Thrift rely on the same storage engine, and we can
always get that metadata back by rebuilding them.
5. Let’s rebuild first_name and last_name:
alter table user add first_name text;
alter table user add last_name text;
The problem is with CQL3’s sparse tables. CQL3 has different metadata (CQL3Metadata) that has NOT been
added to Thrift’s CFMetaData. Do not mix and match CQL3 and Thrift to perform DDL/DML operations. It will always
lead any one of these metadata to an inconsistent state.
A developer who can’t afford loosing Thrift’s dynamic column support still prefers to perform an insert via Thrift,
but to read them back via CQL3. It is recommended to use CQL3 for a new application development over Cassandra.
However, it has been noticed that Thrift based mutation still works faster than CQL3 (such as batch operation) until
Cassandra 1.x.x releases. This is scheduled to address with Cassandra 2.0.0 release (https://issues.apache.org/
jira/browse/CASSANDRA-4693).
Chapter 2 ■ Cassandra Data Modeling
38
Changing Data Types
Changing data types with Cassandra is possible in two ways, Thrift and CQL3.
Thrift Way
Let’s discuss more about data types with legacy Thrift API:
1. Let’s create a column family with minimal definition, such as:
create keyspace twitter with strategy_options={replication_factor:1} and
placement_strategy='org.apache.cassandra.locator.SimpleStrategy';
use twitter;
create column family default;
Default data type for comparator and validator is BytesType.
2. Let’s describe the keyspace and have a look at the default column family (see Figure 2-17):
describe twitter;
Figure 2-18. Error while storing string value but column value is of bytes type
Figure 2-17. Structure of twitter keyspace
3. Let’s try to store some data in the column family:
set default[1]['type']='bytes'; gives an error
Figure 2-18 shows that this produces an error.
Chapter 2 ■ Cassandra Data Modeling
39
Since the comparator and validator are set to default data type (e.g., BytesType), Cassandra-cli is not able to
parse and store such requests.
4. To get step 3 working, we need to use the assume function to provide some hint:
assume default keys as UTF8Type;
assume default comparator as UTF8Type;
assume default validator as UTF8Type;
5. Now let’s try to change the comparator from BytesType to UTF8Type:
update column family default with comparator='UTF8Type';
gives error
This generates an error because changing the comparator type is not allowed (see Figure 2-19).
Figure 2-19. Changing comparator type is not allowed
Figure 2-20. Retrieving values using cql shell
6. Although changing comparator type is not allowed, we can always change the data type of
the column and key validation class as follows:
update column family default with key_validation_class=UTF8Type and
default_validation_class = UTF8Type;
Columns in a row are sorted by column names and that’s where comparator plays a vital role. Based on
comparator type (i.e., UTF8Type, Int32Type, etc.) columns can be stored in a sorted manner.
CQL3 Way
Cassandra CQL3 is the driving factor at present. Most of the high-level APIs are supporting and extending further
development around it.
Let’s discuss a few tricks while dealing with data types in CQL3 way. We will explore with the default column
family created in the Thrift way (see the preceding section).
1. Let’s try to fetch rows from the default column family (see Figure 2-20).
Select * from default;
Chapter 2 ■ Cassandra Data Modeling
40
2. Let’s issue the assume command and try to fetch rows from the default column family in
readable format:
assume default(column1) values are text;
assume default(value) values are text;
assume default(key) values are text;
select * from default;
Figure 2-21 shows the result.
Figure 2-21. Retrieving after assume function is applied
3. typeAsBlob or blobAsType functions can also be used to marshal data while running
CQL3 queries:
select blobAsText(key),blobAsText(type),blobAsText(value) from default;
4. We can alter the data type of validator as follows:
alter table default alter value type text;
alter table default alter key type text;
Note
■
■  The assume command will not be available after Cassandra 1.2.X release. As an alternative we can use
typeAsBlob (e.g., textAsBlob) CQL3 functions.
Counter Column
Distributed counters are incremental values of a column partitioned across multiple nodes. Counter columns can be
useful to provide counts and aggregation analytics for Cassandra-powered applications (e.g., Number of page hits,
number of active users, etc.).
In Cassandra, a counter is a 64-bit signed integer. A write on counter will require a read from replica nodes
(this depends on consistency level, default is ONE). While reading a counter column value, read has to be consistent.
Counter Column with and without replicate_on_write
Default value of replicate_on_write is true. If set to false it will replicate on one replica node (irrespective of
replication factor). That might be helpful to avoid read-before-write on serving write request. But any subsequent
read may not be consistent and may also result in data loss (single replica node is gone!).
Chapter 2 ■ Cassandra Data Modeling
41
Play with Counter Columns
In Chapter 1 we discussed setting multiple clusters on a single machine. First let’s start with a cluster of three nodes on
a single machine. (Please refer to the “Configuring Multiple Nodes on a Single Machine” section in Chapter 1.) In this
recipe we will discuss the do’s and don’ts of using counter columns.
1. Let’s create a keyspace counterkeyspace:
create keyspace counterkeyspace with replication = {'class' : 'SimpleStrategy',
'replication_factor' : 2 }
2. Create a column family counternoreptable with replicate_on_write as false:
create table counternoreptable(id text PRIMARY KEY, pagecount counter)
with replicate_on_write='false';
3. Update pagecount to increment by 2 as follows:
update counternoreptable set pagecount=pagecount+2 where id = '1';
4. Select from the column family as follows:
select * from counternoreptable;
As shown in Figure 2-22, it results in zero rows. Whether it results in zero rows may depend on which node
it is written to.
Figure 2-22. Inconsistent result on fetching from counter table
Figure 2-23. Retrieving from the counter table after incrementing the counter column value
5. Let’s update pagecount for some more values and verify the results:
update counternoreptable set pagecount=pagecount+12 where id = '1';
select * from counternoreptable;
Figure 2-23 shows the result of this command.
update counternoreptable set pagecount=pagecount-2 where id = '1';
select * from counternoreptable;
Chapter 2 ■ Cassandra Data Modeling
42
The result is different for this command (see Figure 2-24).
Figure 2-24. Inconsistent result of counter column without replication
You can see the inconsistent results on read with replicate_on_write as false. With this, conclude that by
disabling such parameters we may avoid read-before-write on each write request, but subsequent read requests may
result in inconsistent data. Also without replication we may suffer data loss, if a single replica containing an updated
counter value goes down or is damaged. Try the above recipe with replicate_on_write as true and monitor whether
results are consistent and accurate or not!
Note
■
■ You may refer to https://issues.apache.org/jira/browse/CASSANDRA-1072 for more on counter columns.
Data Modeling Tips
Cassandra is a column-oriented database that is quite different from traditional RDBMS. We don’t need to define
schema up front, but it is always better to get a good understanding of the requirements and database before moving
ahead with data modeling, including:
Writes in Cassandra are relatively fast but reads are not. Pre-analysis of how we want to
•
perform read operations is always very important to keep in mind before data modeling.
Data should be de-normalized as much as possible.
•
Choose the correct partitioning strategy to avoid rebuilding/populating data over updated
•
partitioning strategy.
Prefer using surrogate keys and composite keys (over super columns) while modeling table/
•
column family.
Summary
To summarize a few things discussed in this chapter so far:
Do not mix Thrift and CQL3 for DDL and DML operations, although reads should be fine.
•
Avoid changing data types.
•
Use Cassandra collection support for adding columns on the fly.
•
In Chapter 3, we will continue our discussion by exploring indexes, composite columns, and the latest features
introduced in Cassandra 2.0, such as Compare and Set.
43
Chapter 3
Indexes and Composite Columns
In previous chapters we have discussed big data problems, Cassandra data modeling concepts, and various schema
management techniques. Although you should avoid normalizing the form of your data too much, you still need to
model read requirements around columns rather than primary keys in your database applications.
The following topics will be covered in this chapter
Indexing concept
•
Data partitioning
•
Cassandra read/write mechanism
•
Secondary indexes
•
Composite columns
•
What’s new in Cassandra 2.0
•
Indexes
An index in a database is a data structure for faster retrieval of a data set in a table. Indexes can be made over single or
multiple columns.
Indexing is a process to create and manage a data structure called Index for fast data retrieval. Each index
consists of indexed field values and references to physical records. In some cases a reference can be an actual row
itself. We will discuss these cases in the clustered indexes section.
Physically data is stored on blocks in data structure form (like sstable in Cassandra). These data blocks are
unordered and distributed across multiple nodes. Accessing data records without a primary key or index would
require a linear search across multiple nodes. Let’s discuss format index data structure.
Indexes are stored in sorted order into B-tree (balanced tree) structure, where indexes are leaf nodes under
branch nodes. Figure 3-1 depicts data storage where multi-level leaf nodes (0,1) are indexed in sorted order and
data is in unsorted order. Here each leaf node is a b-tree node containing multiple keys. Based on inserts/updates/
deletes, the number of keys per b-tree node keeps changing but in sorted order.
Chapter 3 ■ Indexes and Composite Columns
44
Let’s simplify further. In Figure 3-2, the table containing age and row keys are leaf nodes and the other one is a
physical table.
Figure 3-1. b-tree Index and data structure with multi-level leaf nodes
Figure 3-2. A physical table and an index table as leaf node
This allows faster retrieval of records using binary search. Since b-tree keeps data sorted for faster searching, it
would introduce some overhead on insert, update, and delete operations and would require rearranging indexes. B-tree
is the preferred data structure of a larger set of read and writes, that’s why it’s widely used with distributed databases.
Clustered Indexes vs. Non-Clustered Indexes
Indexes that are maintained independently from physical rows and don’t manage ordering of rows are called
non-clustered indexes (see Figure 3-1). On the other hand, clustered indexes will store actual rows in sorted order for
the index field. Since a clustered index will store and manage ordering of physical rows, only one clustered index is
possible per table.
Chapter 3 ■ Indexes and Composite Columns
45
The important question is for what scenarios we should use clustered indexes and non-clustered indexes. For
example, a department can be multiple employees (many-to-one relation) and often is required to read employee
details by department. Here department is a suitable candidate for a clustered index. All rows containing employee
details would be stored and ordered by department for faster retrieval. Here employee name is a perfect candidate for
a non-clustered index and thus we can hold multiple non-clustered indexes in a table but there will always be a single
clustered index per table.
Index Distribution
With distributed databases, data gets distributed and replicated across multiple nodes. Retrieval of a data collection
would require fetching rows from multiple nodes. Opting for indexes over a non-row key column would also require
being distributed across multiple nodes, such as shards. Long-running queries can benefit from such shard-based
indexing for fast retrieval of data sets.
Due to peer-to-peer architecture each node in a Cassandra cluster will hold an identical configuration. Data
replication, eventual consistency, and partitioning schema are two important aspects of data distribution.
Please refer to Chapter 1 for more details about replication factor, strategy class, and read/write consistency.
Indexing in Cassandra
Data on a Cassandra node is stored locally for each row. Rows are distributed across multiple nodes, but all columns
for a particular row key will be stored locally on a node. Cassandra by default provides the primary index over row key
for faster retrieval by row key.
Secondary Indexes
Indexes over column values are known as secondary indexes. These indexes are stored locally on a node where
physical data resides. That allows Cassandra to perform faster index-based retrieval of data. Secondary indexes are
stored in a hidden column family and internally managed by the node itself.
Let’s explore more on secondary indexes with a simple exercise.
1. First, let’s create a keyspace twitter and column family users.
create keyspace twitter with replication = { 'class':'SimpleStrategy' ,
'replication_factor':3};use twitter;
create column family users with key_validation_class='UTF8Type' and
comparator='UTF8Type' and default_validation_class='UTF8Type';
create table users (user_id uuid primary key, first_name text, twitter_handle text);
2. Let’s create index over first_name using create index syntax (see Figure 3-3).
create index fname_idx on users(first_name);
3. Describe table users:
describe table users;
Chapter 3 ■ Indexes and Composite Columns
46
Figure 3-3 shows users schema with index created on first_name.
Figure 3-4. Fetching users by first_name
4. Let’s insert a few rows in the users column family.
insert into users(user_id,first_name,twitter_handle) values(now(),'apress','#apress_team');
insert into users(user_id,first_name,twitter_handle) values(now(),'jonathan','#jhassell');
insert into users(user_id,first_name,twitter_handle) values(now(),'vivek','#mevivs');
insert into users(user_id,first_name,twitter_handle) values(now(),'vivek','#vivekab');
5. Let’s try to find records using the indexed column first_name (see Figure 3-4).
select * from users where first_name='vivek';
Figure 3-4 shows output of fetching users having first name vivek.
Figure 3-3. Users table with index on first_name
Query over indexed column first_name with value 'vivek' (Figure 3-4) returns two rows. Here both rows can be
available on the same node or different nodes.
One point worth mentioning here is that indexes would also be stored locally along with data rows, which would
ensure data locality.
On the other hand, if we try to fetch rows using column twitter_handle, which is non-indexed:
select * from users where twitter_handle='#imvivek';
Random documents with unrelated
content Scribd suggests to you:
U nearly touches line below, and O of POST line above. Buff,
Orange, Amber and Dark Manila.
POST 8 mm. U large and far from left oval. S and P near. The
latter is in a nearly vertical position and stands well to the left of the
point. POST equally spaced. T far from right oval. T of TWO
near left oval; WO close. OC near. C vertical, and at top near
point of inner frame line. EN well spaced. S near right oval. Nose
near left oval. Top of left figure 2 near point of oval. U line passes
close to head of E and touches the latter at base.
VARIETY 13. (24-1/2 × 25-3/4 mm.)
Hair projecting. CE on level and nearly touch at top. Buff, Orange
and Dark Manila.
POST 8 mm. U large and nearer to left oval than in Var. 12. U.
S. near. SP. near. P a little inclined to left and to left of the point.
POST spaced near. T far from right oval. T of TWO close to
left oval. WO and OC close. ENT close. S far from right oval.
Nose near left oval. E line passes near right stroke of U.
CLASS V.
Point of Bust over middle of O.
VARIETY 14. (23-1/2 × 26 mm.)
OS far apart. S of CENTS near oval line. Buff, Orange, Dark
Manila.
POST 8 mm. U large, far from left oval, and near inner frame
line. SP wide at top. PO near, but top of letters some distance
from outer frame line. T far from right oval. T of TWO close to
left oval. OC near and top of C under the point. CE wide at
base. EN widely spaced. NT wide at base. TS near. Nose near
left oval. Figures of value well centered in ovals. W line touches top
of P. A deterioration of this variety in which the nose almost
touches left oval and TW touch upper and lower frame lines is
called 14a.
VARIETY 15. (25 × 16 mm.)
Bust touches line over center of O. Buff, Orange, Amber, Dark
Manila.
POST 8 mm. U large, near left oval and at top far from outer
frame line. P to left of point. O well to right of point and slanting
to right. OST near. T far from right oval. T of TWO close to
left oval. WO close. OC wide. C low and touching outer frame
line. ENTS spaced near, but S far from right oval. Nose near left
oval. Left figure 2 well centered, but right figure 2 much nearer
to inner frame line. W line falls between base of S and the
period. A deterioration of this die is Var. 15a.
VARIETY 16. (24-3/4 × 26-1/4 mm.)
Bust nearly touches line to right of O. Buff, Orange.
POST 8 mm. U wide and far from left oval. P to left of point and
close to outer frame line. PO wide. O far to right of point. OST
near. T far from right oval. T of TWO far from left oval. Inner
frame line is some distance from top of letters WO of TWO and
N of CENTS. OC wide. CE near but EN wide. S far from
right oval. Nose far from left oval. Left figure 2 well centered, but
right figure 2 much nearer to inner oval line.
D. DIES.
25-1/2 to 26-1/4 mm.
NOTE:—In Var. 17, 18, 23, 24, 31, and 34 the word POST is short
and spaced closely. Var. 22 has the narrow U, and Var. 21, 27, 38,
39 and 40 show the widest spacing of POST.
CLASS III.
Point of Bust over last bar of W.
VARIETY 17. (26-1/4 × 25-1/2 mm.)
O of POST considerably above level of P. Wide space, after S
of CENTS. Buff, Orange and Amber.
POST 7-1/2 mm. U near left oval and near inner frame line.
U.S. close. P far to left of point; O near point. OST close. T
very far from right oval. T of TWO, far from left oval. WO near.
OC near. CE close at top. N above level of E. NT close to
inner frame line. Nose far from left oval. Figures well centered. U
line touches O at right.
VARIETY 18. (26 × 25-3/4 mm.)
OC very near and O nearly touching line below. Buff, Orange and
Amber.
POST 8 mm. U wide, slanting sharply to left and near left oval.
P is to left of point and slants to the left. POS near, but ST
spaced wider. T very far from right oval. T of TWO close to left
oval. WO close. CE close at top. EN well spaced at top. NTS
near and S close to right oval. Nose near left oval. U line touches
base of N. Envelopes only.
VARIETY 19. (26 × 25-3/4 mm.)
Letters evenly spaced, those in upper label almost in vertical
position. Amber and Light Manila.
POST 8 mm. U wide, nearly vertical and far from left oval. U.S.
wide. P vertical and to left of point. POS widely spaced. ST
near. T very far from right oval. T of TWO far from left oval and
top stroke of T nearly touches W. WO near. OC near. C
vertical but a little below E. Top stroke of T of CENTS close to
inner frame line. S near right oval. Nose near left oval. Figures well
in center of ovals. T line touches top of E.
VARIETY 20. (25-1/2 × 25-1/2 mm.)
Sharp point at base of right 2. Amber and Light Manila.
POST 8 mm. U wide and near left oval. P nearly vertical and to
left of point. Top of O almost touches outer frame line. Base of S
and T close to inner frame line. T of TWO far from left oval.
WO very close. OC close. CENTS close and S far from right
oval. Nose far from left oval. T line touches O to right.
VARIETY 21. (26 × 25-1/2 mm.)
ST and OC extremely wide. Point of bust far from line. Sharply
pointed nose. Amber and Light Manila.
POST 9 mm. U wide, near left oval, and sharply slanting to left.
U.S. and SP very wide. P to left of point and slanting a little to
the right. PO very wide. O far to right of point and turned to
right. OS wide. T near right oval, T of TWO close to left oval.
TW very wide at base. WO close. C low and nearly under the
point. ENTS near and S close to right oval. Nose pointed and far
from left oval. Figures well centered. U line passes from tip of E
to base of N.
CLASS IV.
Bust points to left line of O.
VARIETY 22. (25-1/2 × 26 mm.)
Narrow U, the only one in DIE D. Buff.
Extremely rare. POST 7-1/2 mm. U nearly vertical and far from
left oval. P small near the point and at top far from outer frame
line. O far to right of point. POST equally spaced. T far from
right oval. T of TWO near left oval. WO close. OC wide, C
slants sharply to right and at base is within the angle, formed by the
outer curves. CENTS are on the same level. S near right oval. The
inner curves are far from top of letters WO and CENTS. Nose
near left oval. In both side ovals the downstroke of figure 2 ends in
a sharp point. U line touches O to left. Buff envelope only. Knife
2.
VARIETY 23. (26 × 25 mm.)
Extremely wide space before U and after T in upper label. Bust
pointed. Amber and Light Manila.
POST 7-1/2 mm. U wide. The inner curves of the label are close
to the inscription. P nearly vertical. POS close. ST near. T of
TWO close to left oval. WO near. OC near but C slants from
left to right and its base touches the outer frame line. Top of vertical
stroke of E close to inner point. EN well spaced at top. S slants
to right and is close to right oval. Nose very far from right oval.
Figure 2 in left oval is lower than figure 2 in right oval. W line
passes through middle of U.
VARIETY 24. (26 × 26 mm.)
O above level of P, C sharply turned to left. Buff Orange and
Light Manila.
POST 7-1/2 mm. U wide, inclined to left and near left oval. U.S.
near. SP near. P slanting to left and near the point. POST about
equally spaced but OST high nearly touching outer frame line at
top. T far from right oval. T of TWO far from left oval. WO
near. OC near. EC close at top. ENT well spaced. S near right
oval. Nose close to left oval. Figures in oval well centered. C line
passes between O of TWO and C of CENTS.
VARIETY 25. (25-1/2 × 26 mm.)
P tipped sharply to left and O to right. Buff and Orange.
POST 8 mm. U wide and far from left oval. Base of U, close to
inner frame line, but top of S close to outer frame line. U S P
near. P far to left and O in line with point. POS near. T far
from S and far from right oval. T of TWO near left oval. WO
close. OC close. CE on level but E slanting to right. TS close.
S near right oval. Nose some distance from left oval. Figures in
ovals well centered. Envelopes only.
VARIETY 26. (26 × 26 mm.)
P nearly on a level with O. POST close. OC near. Amber and
Light Manila.
POST 8 mm. U wide slanting to left, and far from left oval. US.
wide. SP wide. P to left of point and nearly vertical. T very far
from right oval. T of TWO near left oval. WO close. OC near.
C vertical. CE close. EN near. NTS close. S far from right
oval. Nose near oval. Figures well centered in ovals. T line passes
close to junction point of inner frame lines, and touches C to left.
VARIETY 27. (26-1/2 × 25-1/2 mm.)
Sharp point of bust high above left of O. Amber and Light Manila.
POST 9-3/4 mm. U wide slanting considerably to left and near
left oval. The entire inscription in upper label is widely spaced, but
OS widest. T slants sharply to right, nearly touches outer frame
line and is far from right oval. T of TWO close to left oval. WO
near. OC wide. The junction point of the inner frame lines is over
the center of C, which is low. EN well spaced and close to inner
frame line. S nearly horizontal and close to right oval. Nose near
left oval. Downstroke of right figure 2 near inner oval line. T line
passes through first stroke of W of TWO.
VARIETY 27a. (26-1/4 × 25-1/2 mm.)
POST 9-3/4 mm. Same as last variety, but appearing to be
different. This is due to great deterioration of the die. It is found on
a wrapper only and is rather scarce.
CLASS V.
Bust points to middle of O.
VARIETY 28. (26 × 26 mm.)
ST close. Wide space after S of CENTS. Buff and Orange.
Post 7-1/2 mm. U wide, nearly vertical and near the left oval.
U.S. near. PO near, but O slightly above P. There is a wide
space between OS. T near right oval. T of TWO far from left
oval. WO very close. OC near. CE close and top of E under
the point. EN wide, especially at top: N slightly above E. NTS
close. Nose near left oval. Figures well centered in ovals. U line
cuts top of O of TWO at right. Envelopes only.
VARIETY 29. (25 × 25-3/4 mm.)
Space before U and after T extremely wide. Light Manila.
POST 7-1/2 mm. U wide. U.S. near and both letters close to
inner frame line. P well to left of point and on a level with O. O
close to point. POS near, but T further from S. T of TWO
close to left oval. WO near. OC near and C under the point. E
quite a distance to right of point. EN wide. NTS near right oval.
Nose far from left oval. Figures well centered in ovals. U line passes
through middle of C of CENTS. Point of bust very broad.
Wrappers only.
VARIETY 30. (26 × 25-1/2 mm.)
Nose far from oval line. Amber and Light Manila.
POST 7-1/2 mm. U wide, nearly vertical and near left oval U.S.
wide. SP widely spaced. PO close and nearly on a level, OST
near. T far from right oval. T of TWO far from left oval. WO
near, but OC wide. CE on level and close at top. EN well
spaced. TS wide at base. S far from right oval. Nose far from left
oval. Figures well centered in ovals. E line touches S of U.S. at
the right.
VARIETY 31. (25-3/4 × 25-3/4 mm.)
P considerably above O. Point of bust square and nearly touches
line. Buff and Orange.
Post 7-1/2 mm. U wide, inclined to left, and near left oval. S
close to inner frame line. Top of P close to outer frame line. POST
near. T far from right oval. T of TWO near left oval and base of
T some distance from outer frame line. WO near. OC very wide.
C low. Back stroke of E almost touches the point. EN wide and
N high. NT wide at top. TS close. S near right oval. Nose near
left oval. Figures well centered in ovals. T line passes through
center of U of U. S.
VARIETY 32. (26 × 26-1/4 mm.)
Bust ends in a sharp point, which nearly touches line over centre of
O of TWO. Orange and light manila.
POST 7-1/4 mm. U rather short, inclined to left and near left oval.
SP wide at top. P near point and above level of O. PO near
but O slanting to right. OS well spaced, but S low. ST wide.
T far from left oval. WO close. C of CENTS almost touches
outer frame line and CE close at base. ENTS close and S near
right oval. Nose near left oval. Figures well centered in ovals. U line
passes slantingly from top of E to base.
VARIETY 33. (25-3/4 × 25-3/4 mm.)
Projecting hair. Wide space after S of CENTS. Buff, Orange and
Light Manila.
POST 8 mm. U wide, close to inner frame line and near left oval.
Base of S some distance from inner frame line. P leans to the left.
PO close but O slants to the right and is near the point. OS well
spaced but ST spaced wider. T far from right oval. T of TWO
far from left oval. WO near. OC wide. C some distance to right
of point but on level with E. The backstroke of the latter nearly
touches the point. EN wide, and ENTS close to inner frame line.
Nose far from left oval. Figures well centered in ovals. P line passes
through back stroke of E.
VARIETY 34. (25-3/4 × 27 mm.)
S of U.S touches line above. OC near. Buff envelope and
wrapper.
POST 8 mm. U wide, inclined to left and near left oval. SP near,
P far to left of point. PO well spaced at top and O a little raised.
OS widely spaced. ST low, so that top stroke of T is somewhat
above top of S. T far from right oval. T of TWO near left oval.
WO near. C slants to left, and E to right, so that there is a
considerable space between the letters at base. ENT wide. TS
close. S far from right oval. Figure in right oval near inner frame
line, but in left oval well centered. U line passes between CE.
VARIETY 35. (25 × 25-3/4 mm.)
O of POST slants sharply to left. Hair far from frame line. Buff,
Orange and Light Manila.
POST 8 mm. U almost vertical and quite near to left oval. U.S.
near. P inclined to left. O near point. OST close. T near right
oval. T of TWO far from left oval. WO near. OC near. CE
wide at base. N higher than E or T. S slants sharply to right
and is far from right oval. Nose far from left oval. Figures well
centered in oval. T line slants through C from right to left. Bust
ends in a rather short point.
VARIETY 36. (26 × 26 mm.)
P tipped to left. O nearly touches outer frame line. Point of bust
short and over centre of O. Amber and Light Manila.
POST 8 mm. U large, inclined to left and near left oval. U. S.
near and base of S some distance from inner frame line. P near
point and slanting to left. PO wide, O nearly vertical. OST wide.
T far from right oval. T of TWO far from left oval. WO close.
OC near. C is low and slants sharply to left. CE close at top.
ENTS close. T almost touches line above. S near right oval.
Nose near left oval. Figures in ovals well centered. U line touches
ends of upper and lower stroke of E.
VARIETY 37. (26-1/2 × 26 mm.)
P nearly touches line at top. POST near. Orange and Amber.
POST 8 mm. U wide, inclined to left and near left oval. US wide.
P nearly vertical and some distance to left of point. PO on a level.
T of POST very far from right oval. T of TWO near left oval.
WO close. OC near. C nearly under the point and vertical. EN
well spaced at top. NTS close, especially the last two letters, S
near right oval. Nose far from left oval. Figures in ovals well
centered. T line slants across top of E. Envelopes only. A common
die.
VARIETY 38. (26 × 26 mm.)
Bust point behind O. NT wide. Orange, Amber, Light Manila.
POST 8 mm. U wide, greatly inclined to left, and quite near left
oval. US very wide. P near point and slanting to left. O some
distance to right of point and inclined to right. POS wide but ST
widest. Top stroke of T close to outer frame line. T of TWO
near left oval. WO near. OC very wide. C almost vertical and
close to point. Top of E slightly above C. EN near. TS wide at
base and S close to right oval. Nose far from left oval. Figures in
ovals well centered. U line touches base of T of CENTS.
VARIETY 39. (26-1/4 × 25-1/2 mm.)
P considerably above level of O. POST wide. Amber, and Light
Manila.
POST 9 mm. U wide, inclined to left, and near left oval. US
wide. SP wide. P slants to left and is close to the point. PO very
wide. O far to right of point and but little slanting. OST wide. T
near right oval. T of TWO close to left oval, WO close. The
entire word is well above the outer frame line. OC very wide. C
under the point and upright. Top of E slightly above C. NT
close. TS wide. S close to right oval. Nose near left oval. Figures
in ovals well centered. W line touches base of U at right. Broad
point to bust. Envelope and wrapper.
VARIETY 40. (26 × 26 mm.)
NT very near. POST wide. Buff, Orange, Amber, Light Manila.
POST 9-1/2 mm. Inscription in upper label much resembles that of
the preceding variety, but S of U.S. is low and PO nearer. T of
TWO near left oval. WO close. OC wide. TS close at top. Nose
far from left oval. Figures in ovals well centered. U line passes
along middle stroke of N. One of the most common varieties.
Reference List of the Two Cent Envelopes and
Wrappers of the Series of 1863 and 1864.
ENVELOPES.
TWO CENTS, BLACK.
1863.
Inscribed: U. S. POSTAGE.
DIE A.
Var. 3.
No.Class.Paper.Knife.Size.Dimensions. Remarks.
370 4 Buff 2 3 139 × 83 Gummed.
371   2 3  Ungummed.
Var. 5.
372 4 Buff 2 3 139 × 83 Ungummed.
373   2 3  Gummed.
Var. 6.
374 4 Amber 2 3 139 × 83 Gummed.
375  Buff 2 3  Ungummed.
DIE B.
Var. 8.
No.Class.Paper.Knife.Size.Dimensions. Remarks.
376 4 Buff 11 3 139 × 83 Ungummed.
377  Orange 11 3  
1864.
Inscribed: U. S. POST.
DIE C.
Var. 1.
No. Class.Paper.Knife.Size.Dimensions. Remarks.
378 2 Buff 11 3 139 × 83 Ungummed.
379   11 3  Gummed.
380  Or. 11 3  Ungummed.
Var. 3.
381 3 Buff 11 3 139 × 83 Gummed
382  Or. 11 3  Ungummed
Var. 5.
383 3 Buff 11 3 139 × 83 Gummed
384  Or. 11 3  Ungummed
Var. 6.
385 3 Buff 11 3 139 × 83 Gummed
Var. 6a.
386 3 Buff 11 3 139 × 83 Gummed
387  Or. 11 3  Ungummed
Var. 7.
388 3 Buff 11 3 139 × 83 Gummed
389  Or. 11 3  Ungummed
Var. 8.
390 3 Buff 11 3 139 × 83 Gummed
391  Or. 11 3  Ungummed
Var. 9.
392 3 Buff 11 3 139 × 83 Gummed
393  Or. 11 3  Ungummed
Var. 10.
394 4 Buff 11 3 139 × 83 Gummed.
Var. 11.
395 4 Buff 11 3 139 × 83 Gummed.
395a   12 5 160 × 90 
396  Or. 11 3 139 × 83 Ungummed.
Var. 12.
397 4 Buff 11 3 139 × 83
Gummed. Generally
Specimen.
398  Or. 11 3 
Ungummed. Generally
Specimen.
399  Buff 12 5 160 × 90
 Generally
Specimen.
Var. 13.
400 4 Buff 11 3 139 × 83 Gummed.
401  Or. 11 3  Ungummed.
Var. 14.
402 5 Buff 11 3 139 × 83 Gummed.
403  Or. 11 3  Ungummed.
Var. 15.
404 5 Buff 11 3 139 × 83 Gummed.
405  Or. 11 3  Ungummed.
406  Buff 12 5 160 × 90 
Var. 16.
407 5 Buff 11 3 139 × 83 Gummed.
408  Or. 11 3  Ungummed.
409  Buff 12 5 160 × 90 
DIE D.
Var. 17.
No. Class.Paper.Knife.Size.Dimensions. Remarks.
410 3 Buff 11 3 139 × 83 Gummed
411  Or. 11 3  Ungummed
412  Buff 12 5 160 × 90 
Var. 18.
413 3 Buff 11 3 139 × 83 Gummed.
414  Or. 11 3  Ungummed.
415  Buff 12 5 160 × 90 
415a   12 5  Gummed.
Var. 19.
416 3 Amber 12 5 160 × 60 Gummed
417   12 5  [HW: Gummed]
Var. 20.
418 3 Amber 12 5 160 × 90 Ungummed
Var. 21.
419 3 Amber 11 3 139 × 83 Gummed
420   12 5 160 × 90 Ungummed
Var. 22.
421 4 Buff 2 3 139 × 83
Ungummed. Very
rare.
Var. 23.
422 4 Amber 12 5 160 × 90 Ungummed.
Var. 24.
423 4 Buff 11 3 139 × 83 Gummed.
424  Or. 11 3  Ungummed.
Var. 25.
425 4 Buff 11 3 139 × 83 Gummed.
426  Or. 11 3  Ungummed.
Var. 26.
427 4 Amber 11 3 139 × 83 Gummed.
Var. 27.
428 4 Amber 12 5 160 × 90 Ungummed.
Var. 27a.
429 4 Amber 12 5 160 × 90 Ungummed.
Var. 28.
430 5 Buff 11 3 139 × 83 Gummed.
431  Or. 11 3  Ungummed.
Var. 30.
432 5 Amber 12 5 160 × 90 Ungummed.
Var. 31.
433 5 Buff 11 3 139 × 83 Gummed.
434  Or. 11 3  Ungummed.
434a  Buff 12 5 160 × 90 
Var. 32.
435 5 Or. 11 3 139 × 83 Gummed.
Var. 33.
436 5 Buff 11 3 139 × 83 Gummed.
437  Or. 11 3  Ungummed.
438  Buff 12 5 160 × 90 Gummed.
Var. 34.
439 5 Buff 11 3 139 × 83 Gummed.
Var. 35.
440 5 Buff 11 3 139 × 83 Gummed.
441  Or. 11 3  Ungummed.
Var. 36.
442 5 Amber 12 5 160 × 90 Ungummed.
Var. 37.
443 5 Amber 11 3 139 × 83 Gummed.
444  Or. 11 3  Ungummed.
Var. 38.
445 5 Or. 11 3 139 × 83 Ungummed.
Var. 39.
446 5 Amber 11 3 139 × 83 Gummed.
447   12 5 160 × 90 Ungummed.
Var. 40.
448 5 Buff 11 3 139 × 83 Gummed.
449  Amber 11 3  
450  Or. 11 3  Ungummed.
Wrappers.
1863.
Inscribed: U. S. POSTAGE.
DIE A.
Var. 1.
No.Class.Paper.Dimensions.Laid.Remarks.
451 1 D. M. 227 × 148
Var. 2.
452 2 D. M. 227 × 148
Var. 4.
453 4 D. M. 227 × 148
Var. 6.
454 4 D. M. 227 × 148
Var. 7.
455 4 D. M. 227 × 148
1864.
Inscribed: U. S. POST.
DIE C.
Var. 2.
No.Class.Paper.Dimensions.Laid.Remarks.
456 3 D. M. 100 × 200 V
Var. 4.
457 3 D. M. 100 × 200 V
Var. 6.
458 3 Buff 100 × 200 V
459  D. M.  V
Var. 6a.
460 3 Buff 100 × 200 H
461  D. M.  V
Var. 7.
462 3 D. M. 100 × 200 V
Var. 8.
463 3 D. M. 100 × 200 V
Var. 10.
464 4 D. M. 100 × 200 V
Var. 12.
465 4 D. M. 100 × 200 V
Var. 13.
466 4 D. M. 100 × 200 V
Var. 14.
467 5 D. M. 100 × 200 V
Var. 15.
468 5 Buff 100 × 200 V
469  D. M.  V
Var. 16.
470 5 Buff 100 × 200 V
DIE D.
Var. 17.
No. Class.Paper.Dimensions.Laid. Remarks.
471 3 Buff 100 × 200 V
Var. 19.
472 3 L. M. 133 × 200 V
Var. 20.
473 3 L. M. 100 × 200 V
474   133 × 200 —
Var. 21.
475 3 L. M. 133 × 200 H
476   115 × 375 H Stamp 137 mm. from top.
Var. 23.
477 4 L. M. 133 × 200 H
478    V
479    Wove
Var. 24.
480 4 L. M. 100 × 200 V
480a  Buff  V
Var. 25.
481 4 Buff 100 × 200 V
Var. 26.
482 4 L. M. 133 × 200 H
Var. 27.
483 4 L. M. 133 × 200 H
Var. 27a.
484 4 L. M. 133 × 200 V
Var. 29.
485 5 L. M. 133 × 200 H
Var. 30.
486 5 L. M. 133 × 200 H
Var. 31.
487 5 Buff 100 × 200 V
Var. 32.
488 5 L. M. 100 × 200 V
Var. 33.
489 5 L. M. 100 × 200 V
490  Buff  V
Var. 34.
491 5 L. M. 100 × 200 V
492  Buff  V
493   150 × 212 V
494    H
Var. 35.
495 5 L. M. 100 × 200 V
496  Buff  V
Var. 36.
497 5 L. M. 133 × 200 V
Var. 38.
498 5 L. M. 133 × 200 H
Var. 39.
499 5 L. M. 133 × 200 H
Var. 40.
499a 5 L. M. 133 × 200 H
FIFTH ISSUE: 1864-1865.
THREE CENTS, ROSE; THREE CENTS, BROWN; SIX CENTS,
ROSE AND SIX CENTS, PURPLE.
In the Postmaster-General's report for 1864 it is stated that during
the last session of Congress a bill was passed for the relief of the
contractor for furnishing the department with stamped envelopes
and newspaper wrappers, under the provisions of which the existing
contract expired on Sept. 11, 1864.
With the renewal of the former contract Nesbitt changed the dies of
the two, three and six cents. The first we have already exhaustively
treated. It is, of course, the two cents, black, U. S. POST. All these
dies remained in use until June 30th, 1870.
As a matter of history it may be noted here that the three cents
printed in brown, likewise the six cents rose, both on official size,
were issued in July, 1865. The dies have a portrait of Washington
facing to the left in a plain oval. It is enclosed in a frame of colorless
lines. Inscription above UNITED STATES; below, THREE CENTS or
SIX CENTS, in block capitals. Large numerals of value at each side.
None of the Nesbitt die varieties have given the writer so many
anxious hours and have required such prolonged study as the three
cents of 1864. Indeed, the final solution of the problem of
classification of the various dies was only arrived at after more than
two years continuous research. Like the famous balancing of the egg
of Columbus, the problem, when solved, is extremely simple. Looking
backward on the long series of failures, it seems strange that the
chief characteristics have so long escaped the attention of
cataloguers. The fact, however, is patent. Even as thorough and
painstaking a student as the late Gilbert Harrison who, in 1895,
chronicled, as he thought, all of the existing die varieties of the three
cents has failed to observe the most important differences. Indeed,
in the entire philatelic literature dealing with the Nesbitt dies of 1864
there is but one allusion to the feature which constitutes the surest
means for the identification of the die varieties, and this is only a
single sentence contained in the Historical Notes of Messrs. Tiffany,
Bogert and Rechert. It reads:—
It is worth mentioning, however, that while dies 9, 15 and 26 (the
latter the die under consideration) all have the small bust of
Washington, there are small differences in each which show them to
be different engravings. * * In die 26 the front hair shows only five
locks and the back hair only four lines.
We shall presently see that, like the three cents, red of 1853, (Die A)
the diemakers have produced different groups of heads which, once
known, are not only an absolute means of differentiating the
varieties, but also protect the collector from acquiring a multitude of
the same die.
Although, as stated above, the die of the three cents rose equals
that of the three cents red in the use of various heads, it is,
otherwise, quite dissimilar to the first issue, as will be seen presently.
As in the varieties of the two cent dies the horizontal and vertical
dimensions of the three cents vary greatly. After careful research and
taking the advice of experienced philatelists, it was decided to adopt
only two sizes for classification: i.e.
Size A:—to include all stamps measuring horizontally 24 mm.
but not exceeding 25 mm.
Size B:—to include all stamps measuring horizontally 25-1/2
mm. or more.
In our study of the three cents red of 1853 we noted, in addition to
the various heads, some minor differences in the spacing of the
letters forming the inscription. Referring now to the three cents of
1864, even the unskilled eye of the layman will be struck with the
surprising changes, not only in the spacing of the letters forming a
word, but, also, in the relative position of the words to each other
and their distance from a definite point, such, for instance, as the
figure 3. The subsequent cuts well illustrate this point.
In the first the S of CENTS is several mm. distant from the right
figure 3: in the second it is close to 3. The same remarks apply to
the U of UNITED in its relative position to the left figure 3. In
the second cut there is also a square period after the final E of
THREE.
Looking at cuts 3 and 4 the great variety of spacing between the
letters of a word is strikingly apparent in the word THREE. These
differences are easily detected by the 10 mm. unit distance
measurement, which has been explained in the introductory chapter
of this series of articles. The subjoined diagram proves that there are
at least three forms of each word, and, with a little study, the
collector will soon recognize the leading types.
It seems strange that such great and palpable differences remained
unknown until 1892. Quoting from the work of Messrs. Tiffany,
Bogert  Rechert, we are, however, informed: Heretofore it has not
been noticed that there are a large number of minor varieties of this
die depending on the relative position of the parts.
Commenting on Die 26 (three cents rose) the writers make some
valuable suggestions, but they discourage the would-be student from
going deeper into the subject by the closing paragraph: So few
collectors would be interested in looking for these varieties that it
has been thought unnecessary to devote space to them in a general
work. In the writer's opinion the most valuable hint thrown out by
Messrs. Tiffany, Bogert  Rechert is contained in the following
sentence: If a thread be laid along the lower stroke of the U it will
pass at different distances from the tip of the nose and fall on
different parts of the right numeral, of the space below it, or even as
low as the S of CENTS.
Why these experts stopped at the gate and did not enter is one of
those freaks of the human mind that defies explanation. Certainly
the person who made this observation was on the very threshold of
discovering a scientific classification of this elusive die. The writer
confesses that, after having independently evolved this system of
classification, nothing has given him greater satisfaction than to find
that the basic idea had been chronicled as far back as 1892. To-day
it is well known that a line prolongation along the U of UNITED
establishes five distinct classes. As this system has been fully
described in a lecture given by the writer before the Boston Philatelic
Society, (April 19, 1904) which lecture has also been published in
pamphlet form, and, as this classification has been accepted by the
writer of the latest Scott Catalo gue, it seems unnecessary to go into
the details, especially as the subjoined diagram is self-explanatory.
It is evident that we now possess various means for the classification
of the three cents die varieties, but a system based solely on a line
measurement, as has been stated heretofore, would not guard the
collector sufficiently from acquiring a number of the same dies, due
to unavoidable mistakes of measurement. To prevent duplication of
dies it is imperative to know the various heads.
Luckily the distinctive features are quite plain and it is easy to divide
the heads into five classes for, as in the first issue, the die cutters
have adorned the head of Washington with a variety of coiffures.
In Heads 1 and 2 there is a triangular open space between the
middle bunch of hair and the lowest strand which meets the queue.
HEAD 1.—The queue consists of three vertical strands extending
from the top of the head to the neck. Next to the queue are 3 rear
locks, of which the middle one is a large, pear-shaped bunch,
consisting of five fine strands, while the second highest is by far the
longest, and cuts into the queue, resembling the stem of a pear.
HEAD 2.—Same as Head 1, but the second lowest strand of hair in
the pear-shaped bunch is the longest and does not extend into the
queue. The triangular space below is slightly larger than in Head 1.
HEAD 3.—The queue consists of either three or four strands which
extend from the top of the head to the neck. Next to the queue
there are five locks in the rear row, the arrangement of which differs
in the various specimens. The main feature of Head 3 consists in the
absence of an open space between the middle bunch and the lowest
lock.
HEAD 4.—The queue consists of three strands which extend from the
top of the head to the neck. The back row of hair consists of five
locks of which the lowest is very small and runs almost
perpendicularly into the queue. There is a small space between the
perpendicular lock and the next lowest.
HEAD 5.—Generally found on the second quality of buff paper. The
queue consists of three strands, which extend from the top of the
head to the neck. The main feature is the middle bunch of hair,
which is oblong shaped and consists of three heavy strands, all of
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

More Related Content

PDF
Nosql Essentials Navigating The World Of Nonrelational Databases Kameron Huss...
PPTX
Architecting Your First Big Data Implementation
PPT
Big data - Cassandra
PDF
PPTX
Apache Cassandra introduction
PPTX
NoSQL A brief look at Apache Cassandra Distributed Database
PPT
Apache Cassandra training. Overview and Basics
PDF
NoSQL Databases Introduction - UTN 2013
Nosql Essentials Navigating The World Of Nonrelational Databases Kameron Huss...
Architecting Your First Big Data Implementation
Big data - Cassandra
Apache Cassandra introduction
NoSQL A brief look at Apache Cassandra Distributed Database
Apache Cassandra training. Overview and Basics
NoSQL Databases Introduction - UTN 2013

Similar to Beginning Apache Cassandra Development 1st Edition Vivek Mishra Auth (20)

PPTX
Selecting best NoSQL
PDF
The Evolution of Open Source Databases
PPTX
Exploring NoSQL and implementing through Cassandra
PDF
Slides: Relational to NoSQL Migration
PPTX
NoSql - mayank singh
DOCX
Cassandra data modelling best practices
PPTX
No SQL- The Future Of Data Storage
PDF
SQL vs NoSQL: Big Data Adoption & Success in the Enterprise
PDF
Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon ...
PDF
Iaetsd mapreduce streaming over cassandra datasets
PDF
Hadoop & no sql new generation database systems
PPTX
Big data presentation
PPTX
Big data vahidamiri-datastack.ir
PDF
Data Con LA 2018 - Analyzing Movie Reviews using DataStax by Amanda Moran
PPT
No sql
PDF
Dba to data scientist -Satyendra
PPTX
Introduction to Bigdata and NoSQL
PPTX
Big Data and NoSQL for Database and BI Pros
PDF
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
PPTX
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Selecting best NoSQL
The Evolution of Open Source Databases
Exploring NoSQL and implementing through Cassandra
Slides: Relational to NoSQL Migration
NoSql - mayank singh
Cassandra data modelling best practices
No SQL- The Future Of Data Storage
SQL vs NoSQL: Big Data Adoption & Success in the Enterprise
Welcome to the Jungle: Distributed Systems for Large Data Sets - StampedeCon ...
Iaetsd mapreduce streaming over cassandra datasets
Hadoop & no sql new generation database systems
Big data presentation
Big data vahidamiri-datastack.ir
Data Con LA 2018 - Analyzing Movie Reviews using DataStax by Amanda Moran
No sql
Dba to data scientist -Satyendra
Introduction to Bigdata and NoSQL
Big Data and NoSQL for Database and BI Pros
IRJET- A Comparative Study on Big Data Analytics Approaches and Tools
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Ad

Recently uploaded (20)

PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Cell Types and Its function , kingdom of life
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Cell Structure & Organelles in detailed.
PDF
Classroom Observation Tools for Teachers
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
RMMM.pdf make it easy to upload and study
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
01-Introduction-to-Information-Management.pdf
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Insiders guide to clinical Medicine.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Final Presentation General Medicine 03-08-2024.pptx
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Cell Types and Its function , kingdom of life
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pharma ospi slides which help in ospi learning
Cell Structure & Organelles in detailed.
Classroom Observation Tools for Teachers
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Institutional Correction lecture only . . .
Microbial diseases, their pathogenesis and prophylaxis
RMMM.pdf make it easy to upload and study
Module 4: Burden of Disease Tutorial Slides S2 2025
01-Introduction-to-Information-Management.pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Insiders guide to clinical Medicine.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPH.pptx obstetrics and gynecology in nursing
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Ad

Beginning Apache Cassandra Development 1st Edition Vivek Mishra Auth

  • 1. Beginning Apache Cassandra Development 1st Edition Vivek Mishra Auth download https://ebookbell.com/product/beginning-apache-cassandra- development-1st-edition-vivek-mishra-auth-4971738 Explore and download more ebooks at ebookbell.com
  • 2. Here are some recommended products that we believe you will be interested in. You can click the link to download. Beginning Apache Cassandra Development Mishra Vivekoneill Brian https://ebookbell.com/product/beginning-apache-cassandra-development- mishra-vivekoneill-brian-11862984 Beginning Apache Spark Using Azure Databricks Unleashing Large Cluster Analytics In The Cloud 1st Ed Robert Ilijason https://ebookbell.com/product/beginning-apache-spark-using-azure- databricks-unleashing-large-cluster-analytics-in-the-cloud-1st-ed- robert-ilijason-22417420 Beginning Apache Struts 1st Edition Arnold Doray https://ebookbell.com/product/beginning-apache-struts-1st-edition- arnold-doray-34504736 Beginning Apache Spark 3 With Dataframe Spark Sql Structured Streaming And Spark Machine Learning Library 2nd Ed Hien Luu https://ebookbell.com/product/beginning-apache-spark-3-with-dataframe- spark-sql-structured-streaming-and-spark-machine-learning-library-2nd- ed-hien-luu-35191130
  • 3. Beginning Apache Pig Big Data Processing Made Easy Balaswamy Vaddeman https://ebookbell.com/product/beginning-apache-pig-big-data- processing-made-easy-balaswamy-vaddeman-5684564 Beginning Apache Spark 2 With Resilient Distributed Datasets Spark Sql Structured Streaming And Spark Machine Learning Library Hien Luu https://ebookbell.com/product/beginning-apache-spark-2-with-resilient- distributed-datasets-spark-sql-structured-streaming-and-spark-machine- learning-library-hien-luu-7213898 Beginning Apache Spark 2 With Resilient Distributed Datasets Spark Sql Structured Streaming And Spark Machine Learning Library Hien Luu https://ebookbell.com/product/beginning-apache-spark-2-with-resilient- distributed-datasets-spark-sql-structured-streaming-and-spark-machine- learning-library-hien-luu-11068224 Beginning Apache Struts Arnold Doray https://ebookbell.com/product/beginning-apache-struts-arnold- doray-34504734 Beginning Php 6 Apache Mysql 6 Web Development Timothy Boronczyk https://ebookbell.com/product/beginning-php-6-apache-mysql-6-web- development-timothy-boronczyk-2318188
  • 6. For your convenience Apress has placed some of the front matter material after the index. Please use the Bookmarks and Contents at a Glance links to access them.
  • 7. v Contents at a Glance About the Author����������������������������������������������������������������������������������������������������������������xv About the Technical Reviewer������������������������������������������������������������������������������������������xvii Acknowledgments�������������������������������������������������������������������������������������������������������������xix Introduction�����������������������������������������������������������������������������������������������������������������������xxi Chapter 1: NoSQL: Cassandra Basics ■ ■ ��������������������������������������������������������������������������������1 Chapter 2: Cassandra Data Modeling ■ ■ ������������������������������������������������������������������������������27 Chapter 3: Indexes and Composite Columns ■ ■ �������������������������������������������������������������������43 Chapter 4: Cassandra Data Security ■ ■ ��������������������������������������������������������������������������������61 Chapter 5: MapReduce with Cassandra ■ ■ ��������������������������������������������������������������������������79 Chapter 6: Data Migration and Analytics ■ ■ �������������������������������������������������������������������������97 Chapter 7: Titan Graph Databases with Cassandra ■ ■ �������������������������������������������������������123 Chapter 8: Cassandra Performance Tuning ■ ■ �������������������������������������������������������������������153 Chapter 9: Cassandra: Administration and Monitoring ■ ■ �������������������������������������������������171 Chapter 10: Cassandra Utilities ■ ■ �������������������������������������������������������������������������������������191 Chapter 11: Upgrading Cassandra and Troubleshooting ■ ■ �����������������������������������������������209 Index���������������������������������������������������������������������������������������������������������������������������������217
  • 8. xxi Introduction Big or large data has been the talk of the town in recent years. With possibilities for solving unstructured and semi-structured data issues, more and more organizations are gradually moving toward big data powered solutions. This essentially gives organization a way to think “beyond RDBMS.” This book will walk you through many such use cases during the journey. Many NoSQL databases have been developed over the last 4-5 years. Recent research shows there are now more than 150 different NoSQL databases. This raises questions about why to adopt a specific database. For example, is it scalable, under active development, and most importantly accepted by the community and organizations? It is in light of these questions that Apache Cassandra comes out as a winner and indicates why it is one of the most popular NoSQL databases currently in use. Apache Cassandra is a columnar distributed database that takes database application development forward from the point at which we encounter the limitations of traditional RDBMSs in terms of performance and scalability. A few things that restrict traditional RDBMSs are that they require predefined schemas, the ability to scale up to hundreds of data nodes, and the amount of work involved with data administration and monitoring. We will discuss these restrictions and how to address these with Apache Cassandra. Beginning Apache Cassandra Development introduces you to Apache Cassandra, including the answers to the questions mentioned above, and provides a detailed overview and explanation of its feature set. Beginning with Cassandra basics, this book will walk you through the following topics and more: Data modeling • Cluster deployment, logging, and monitoring • Performance tuning • Batch processing via MapReduce • Hive and Pig integration • Working on graph-based solutions • Open source tools for Cassandra and related utilities • The book is intended for database administrators, big data developers, students, big data solution architects, and technology decision makers who are planning to use or are already using Apache Cassandra. Many of the features and concepts covered in this book are approached through hands on recipes that show how things are done. In addition to those step-by-step guides, the source code for the examples is available as a download from the book’s Apress product page (www.apress.com/9781484201435).
  • 9. 1 Chapter 1 NoSQL: Cassandra Basics The purpose of this chapter is to discuss NoSQL, let users dive into NoSQL elements, and then introduce big data problems, distributed database concepts, and finally Cassandra concepts. Topics covered in this chapter are: NoSQL introduction • CAP theorem • Data distribution concepts • Big data problems • Cassandra configurations • Cassandra storage architecture • Setup and installation • Logging with Cassandra • The intent of the detailed introductory chapter is to dive deep into the NoSQL ecosystem by discussing problems and solutions, such as distributed programming concepts, which can help in solving scalability, availability, and other data-related problems. This chapter will introduce the reader to Cassandra and discuss Cassandra’s storage architecture, various other configurations, and the Cassandra cluster setup over local and AWS boxes. Introducing NoSQL Big data’s existence can be traced back to the mid 1990s. However, the actual shift began in the early 2000s. The evolution of the Internet and mobile technology opened many doors for more people to participate and share data globally. This resulted in massive data production, in various formats, flowing across the globe. A wider distributed network resulted in incremental data growth. Due to this massive data generation, there is a major shift in application development and many new domain business possibilities have emerged, like: Social trending • OLAP and Data mining • Sentiment analysis • Behavior targeting • Real-time data analysis •
  • 10. Chapter 1 ■ NoSQL: Cassandra Basics 2 With high data growth into peta/zeta bytes, challenges like scalability and managing data structure would be very difficult with traditional relational databases. Here big data and NoSQL technologies are considered an alternative to building solutions. In today’s scenario, existing business domains are also exploring the possibilities of new functional aspects and handling massive data growth simultaneously. NoSQL Ecosystem NoSQL, often called “Not Only SQL,” implies thinking beyond traditional SQL in a distributed way. There are more than 150 NoSQL databases available today. The following are a few popular databases: Columnar databases, such as Cassandra HBase • Document based storage like MongoDB Couchbase • Graph based access like Neo4J Titan Graph DB • Simple key-value store like Redis Couch DB • With so many options and categories, the most important question is, what, how, and why to choose! Each NoSQL database category is meant to deal with a specific set of problems. Specific technology for specific requirement paradigm is leading the current era of technology. It is certain that a single database for all business needs is clearly not a solution, and that’s where the need for NoSQL databases arises. The best way to adopt databases is to understand the requirements first. If the application is polyglot in nature, then you may need to choose more than one database from the available options. In the next section, we will discuss a few points that describe why Cassandra could be an answer to your big data problem. CAP Theorem CAP theorem, which was introduced in early 2000 by Eric Brewer, states that no database can offer Consistency, Availability, and Partition tolerance together (see Figure 1-1), but depending on use case may allow for any two of them. Figure 1-1. CAP theorem excludes the possibility of a database with all three characteristics (the “NA” area)
  • 11. Chapter 1 ■ NoSQL: Cassandra Basics 3 Traditional relational database management systems (RDBMS) provide atomicity, consistency, isolation, and durability (ACID) semantics and advocate for strong consistency. That’s where most of NoSQL databases differ and strongly advocate for partition tolerance and high availability with eventual consistency. High availability of data means data must be available with minimal latency. For distributed databases where data is distributed across multiple nodes, one way to achieve high availability is to replicate it across multiple nodes. Like most of NoSQL databases, Cassandra also provides high availability. Partition tolerance implies if a node or couple of nodes is down, the system would still be able to serve read/write requests. In scalable systems, built to deal with a massive volume of data (in peta bytes) it is highly likely that situations may occur often. Hence, such systems have to be partition tolerant. Cassandra’s storage architecture enables this as well. Consistency means consistent across distributed nodes. Strong consistency refers to most updated or consistent data on each node in a cluster. On each read/write request most stable rows can be read or written to by introducing latency (downside of NoSQL) on each read and write request, ensuring synchronized data on all the replicas. Cassandra offers eventual consistency, and levels of configuration consistency for each read/write request. We will discuss various consistency level options in detail in the coming chapters. Budding Schema Structured or fixed schema defines the number of columns and data types before implementation. Any alteration to schema like adding column(s) would require a migration plan across the schema. For semistructured or unstructured data formats where number of columns and data types may vary across multiple rows, static schema doesn’t fit very well. That’s where budding or dynamic schema is best fit for semistructured or unstructured data. Figure 1-2 presents four records containing twitter-like data for a particular user id. Here, the user id imvivek consists of three columns “tweet body” , ”followers” , and “retweeted by” . But on the row for user “apress_team” there is only the column followers. For unstructured schema such as server logs, the number of fields may vary from row to row. This requires the addition of columns “on the fly” a strong requirement for NoSQL databases. Traditional RDBMS can handle such data set in a static way, but unlike Cassandra RDBMS cannot scale to have up to a million columns per row in each partition. With predefined models in the RDBMS world, handling frequent schema changes is certainly not a workable option. Imagine if we attempt to support dynamic columns we may end up having many null columns! Having default null values for multiple columns per row is certainly not desirable. With Cassandra we can have as many columns as we want (up to 2 billion)! Also another possible option is to define datatype for column names (comparator) which is not possible with RDBMS (to have a column name of type integer).
  • 12. Chapter 1 ■ NoSQL: Cassandra Basics 4 Scalability Traditional RDBMSs offer vertical scalability, that is, scaling by adding more processors or RAM to a single unit. Whereas, NoSQL databases offer horizontal scalability, and add more nodes. Mostly NoSQL databases are schemaless and can perform well over commodity servers. Adding nodes to an existing RDBMS cluster is a cumbersome process and relatively expensive whereas it is relatively easy to add data nodes with a NoSQL database, such as Cassandra. We will discuss adding nodes to Cassandra in coming chapters. No Single Point of Failure With centralized databases or master/slave architectures, where database resources or a master are available on a single machine, database services come to a complete halt if the master node goes down. Such database architectures are discouraged where high availability of data is a priority. NoSQL distributed databases generally prefer multiple master/slave configuration or peer-to-peer architecture to avoid a single point of failure. Cassandra delivers peer-to- peer architecture where each Cassandra node would have an identical configuration. We will discuss this at length in the coming chapters. Figure 1-3a depicts a system single master acting as single point of contact to retrieve data from slave nodes. If the master goes down, it would bring the whole system to a halt until the master node is reinstated. But with multiple master configurations, like the one in Figure 1-3b, a single point of failure does not interrupt service. Figure 1-2. A dynamic column, a.k.a. budding schema, is one way to relax static schema constraint of RDBMS world
  • 13. Chapter 1 ■ NoSQL: Cassandra Basics 5 High Availability High availability clusters suggest the database is available with 24x7 support with minimal (or no) downtime. In such clusters, data is replicated across multiple nodes, in case one node is down still another node is available to serve the read/write requests until that node is up and running. Cassandra’s peer-to-peer architecture ensures high availability of data with co-location. Identifying the Big Data Problem Recently, it has been observed that developers are opting for NoSQL databases as an alternative to RDBMS. However, I recommend that you perform an in-depth analysis before deciding on NoSQL technologies. Traditional RDBMS does offer lots of features which are absent in most of NoSQL databases. A couple of questions that must be analyzed and answered before jumping to a NoSQL based approach include Is it really a big data problem? • Why/where RDBMS fails to deliver? • Identifying a “big data problem” is an interesting errand. Scalability, nature of data (structured, unstructured, or semistructured) and cost of maintaining data volume are a few important factors. In most cases, managing secured and structured data within an RDBMS may still be the preferred approach; however, if the nature of the data is semistructured, less vulnerable, and scalability is preferred over traditional RDBMS features (e.g., joins, materialized view, and so forth), it qualifies as a big data use case. Here data security means the authentication and authorization mechanism. Although Cassandra offers decent support for authentication and authorization but RDBMS fairs well in comparison with most of NoSQL databases. Figure 1-4 shows a scenario in which a cable/satellite operator system is collecting audio/video transmission logs (on daily basis) of around 3 GB/day per connection. A “viewer transmission analytic system” can be developed using a big data tech stack to perform “near real time” and “large data” analytics over the streaming logs. Also the nature of data logs is uncertain and may vary from user to user. Generating monthly/yearly analytic reports would require dealing with petabytes of data, and NoSQL’s scalability is definitely a preference over that of RDBMS. Figure 1-3. Centralized vs. distributed architecural setup
  • 14. Chapter 1 ■ NoSQL: Cassandra Basics 6 Consider an example in which a viewer transmission analytic system is capturing random logs for each transmitted program and watched or watching users. The first question we need to ask is, is it really a big data problem? Yes, here we are talking about logs; imagine in a country like India the user base is huge as are the logs captured 24x7! Also, the nature of transmitted logs may be random, meaning the structure is not fixed! It can be semi-structured or totally unstructured. That’s where RDBMS will fail to deliver because of budding schema and scalability problems (see previous section). To summarize, build a NoSQL based solution if: Data format is semi/unstructured • RDBMS reaches the storage limit and cannot scale further • RDBMS specific features like relations, indexes can be sacrificed against denormalized but • distributed data Data redundancy is not an issue and a read-before-write approach can be applied • In the next section, we will discuss how Cassandra can be a best fit to address such technical and functional challenges. Introducing Cassandra Cassandra is an open-source column, family-oriented database. Originally developed at Facebook, it has been an Apache TLP since 2009. Cassandra comes with many important features; some are listed below: Distributed database • Peer to Peer architecture • Configurable consistency • CQL (Cassandra Query Language) • Figure 1-4. Family watching satellite transmitted programs
  • 15. Chapter 1 ■ NoSQL: Cassandra Basics 7 Distributed Databases Cassandra is a global distributed database. Cassandra supports features like replication and partitioning. Replication is a process where system maintains n* number of replicas on various data sites. Such data sites are called nodes in Cassandra. Data Partitioning is a scheme, where data may be distributed across multiple nodes. Partitioning is usually for managing high availability/performance on data. Note ■ ■  A node is a physical location where data resides. Peer-to-Peer Design Cassandra storage architecture is peer-to-peer. Each node in a cluster is assigned the same role, making it a decentralized database. Each node is independent of the other but interconnected. Nodes in a network are capable of serving read/write database requests, so at a given point even if a node goes down, subsequent read/write requests will be served from other nodes in the network, hence there is no SPOF (Single Point Of Failure). Figure 1-5 is a graphical representation of peer-to-peer (P2P) architecture. Figure 1-5. Peer to Peer decentralized Cassandra nodes. Every node is identical and can communicate with other nodes
  • 16. Chapter 1 ■ NoSQL: Cassandra Basics 8 Configurable Data Consistency Data consistency is synchronization of data across multiple replica nodes. Eventually the consistency-based data model returns the last updated record. Such a data model is widely supported by many distributed databases. Cassandra also offers configurable eventual consistency. Write Consistency If the data is successfully written and synchronized on replica nodes before acknowledging the write request, data is considered write consistent. However, various consistency level values are possible while submitting a write request. Available consistency levels are • ANY: A write must be written to at least ANY one node. In this case, all replica nodes are down and “hinted_handoff_enabled: true” (default is true), then still corresponding write data and hint will be stored by coordinator node, and later once all replica nodes are up, they will be coordinated to at least one node. That written data will not be available for reads until all replica nodes are down. Though ANY is the lowest consistency level but with highest availability as it requires data to be replicated on any one node before sending write acknowledgment. • ONE: With consistency level ONE; write request must be successfully written on at least one replica node before acknowledgment. • QUORUM*: With the consistency level QUORUM* write requests must be successfully written on a selected group of replica nodes. • LOCAL_QUORUM: With the consistency level LOCAL_QUORUM write requests must be successfully written on a selected group of replica nodes, known as quorum, which are locally available on the same data center as the coordinator node. • EACH_QUORUM: With the consistency level EACH_QUORUM write requests must be successfully written on select groups of replica nodes (quorum). • ALL: With the consistency level ALL write requests must be written to the commit log and memory table on all replica nodes in the cluster for that row key to ensure the highest consistency level. • SERIAL: Linearizable consistency is being introduced in Cassandra 2.0 as a lightweight transaction support. With the consistency level SERIAL write requests must be written to the commit log and memory table on quorum replica nodes conditionally. Here conditionally means either guaranteed write on all nodes or none. • TWO: Similar to ONE except with the consistency level TWO write requests must be written to the commit log and memory table on minimum two replica nodes. • THREE: Similar to TWO except with the consistency level TWO write requests must be written to the commit log and memory table on a minimum of three replica nodes. Read Consistency No data is of much use if it is not consistent. Large or small data applications would prefer not to have dirty reads or inconsistent data. A dirty read is a scenario where a transaction may end up in reading uncommitted data from another thread. Although dirty reads are more RDBMS specific, with Cassandra there is a possibility for inconsistent data if the responsible node is down and the latest data is not replicated on each replica node. In such cases, the application may prefer to have strong consistency at the read level. With Cassandra’s tunable consistency, it is possible to have configurable consistency per read request. Possible options are
  • 17. Chapter 1 ■ NoSQL: Cassandra Basics 9 • ONE: With the read consistency level ONE, data is returned from the nearest replica node to coordinator node. Cassandra relies on snitch configuration to determine the nearest possible replica node. Since a response is required to be returned from the closest replica node, ONE is the lowest consistency level. • QUORUM: With the read consistency level QUORUM, the last updated data (based on timestamp) is returned among data responses received by a quorum of replica nodes. • LOCAL_QUORUM: With the read consistency level LOCAL_QUORUM, the last updated data (based on timestamp) is returned among the data response received by a local quorum of replica nodes. • EACH_QUORUM: With the read consistency level EACH_QUORUM, the last updated data (based on timestamp) is returned among the data response received by each quorum of replica nodes. • ALL: With the read consistency level ALL, the last updated data (based on timestamp) returned among the data response received from all replica nodes. Since responses with the latest timestamp are returned among all replica nodes, ALL is the highest consistency level. • SERIAL: With the read consistency level SERIAL, it would return the latest set of columns committed or in progress. Uncommitted transactions discovered during read would result in implicit commit of running transactions and return to the latest column values. • TWO: With the read consistency level TWO, the latest column values will be returned from the two closest replica nodes. • THREE: With the read consistency level THREE, the latest column values will be returned from three of the closest replica nodes. Based on the above-mentioned consistency level configurations, the user can always configure each read/write request with a desired consistency level. For example, to ensure the lowest write consistency but the highest read consistency, we can opt for ANY as write consistency and ALL for read consistency level. Cassandra Query Language (CQL) One of the key features of Cassandra from an end user perspective is ease-of-use rather than familiarity. Cassandra query language (CQL) was introduced with Cassandra 0.8 release with the intention of having a RDBMS style structured query language (SQL). Since its inception CQL has gone through many changes. Many new features have been introduced in later releases along with lots of performance-related enhancement work. CQL adds a flavor of known data definition language (ddl) and data manipulation language (dml) statements. During the course of this book, we will be covering most of the CQL features. Installing Cassandra Installing Cassandra is fairly easy. In this section we will cover how to set up a Cassandra tarball (.tar file) installation over Windows and Linux box. 1. Create a folder to download Cassandra tarball, for example: Run • mkdir /home/apress/Cassandra {Here apress is user.name environment variable} Run • cd/home/apress/cassandra
  • 18. Chapter 1 ■ NoSQL: Cassandra Basics 10 2. Download the Cassandra tarball: Linux: • wget http://archive.apache.org/dist/cassandra/2.0.6/apache-cassandra- 2.0.6-bin.tar.gz Windows: • http://archive.apache.org/dist/cassandra/2.0.6/apache-cassandra- 2.0.6-bin.tar.gz 3. Extract the downloaded tar file using the appropriate method for your platform: For Linux, use the following command: • tar- xvf apache-cassandra-2.0.6-bin.tar.gz For Windows, you may use tools like WinZip or 7zip to extract the tarball. • Note ■ ■  If you get an “Out of memory” or segmentation fault, check for the JAVA_HOME and JVM_OPTS parameters in cassandra-env.sh file. Logging in Cassandra While running an application in development or production mode, we might need to look into server logs in certain circumstances, such as: Performance issues • Operation support • Debug application vulnerability • Default server logging settings are defined within the log4j-server.properties file, as shown in the following. # output messages into a rolling log file as well as stdout log4j.rootLogger=INFO,stdout,R # stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%5p %d{HH:mm:ss,SSS} %m%n # rolling log file log4j.appender.R=org.apache.log4j.RollingFileAppender log4j.appender.R.maxFileSize=20MB log4j.appender.R.maxBackupIndex=50 log4j.appender.R.layout=org.apache.log4j.PatternLayout log4j.appender.R.layout.ConversionPattern=%5p [%t] %d{ISO8601} %F (line %L) %m%n # Edit the next line to point to your logs directory log4j.appender.R.File=/var/log/cassandra/system.log # Application logging options #log4j.logger.org.apache.cassandra=DEBUG #log4j.logger.org.apache.cassandra.db=DEBUG #log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG # Adding this to avoid thrift logging disconnect errors. log4j.logger.org.apache.thrift.server.TNonblockingServer=ERROR
  • 19. Chapter 1 ■ NoSQL: Cassandra Basics 11 Let’s discuss these properties in sequence Properties with prefix • log4j.appender.stdout are for console logging. Server logs are generated and appended on a location defined as property • log4j.appender.R.File value. The default value is /var/log/cassandra/system. User can overwrite the property file for default location • og4j.appender.R.maxFileSize defines the maximum log file size. The • log4j.appender.R.maxBackupIndex property defines the maximum rolling log file (default 50). The • Log4j.appender.R.layout.ConversionPattern property defines logging pattern for log files. Last line in the • log4j-server.properties file is for application logging in case of thrift connection with Cassandra. By default it’s commented out to avoid unnecessary logging on frequent socket disconnection. Application Logging Options By default, Cassandra API level logging is disabled. But we can enable and change log level to log more application level information. Many times applications may need to enable Cassandra-specific server-side logging to troubleshoot the problems. The following code depicts the section that can be used for application-specific logging. # Application logging options #log4j.logger.org.apache.cassandra=DEBUG #log4j.logger.org.apache.cassandra.db=DEBUG #log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG Changing Log Properties There are two possible ways for configuring log properties. First, we can modify log4j-server.properties and second, via JMX (Java Management Extension), using jconsole. The difference between both of them is, using the latter can change the logging level dynamically at run time, while the first one is static. Managing Logs via JConsole JConsole is a GUI monitoring tool for resource usage and performance monitoring of running Java applications using JMX. The jconsole executable can be found in JDK_HOME/bin, where JDK_HOME is the directory in which the Java Development Kit (JDK) is installed. If this directory is in your system path, you can start JConsole by simply typing jconsole at command (shell) prompt. Otherwise, you have to use the full path of the executable file.
  • 20. Chapter 1 ■ NoSQL: Cassandra Basics 12 Figure 1-6. JConsole connection layout On running jconsole, you need to connect the Cassandra Daemon thread as shown in Figure 1-6.
  • 21. Chapter 1 ■ NoSQL: Cassandra Basics 13 After successfully connecting to CassandraDaemon process, click on the MBeans tab to look into registered message beans. Figure 1-7 depicts changing the log level for classes within the org.apache.cassandra.db package to INFO level. Figure 1-7. Changing the log level via jconsole Mbeans setting Note ■ ■  Please refer to http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/PatternLayout.html for more information on logging patterns. Understanding Cassandra Configuration The primary Cassandra configuration file is Cassandra.yaml, which is available within the $CASSSANDRA_HOME/conf folder. Roughly there are approximately 100 properties. Table 1-1 consists of a subset of such properties, which are helpful for Cassandra beginners and worth mentioning.
  • 22. Chapter 1 ■ NoSQL: Cassandra Basics 14 Table 1-1. Important Cassandra server properties Property Default Description cluster_name “Test cluster” This is to restrict node to join in one logical cluster only. num_tokens Disabled, not specified If not specified, default value is 1. For example, if you want to enable virtual node support while bootstrapping a node, you need to set num_tokens parameter. Recommended value is 256. initial_token N/A Assigns a data range for node. While bootstrapping a node, it is recommended to assign a value. If left unspecified, a random token will be assigned by Cassandra. For Random partitioning schema, the way to calculate a initial_token is: i * (2**127 / N) for i = 0 .. N-1. N is number of nodes. hinted_handoff_ enabled True With a consistency level ANY, if replica node is down, then the corresponding write request will be stored down on coordinator node as a hint in system.hints column family. This is used to replay the mutation object, once replica node starts accepting write requests. max_hint_window_in_ ms 3 hours Maximum wait time for a dead node until new hints meant to be written on coordinator node. After the hint window expires, no more new hints will be stored. If left unspecified, the default value is 1 hour. This property is used when writing hints on the coordinator node. If the gossip protocol end point down time for a specific replica node is greater than the specified Maximum wait time value, then no new hints can be written by the StorageProxy service on the coordinator node. hinted_handoff_ throttle_in_kb 1024 kb/sec hint data flow/per thread. max_hints_delivery_ threads 2 Maximum number of allowed threads to send data hints. Useful when writing hints across multiple data centers. populate_io_cache_ on_flush False Set it to true, if complete data on a node can fit into memory. Since Cassandra 1.2.2, we can also set this parameter per column family as well. https://issues. apache.org/jira/browse/CASSANDRA-4694. authenticator AllowAllAuthenticator Implementation of IAuthenticator interface. By default Cassandra offers AllowAllAuthenticator and PasswordAuthenticator as internal authentication implementations. PasswordAuthenticator validates username and password against data stored in credentials and users column family in system_auth keyspace. (Security in Cassandra will be discussed at length in Chapter 10.) (continued)
  • 23. Chapter 1 ■ NoSQL: Cassandra Basics 15 Table 1-1. (continued) Property Default Description Authorizer AllowAllAuthorizer Implementation of IAuthorizer interface. Implementation manages user’s permission over keyspace, column family, index, etc. Enabling CassandraAuthorizer on server startup will create a permissions table in system_auth keyspace and to store user permissions. (Security in Cassandra will be discussed at length in Chapter 10.) permissions_ validity_in_ms Default is 2000. Disabled if authorizer property is AllowAllAuthorizer Default permissions cache validity. Partitioner Murmur3Partitioner Rows distribution across nodes in cluster is decided based on selection partitioner. Available values are RandomPartitioner, ByteOrderedPartitioner, Murmur3Partitioner and OrderPreservingPartitioner (deprecated). data_file_ directories /var/lib/cassandra/data Physical data location of node. commitlog_directory /var/lib/cassandra/ commitlog Physical location of commit log files of node. disk_failure_policy Stop Available values are stop, best_effort, and ignore. Stop will shut down all communication with node (except JMX). best_effort will still acknowledge read request from available sstables. key_cache_size_in_mb Empty, means 100MB or 5% of available heap size, whichever is smaller To disable set it to Zero. saved_caches_ directory /var/lib/cassandra/ saved_caches Physical location for saved cache on node. key_cache_save_ period 14400 Key cache save duration (in seconds) save under saved_caches_directory. key_cache_keys_to_ save Disabled. By default disabled. All row keys will be cached. row_cache_size_in_mb 0(Disabled) In-memory row cache size. row_cache_save_ period 0(Disabled) row cache save duration (in seconds) save under saved_caches_directory. row_cache_keys_to_ save Disabled. By default disabled. All row keys will be cached. (continued)
  • 24. Chapter 1 ■ NoSQL: Cassandra Basics 16 Table 1-1. (continued) Property Default Description row_cache_provider SerializingCacheProvider Available values are SerializingCacheProvider and ConcurrentLinkedHashCacheProvider. SerializingCacheProvider is recommended in case workload is not intensive update as it uses native memory (not JVM) for caching. commitlog_sync Periodic Available values are periodic and batch. In case of batch sync, writes will not be acknowledged until writes are synced with disk. See the commitlog_sync_batch_window_in_ms property. commitlog_sync_ batch_window_in_ms 50 If commitlog_sync is in batch mode, Cassandra will acknowledge writes only after commit log sync windows expires and data will be fsynced to disk. commitlog_sync_ period_in_ms 10000 If commitlog_sync is periodic. Commit log will be fsynced to disk after this value. commitlog_segment_ size_in_mb 32 Commit log segment size. Upon reaching this limit, Cassandra flushes memtables to disk in form of sstables. Keep it to minimum in case of 32 bit JVM to avoid running out of address space and reduced commit log flushing. seed_provider SimpleSeedProvider Implementation of SeedProvider interface. SimpleSeedProvider is default implementation and takes comma separated list of addresses. Default value for “-seeds” parameter is 127.0.0.1. Please change it for multiple node addresses, in case of multi-node deployment. concurrent_reads 32 If workload data cannot fit in memory, it would require to fetch data from disk. Set this parameter to perform number of concurrent reads. concurrent_writes 32 Generally writes are faster than reads. So we can set this parameter on the higher side in comparison to concurrent_reads. memtable_total_ space_in_mb One third of JVM heap(disabled) Total space allocated for memtables. Once exceeding specified size Cassandra will flush the largest memtable first onto disk. commitlog_total_ space_in_mb 32(32 bit JVM), 1024 (64bit JVM) Total space allocated commit log segments. Upon reaching the specified limit, Cassandra flushes memtables to claim space by removing the oldest commit log first. storage_port 7000 TCP port for internal communication between nodes. ssl_storage_port 7001 Used if client_encryption_options is enabled. listen_address Localhost Address to bind and connect with other Cassandra nodes. (continued)
  • 25. Chapter 1 ■ NoSQL: Cassandra Basics 17 Table 1-1. (continued) Property Default Description broadcast_address Disabled(same as listen_ address) Broadcast address for other Cassandra nodes. internode_ authenticator AllowAllInternode Authenticator IinternodeAuthenticator interface implementation for internode communication. start_native_ transport False CQL native transport for clients. native_transport_ port 9042 CQL native transport port to connect with clients. rpc_address Localhost Thrift rpc address, client to connect with. rpc_port 9160 Thrift rpc port for clients to communicate. rpc_min_threads 16 Minimum number of thread for thrift rpc. rpc_max_threads 2147483647(Maximum 32-bit signed integer) Maximum number of threads for thrift rpc. rpc_recv_buff_size_ in_bytes Disabled Enable if you want to set a limit of receiving socket buffer size for thrift rpc. rpc_send_buff_size_ in_bytes Disabled Enable if you want to set a limit of sending socket buffer size for thrift rpc. incremental_backups False If enabled, Cassandra will hard links flushed sstables to backup directory under data_file_directories/keyspace/ backup directory. snapshot_before_ compaction False If enabled, will create snapshots before each compaction under the data_file_directories/keyspace/ snapshots directory. auto_snapshot True If disabled, snapshot will not be taken in case of dml operation (truncate, drop) over keyspace. concurrent_ compactors Equals number of processors Equal to cassandra.available_processors (if defined) else number of available processors. multithreaded_ compaction False If enabled, single thread per processor will be used for compaction. compaction_ throughput_mb_per_ sec 16 Data compaction flow in megabytes per seconds. More compaction throughput will ensure less sstables and more space on disk. endpoint_snitch SimpleSnitch A very important configuration. Snitch can also be termed as informer. Useful to route requests for replica nodes in cluster. Available values are SimpleSnitch, PropertyFileSnitch, RackInferringSnitch, Ec2Snitch, and Ec2MultiRegionSnitch. (I will cover snitch configuration in later chapters.) (continued)
  • 26. Chapter 1 ■ NoSQL: Cassandra Basics 18 Commit Log Archival To enable Cassandra for auto commit log archiving and restore for recovery (supported since 1.1.1.), the commitlog_archiving.properties file is used. It configures archive_command and restore_command properties. Commit log archival is also referred to as write ahead log (WAL) archive and used for point-in-time recovery. Cassandra’s implementation is similar to Postgresql. Postgresql is an object-oriented relational database management system (OORDBMS) that offers wal_level settings with minimum as the lowest, followed by archive and hot_standby levels to allow executing queries during recovery. For more details on Postgresql refer to http://www.postgresql.org/. archive_command Enable archive_command for implicit commit log archival using a command such as: archive_command= /bin/ln %path /home/backup/%name Here %path is a fully qualified path of the last active commit log segment and %name is the name of commit log. The above-mentioned shell command will create a hard link for the commit log segment (%path). If row mutation size exceeds commitlog_segment_size_in_mb, Cassandra archives this segment using the archive command under /home/backup/. Here %path is the name of latest old segment and %name is commit log file name. restore_command Leaving restore_command and restore_directories blank in commitlog_archiving.properties during bootstrap Cassandra will replay these log files using the restore_command: restore_command=cp -f %from %to Table 1-1. (continued) Property Default Description request_scheduler NoScheduler Client request scheduler. By default no scheduling is done, but we can configure this to RoundRobinScheduler or a custom implementation. It will queue up client dml request and finally release it after successfully processing the request. server_encryption_ options None To enable encryption for internode communication. Available values are all, none, dc, and rack. client_encryption_ options false(not enabled) To enable client/server communication. If enabled must specify ssl_storage_port. As it will be used for client/ server communication. internode_ compression All To compress traffic in internode communication. Available values are: all, dc, and none. inter_dc_tcp_nodelay True Setting it to false will cause less congestion over TCP protocol but increased latency.
  • 27. Chapter 1 ■ NoSQL: Cassandra Basics 19 Here %from is a value specified as restore_directories and %to is next commit log segment file under commitlog_directory. One advantage of this continuous commit log is high availability of data also termed warm standby. Configuring Replication and Data Center Recently, the need for big data heterogeneous systems has evolved. Components in such systems are diverse in nature and can be made up of different data sets. Considering nature, locality, and quantity of data volume, it is highly possible that such systems may need to interconnect with data centers available on different physical locations. A data center is a hardware system (say commodity server), which consists of multiple racks. A rack may contain one or more nodes (see Figure 1-8). Figure 1-8. Image depicting a Cassandra data center Reasons for maintaining multiple data centers can be high availability, stand-by-node, and data recovery. With high availability, any incoming request must be served with minimum latency. Data replication is a mechanism to keep redundant copy of the same data on multiple nodes. As explained above, a data center consists of multiple racks with each rack containing multiple nodes. A data replication strategy is vital in order to enable high availability and node failure. Situations like Local reads (high availability) • Fail-over (node failure) • Considering these factors, we should replicate data on multiple nodes in the same data center but with different racks. This would avoid read/write failure (in case of network connection issues, power failure, etc.) of nodes in the same rack.
  • 28. Chapter 1 ■ NoSQL: Cassandra Basics 20 Replication means keeping redundant copies of data over multiple data nodes for high availability and consistency. With Cassandra we can configure the replication factor and replication strategy class while creating keyspace. While creating schema (rather than keyspace) we can configure replication as: CREATE KEYSPACE apress WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3}; // cql3 script create keyspace apress with placement_strategy='org.apache.cassandra.locator.SimpleStrategy' and strategy_options ={replication_factor:1}; // using cassandra-cli thrift Note ■ ■ Schema creation and management via CQL3 and Cassandra-cli will be discussed in Chapter 2. Here, SimpleStrategy is the replication strategy, where the replication_factor is 3. Using SimpleStrategy like this, each data row will be replicated on 3 replica nodes synchronously or asynchronously (depending on the write consistency level) in clockwise direction. Different strategy class options supported by Cassandra are • SimpleStrategy • LocalStrategy • NetworkTopologyStrategy LocalStrategy LocalStrategy is available for internal purposes and is used for system and system_auth keyspaces. System and system_auth are internal keyspaces, implicitly handled by Cassandra’s storage architecture for managing authorization and authentication. These keyspaces also keep metadata about user-defined keyspaces and column families. In the next chapter we will discuss them in detail. Trying to create keyspace with strategy class as LocalStrategy is not permissible in Cassandra and would give an error like “LocalStrategy is for Cassandra’s internal purpose only” . NetworkTopologyStrategy NetworkTopologyStrategy is preferred if multiple replica nodes need to be placed on different data centers. We can create a keyspace with this strategy as CREATE KEYSPACE apress WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1' : 2, 'dc2' : 3}; Here dc1 and dc2 are data center names with replication factor of 2 and 3 respectively. Data center names are derived from a configured snitch property.
  • 29. Chapter 1 ■ NoSQL: Cassandra Basics 21 SimpleStrategy SimpleStrategy is recommended for multiple nodes over multiple racks in a single data center. CREATE KEYSPACE apress WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3}; Here, replication factor 3 would mean to replicate data on 3 nodes and strategy class SimpleStrategy would mean to have those Cassandra nodes within the same data center. Cassandra Multiple Node Configuration In this section, we will discuss multiple Cassandra node configurations over a single machine and over Amazon EC2 instances. Reasons to choose AWS EC2 instances include the setup of the Cassandra cluster over the cloud and the set up on the local box to configure the Cassandra cluster over physical boxes. AWS based configuration would educate users about AWS and Cassandra. Configuring Multiple Nodes over a Single Machine Configuring multiple nodes over a single machine is more of an experiment, as with a production application you would like to configure a Cassandra cluster over multiple Cassandra nodes. Setting up multinode clusters over a single machine or multiple machines is similar. That’s what we will be covering in this sample exercise. In this example, we will configure 3 nodes (127.0.0.2-4) on a single machine. 1. We need to map hostnames to IP addresses. In Windows and Linux OS, these configurations are available in a. etc/hosts (Windows) or /etc/hosts (Linux) files. Modify the configuration file to add the above-mentioned 3 node configuration as: 127.0.0.1 127.0.0.2 127.0.0.1 127.0.0.3 127.0.0.1 127.0.0.4 For Mac OS, we need to create those aliases as: b. sudo ifconfig lo0 alias 127.0.0.2 up sudo ifconfig lo0 alias 127.0.0.3 up sudo ifconfig lo0 alias 127.0.0.4 up 2. Unzip the downloaded Cassandra tarball installation in 3 different folders (one for each node). Assign each node an identical cluster_name as: # The name of the cluster. This is mainly used to prevent machines in # one logical cluster from joining another. cluster_name: 'Test Cluster'
  • 30. Chapter 1 ■ NoSQL: Cassandra Basics 22 3. We should hold identical seeds on each node in the cluster. These are used just to initiate gossip protocol among nodes in the cluster. Configure seeds in cassandra.yaml as : seed_provider: # Addresses of hosts that are deemed contact points. # Cassandra nodes use this list of hosts to find each other and learn # the topology of the ring. You must change this if you are running # multiple nodes! - class_name: org.apache.cassandra.locator.SimpleSeedProvider parameters: # seeds is actually a comma-delimited list of addresses. # Ex: ip1,ip2,ip3 - seeds: 127.0.0.2 4. Change the listen_address and rpc_address configurations for 127.0.0.2, 127.0.0.3, and 127.0.0.4 IP addresses in each cassandra.yaml file. Since all 3 nodes are running on the same machine, change the rpc_address to 9160, 9161, and 9162 for each respectively. 5. Here we have an option to choose between 1 token per node or multiple tokens per node. Cassandra 1.2 introduced the “Virtual nodes” feature which allows assigning a range of tokens on a node. We will discuss Virtual nodes in coming chapter. Change the initial_token to empty and keep num_tokens as 2 (recommend is 256). 6. Next is to assign different JMX_PORT (say 8081, 8082, and 8083) for each node. a. With Linux, modify $CASSANDRA_HOME/conf/cassandra.env.sh as: # Specifies the default port over which Cassandra will be available for # JMX connections. JMX_PORT=7199 b. With Windows, modify $CASSANDRA_HOME/bin/cassandra.bat as: REM ***** JAVA options ***** set JAVA_OPTS=-ea^ -javaagent:%CASSANDRA_HOME%libjamm-0.2.5.jar^ -Xms1G^ -Xmx1G^ -XX:+HeapDumpOnOutOfMemoryError^ -XX:+UseParNewGC^ -XX:+UseConcMarkSweepGC^ -XX:+CMSParallelRemarkEnabled^ -XX:SurvivorRatio=8^ -XX:MaxTenuringThreshold=1^ -XX:CMSInitiatingOccupancyFraction=75^ -XX:+UseCMSInitiatingOccupancyOnly^ -Dcom.sun.management.jmxremote.port=7199^ -Dcom.sun.management.jmxremote.ssl=false^ -Dcom.sun.management.jmxremote.authenticate=false^ -Dlog4j.configuration=log4j-server.properties^ -Dlog4j.defaultInitOverride=true
  • 31. Chapter 1 ■ NoSQL: Cassandra Basics 23 7. Let’s start each node one by one and check ring status as: $CASSANDRA_HOME/apache- cassandra-1.2.4/bin/nodetool -h 127.0.0.02 -p 8081 ring. Figure 1-9 shows ring status while connecting to one Cassandra node using jmx. Since Cassandra’s architecture is peer-to-peer, checking ring status on any node will yield the same result. Figure 1-9. The ring status Figure 1-10. Ec2 console display with 2 instances in running state Configuring Multiple Nodes over Amazon EC2 Amazon Elastic Computing Cloud (Amazon EC2), one of the important parts of Amazon Web Service (AWS) cloud computing platform. AWS offers you to choose OS platform and provides required hardware support over the cloud, which allows you to quickly set up and deploy application over the cloud computing platform. To learn more about Amazon ec2 setup please refer to http://aws.amazon.com/ec2/. In this section, we will learn about how to configure multiple Cassandra nodes over Amazon EC2. To do so, follow these steps. 1. First let’s launch 2 instances of AMI (ami-00730969), as shown in Figure 1-10.
  • 32. Chapter 1 ■ NoSQL: Cassandra Basics 24 2. Modify security group to enable 9160, 7000, and 7199 ports, as in Figure 1-11. Figure 1-11. Configuring security group settings 3. Connect to each instance and download Cassandra tarball as: wget http://archive.apache.org/dist/cassandra/1.2.4/apache-cassandra-1.2.4-bin.tar.gz 4. Download and setup Java on each EC2 instance using the rpm installer as: sudo rpm -i jdk-7-linux-x64.rpm sudo rm -rf /usr/bin/java sudo ln -s /usr/java/jdk1.7.0/bin/java /usr/bin/java sudo rm -rf /usr/bin/javac sudo ln -s /usr/java/jdk1.7.0/bin/javac /usr/bin/javac 5. Multiple Cassandra node configurations are the same as we discussed in the previous section. In this section we will demonstrate using single token per node (initial_token). Let’s assign initial token values 0 and 1. We can assign initial_token values by modifying Cassandra.yaml files on each node. Figure 1-12. initial_token configuration for both nodes
  • 33. Chapter 1 ■ NoSQL: Cassandra Basics 25 6. Create any one of these two as a seed node and keep storage port, jmx_port, and rpc_port to 7000, 7199, and 9160. 7. Let’s keep listen_address and rpc_address with empty values (default is the node’s inet address (underlined), shown in Figure 1-13). Figure 1-13. How to get inet address for node 8. Let’s start each node one by one and check ring status. Verify both EC2 instances should be up, running, and connected using ring topology. Figure 1-14 shows the ring status of both running ec2 instances. Figure 1-14. The two EC2 instances and their ring statuses
  • 34. Chapter 1 ■ NoSQL: Cassandra Basics 26 9. Figure 1-15 shows instance 10.145.213.3 is up and joining the cluster ring. Summary This chapter is an introductory one to cover all generic concepts and Cassandra-specific configurations. For application developers it is really important to understand the essence of replication, data distribution, and most importantly setting this up with Cassandra. Now we are ready for the next challenge: handling big data with Cassandra! In the next chapter we will discuss Cassandra’s storage mechanism and data modeling. With data modeling and understanding Cassandra’s storage architecture it would help us to model the data set, and analyze and look into the best possible approaches available with Cassandra. Figure 1-15. Node 10.145.213.3 is up and joining the cluster
  • 35. 27 Chapter 2 Cassandra Data Modeling In the previous chapter we discussed Cassandra configuration, installation, and cluster setup. This chapter will walk you through Data modeling concepts • Cassandra collection support • CQL vs thrift based schema • Managing data types • Counter columns • Get ready to learn with an equal balance of theoretical and practical approach. Introducing Data Modeling Data modeling is a mechanism to define read/write requirements and build a logical structure and object model. Cassandra is an NOSQL database and promotes read-before-write instead of relational model. Read-before-write or ready-for-read design is used to analyze your data read requirement first and store it in the same way. Consider managing data volume in peta bytes or zeta bytes, where we cannot afford to have in-memory computations (e.g., joins) because of data volume. Hence it is preferable to have the data set ready for retrieval or large data analytics. Users need not know about columns up front, but should avoid storing flat columns and favor doing computations (e.g., aggregation, joins etc.) during read time. Cassandra is a column-family–oriented database. Column family, as the name suggests is “family of columns.” Each row in Cassandra may contain one or more columns. A column is the smallest unit of data containing a name, value, and time stamp (see Figure 2-1). Figure 2-1. Cassandra column definition
  • 36. Chapter 2 ■ Cassandra Data Modeling 28 By default Cassandra distribution comes with cqlsh and Cassandra-cli command line clients to manipulate. Cassandra-cli and cqlsh (.sh and .bat) are available under bin folder. Running these command line clients over Linux, Windows, or Mac is fairly easy. Running shell files over Linux and Mac box requires simply running cql.sh. However running cqlsh over Windows would require Python to be installed. To install cqlsh on Windows, follow these steps: 1. First, download Python from https://www.python.org/ftp/python/2.7.6/ python-2.7.6.msi. 2. Add python.exe to PATH under environment variable 3. Run setup.py, available under $CASSANDRA_HOME/pylib directory: python setup.py install 4. Run cqlsh, available under bin directory (see Figure 2-2): python cqlsh Figure 2-2. successfully connected to cql shell Figure 2-3. Cassandra’s supported data types Data Types Before CQL’s evolution, data types in Cassandra are defined in the form of a comparator and validator. Column or row key value is referred to as a validator, whereas a column name is called a comparator. Available data types are shown in Figure 2-3.
  • 37. Chapter 2 ■ Cassandra Data Modeling 29 Dynamic Columns Since its inception, Cassandra is projected as a schema-less, column-family–oriented distributed database. The number of columns may vary for each row in a column family. A column definition can be added dynamically at run time. Cassandra-cli (Thrift) and cqlsh (CQL3) are two command clients we will be using for various exercises in this chapter. Dynamic Columns via Thrift Let’s discuss a simple Twitter use case. In this example we would explore ways to model and store dynamic columns via Thrift. 1. First, let’s create a keyspace twitter and column family users: create keyspace twitter with strategy_options={replication_factor:1} and placement_strategy='org.apache.cassandra.locator.SimpleStrategy'; use twitter; create column family users with key_validation_class='UTF8Type' and comparator='UTF8Type' and default_validation_class='UTF8Type'; Here, while defining a column family, we did not define any columns with the column family. Columns will be added on the fly against each row key value. 2. Store a few columns in the users column family for row key value 'imvivek': set users['imvivek']['apress']='apress author'; set users['imvivek']['team_marketing']='apress marketing'; set users['imvivek']['guest']='guest user'; set users['imvivek']['ritaf']='rita fernando'; Here we are adding followers as dynamic columns for user imvivek. 3. Let’s add 'imvivek' and 'team_marketing' as followers for 'ritaf': set users['ritaf']['imvivek']='vivek mishra'; set users['team_marketing']['imvivek']='vivek mishra'; 4. To view a list of rows in users column family (see Figure 2-4), use the following command: list users;
  • 38. Chapter 2 ■ Cassandra Data Modeling 30 In Figure 2-4, we can see column name and their values against each row key stored in step 3. 5. We can delete columns for an individual key as well. For example, to delete a column 'apress' for row key 'imvivek': del users['imvivek']['apress']; Figure 2-5 shows the number of columns for imvivek after step 5. Figure 2-4. Output of selecting users Figure 2-5. The number of columns for imvivek after deletion Here column name is the follower’s twitter_id and their full name is column value. That’s how we can manage schema and play with dynamic columns in Thrift way. We will discuss dynamic column support with CQL3 in Chapter 3.
  • 39. Chapter 2 ■ Cassandra Data Modeling 31 Dynamic Columns via cqlsh Using Map Support In this section, we will discuss how to implement the same Twitter use case using map support. Collection support in Cassandra would work only with CQL3 binary protocol. 1. First, let’s create a keyspace twitter and column family users: create keyspace twitter with replication = {'class':'SimpleStrategy', 'replication_factor':3}; use twitter; create table users(twitter_id text primary key,followers maptext,text); 2. Store a few columns in users column family for row key value 'imvivek': insert into users(twitter_id,followers) values('imvivek',{'guestuser':'guest', 'ritaf':'rita fernando','team_marketing':'apress marketing'}); Here we are adding followers as dynamic columns as map attributes for user imvivek. 3. Let’s add 'imvivek' and 'team_marketing' as followers for 'ritaf': insert into users(twitter_id,followers) values('ritaf',{'imvivek':'vivek mishra'}); insert into users(twitter_id,followers) values('team_marketing', {'imvivek':'vivek mishra'}); 4. To view list of rows in the users column family (see Figure 2-6), use the following command: select * from users; Figure 2-6. Map containing followers for user 5. To add 'team_marketing' as a follower for 'ritaf' and vice versa (see Figure 2-7), we can simply add it as an element in users column family: update users set followers = followers + {'team_marketing':'apress marketing'} where twitter_id='ritaf'; update users set followers = followers + {'ritaf':'rita fernando'} where twitter_id='apress_marketing';
  • 40. Chapter 2 ■ Cassandra Data Modeling 32 6. Using update would work as an insert if row key doesn’t exist in the database. For example, update users set followers = followers + {'ritaf':'rita fernando'} where twitter_id='jhassell'; // update as insert Figure 2-8 shows that ritaf has been added as a follower of jhassell. Figure 2-7. After update map of followers for each user Figure 2-8. Update works as an insert for map of followers for nonexisting row key (e.g., twitter_id) Figure 2-9. After deleting guestuser as a follower for imvivek 7. To delete an element from the map we need to execute, use this command: delete followers['guestuser'] from users where twitter_id='imvivek'; You can see that the list of followers for imvivek is reduced to four followers after deletion (see Figure 2-9).
  • 41. Chapter 2 ■ Cassandra Data Modeling 33 With that said, we can add a dynamic column as a key-value pair using collection support. Dynamic Columns via cqlsh Using Set Support Consider a scenario where the user wants to store only a collection of follower’s id (not full name). Cassandra offers collection support for keeping a list or set of such elements. Let’s discuss how to implement it using set support. 1. First, let’s create a keyspace twitter and column family users. create keyspace twitter with replication = {'class':'SimpleStrategy', 'replication_factor':3}; use twitter; create table users(twitter_id text primary key,followers settext); 2. Store few columns in users column family for row key value 'imvivek'. insert into users(twitter_id,followers) values('imvivek', {'guestuser','ritaf','team_marketing'}); Here we are adding followers as dynamic columns as set attributes for user imvivek. 3. Let’s add the following: 'imvivek' and 'team_marketing' as followers for 'ritaf' 'ritaf' as a follower for 'jhassell' insert into users(twitter_id,followers) values('ritaf', {'imvivek','jhassell', 'team_marketing'}); insert into users(twitter_id,followers) values('jhassell', {'ritaf'}); 4. To view the list of rows in users column family (see Figure 2-10), use the following command: select * from users; Figure 2-10. Followers for ritaf, jhassell, and imvivek have been added 5. We can update the collection to delete an element as follows. Figure 2-10 shows the result: update users set followers = followers - {'guestuser'} where twitter_id = 'imvivek';
  • 42. Chapter 2 ■ Cassandra Data Modeling 34 Collection support can be a good alternative for achieving Adding dynamic columns over Cassandra. Composite key is a combination of multiple table fields where the first part is referred to as partition key and the remaining part of the composite key is known as cluster key. Chapter 3 will discuss achieving dynamic columns using composite columns. Secondary Indexes In a distributed cluster, data for a column family is distributed across multiple nodes, based on replication factor and partitioning schema. However data for a given row key value will always be on the same node. Using the primary index (e.g., Row key) we can always retrieve a row. But what about retrieving it using non-row key values? Cassandra provides support to add indexes over column values, called Secondary indexes. Chapter 3 will cover more about indexes, so for now let’s just take a look at a simple secondary index example. Let’s discuss the same Twitter example and see how we can utilize and enable secondary index support. 1. First, let’s create twitter keyspace and column family users. create keyspace twitter with replication = {'class' : 'SimpleStrategy' , 'replication_factor' : 3}; use twitter; create table users(user_id text PRIMARY KEY,fullname text,email text,password text, followers maptext, text); 2. Insert a user with e-mail and password: insert into users(user_id,email,password,fullname,followers) values ('imvivek', 'imvivek@xxx.com','password','vivekm',{'mkundera':'milan kundera','guest': 'guestuser'}); Before we move ahead with this exercise, it’s worth discussing which columns should be indexed? Any read request using the secondary index will actually be broadcast to all nodes in a cluster. Cassandra maintains a hidden column family for the secondary index locally on node, which is scanned for retrieving rows using secondary indexes. While performing data modeling, we should create secondary indexes over column values which should return a big chunk of data over a very large data set. Indexes over unique values of small data sets would simply become an overhead, which is not a good data modeling practice. Index over fullname is a possible candidate for indexing. 3. Let’s create secondary index over fullname create index fullname_idx on users(fullname); Figure 2-11. Updated set of followers after removing guestuser for imvivek
  • 43. Chapter 2 ■ Cassandra Data Modeling 35 After successful index creation, we can fetch records using fullname. Figure 2-12 shows the result. Figure 2-12. Search user for records having fullname value vivekm Figure 2-13. Selecting all users of age 51 4. Let’s add a column of age and create the index: alter table users add age text; create index age_idx on users(age); update users set age='32' where user_id='imvivek'; insert into users(user_id,email,password,fullname,followers,age) values ('mkundera','mkundera@outlook.com','password','milan kundera',{'imvivek':'vivekm','gues t': 'guestuser'},'51'); Figure 2-13 shows the outcome. 5. Let’s alter data type of age to int: alter table users alter age type int; It will result in the following error: TSocket read 0 bytes (via cqlsh) 6. To alter data type of indexed columns we need to rebuild them: drop index age_idx; alter table users alter age type int; But please note that in such cases, it may result the data set being in an incompatible state (see Figure 2-14).
  • 44. Chapter 2 ■ Cassandra Data Modeling 36 Here is the error: Failed to decode value '51' (for column 'age') as int: unpack requires a string argument of length 4 Failed to decode value '32' (for column 'age') as int: unpack requires a string argument of length 4 Hence it is recommended to change data types on indexed columns, when there is no data available for that column. Indexes over collections are not supported in Cassandra 2.0. Figure 2-15 shows what happens if we try to create an index follower. However, before this book went to press, version 2.1 was released and added this capability. See “Indexing on Collection Attributes” in Chapter 11. Figure 2-14. Error while changing data type to int from string Figure 2-15. Indexes over collections are not supported in Cassandra 2.0 Note ■ ■ Updates to the data type of clustering keys and indexes are not allowed. CQL3 and Thrift Interoperability Prior to CQL existence, Thrift was the only way to develop an application over Cassandra. CQL3 and Thrift interoperability issues are often discussed within the Cassandra community. Let’s discuss some issues with a simple example: 1. First, let’s create a keyspace and column family using CQL3. create keyspace cql3usage with replication = {'class' : 'SimpleStrategy' , 'replication_factor' : 3}; use cql3usage; create table user(user_id text PRIMARY KEY, first_name text, last_name text, emailid text); 2. Let’s insert one record: insert into user(user_id,first_name,last_name,emailid) values('@mevivs','vivek','mishra','vivek.mishra@xxx.com');
  • 45. Chapter 2 ■ Cassandra Data Modeling 37 Figure 2-16. Describes table user 3. Now, connect with Cassandra-cli (the Thrift way) and update the user column family to create indexes over last_name and first_name: update column family user with key_validation_class='UTF8Type' and column_ metadata=[{column_name:last_name, validation_class:'UTF8Type', index_type:KEYS}, {column_name:first_name, validation_class:'UTF8Type', index_type:KEYS}]; Note ■ ■ Chapter 3 will cover indexing in detail. 4. Now explore the user column family with CQL3, and see the result in Figure 2-16. describe table user; Metadata has been changed, and columns (first_name and last_name) modified via Thrift are no longer available with CQL3! Don’t worry! Data is not lost as CQL3 and Thrift rely on the same storage engine, and we can always get that metadata back by rebuilding them. 5. Let’s rebuild first_name and last_name: alter table user add first_name text; alter table user add last_name text; The problem is with CQL3’s sparse tables. CQL3 has different metadata (CQL3Metadata) that has NOT been added to Thrift’s CFMetaData. Do not mix and match CQL3 and Thrift to perform DDL/DML operations. It will always lead any one of these metadata to an inconsistent state. A developer who can’t afford loosing Thrift’s dynamic column support still prefers to perform an insert via Thrift, but to read them back via CQL3. It is recommended to use CQL3 for a new application development over Cassandra. However, it has been noticed that Thrift based mutation still works faster than CQL3 (such as batch operation) until Cassandra 1.x.x releases. This is scheduled to address with Cassandra 2.0.0 release (https://issues.apache.org/ jira/browse/CASSANDRA-4693).
  • 46. Chapter 2 ■ Cassandra Data Modeling 38 Changing Data Types Changing data types with Cassandra is possible in two ways, Thrift and CQL3. Thrift Way Let’s discuss more about data types with legacy Thrift API: 1. Let’s create a column family with minimal definition, such as: create keyspace twitter with strategy_options={replication_factor:1} and placement_strategy='org.apache.cassandra.locator.SimpleStrategy'; use twitter; create column family default; Default data type for comparator and validator is BytesType. 2. Let’s describe the keyspace and have a look at the default column family (see Figure 2-17): describe twitter; Figure 2-18. Error while storing string value but column value is of bytes type Figure 2-17. Structure of twitter keyspace 3. Let’s try to store some data in the column family: set default[1]['type']='bytes'; gives an error Figure 2-18 shows that this produces an error.
  • 47. Chapter 2 ■ Cassandra Data Modeling 39 Since the comparator and validator are set to default data type (e.g., BytesType), Cassandra-cli is not able to parse and store such requests. 4. To get step 3 working, we need to use the assume function to provide some hint: assume default keys as UTF8Type; assume default comparator as UTF8Type; assume default validator as UTF8Type; 5. Now let’s try to change the comparator from BytesType to UTF8Type: update column family default with comparator='UTF8Type'; gives error This generates an error because changing the comparator type is not allowed (see Figure 2-19). Figure 2-19. Changing comparator type is not allowed Figure 2-20. Retrieving values using cql shell 6. Although changing comparator type is not allowed, we can always change the data type of the column and key validation class as follows: update column family default with key_validation_class=UTF8Type and default_validation_class = UTF8Type; Columns in a row are sorted by column names and that’s where comparator plays a vital role. Based on comparator type (i.e., UTF8Type, Int32Type, etc.) columns can be stored in a sorted manner. CQL3 Way Cassandra CQL3 is the driving factor at present. Most of the high-level APIs are supporting and extending further development around it. Let’s discuss a few tricks while dealing with data types in CQL3 way. We will explore with the default column family created in the Thrift way (see the preceding section). 1. Let’s try to fetch rows from the default column family (see Figure 2-20). Select * from default;
  • 48. Chapter 2 ■ Cassandra Data Modeling 40 2. Let’s issue the assume command and try to fetch rows from the default column family in readable format: assume default(column1) values are text; assume default(value) values are text; assume default(key) values are text; select * from default; Figure 2-21 shows the result. Figure 2-21. Retrieving after assume function is applied 3. typeAsBlob or blobAsType functions can also be used to marshal data while running CQL3 queries: select blobAsText(key),blobAsText(type),blobAsText(value) from default; 4. We can alter the data type of validator as follows: alter table default alter value type text; alter table default alter key type text; Note ■ ■  The assume command will not be available after Cassandra 1.2.X release. As an alternative we can use typeAsBlob (e.g., textAsBlob) CQL3 functions. Counter Column Distributed counters are incremental values of a column partitioned across multiple nodes. Counter columns can be useful to provide counts and aggregation analytics for Cassandra-powered applications (e.g., Number of page hits, number of active users, etc.). In Cassandra, a counter is a 64-bit signed integer. A write on counter will require a read from replica nodes (this depends on consistency level, default is ONE). While reading a counter column value, read has to be consistent. Counter Column with and without replicate_on_write Default value of replicate_on_write is true. If set to false it will replicate on one replica node (irrespective of replication factor). That might be helpful to avoid read-before-write on serving write request. But any subsequent read may not be consistent and may also result in data loss (single replica node is gone!).
  • 49. Chapter 2 ■ Cassandra Data Modeling 41 Play with Counter Columns In Chapter 1 we discussed setting multiple clusters on a single machine. First let’s start with a cluster of three nodes on a single machine. (Please refer to the “Configuring Multiple Nodes on a Single Machine” section in Chapter 1.) In this recipe we will discuss the do’s and don’ts of using counter columns. 1. Let’s create a keyspace counterkeyspace: create keyspace counterkeyspace with replication = {'class' : 'SimpleStrategy', 'replication_factor' : 2 } 2. Create a column family counternoreptable with replicate_on_write as false: create table counternoreptable(id text PRIMARY KEY, pagecount counter) with replicate_on_write='false'; 3. Update pagecount to increment by 2 as follows: update counternoreptable set pagecount=pagecount+2 where id = '1'; 4. Select from the column family as follows: select * from counternoreptable; As shown in Figure 2-22, it results in zero rows. Whether it results in zero rows may depend on which node it is written to. Figure 2-22. Inconsistent result on fetching from counter table Figure 2-23. Retrieving from the counter table after incrementing the counter column value 5. Let’s update pagecount for some more values and verify the results: update counternoreptable set pagecount=pagecount+12 where id = '1'; select * from counternoreptable; Figure 2-23 shows the result of this command. update counternoreptable set pagecount=pagecount-2 where id = '1'; select * from counternoreptable;
  • 50. Chapter 2 ■ Cassandra Data Modeling 42 The result is different for this command (see Figure 2-24). Figure 2-24. Inconsistent result of counter column without replication You can see the inconsistent results on read with replicate_on_write as false. With this, conclude that by disabling such parameters we may avoid read-before-write on each write request, but subsequent read requests may result in inconsistent data. Also without replication we may suffer data loss, if a single replica containing an updated counter value goes down or is damaged. Try the above recipe with replicate_on_write as true and monitor whether results are consistent and accurate or not! Note ■ ■ You may refer to https://issues.apache.org/jira/browse/CASSANDRA-1072 for more on counter columns. Data Modeling Tips Cassandra is a column-oriented database that is quite different from traditional RDBMS. We don’t need to define schema up front, but it is always better to get a good understanding of the requirements and database before moving ahead with data modeling, including: Writes in Cassandra are relatively fast but reads are not. Pre-analysis of how we want to • perform read operations is always very important to keep in mind before data modeling. Data should be de-normalized as much as possible. • Choose the correct partitioning strategy to avoid rebuilding/populating data over updated • partitioning strategy. Prefer using surrogate keys and composite keys (over super columns) while modeling table/ • column family. Summary To summarize a few things discussed in this chapter so far: Do not mix Thrift and CQL3 for DDL and DML operations, although reads should be fine. • Avoid changing data types. • Use Cassandra collection support for adding columns on the fly. • In Chapter 3, we will continue our discussion by exploring indexes, composite columns, and the latest features introduced in Cassandra 2.0, such as Compare and Set.
  • 51. 43 Chapter 3 Indexes and Composite Columns In previous chapters we have discussed big data problems, Cassandra data modeling concepts, and various schema management techniques. Although you should avoid normalizing the form of your data too much, you still need to model read requirements around columns rather than primary keys in your database applications. The following topics will be covered in this chapter Indexing concept • Data partitioning • Cassandra read/write mechanism • Secondary indexes • Composite columns • What’s new in Cassandra 2.0 • Indexes An index in a database is a data structure for faster retrieval of a data set in a table. Indexes can be made over single or multiple columns. Indexing is a process to create and manage a data structure called Index for fast data retrieval. Each index consists of indexed field values and references to physical records. In some cases a reference can be an actual row itself. We will discuss these cases in the clustered indexes section. Physically data is stored on blocks in data structure form (like sstable in Cassandra). These data blocks are unordered and distributed across multiple nodes. Accessing data records without a primary key or index would require a linear search across multiple nodes. Let’s discuss format index data structure. Indexes are stored in sorted order into B-tree (balanced tree) structure, where indexes are leaf nodes under branch nodes. Figure 3-1 depicts data storage where multi-level leaf nodes (0,1) are indexed in sorted order and data is in unsorted order. Here each leaf node is a b-tree node containing multiple keys. Based on inserts/updates/ deletes, the number of keys per b-tree node keeps changing but in sorted order.
  • 52. Chapter 3 ■ Indexes and Composite Columns 44 Let’s simplify further. In Figure 3-2, the table containing age and row keys are leaf nodes and the other one is a physical table. Figure 3-1. b-tree Index and data structure with multi-level leaf nodes Figure 3-2. A physical table and an index table as leaf node This allows faster retrieval of records using binary search. Since b-tree keeps data sorted for faster searching, it would introduce some overhead on insert, update, and delete operations and would require rearranging indexes. B-tree is the preferred data structure of a larger set of read and writes, that’s why it’s widely used with distributed databases. Clustered Indexes vs. Non-Clustered Indexes Indexes that are maintained independently from physical rows and don’t manage ordering of rows are called non-clustered indexes (see Figure 3-1). On the other hand, clustered indexes will store actual rows in sorted order for the index field. Since a clustered index will store and manage ordering of physical rows, only one clustered index is possible per table.
  • 53. Chapter 3 ■ Indexes and Composite Columns 45 The important question is for what scenarios we should use clustered indexes and non-clustered indexes. For example, a department can be multiple employees (many-to-one relation) and often is required to read employee details by department. Here department is a suitable candidate for a clustered index. All rows containing employee details would be stored and ordered by department for faster retrieval. Here employee name is a perfect candidate for a non-clustered index and thus we can hold multiple non-clustered indexes in a table but there will always be a single clustered index per table. Index Distribution With distributed databases, data gets distributed and replicated across multiple nodes. Retrieval of a data collection would require fetching rows from multiple nodes. Opting for indexes over a non-row key column would also require being distributed across multiple nodes, such as shards. Long-running queries can benefit from such shard-based indexing for fast retrieval of data sets. Due to peer-to-peer architecture each node in a Cassandra cluster will hold an identical configuration. Data replication, eventual consistency, and partitioning schema are two important aspects of data distribution. Please refer to Chapter 1 for more details about replication factor, strategy class, and read/write consistency. Indexing in Cassandra Data on a Cassandra node is stored locally for each row. Rows are distributed across multiple nodes, but all columns for a particular row key will be stored locally on a node. Cassandra by default provides the primary index over row key for faster retrieval by row key. Secondary Indexes Indexes over column values are known as secondary indexes. These indexes are stored locally on a node where physical data resides. That allows Cassandra to perform faster index-based retrieval of data. Secondary indexes are stored in a hidden column family and internally managed by the node itself. Let’s explore more on secondary indexes with a simple exercise. 1. First, let’s create a keyspace twitter and column family users. create keyspace twitter with replication = { 'class':'SimpleStrategy' , 'replication_factor':3};use twitter; create column family users with key_validation_class='UTF8Type' and comparator='UTF8Type' and default_validation_class='UTF8Type'; create table users (user_id uuid primary key, first_name text, twitter_handle text); 2. Let’s create index over first_name using create index syntax (see Figure 3-3). create index fname_idx on users(first_name); 3. Describe table users: describe table users;
  • 54. Chapter 3 ■ Indexes and Composite Columns 46 Figure 3-3 shows users schema with index created on first_name. Figure 3-4. Fetching users by first_name 4. Let’s insert a few rows in the users column family. insert into users(user_id,first_name,twitter_handle) values(now(),'apress','#apress_team'); insert into users(user_id,first_name,twitter_handle) values(now(),'jonathan','#jhassell'); insert into users(user_id,first_name,twitter_handle) values(now(),'vivek','#mevivs'); insert into users(user_id,first_name,twitter_handle) values(now(),'vivek','#vivekab'); 5. Let’s try to find records using the indexed column first_name (see Figure 3-4). select * from users where first_name='vivek'; Figure 3-4 shows output of fetching users having first name vivek. Figure 3-3. Users table with index on first_name Query over indexed column first_name with value 'vivek' (Figure 3-4) returns two rows. Here both rows can be available on the same node or different nodes. One point worth mentioning here is that indexes would also be stored locally along with data rows, which would ensure data locality. On the other hand, if we try to fetch rows using column twitter_handle, which is non-indexed: select * from users where twitter_handle='#imvivek';
  • 55. Random documents with unrelated content Scribd suggests to you:
  • 56. U nearly touches line below, and O of POST line above. Buff, Orange, Amber and Dark Manila. POST 8 mm. U large and far from left oval. S and P near. The latter is in a nearly vertical position and stands well to the left of the point. POST equally spaced. T far from right oval. T of TWO near left oval; WO close. OC near. C vertical, and at top near point of inner frame line. EN well spaced. S near right oval. Nose near left oval. Top of left figure 2 near point of oval. U line passes close to head of E and touches the latter at base. VARIETY 13. (24-1/2 × 25-3/4 mm.) Hair projecting. CE on level and nearly touch at top. Buff, Orange and Dark Manila. POST 8 mm. U large and nearer to left oval than in Var. 12. U. S. near. SP. near. P a little inclined to left and to left of the point. POST spaced near. T far from right oval. T of TWO close to left oval. WO and OC close. ENT close. S far from right oval. Nose near left oval. E line passes near right stroke of U. CLASS V. Point of Bust over middle of O. VARIETY 14. (23-1/2 × 26 mm.)
  • 57. OS far apart. S of CENTS near oval line. Buff, Orange, Dark Manila. POST 8 mm. U large, far from left oval, and near inner frame line. SP wide at top. PO near, but top of letters some distance from outer frame line. T far from right oval. T of TWO close to left oval. OC near and top of C under the point. CE wide at base. EN widely spaced. NT wide at base. TS near. Nose near left oval. Figures of value well centered in ovals. W line touches top of P. A deterioration of this variety in which the nose almost touches left oval and TW touch upper and lower frame lines is called 14a. VARIETY 15. (25 × 16 mm.) Bust touches line over center of O. Buff, Orange, Amber, Dark Manila. POST 8 mm. U large, near left oval and at top far from outer frame line. P to left of point. O well to right of point and slanting to right. OST near. T far from right oval. T of TWO close to left oval. WO close. OC wide. C low and touching outer frame line. ENTS spaced near, but S far from right oval. Nose near left oval. Left figure 2 well centered, but right figure 2 much nearer to inner frame line. W line falls between base of S and the period. A deterioration of this die is Var. 15a. VARIETY 16. (24-3/4 × 26-1/4 mm.) Bust nearly touches line to right of O. Buff, Orange. POST 8 mm. U wide and far from left oval. P to left of point and close to outer frame line. PO wide. O far to right of point. OST near. T far from right oval. T of TWO far from left oval. Inner frame line is some distance from top of letters WO of TWO and N of CENTS. OC wide. CE near but EN wide. S far from right oval. Nose far from left oval. Left figure 2 well centered, but right figure 2 much nearer to inner oval line.
  • 58. D. DIES. 25-1/2 to 26-1/4 mm. NOTE:—In Var. 17, 18, 23, 24, 31, and 34 the word POST is short and spaced closely. Var. 22 has the narrow U, and Var. 21, 27, 38, 39 and 40 show the widest spacing of POST. CLASS III. Point of Bust over last bar of W. VARIETY 17. (26-1/4 × 25-1/2 mm.) O of POST considerably above level of P. Wide space, after S of CENTS. Buff, Orange and Amber. POST 7-1/2 mm. U near left oval and near inner frame line. U.S. close. P far to left of point; O near point. OST close. T very far from right oval. T of TWO, far from left oval. WO near. OC near. CE close at top. N above level of E. NT close to inner frame line. Nose far from left oval. Figures well centered. U line touches O at right. VARIETY 18. (26 × 25-3/4 mm.) OC very near and O nearly touching line below. Buff, Orange and Amber.
  • 59. POST 8 mm. U wide, slanting sharply to left and near left oval. P is to left of point and slants to the left. POS near, but ST spaced wider. T very far from right oval. T of TWO close to left oval. WO close. CE close at top. EN well spaced at top. NTS near and S close to right oval. Nose near left oval. U line touches base of N. Envelopes only. VARIETY 19. (26 × 25-3/4 mm.) Letters evenly spaced, those in upper label almost in vertical position. Amber and Light Manila. POST 8 mm. U wide, nearly vertical and far from left oval. U.S. wide. P vertical and to left of point. POS widely spaced. ST near. T very far from right oval. T of TWO far from left oval and top stroke of T nearly touches W. WO near. OC near. C vertical but a little below E. Top stroke of T of CENTS close to inner frame line. S near right oval. Nose near left oval. Figures well in center of ovals. T line touches top of E. VARIETY 20. (25-1/2 × 25-1/2 mm.) Sharp point at base of right 2. Amber and Light Manila. POST 8 mm. U wide and near left oval. P nearly vertical and to left of point. Top of O almost touches outer frame line. Base of S and T close to inner frame line. T of TWO far from left oval.
  • 60. WO very close. OC close. CENTS close and S far from right oval. Nose far from left oval. T line touches O to right. VARIETY 21. (26 × 25-1/2 mm.) ST and OC extremely wide. Point of bust far from line. Sharply pointed nose. Amber and Light Manila. POST 9 mm. U wide, near left oval, and sharply slanting to left. U.S. and SP very wide. P to left of point and slanting a little to the right. PO very wide. O far to right of point and turned to right. OS wide. T near right oval, T of TWO close to left oval. TW very wide at base. WO close. C low and nearly under the point. ENTS near and S close to right oval. Nose pointed and far from left oval. Figures well centered. U line passes from tip of E to base of N. CLASS IV. Bust points to left line of O. VARIETY 22. (25-1/2 × 26 mm.) Narrow U, the only one in DIE D. Buff. Extremely rare. POST 7-1/2 mm. U nearly vertical and far from left oval. P small near the point and at top far from outer frame line. O far to right of point. POST equally spaced. T far from right oval. T of TWO near left oval. WO close. OC wide, C
  • 61. slants sharply to right and at base is within the angle, formed by the outer curves. CENTS are on the same level. S near right oval. The inner curves are far from top of letters WO and CENTS. Nose near left oval. In both side ovals the downstroke of figure 2 ends in a sharp point. U line touches O to left. Buff envelope only. Knife 2. VARIETY 23. (26 × 25 mm.) Extremely wide space before U and after T in upper label. Bust pointed. Amber and Light Manila. POST 7-1/2 mm. U wide. The inner curves of the label are close to the inscription. P nearly vertical. POS close. ST near. T of TWO close to left oval. WO near. OC near but C slants from left to right and its base touches the outer frame line. Top of vertical stroke of E close to inner point. EN well spaced at top. S slants to right and is close to right oval. Nose very far from right oval. Figure 2 in left oval is lower than figure 2 in right oval. W line passes through middle of U. VARIETY 24. (26 × 26 mm.) O above level of P, C sharply turned to left. Buff Orange and Light Manila. POST 7-1/2 mm. U wide, inclined to left and near left oval. U.S. near. SP near. P slanting to left and near the point. POST about equally spaced but OST high nearly touching outer frame line at top. T far from right oval. T of TWO far from left oval. WO near. OC near. EC close at top. ENT well spaced. S near right oval. Nose close to left oval. Figures in oval well centered. C line passes between O of TWO and C of CENTS.
  • 62. VARIETY 25. (25-1/2 × 26 mm.) P tipped sharply to left and O to right. Buff and Orange. POST 8 mm. U wide and far from left oval. Base of U, close to inner frame line, but top of S close to outer frame line. U S P near. P far to left and O in line with point. POS near. T far from S and far from right oval. T of TWO near left oval. WO close. OC close. CE on level but E slanting to right. TS close. S near right oval. Nose some distance from left oval. Figures in ovals well centered. Envelopes only. VARIETY 26. (26 × 26 mm.) P nearly on a level with O. POST close. OC near. Amber and Light Manila.
  • 63. POST 8 mm. U wide slanting to left, and far from left oval. US. wide. SP wide. P to left of point and nearly vertical. T very far from right oval. T of TWO near left oval. WO close. OC near. C vertical. CE close. EN near. NTS close. S far from right oval. Nose near oval. Figures well centered in ovals. T line passes close to junction point of inner frame lines, and touches C to left. VARIETY 27. (26-1/2 × 25-1/2 mm.) Sharp point of bust high above left of O. Amber and Light Manila. POST 9-3/4 mm. U wide slanting considerably to left and near left oval. The entire inscription in upper label is widely spaced, but OS widest. T slants sharply to right, nearly touches outer frame line and is far from right oval. T of TWO close to left oval. WO near. OC wide. The junction point of the inner frame lines is over the center of C, which is low. EN well spaced and close to inner frame line. S nearly horizontal and close to right oval. Nose near left oval. Downstroke of right figure 2 near inner oval line. T line passes through first stroke of W of TWO. VARIETY 27a. (26-1/4 × 25-1/2 mm.) POST 9-3/4 mm. Same as last variety, but appearing to be different. This is due to great deterioration of the die. It is found on a wrapper only and is rather scarce. CLASS V.
  • 64. Bust points to middle of O. VARIETY 28. (26 × 26 mm.) ST close. Wide space after S of CENTS. Buff and Orange. Post 7-1/2 mm. U wide, nearly vertical and near the left oval. U.S. near. PO near, but O slightly above P. There is a wide space between OS. T near right oval. T of TWO far from left oval. WO very close. OC near. CE close and top of E under the point. EN wide, especially at top: N slightly above E. NTS close. Nose near left oval. Figures well centered in ovals. U line cuts top of O of TWO at right. Envelopes only. VARIETY 29. (25 × 25-3/4 mm.) Space before U and after T extremely wide. Light Manila. POST 7-1/2 mm. U wide. U.S. near and both letters close to inner frame line. P well to left of point and on a level with O. O close to point. POS near, but T further from S. T of TWO close to left oval. WO near. OC near and C under the point. E quite a distance to right of point. EN wide. NTS near right oval. Nose far from left oval. Figures well centered in ovals. U line passes through middle of C of CENTS. Point of bust very broad. Wrappers only. VARIETY 30. (26 × 25-1/2 mm.) Nose far from oval line. Amber and Light Manila. POST 7-1/2 mm. U wide, nearly vertical and near left oval U.S. wide. SP widely spaced. PO close and nearly on a level, OST near. T far from right oval. T of TWO far from left oval. WO near, but OC wide. CE on level and close at top. EN well spaced. TS wide at base. S far from right oval. Nose far from left oval. Figures well centered in ovals. E line touches S of U.S. at the right.
  • 65. VARIETY 31. (25-3/4 × 25-3/4 mm.) P considerably above O. Point of bust square and nearly touches line. Buff and Orange. Post 7-1/2 mm. U wide, inclined to left, and near left oval. S close to inner frame line. Top of P close to outer frame line. POST near. T far from right oval. T of TWO near left oval and base of T some distance from outer frame line. WO near. OC very wide. C low. Back stroke of E almost touches the point. EN wide and N high. NT wide at top. TS close. S near right oval. Nose near left oval. Figures well centered in ovals. T line passes through center of U of U. S. VARIETY 32. (26 × 26-1/4 mm.)
  • 66. Bust ends in a sharp point, which nearly touches line over centre of O of TWO. Orange and light manila. POST 7-1/4 mm. U rather short, inclined to left and near left oval. SP wide at top. P near point and above level of O. PO near but O slanting to right. OS well spaced, but S low. ST wide. T far from left oval. WO close. C of CENTS almost touches outer frame line and CE close at base. ENTS close and S near right oval. Nose near left oval. Figures well centered in ovals. U line passes slantingly from top of E to base. VARIETY 33. (25-3/4 × 25-3/4 mm.) Projecting hair. Wide space after S of CENTS. Buff, Orange and Light Manila. POST 8 mm. U wide, close to inner frame line and near left oval. Base of S some distance from inner frame line. P leans to the left. PO close but O slants to the right and is near the point. OS well spaced but ST spaced wider. T far from right oval. T of TWO far from left oval. WO near. OC wide. C some distance to right of point but on level with E. The backstroke of the latter nearly touches the point. EN wide, and ENTS close to inner frame line. Nose far from left oval. Figures well centered in ovals. P line passes through back stroke of E. VARIETY 34. (25-3/4 × 27 mm.)
  • 67. S of U.S touches line above. OC near. Buff envelope and wrapper. POST 8 mm. U wide, inclined to left and near left oval. SP near, P far to left of point. PO well spaced at top and O a little raised. OS widely spaced. ST low, so that top stroke of T is somewhat above top of S. T far from right oval. T of TWO near left oval. WO near. C slants to left, and E to right, so that there is a considerable space between the letters at base. ENT wide. TS close. S far from right oval. Figure in right oval near inner frame line, but in left oval well centered. U line passes between CE. VARIETY 35. (25 × 25-3/4 mm.) O of POST slants sharply to left. Hair far from frame line. Buff, Orange and Light Manila. POST 8 mm. U almost vertical and quite near to left oval. U.S. near. P inclined to left. O near point. OST close. T near right oval. T of TWO far from left oval. WO near. OC near. CE wide at base. N higher than E or T. S slants sharply to right and is far from right oval. Nose far from left oval. Figures well centered in oval. T line slants through C from right to left. Bust ends in a rather short point. VARIETY 36. (26 × 26 mm.) P tipped to left. O nearly touches outer frame line. Point of bust short and over centre of O. Amber and Light Manila.
  • 68. POST 8 mm. U large, inclined to left and near left oval. U. S. near and base of S some distance from inner frame line. P near point and slanting to left. PO wide, O nearly vertical. OST wide. T far from right oval. T of TWO far from left oval. WO close. OC near. C is low and slants sharply to left. CE close at top. ENTS close. T almost touches line above. S near right oval. Nose near left oval. Figures in ovals well centered. U line touches ends of upper and lower stroke of E. VARIETY 37. (26-1/2 × 26 mm.) P nearly touches line at top. POST near. Orange and Amber. POST 8 mm. U wide, inclined to left and near left oval. US wide. P nearly vertical and some distance to left of point. PO on a level. T of POST very far from right oval. T of TWO near left oval. WO close. OC near. C nearly under the point and vertical. EN well spaced at top. NTS close, especially the last two letters, S near right oval. Nose far from left oval. Figures in ovals well centered. T line slants across top of E. Envelopes only. A common die. VARIETY 38. (26 × 26 mm.) Bust point behind O. NT wide. Orange, Amber, Light Manila. POST 8 mm. U wide, greatly inclined to left, and quite near left oval. US very wide. P near point and slanting to left. O some
  • 69. distance to right of point and inclined to right. POS wide but ST widest. Top stroke of T close to outer frame line. T of TWO near left oval. WO near. OC very wide. C almost vertical and close to point. Top of E slightly above C. EN near. TS wide at base and S close to right oval. Nose far from left oval. Figures in ovals well centered. U line touches base of T of CENTS. VARIETY 39. (26-1/4 × 25-1/2 mm.) P considerably above level of O. POST wide. Amber, and Light Manila. POST 9 mm. U wide, inclined to left, and near left oval. US wide. SP wide. P slants to left and is close to the point. PO very wide. O far to right of point and but little slanting. OST wide. T near right oval. T of TWO close to left oval, WO close. The entire word is well above the outer frame line. OC very wide. C under the point and upright. Top of E slightly above C. NT close. TS wide. S close to right oval. Nose near left oval. Figures in ovals well centered. W line touches base of U at right. Broad point to bust. Envelope and wrapper. VARIETY 40. (26 × 26 mm.) NT very near. POST wide. Buff, Orange, Amber, Light Manila. POST 9-1/2 mm. Inscription in upper label much resembles that of the preceding variety, but S of U.S. is low and PO nearer. T of TWO near left oval. WO close. OC wide. TS close at top. Nose far from left oval. Figures in ovals well centered. U line passes along middle stroke of N. One of the most common varieties. Reference List of the Two Cent Envelopes and Wrappers of the Series of 1863 and 1864.
  • 70. ENVELOPES. TWO CENTS, BLACK. 1863. Inscribed: U. S. POSTAGE. DIE A. Var. 3. No.Class.Paper.Knife.Size.Dimensions. Remarks. 370 4 Buff 2 3 139 × 83 Gummed. 371 2 3 Ungummed. Var. 5. 372 4 Buff 2 3 139 × 83 Ungummed. 373 2 3 Gummed. Var. 6. 374 4 Amber 2 3 139 × 83 Gummed. 375 Buff 2 3 Ungummed. DIE B. Var. 8. No.Class.Paper.Knife.Size.Dimensions. Remarks. 376 4 Buff 11 3 139 × 83 Ungummed. 377 Orange 11 3 1864. Inscribed: U. S. POST. DIE C.
  • 71. Var. 1. No. Class.Paper.Knife.Size.Dimensions. Remarks. 378 2 Buff 11 3 139 × 83 Ungummed. 379 11 3 Gummed. 380 Or. 11 3 Ungummed. Var. 3. 381 3 Buff 11 3 139 × 83 Gummed 382 Or. 11 3 Ungummed Var. 5. 383 3 Buff 11 3 139 × 83 Gummed 384 Or. 11 3 Ungummed Var. 6. 385 3 Buff 11 3 139 × 83 Gummed Var. 6a. 386 3 Buff 11 3 139 × 83 Gummed 387 Or. 11 3 Ungummed Var. 7. 388 3 Buff 11 3 139 × 83 Gummed 389 Or. 11 3 Ungummed Var. 8. 390 3 Buff 11 3 139 × 83 Gummed 391 Or. 11 3 Ungummed Var. 9. 392 3 Buff 11 3 139 × 83 Gummed 393 Or. 11 3 Ungummed Var. 10. 394 4 Buff 11 3 139 × 83 Gummed. Var. 11. 395 4 Buff 11 3 139 × 83 Gummed. 395a 12 5 160 × 90 396 Or. 11 3 139 × 83 Ungummed. Var. 12.
  • 72. 397 4 Buff 11 3 139 × 83 Gummed. Generally Specimen. 398 Or. 11 3 Ungummed. Generally Specimen. 399 Buff 12 5 160 × 90 Generally Specimen. Var. 13. 400 4 Buff 11 3 139 × 83 Gummed. 401 Or. 11 3 Ungummed. Var. 14. 402 5 Buff 11 3 139 × 83 Gummed. 403 Or. 11 3 Ungummed. Var. 15. 404 5 Buff 11 3 139 × 83 Gummed. 405 Or. 11 3 Ungummed. 406 Buff 12 5 160 × 90 Var. 16. 407 5 Buff 11 3 139 × 83 Gummed. 408 Or. 11 3 Ungummed. 409 Buff 12 5 160 × 90 DIE D. Var. 17. No. Class.Paper.Knife.Size.Dimensions. Remarks. 410 3 Buff 11 3 139 × 83 Gummed 411 Or. 11 3 Ungummed 412 Buff 12 5 160 × 90 Var. 18. 413 3 Buff 11 3 139 × 83 Gummed. 414 Or. 11 3 Ungummed. 415 Buff 12 5 160 × 90 415a 12 5 Gummed.
  • 73. Var. 19. 416 3 Amber 12 5 160 × 60 Gummed 417 12 5 [HW: Gummed] Var. 20. 418 3 Amber 12 5 160 × 90 Ungummed Var. 21. 419 3 Amber 11 3 139 × 83 Gummed 420 12 5 160 × 90 Ungummed Var. 22. 421 4 Buff 2 3 139 × 83 Ungummed. Very rare. Var. 23. 422 4 Amber 12 5 160 × 90 Ungummed. Var. 24. 423 4 Buff 11 3 139 × 83 Gummed. 424 Or. 11 3 Ungummed. Var. 25. 425 4 Buff 11 3 139 × 83 Gummed. 426 Or. 11 3 Ungummed. Var. 26. 427 4 Amber 11 3 139 × 83 Gummed. Var. 27. 428 4 Amber 12 5 160 × 90 Ungummed. Var. 27a. 429 4 Amber 12 5 160 × 90 Ungummed. Var. 28. 430 5 Buff 11 3 139 × 83 Gummed. 431 Or. 11 3 Ungummed. Var. 30. 432 5 Amber 12 5 160 × 90 Ungummed. Var. 31. 433 5 Buff 11 3 139 × 83 Gummed. 434 Or. 11 3 Ungummed.
  • 74. 434a Buff 12 5 160 × 90 Var. 32. 435 5 Or. 11 3 139 × 83 Gummed. Var. 33. 436 5 Buff 11 3 139 × 83 Gummed. 437 Or. 11 3 Ungummed. 438 Buff 12 5 160 × 90 Gummed. Var. 34. 439 5 Buff 11 3 139 × 83 Gummed. Var. 35. 440 5 Buff 11 3 139 × 83 Gummed. 441 Or. 11 3 Ungummed. Var. 36. 442 5 Amber 12 5 160 × 90 Ungummed. Var. 37. 443 5 Amber 11 3 139 × 83 Gummed. 444 Or. 11 3 Ungummed. Var. 38. 445 5 Or. 11 3 139 × 83 Ungummed. Var. 39. 446 5 Amber 11 3 139 × 83 Gummed. 447 12 5 160 × 90 Ungummed. Var. 40. 448 5 Buff 11 3 139 × 83 Gummed. 449 Amber 11 3 450 Or. 11 3 Ungummed. Wrappers. 1863. Inscribed: U. S. POSTAGE.
  • 75. DIE A. Var. 1. No.Class.Paper.Dimensions.Laid.Remarks. 451 1 D. M. 227 × 148 Var. 2. 452 2 D. M. 227 × 148 Var. 4. 453 4 D. M. 227 × 148 Var. 6. 454 4 D. M. 227 × 148 Var. 7. 455 4 D. M. 227 × 148 1864. Inscribed: U. S. POST. DIE C. Var. 2. No.Class.Paper.Dimensions.Laid.Remarks. 456 3 D. M. 100 × 200 V Var. 4. 457 3 D. M. 100 × 200 V Var. 6. 458 3 Buff 100 × 200 V 459 D. M. V Var. 6a. 460 3 Buff 100 × 200 H 461 D. M. V Var. 7. 462 3 D. M. 100 × 200 V
  • 76. Var. 8. 463 3 D. M. 100 × 200 V Var. 10. 464 4 D. M. 100 × 200 V Var. 12. 465 4 D. M. 100 × 200 V Var. 13. 466 4 D. M. 100 × 200 V Var. 14. 467 5 D. M. 100 × 200 V Var. 15. 468 5 Buff 100 × 200 V 469 D. M. V Var. 16. 470 5 Buff 100 × 200 V DIE D. Var. 17. No. Class.Paper.Dimensions.Laid. Remarks. 471 3 Buff 100 × 200 V Var. 19. 472 3 L. M. 133 × 200 V Var. 20. 473 3 L. M. 100 × 200 V 474 133 × 200 — Var. 21. 475 3 L. M. 133 × 200 H 476 115 × 375 H Stamp 137 mm. from top. Var. 23. 477 4 L. M. 133 × 200 H 478 V 479 Wove
  • 77. Var. 24. 480 4 L. M. 100 × 200 V 480a Buff V Var. 25. 481 4 Buff 100 × 200 V Var. 26. 482 4 L. M. 133 × 200 H Var. 27. 483 4 L. M. 133 × 200 H Var. 27a. 484 4 L. M. 133 × 200 V Var. 29. 485 5 L. M. 133 × 200 H Var. 30. 486 5 L. M. 133 × 200 H Var. 31. 487 5 Buff 100 × 200 V Var. 32. 488 5 L. M. 100 × 200 V Var. 33. 489 5 L. M. 100 × 200 V 490 Buff V Var. 34. 491 5 L. M. 100 × 200 V 492 Buff V 493 150 × 212 V 494 H Var. 35. 495 5 L. M. 100 × 200 V 496 Buff V Var. 36. 497 5 L. M. 133 × 200 V Var. 38.
  • 78. 498 5 L. M. 133 × 200 H Var. 39. 499 5 L. M. 133 × 200 H Var. 40. 499a 5 L. M. 133 × 200 H
  • 79. FIFTH ISSUE: 1864-1865. THREE CENTS, ROSE; THREE CENTS, BROWN; SIX CENTS, ROSE AND SIX CENTS, PURPLE. In the Postmaster-General's report for 1864 it is stated that during the last session of Congress a bill was passed for the relief of the contractor for furnishing the department with stamped envelopes and newspaper wrappers, under the provisions of which the existing contract expired on Sept. 11, 1864. With the renewal of the former contract Nesbitt changed the dies of the two, three and six cents. The first we have already exhaustively treated. It is, of course, the two cents, black, U. S. POST. All these dies remained in use until June 30th, 1870. As a matter of history it may be noted here that the three cents printed in brown, likewise the six cents rose, both on official size, were issued in July, 1865. The dies have a portrait of Washington facing to the left in a plain oval. It is enclosed in a frame of colorless lines. Inscription above UNITED STATES; below, THREE CENTS or SIX CENTS, in block capitals. Large numerals of value at each side. None of the Nesbitt die varieties have given the writer so many anxious hours and have required such prolonged study as the three cents of 1864. Indeed, the final solution of the problem of classification of the various dies was only arrived at after more than two years continuous research. Like the famous balancing of the egg of Columbus, the problem, when solved, is extremely simple. Looking backward on the long series of failures, it seems strange that the chief characteristics have so long escaped the attention of cataloguers. The fact, however, is patent. Even as thorough and painstaking a student as the late Gilbert Harrison who, in 1895,
  • 80. chronicled, as he thought, all of the existing die varieties of the three cents has failed to observe the most important differences. Indeed, in the entire philatelic literature dealing with the Nesbitt dies of 1864 there is but one allusion to the feature which constitutes the surest means for the identification of the die varieties, and this is only a single sentence contained in the Historical Notes of Messrs. Tiffany, Bogert and Rechert. It reads:— It is worth mentioning, however, that while dies 9, 15 and 26 (the latter the die under consideration) all have the small bust of Washington, there are small differences in each which show them to be different engravings. * * In die 26 the front hair shows only five locks and the back hair only four lines. We shall presently see that, like the three cents, red of 1853, (Die A) the diemakers have produced different groups of heads which, once known, are not only an absolute means of differentiating the varieties, but also protect the collector from acquiring a multitude of the same die. Although, as stated above, the die of the three cents rose equals that of the three cents red in the use of various heads, it is, otherwise, quite dissimilar to the first issue, as will be seen presently. As in the varieties of the two cent dies the horizontal and vertical dimensions of the three cents vary greatly. After careful research and taking the advice of experienced philatelists, it was decided to adopt only two sizes for classification: i.e. Size A:—to include all stamps measuring horizontally 24 mm. but not exceeding 25 mm. Size B:—to include all stamps measuring horizontally 25-1/2 mm. or more. In our study of the three cents red of 1853 we noted, in addition to the various heads, some minor differences in the spacing of the letters forming the inscription. Referring now to the three cents of 1864, even the unskilled eye of the layman will be struck with the
  • 81. surprising changes, not only in the spacing of the letters forming a word, but, also, in the relative position of the words to each other and their distance from a definite point, such, for instance, as the figure 3. The subsequent cuts well illustrate this point. In the first the S of CENTS is several mm. distant from the right figure 3: in the second it is close to 3. The same remarks apply to the U of UNITED in its relative position to the left figure 3. In the second cut there is also a square period after the final E of THREE. Looking at cuts 3 and 4 the great variety of spacing between the letters of a word is strikingly apparent in the word THREE. These differences are easily detected by the 10 mm. unit distance measurement, which has been explained in the introductory chapter of this series of articles. The subjoined diagram proves that there are
  • 82. at least three forms of each word, and, with a little study, the collector will soon recognize the leading types. It seems strange that such great and palpable differences remained unknown until 1892. Quoting from the work of Messrs. Tiffany, Bogert Rechert, we are, however, informed: Heretofore it has not been noticed that there are a large number of minor varieties of this die depending on the relative position of the parts. Commenting on Die 26 (three cents rose) the writers make some valuable suggestions, but they discourage the would-be student from
  • 83. going deeper into the subject by the closing paragraph: So few collectors would be interested in looking for these varieties that it has been thought unnecessary to devote space to them in a general work. In the writer's opinion the most valuable hint thrown out by Messrs. Tiffany, Bogert Rechert is contained in the following sentence: If a thread be laid along the lower stroke of the U it will pass at different distances from the tip of the nose and fall on different parts of the right numeral, of the space below it, or even as low as the S of CENTS. Why these experts stopped at the gate and did not enter is one of those freaks of the human mind that defies explanation. Certainly the person who made this observation was on the very threshold of discovering a scientific classification of this elusive die. The writer confesses that, after having independently evolved this system of classification, nothing has given him greater satisfaction than to find that the basic idea had been chronicled as far back as 1892. To-day it is well known that a line prolongation along the U of UNITED establishes five distinct classes. As this system has been fully described in a lecture given by the writer before the Boston Philatelic Society, (April 19, 1904) which lecture has also been published in pamphlet form, and, as this classification has been accepted by the writer of the latest Scott Catalo gue, it seems unnecessary to go into the details, especially as the subjoined diagram is self-explanatory.
  • 84. It is evident that we now possess various means for the classification of the three cents die varieties, but a system based solely on a line measurement, as has been stated heretofore, would not guard the collector sufficiently from acquiring a number of the same dies, due to unavoidable mistakes of measurement. To prevent duplication of dies it is imperative to know the various heads. Luckily the distinctive features are quite plain and it is easy to divide the heads into five classes for, as in the first issue, the die cutters have adorned the head of Washington with a variety of coiffures. In Heads 1 and 2 there is a triangular open space between the middle bunch of hair and the lowest strand which meets the queue.
  • 85. HEAD 1.—The queue consists of three vertical strands extending from the top of the head to the neck. Next to the queue are 3 rear locks, of which the middle one is a large, pear-shaped bunch, consisting of five fine strands, while the second highest is by far the longest, and cuts into the queue, resembling the stem of a pear. HEAD 2.—Same as Head 1, but the second lowest strand of hair in the pear-shaped bunch is the longest and does not extend into the queue. The triangular space below is slightly larger than in Head 1. HEAD 3.—The queue consists of either three or four strands which extend from the top of the head to the neck. Next to the queue there are five locks in the rear row, the arrangement of which differs in the various specimens. The main feature of Head 3 consists in the absence of an open space between the middle bunch and the lowest lock. HEAD 4.—The queue consists of three strands which extend from the top of the head to the neck. The back row of hair consists of five locks of which the lowest is very small and runs almost perpendicularly into the queue. There is a small space between the perpendicular lock and the next lowest. HEAD 5.—Generally found on the second quality of buff paper. The queue consists of three strands, which extend from the top of the head to the neck. The main feature is the middle bunch of hair, which is oblong shaped and consists of three heavy strands, all of
  • 86. Welcome to our website – the perfect destination for book lovers and knowledge seekers. We believe that every book holds a new world, offering opportunities for learning, discovery, and personal growth. That’s why we are dedicated to bringing you a diverse collection of books, ranging from classic literature and specialized publications to self-development guides and children's books. More than just a book-buying platform, we strive to be a bridge connecting you with timeless cultural and intellectual values. With an elegant, user-friendly interface and a smart search system, you can quickly find the books that best suit your interests. Additionally, our special promotions and home delivery services help you save time and fully enjoy the joy of reading. Join us on a journey of knowledge exploration, passion nurturing, and personal growth every day! ebookbell.com