NOSQL DATAbASES INTRDUCTION powerpoint presentaion

Why NoSQL?
• Relational databases have been the default
choice for serious data storage, especially in the
world of enterprise applications your only
choice can be which relational database to use.
• After such a long period of dominance, the
current excitement about NoSQL databases
comes as a surprise.
• Now we’ll explore why
relational databases

The Value of Relational Databases
1. Getting at Persistent Data
Two areas of memory:
• Fast, small, volatile main memory
• Larger, slower, non volatile backing store
• Since main memory is volatile to keep data around, we
write it to a backing store, commonly seen a disk which
can be persistent memory.
The backing store can be:
• File system
• Database

• The database allows more flexibility than a file system
in storing large amounts of data in a way that allows
an application program to get information quickly and
easily.
2. Concurrency
• Enterprise applications tend to have many people using
same data at once, possibly modifying that data. We
have to worry about coordinating interactions between
them to avoid things like double booking of hotel
rooms.
• Since enterprise applications can have lots of users and
other systems all working concurrently, there’s a lot of
room for bad things to happen. Relational databases
help to handle this by controlling all access to their
data through transactions.

3. Integration
• Enterprise requires multiple applications, written by
different teams, to collaborate in order to get things
done. Applications often need to use the same data and
updates made through one application have to be
visible to others.
• A common way to do this is shared database
integration where multiple applications store their data
in a single database.
• Using a single database allows all the applications to use
each others’ data easily, while the database’s
concurrency control handles multiple applications in
the same way as it handles multiple users in a single
application.

4. A (Mostly) Standard
Model
• Relational databases have succeeded because they
provide the core benefits in a (mostly) standard way.
• As a result, developers can learn the basic relational
model and apply it in many projects.
• Although there are differences between different
relational databases, the core mechanisms remain the
same.

Impedance Mismatch
• For Application developers using relational databases, the
biggest frustration has been what’s commonly called the
impedance mismatch: the difference between the relational
model and the in-memory data structures.
• The relational data model organizes data into a structure of
tables. Where a tuple is a set of name-value pairs and a relation
is a set of tuples.
• The values in a relational tuple have to be simple—they
cannot contain any structure, such as a nested record or a
list. This limitation isn’t true for in-memory data structures,
which can take on much richer structures than relations.
• So if you want to use a richer in-memory data structure, you
have to translate it to a relational representation to store it on
disk. Hence the impedance mismatch—two different
representations that require translation.

Figure: An order, which looks like a single
aggregate structure in the UI, is split into many
rows from many tables in a relational database

• The impedance mismatch lead to relational databases
being replaced with databases that replicate the in-
memory data structures to disk. That decade was marked
with the growth of object-oriented programming
languages, and with them came object-oriented
databases—both looking to be the dominant
environment for software development in the new
millennium. However, while object-oriented languages
succeeded in becoming the major force in programming,
object-oriented databases faded into obscurity.
• Impedance mismatch has been made much easier to deal
with by the wide availability of object relational
mapping frameworks, such as Hibernate and iBATIS
that implement well-known mapping patterns, but the
mapping problem is still an issue.
• Relational databases continued to dominate the
enterprise computing world in the 2000s, but during that
decade cracks began to open in their dominance.

Application and Integration Databases
• In relational databases, the database acts as an integration
database—where multiple applications developed by
separate teams storing their data in a common
database. This improves communication because all the
applications are operating on a consistent set of persistent
data.
There are downsides to shared database integration.
• A structure that’s designed to integrate many applications
is more complex than any single application needs.
• If an application wants to make changes to its data
storage, it needs to coordinate with all the other
applications using the database.
• Different applications have different structural and
performance needs, so an index required by one
application may cause a problematic hit on inserts for
another.

• A different approach is to treat your database as an
application database—which is only accessed by a
single application codebase that’s looked after by a
single team.
Advantages:
• With an application database, only the team using the
application needs to know about the database
structure, which makes it much easier to maintain and
evolve the schema.
• Since the application team controls both the database and
the application code, the responsibility for database
integrity can be put in the application code.

Web
Services
• During the 2000s we saw a distinct shift to web services
where applications would communicate over HTTP.
• If you communicate with SQL, the data must be
structured as relations. However, with a service, you are
able to use richer data structures with nested records
and lists. These are usually represented as documents in
XML or, more recently, JSON.
• In general, with remote communication you want to
reduce the number of round trips involved in the
interaction, so it’s useful to be able to put a rich structure
of information into a single request or response.

• If you are going to use services for integration, most
of the time web services —using text over HTTP—
is the way to go. However, if you are dealing with
highly performance-sensitive interactions, you may
need a binary protocol. Only do this if you are sure
you have the need, as text protocols are easier to
work with—consider the example of the Internet.
• Once you have made the decision to use an
application database, you get more freedom of
choosing a database. Since there is a decoupling
between your internal database and the services with
which you talk to the outside world, the outside
world doesn’t have to care how you store your
data, allowing you to consider non-relational
options.

Attack of the Clusters
• In 2000s several large web properties dramatically
increase in scale. This increase in scale was
happening along many dimensions.
Websites
• Started tracking activity and structure in a very
detailed way.
• Large sets of data appeared: links, social networks,
activity in logs, mapping data.
• With growth in data came a growth in users .

Coping with the increase in data and traffic required
more computing resources. To handle this kind of
increase, you have two choices:
1. Scaling up implies:
• bigger machines
• more processors
• more disk storage
• more memory
Scaling up disadvantages:
• But bigger machines get more and more expensive.
• There are real limits as size increases.

2. Use lots of small machines in a cluster:
• A cluster of small machines can use commodity
hardware and ends up being cheaper at these kinds of
scales.
• more resilient—while individual machine failures are
common, the overall cluster can be built to keep
going despite such failures, providing high
reliability.

Cluster disadvantages
• Relational databases are not designed to be run on
clusters.
• Clustered relational databases, such as the Oracle
Microsoft SQL Server, work on the concept of a
shared disk subsystem where cluster still has the
disk subsystem as a single point of failure.
• Relational databases could also be run as separate
servers for different sets of data, effectively sharding
the database. Even though this separates the load, all
the sharding has to be controlled by the
application which has to keep track of which
database server to talk to for each bit of data.

• We lose any querying, referential integrity,
transactions, or consistency controls that cross shards.
• Commercial relational databases (licensed) are
usually priced on a single-server assumption, so
running on a cluster raised prices.
This mismatch between relational databases and
clusters led some organization to consider an
alternative route to data storage. Two companies in
particular
1. Google
2. Amazon
• Both were running large clusters

• These things gave them the motive. Both were successful and
growing companies with strong technical components, which
gave them the means and opportunity. It was no wonder they had
murder in mind for their relational databases. As the 2000s
drew on, both companies produced brief but highly influential
papers about their efforts:
– BigTable from Google
– Dynamo from Amazon
• It’s often said that Amazon and Google operate at scales far
removed from most organizations, so the solutions they needed
may not be relevant to an average organization. But more and
more organizations are beginning to explore what they can do
by capturing and processing more data—and to run into the same
problems. So people began to explore making databases along
similar lines—explicitly designed to live in a world of clusters.

The Emergence of NoSQL
For NoSQL there is no generally accepted definition, nor an
authority to provide one, so all we can do is discuss some
common characteristics of the databases that tend to be called
“NoSQL.”
• The name NoSQL comes from the fact that the NoSQL
databases doesn’t use SQL as a query language. Instead,
the database is manipulated through shell scripts that can
be combined into the usual UNIX pipelines.
• They are generally open-source projects.
• Most NoSQL databases are driven by the need to run on
clusters. Relational databases use ACID transactions to
handle consistency across the whole database. This
inherently clashes with a cluster environment, so NoSQL
databases offer a range of options for consistency and
distribution.

• Not all NoSQL databases are strongly oriented
towards running on clusters. Graph databases are
one style of NoSQL databases that uses a distribution
model similar to relational databases but offers a
different data model that makes it better at handling
data with complex relationships.
• NoSQL databases operate without a schema,
allowing you to freely add fields to database records
without having to define any changes in structure
first. This is particularly useful when dealing with
non uniform data and custom fields which forced
relational databases to use names like customField6
or custom field tables that are awkward to process
and understand.

• When you first hear “NoSQL,” an immediate
question is what does it stand for—a “no” to SQL?
Most people who talk about NoSQL say that it really
means “Not Only SQL,” but this interpretation has
a couple of problems. Most people write “NoSQL”
whereas “Not Only SQL” would be written
“NOSQL.”
• To resolve these problems, don’t worry about what
the term stands for, but rather about what it means.
Thus, when “NoSQL” is applied to a database, it
refers to an ill-defined set of mostly open-source
databases, mostly developed in the early 21st
century, and mostly not using SQL.

• It’s better to think of NoSQL as a movement rather than a
technology. We don’t think that relational databases are going
away—they are still going to be the most common form of
database in use. Their familiarity, stability, feature set, and
available support are compelling arguments for most projects.
• The change is that now we see relational databases as one
option for data storage. This point of view is often referred to as
polyglot persistence—using different data stores in different
circumstances.
• We need to understand the nature of the data we’re storing and
how we want to manipulate it. The result is that most
organizations will have a mix of data storage technologies for
different circumstances. In order to make this polyglot world
work, our view is that organizations also need to shift from
integration databases to application databases.

• In our account of the history of NoSQL development,
we’ve concentrated on big data running on clusters.
The big data concerns have created an opportunity for
people to think freshly about their data storage needs,
and some development teams see that using a
NoSQL database can help their productivity by
simplifying their database access even if they have
no need to scale beyond a single machine.
Two primary reasons for considering NoSQL:
1) To handle data access with sizes and performance
that demand a cluster
2) To improve the productivity of application
development by using a more convenient data
interaction style.

A NoSQL is a database that provides a mechanism for
storage and retrieval of data, they are used in real-time
web applications and big data and their use are
increasing over time.
Many NoSQL stores compromise consistency in favor of
availability, speed and partition tolerance.
Advantages of NoSQL:
1. High Scalability
NoSQL databases use sharding for horizontal scaling. It
can handle huge amount of data because of scalability,
as the data grows NoSQL scale itself to handle that data in
efficient manner.
2. High Availability
Auto replication feature in NoSQL databases makes it
highly available.

Disadvantages of NoSQL:
1. Narrow Focus: It is mainly designed for storage,
but it provides very little functionality.
2. Open Source: NoSQL is open-source database that is
two database systems are likely to be unequal.
3.Management Challenge: Big data management in
NoSQL is much more complex than a relational
database.
4.GUI is not available: GUI mode tools to access the
database is not flexibly available in the market.
5. Backup: it is a great weak point for some NoSQL
databases like MongoDB.
6.Large Document size: Data in JSON format

When should NoSQL be
used
• When huge amount of data need to be stored and
retrieved.
• The relationship between data you store is not that
important.
• The data changing over time and is not structured.
• Support of constraint and joins is not required at
database level.
• The data is growing continuously and you need to
scale the database regular to handle the data.

Key Points
• Relational databases have been a successful
technology for twenty years, providing persistence,
concurrency control, and an integration mechanism.
• Application developers have been frustrated with
the impedance mismatch between the relational
model and the in-memory data structures.
• There is a movement away from using integration
databases towards encapsulating databases within
applications and integrating through services.
• The vital factor for a change in data storage was the
need to support large volumes of data by running
on clusters. Relational databases are not designed to
run efficiently on clusters.

The common characteristics of NoSQL databases
1. Not using the relational model
2. Running well on clusters
3. Open-source
4. Built for the 21st century web estates
5. Schemaless
6. The most important result of the rise of NoSQL is
Polyglot Persistence.

Aggregate Data Models
Data Model: Model through which we identify and manipulate
our data. It describes how we interact with the data in the
database.
Storage model: Model which describes how the database
stores and manipulates the data internally.
In NoSQL “data model” refer to the model by which the
database organizes data more formally called a metamodel.
The dominant data model is relational data model which
uses set of tables:
• Each table has rows
• Each row representing entity
• Column describe entity

NoSQL move away from the relational model.
Each NoSQL solution has a different model that it
uses:
1. Key-value
2. Document
3. Column-family
4. Graph
Out of this first three share a common characteristic
of their data models which is called as aggregate
orientation.

Aggregates
The relational model takes the information to store and
divides it into tuples.
A tuple is a limited data structure:
• You cannot nest one tuple within another to get nested
records.
• You cannot put a list of values or tuples within another.
Aggregate model recognizes that often we need to operate
on data that have a more complex structure than a set of
tuples.
• It has complex record that allows lists and other record
structures to be nested inside it.
• key-value, document, and column-family databases all

Definition:
• In Domain-Driven Design, an aggregate is a collection
of related objects that we wish to treat as a unit. It is a
unit for data manipulation and management of
consistency. Typically, we like to update aggregates
with atomic operations and communicate with our data
storage in terms of aggregates.
Advantages of Aggregate:
• Dealing in aggregates makes easy to handle operating
on a cluster, since the aggregate makes a natural unit
for replication and sharding.
• Aggregates are also often easier for application
programmers to work with, since they often
manipulate data through aggregate structures.

Example of Relations and Aggregates
• Let’s assume we have to build an e-commerce website; we are
going to be selling items directly to customers over the web, and
we will have to store information about users, our product catalog,
orders, shipping addresses, billing addresses, and payment data.
• Data model for a relational database:

Sample data for Relational Data Model
Everything is properly normalized, no data is repeated in multiple
tables. We also have referential integrity.

Sample Data for aggregate data model
// in customers
{
“id":1,
"name":"Martin",
"billingAddress":[{"city":"Chicago"}]
}
// in orders
{
"id":99,
"customerId":1,
"orderItems":[
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}

• We’ve used the black-diamond composition
marker in UML to show how data fits into the
aggregation structure.
• The customer aggregate contains a list of
billing addresses.
• The order aggregate contains a list of order
items, a shipping address, and payments.
• The payment itself contains a billing address
for that payment.

• Here single logical address record appears three times
but instead of using IDs it’s treated as a value and
copied each time. This fits the domain where we
would not want the shipping address, nor the
payment’s billing address, to change.
• The link between the customer and the order isn’t
within either aggregate—it’s a relationship between
aggregates. We’ve shown the product name as part of
the order item here—this kind of denormalization is
similar to the tradeoffs with relational databases, but is
more common with aggregates because we want to
minimize the number of aggregates we access
during a data interaction.

•To draw aggregate boundary you have to think about
accessing that data—and make that part of your thinking when
developing the application data model.
•Indeed we could draw our aggregate boundaries differently,
putting all the orders for a customer into the customer aggregate
Embed all the objects for customer and the customer’s orders

Sample Data for above aggregate data model
// in customers
{ "customer":
{
"id": 1,
"name": "Martin",
"billingAddress": [{"city": "Chicago"}],
"orders": [
{
"id":99,
"customerId":1
,
"orderItems":[
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city":
"Chicago"}
}],

• There’s no universal answer for how to draw your
aggregate boundaries. It depends entirely on how you
tend to manipulate your data.
• If you tend to access a customer together with all of
that customer’s orders at once, then you would prefer
a single aggregate.
• However, if you tend to focus on accessing a single
order at a time, then you should prefer having separate
aggregates for each order.

Consequences of Aggregate
Orientation
• Relational databases have no concept of aggregate within their data
model, so we call them aggregate-ignorant. In the NoSQL world,
graph databases are also aggregate-ignorant. Being aggregate-
ignorant is not a bad thing. It’s often difficult to draw aggregate
boundaries well, particularly if the same data is used in many
different contexts.
• An order makes a good aggregate when a customer is making and
reviewing orders, and when the retailer is processing orders.
• However, if a retailer wants to analyze its product sales over the
last few months, then an order aggregate becomes a trouble. To
get to product sales history, you’ll have to dig into every aggregate
in the database. So an aggregate structure may help with some
data interactions but be an obstacle for others.

• An aggregate-ignorant model allows you to easily look at
the data in different ways, so it is a better choice when
you don’t have a primary structure for manipulating
your data.
• The aggregate orientation helps greatly with running on
a cluster.
• If we’re running on a cluster, we need to minimize how
many nodes we need to query when we are gathering
data.
• By explicitly including aggregates, we give the database
important information about which bits of data will be
manipulated together, and thus should live on the same
node.

Aggregates have an important consequence for transactions:
• Relational databases allow you to manipulate any combination of
rows from any tables in a single transaction. Such transactions are
called ACID transactions.
• Many rows spanning many tables are updated as a single operation.
This operation either succeeds or fails in its entirety, and concurrent
operations are isolated from each other so they cannot see a partial
update.
• It’s often said that NoSQL databases don’t support ACID
transactions and thus sacrifice consistency, but they support
atomic manipulation of a single aggregate at a time.
• This means that if we need to manipulate multiple aggregates in an
atomic way, we have to manage that ourselves in the application
code. Graph and other aggregate-ignorant databases usually do
support ACID transactions similar to relational databases.

Key-Value and Document Data
Models
• Key-value and document databases were strongly
aggregate-oriented means we think these databases as
primarily constructed through aggregates.
• Both of these types of databases consist of lots of
aggregates with each aggregate having a key or ID
that’s used to get at the data.
• Riak and Redis database are examples of key-value
databases.
• MongoDB and CouchDB are most popular document
based databases.

Key-Value Data
Model
• Key-value databases are the simplest of the NoSQL
databases: The basic data structure is a dictionary or map.
You can store a value, such as an integer, string, a JSON
structure, or an array, along with a key used to reference that
value.
• For example, a simple key-value database might have a value
such as "Douglas Adams". This value is then assigned an ID,
such as cust1237.
• Using a JSON structure adds complexity to the database. For
example, the database could store a full mailing address in
addition to a person's name. In the previous example, key
cust1237 could point to the following information:
{ name: "Douglas Adams",
street: "782 Southwest St.",
city: "Austin",
state: "TX“
}

Weakness of key-value database
• This model will not provide any kind of
traditional database capabilities such as atomicity
of transaction, or consistency when multiple
transactions are executed simultaneously. Such
capability must be provided by application
itself.
• As the volume of data increases, maintain
unique values as keys may become more difficult;
addressing this issue requires the introduction of
some complexity in generating character
strings that will remain unique among an
extremely large set of keys.

Document Data Model
• It is a type of non-relational database that is designed to store and query data
as JSON-like documents which makes it easier for developer to store and query
data in a database.
• It works well with use cases such as catalogs, user profiles etc.
• In document store database the data which is collection of key-value pairs is
compressed as a document store.
• The flexible, semi-structured and hierarchical nature of documents and
document databases allows them to evolve with applications need.
• Example: Book document
{ “id” : ”98765432”,
“type” : ”book”,
“ISBN”: 987-6-543-21012-3,
“Author”:
{
“Lname”:”Roe”,
“MI”:”T”,
“Fname”:”Richard”
},
“Title”: “Understanding document databases”
}

Difference between key-value and document database
1. Opacity
• In key-value database, the aggregate is opaque to
the database—just some big blob of mostly
meaningless bits. The advantage of opacity is that
we can store whatever we like in the aggregate.
The database may impose some general size limit,
but other than that we have complete freedom.
• In contrast, a document database is able to see a
structure in the aggregate. A document database
imposes limits on what we can place in it, defining
allowable structures and types. In return, however,
we get more flexibility in access.

2. Access
• With a key-value store, we can only access an
aggregate by lookup based on its key.
• With a document database, we can submit
queries to the database based on the fields in
the aggregate.
• In document database we can retrieve part of the
aggregate rather than the whole thing, and
database can create indexes based on the
contents of the aggregate.

Column-Family Stores
• One of the early and powerful NoSQL databases was
Google’s BigTable, it is a two-level map. It has been a
model that influenced later databases such as HBase and
Cassandra.
• These databases with a BigTable-style data model are
often referred to as column stores. The thing that made
them different was the way in which they physically
stored data.
• Most databases have a row as a unit of storage which,
in particular, helps write performance. However, there
are many scenarios where writes are rare, but you
often need to read a few columns of many rows at
once.
• In this situation, it’s better to store groups of columns
for all rows as the basic storage unit—which is why

• BigTable and its next generation follow this notion of
storing groups of columns (column families)
together, we refer this as column-family databases.
• Column-family model is a two-level aggregate
structure. As with key-value stores, the first key is
often described as a row identifier, picking up the
aggregate of interest. The difference with column-
family structures is that this row aggregate is itself
formed of a map of more detailed values. These
second-level values are referred to as columns. As
well as accessing the row as a whole, operations also
allow picking out a particular column, so to get a
particular customer’s name from you could do
something like get('1234', 'name').

Fig. Representing customer info in a column-family structure
Column-family databases organize their columns into column
families. Each column has to be part of a single column family, and
the column acts as unit for access, with the assumption that data
for a particular column family will be usually accessed together.

• This also gives you a couple of ways to think about how
the data is structured.
• Row-oriented: Each row is an aggregate (for example,
customer with the ID of 1234) with column families
representing useful chunks of data (profile, order history)
within that aggregate.
• Column-oriented: Each column family defines a record
type (e.g., customer profiles) with rows for each of the
records. You then think of a row as the join of records in
all column families.
• This latter aspect reflects the columnar nature of
column-family databases. Since the database knows
about these common groupings of data, it can use this
information for its storage and access behavior.

• Cassandra uses the terms “wide” and “skinny.”
• Skinny rows have few columns with the same
columns used across the many different rows.
• In this case, the column family defines a
recordtype, each row is a record, and each
column is a field.
• A wide row has many columns (perhaps
thousands), with rows having very different
columns.
• A wide column family models a list, with each
column being one element in that list.

Summarizing Aggregate-Oriented
Databases
• These are the three different styles of aggregate-
oriented data models. What they all share is the
notion of an aggregate indexed by a key that you
can use for lookup. This aggregate is central to
running on a cluster, as the database will ensure that
all the data for an aggregate is stored together on
one node. The aggregate also acts as the atomic
unit for updates, providing a useful, if limited,
amount of transactional control.
• Within that notion of aggregate, we have some
differences. The key-value data model treats the
aggregate as an opaque whole, which means you
can only do key lookup for the whole aggregate—
you cannot run a query nor retrieve a part of the
aggregate.

• The document model makes the aggregate
transparent to the database allowing you to do
queries and partial retrievals. However, since the
document has no schema, the database cannot
act much on the structure of the document to
optimize the storage and retrieval of parts of the
aggregate.
• Column-family models divide the aggregate into
column families, allowing the database to treat
them as units of data within the row aggregate.
This imposes some structure on the aggregate
but allows the database to take advantage of that
structure to improve its accessibility.

Key Points
• An aggregate is a collection of data that we interact
with as a unit. Aggregates form the boundaries for
ACID operations with the database.
• Key-value, document, and column-family databases
can all be seen as forms of aggregate oriented
database.
• Aggregates make it easier for the database to
manage data storage over clusters.
• Aggregate-oriented databases work best when most
data interaction is done with the same aggregate;
aggregate-ignorant databases are better when
interactions use data organized in many different

More Details on Data Models
Relationships
• Aggregates are useful because they put together data
that is commonly accessed together. But there are still
lots of cases where data that’s related is accessed
differently.
• Consider the relationship between a customer and all of
his orders. Some applications will want to access the
order history whenever they access the customer; this
fits in well with combining the customer with his
order history into a single aggregate.
• Other applications, however, want to process orders
individually and thus model orders as independent
aggregates.

• In this case, you’ll want separate order and customer
aggregates but with some kind of relationship between
them so that any work on an order can look up customer
data. The simplest way to provide such a link is to embed
the ID of the customer within the order’s aggregate
data.
• That way, if you need data from the customer record, you
read the order, search out the customer ID, and make
another call to the database to read the customer data. This
will work, and will be just fine in many scenarios—but
the database will be ignorant of the relationship in the
data. This can be important because there are times when
it’s useful for the database to know about these links.
• As a result, many databases—even key-value stores—
provide ways to make these relationships visible to the
database. Document stores make the content of the
aggregate available to the database to form indexes and
queries.

• An important aspect of relationships between aggregates
is how they handle updates. Aggregate oriented
databases treat the aggregate as the unit of data-
retrieval. Consequently, atomicity is only supported
within the contents of a single aggregate.
• If you update multiple aggregates at once, you have
to deal yourself with a failure partway through.
• Relational databases help you with this by allowing
you to modify multiple records in a single
transaction, providing ACID guarantees while altering
many rows.
• All of this means that aggregate-oriented databases
become more awkward as you need to operate across
multiple aggregates.

• This may imply that if you have data based on lots
of relationships, you should prefer a relational
database over a NoSQL store.
• While that’s true for aggregate-oriented databases,
it’s worth remembering that relational databases
aren’t all that stellar with complex relationships
either.
• This makes it a good moment to introduce another
category of databases that’s often lumped into the
NoSQL pile.

Graph Databases
• Graph databases are an odd fish in the NoSQL
pond.
• Most NoSQL databases were inspired by the
need to run on clusters, which led to
aggregate-oriented data models of large
records with simple connections.
• Graph databases are motivated by a different
frustration with relational databases and thus
have an opposite model—small records with
complex interconnections, something like

Fig: An example graph structure
In this context, a graph isn’t a bar chart or histogram;
instead, we refer to a graph data structure of nodes
connected by edges.

• In Fig: we have a web of information whose nodes are
very small (nothing more than a name) but there is a
rich structure of interconnections between them. With
this structure, we can ask questions such as “find the
books in the Databases category that are written by
someone whom a friend of mine likes.”
• Graph databases are ideal for capturing any data
consisting of complex relationships such as social
networks, product preferences, or eligibility rules.
• The fundamental data model of a graph database is
very simple: nodes connected by edges (also called
arcs).

Difference between Graph & Relational databases
• Although relational databases can implement
relationships using foreign keys, the joins required to
navigate around can get quite expensive—which
means performance is often poor for highly connected
data models.
• Graph databases make traversal along the
relationships very cheap. A large part of this is
because graph databases shift most of the work of
navigating relationships from query time to insert
time. This naturally pays off for situations where
querying performance is more important than insert
speed.

• The emphasis on relationships makes graph
databases very different from aggregate-
oriented databases.
• Graph databases are more likely to run on a
single server rather than distributed across
clusters.
• ACID transactions need to cover multiple
nodes and edges to maintain consistency.
• The only thing graph database have in common
with aggregate-oriented databases is their
rejection of the relational model.

Schemaless Databases
• A common theme across all the forms of NoSQL
databases is that they are schemaless.
• When you want to store data in a relational
database, you first have to define a schema—a
defined structure for the database which says what
tables exist, which columns exist, and what data
types each column can hold.
• Before you store some data, you have to have the
schema defined for it in relational database.

With NoSQL databases, way of storing
data
• A key-value store allows you to store any data you
like under a key.
• A document database effectively does the same
thing, since it makes no restrictions on the
structure of the documents you store.
• Column-family databases allow you to store any
data under any column you like.
• Graph databases allow you to freely add new edges
and freely add properties to nodes and edges as you
wish.

With a schema:
• You have to figure out in advance what you need to
store, but that can be hard to do.
Without a schema:
• You can easily store whatever you need.
• This allows you to easily change your data storage as
you learn more about your project.
• You can easily add new things as you discover them.
• If you find you don’t need some things anymore, you
can just stop storing them, without worrying about
losing old data as you would if you delete columns in a
relational schema.

• A schema puts all rows of a table into a
straightjacket, which becomes awkward if you
have different kinds of data in different rows.
You either end up with lots of columns that are
usually null (a sparse table), or you end up
with meaningless columns like custom
column 4.
• A schemaless store also makes it easier to deal
with nonuniform data: data where each record
has a different set of fields. It allows each
record to contain just what it needs—no more,
no less.

Problems in Schemaless:
• If you are storing some data and displaying it in
a report as a simple list of fieldName: value
lines then a schema is only going to get in the
way.
• But usually we do with our data more than this,
and we do it with programs that need to know
that the billing address is called
billingAddress and not addressForBilling
and that the quantify field is going to be an
integer 5 and not five.

Fact is that whenever we write a program that accesses data,
that program almost always relies on some form of implicit
schema. Unless it just says something like
//pseudo code
foreach (Record r in records)
{
foreach (Field f in r.fields)
{ print (f.name, f.value)
}
}
Here it will assume that certain field names are present and
carry data with a certain meaning, and assume something
about the type of data stored within that field.

• Programs are not humans; they cannot read “qty” and conclude
that, that must be the same as “quantity”. So, however schemaless
our database is, there is usually an implicit schema present.
Having the implicit schema in the application code results in some
problems.
• In order to understand what data is present you have to dig into
the application code.
• The database remains ignorant of the schema—it can’t use the
schema to help it decide how to store and retrieve data efficiently. It
can’t apply its own validations upon that data to ensure that
different applications don’t manipulate data in an inconsistent way.
These are the reasons why relational databases have a fixed schema.
• Schemaless database shifts the schema into the application code
that accesses it. This becomes problematic if multiple
applications, developed by different people, access the same
database.

These problems can be
reduced with a couple
of
approaches:
• Encapsulate all database interaction within a single
application and integrate it with other applications using
web services.
• Another approach is to clearly define different areas of an
aggregate for access by different applications. These
could be different sections in a document database or
different column families in column-family database.
Relational schemas can also be changed at any time with
standard SQL commands. If necessary, you can create new
columns in an ad-hoc way to store nonuniform data. We have
only rarely seen this done.

Materialized Views
• When we talked about aggregate-oriented data models,
we stressed their advantages. If you want to access
orders, it’s useful to have all the data for an order
contained in a single aggregate that can be stored and
accessed as a unit.
• But aggregate-orientation has a corresponding
disadvantage: What happens if a product manager
wants to know how much a particular item has sold
over the last couple of weeks?
• Now the aggregate-orientation works against you,
forcing you to potentially read every order in the
database to answer the question. You can reduce this
burden by building an index on the product, but you’re
still working against the aggregate structure.

• Relational databases support accessing data in
different ways. Furthermore, they provide a
convenient mechanism that allows you to look at
data differently from the way it’s stored—views.
View:
• A view is like a relational table (it is a relation) but
it’s defined by computation over the base tables.
When you access a view, the database computes the
data in the view—a handy form of encapsulation.
• Views provide a mechanism to hide from the client
whether data is derived data or base data.
• But some views are expensive to compute.

Materialized Views:
• To cope with this, materialized views were
invented, which are views that are computed in
advance and cached on disk. Materialized views
are effective for data that is read heavily but can
stand being somewhat stale.
• Although NoSQL databases don’t have views,
they may have precomputed and cached queries,
and they reuse the term “materialized view” to
describe them. Often, NoSQL databases create
materialized views using a map-reduce
computation.

There are two strategies to building a materialized view
• The first is the eager approach where you update
the materialized view at the same time you update
the base data for it. In this case, adding an order
would also update the purchase history aggregates for
each product.
• This approach is good when you have more frequent
reads of the materialized view than you have writes
and you want the materialized views to be as fresh as
possible. The application database approach is
valuable here as it makes it easier to ensure that
any updates to base data also update materialized
views.
• If you don’t want to pay that overhead on each
update, you can run batch jobs to update the
materialized views at regular intervals as per
requirements.

• You can build materialized views outside of the
database by reading the data, computing the view,
and saving it back to the database.
• More often databases will support building
materialized views themselves.
• In this case, you provide the computation that needs
to be done, and the database executes the
computation when needed according to some
parameters that you configure. This is particularly
handy for eager updates of views with incremental
map-reduce.

Modeling for Data
Access
As mentioned earlier, when modeling data aggregates we
need to consider how the data is going to be read as well as
what are the side effects on data related to those aggregates.
1. Let’s start with the model where all the data for the customer
is embedded using a key-value store.
Fig: Embed all the objects for customer and their orders.

• In this scenario, the application can read the
customer’s information and all the related data by
using the key.
• If the requirements are to read the orders or the
products sold in each order, the whole object has to
be read and then parsed on the client side to build
the results.
• When references are needed, we could switch to
document stores and then query inside the
documents, or even change the data for the key-value
store to split the value object into Customer and
Order objects and then maintain these objects’
references to each other.

With the references (see Figure), we can now find the orders independently from the
Customer, and with the orderId reference in the Customer we can find all Orders for the
Customer.
# Customer object
{ "customerId": 1,
"customer": {
"name": "Martin",
"payment": [{"type": "debit","ccinfo": "1000-1000-1000-1000"}],
"orders":[{"orderId":99}]
}
}
# Order object
{ "customerId": 1,
"orderId": 99,
"order":{
"orderDate":"Nov-20-2011",
"orderItems":[{"productId":27, "price": 32.45}],
"orderPayment":[{"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft"}],
"shippingAddress":{"city":"Chicago"} } }

Fig: Customer is stored separately from Order

2. In document stores, since we can query inside documents, removing references
to Orders from the Customer object is possible. This change allows us to not
update the Customer object when new orders are placed by the Customer.
# Customer object
{ "customerId": 1,
"name": "Martin",
"payment": [
{"type": "debit",
"ccinfo": "1000-1000-1000-1000"}
]
}
#Order object
{ "orderId": 99,
"customerId": 1,
"orderDate":"Nov-20-2011",
"orderItems":[{"productId":27, "price": 32.45}],
"orderPayment":[{"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft"}],
"shippingAddress":{"city":"Chicago"}
}

• Since document data stores allow you to query by
attributes inside the document, searches such as
“find all orders that include the Refactoring
Databases product” are possible, but the decision to
create an aggregate of items and orders they belong
to is not based on the database’s query capability
but on the read optimization desired by the
application.

3. When using the column families to model the
data, it is important to remember to do it as per
your query requirements and not for the purpose
of writing; the general rule is to make it easy to
query and denormalize the data during write.
• There are multiple ways to model the data; one
way is to store the Customer and Order in
different column-family families (see Figure).
Here, it is important to note the reference to all the
orders placed by the customer are in the Customer
column family.

Fig: Conceptual view into a column data store
4. When using graph databases to model the same data, we
model all objects as nodes and relations within them as
relationships; these relationships have types and
directional significance.

• Each node has independent relationships with other
nodes. These relationships have names like
PURCHASED, PAID_WITH, or BELONGS_TO (see
Figure); these relationship names let you traverse
the graph.
• Let’s say you want to find all the Customers who
PURCHASED a product with the name
Refactoring Database. All we need to do is query
for the product node Refactoring Databases and
look for all the Customers with the incoming
PURCHASED relationship.

Fig: Graph model of e-commerce data

Key Points
• Aggregate-oriented databases make inter-aggregate
relationships more difficult to handle than intra-
aggregate relationships.
• Graph databases organize data into node and edge
graphs; they work best for data that has complex
relationship structures.
• Schemaless databases allow you to freely add fields
to records, but there is usually an implicit schema
expected by users of the data.
• Aggregate-oriented databases often compute
materialized views to provide data organized
differently from their primary aggregates. This is
often done with map-reduce computations.

NOSQL DATAbASES INTRDUCTION powerpoint presentaion

More Related Content

Similar to NOSQL DATAbASES INTRDUCTION powerpoint presentaion (20)

Recently uploaded (20)

NOSQL DATAbASES INTRDUCTION powerpoint presentaion