NoSQL and The Big Data Hullabaloo

A Practical Look at the
NOSQL and Big Data Hullabaloo
Andrew J. Brust Sam Bisbee
CEO and Founder Senior Doing Stuff Person
Blue Badge Insights Cloudant
(In Absentia)

Level: Intermediate

Meet Andrew

• CEO and Founder, Blue Badge Insights
• Big Data blogger for ZDNet
• Microsoft Regional Director, MVP
• Co-chair VSLive! and 17 years as a speaker
• Founder, Microsoft BI User Group of NYC
– http://www.msbinyc.com
• Co-moderator, NYC .NET Developers Group
– http://www.nycdotnetdev.com
• “Redmond Review” columnist for
Visual Studio Magazine and Redmond
Developer News
• brustblog.com, Twitter: @andrewbrust

My New Blog (bit.ly/bigondata)

Meet Sam
• Wait…you can’t. He’s not here.
• Sam Bisbee
– Director of Technical Business Development,
Cloudant
– He prefers “Senior Doing Stuff Person”
Which is ironic
• I’ve preserved a few of his slides.
• Look for: From Sam in upper-right-hand corner

Agenda
• Why NoSQL?
• NoSQL Definition(s)
• Concepts
• NoSQL Categories
• Provisioning, market, applicability
• Take-aways

NoSQL Data Fodder

Addresses Preferences Documents

Friends,
Notes
Followers

“Web Scale”
• This the term used to
justify NoSQL
• Scenario is simple needs
but “made up for in
volume”
– Millions of concurrent users
• Think of sites like Amazon
or Google
• Think of non-transactional
tasks like loading catalog
data to display product
page, or environment
preferences

From Sam
What is NOSQL?
• “Not Only SQL” - this is not a holy war

• 1870: Modern study of set theory begins

• 1970: Codd writes “A Relational Model of
Data for Large Shared Data Banks”

• 1970 – 1980: Commercial implementations
of Codd's theory are released

From Sam
What is NOSQL?
• 1970 - ~2000: the same sorts of databases
were made (plus a few niche products)

• Dot-Com Bubble forced the same data tier
problems but at a new scale (Amazon),
forcing innovation out of necessity

• 2000 – present: innovations are becoming
open source and “main stream” (Hadoop)

From Sam
So What is NOSQL Really?

New ways of looking at dynamic data storage

and querying for larger scale systems.

(scale = concurrent users and data size)

NoSQL Common Traits

• Non-relational
• Non-schematized/schema-free
• Open source
• Distributed
• Eventual consistency
• “Web scale”
• Developed at big Internet companies

Consistency
• CAP Theorem
– Databases may only excel at two of the following
three attributes: consistency, availability and partition
tolerance
• NoSQL does not offer “ACID” guarantees
– Atomicity, consistency, isolation and durability
• Instead offers “eventual consistency”
– Similar to DNS propagation

Consistency
• Things like inventory, account balances should be
consistent
– Imagine updating a server in Seattle that stock was depleted
– Imagine not updating the server in NY
– Customer in NY goes to order 50 pieces of the item
– Order processed even though no stock
• Things like catalog information don’t have to be,
at least not immediately
– If a new item is entered into the catalog, it’s OK for some
customers to see it even before the other customers’ server
knows about it
• But catalog info must come up quickly
– Therefore don’t lock data in one location while waiting to
update the other
• Therefore, OK to sacrifice consistency for speed,
in some cases

CAP Theorem

Relational
Consistency

NoSQL

Partition
Availability
Tolerance

Indexing
• Most NoSQL databases are indexed by key
• Some allow so-called “secondary” indexes
• Often the primary key indexes are
clustered
• HBase uses HDFS (the Hadoop Distributed
File System), which is append-only
– Writes are logged
– Logged writes are batched
– File is re-created and sorted

Queries
• Typically no query language
• Instead, create procedural program
• Sometimes SQL is supported
• Sometimes MapReduce code is used…

MapReduce
• Map step: pre-processes data
• Reduce step: summarizes/aggregates data
• Most typical of Hadoop and used with
Wide Column Stores, esp. HBase
• Amazon Web Services’ Elastic MapReduce
(EMR) can read/write DynamoDB, S3,
Relational Database Service (RDS)
• “Hive” offers a HiveQL (SQL-like)
abstraction over MR
– Use with Hive tables
– Use with HBase

Sharding
• A partitioning pattern where separate
servers store partitions
• Fan-out queries supported
• Partitions may be duplicated, so
replication also provided
– Good for disaster recovery
• Since “shards” can be geographically
distributed, sharding can act like a CDN
• Good for keeping data close to processing
– Reduces network traffic when MapReduce splitting
takes place

Key-Value Stores
• The most common; not necessarily the most
popular
• Has rows, each with something like a big
dictionary/associative array
– Schema may differ from row to row
• Common on cloud platforms
– e.g. Amazon SimpleDB, Azure Table Storage
• MemcacheDB, Voldemort, Couchbase
• DynamoDB (AWS), Dynomite, Redis and Riak

Key-Value Stores
Database

Table: Customers Table: Orders
Row ID: 101 Row ID: 1501
First_Name: Andrew Price: 300 USD
Last_Name: Brust Item1: 52134
Address: 123 Main Street
Item2: 24457
Last_Order: 1501

Row ID: 202 Row ID: 1502
First_Name: Jane Price: 2500 GBP
Last_Name: Doe Item1: 98456
Address: 321 Elm Street
Item2: 59428
Last_Order: 1502

Wide Column Stores
• Has tables with declared column families
– Each column family has “columns” which are KV pairs that
can vary from row to row
• These are the most foundational for large
sites
– BigTable (Google)
– HBase (Originally part of Yahoo-dominated Hadoop project)
– Cassandra (Facebook)
Calls column families “super columns” and tables “super
column families”
• They are the most “Big Data”-ready
– Especially HBase + Hadoop

Wide Column Stores
Table: Customers Table: Orders
Row ID: 101
Super Column: Name
Column: First_Name: Row ID: 1501
Andrew Super Column: Pricing
Column: Last_Name: Brust Column: Price: 300 USD
Super Column: Address Super Column: Items
Column: Number: 123 Column: Item1: 52134
Column: Street: Main Street Column: Item2: 24457
Super Column: Orders
Column: Last_Order: 1501

Row ID: 202
Row ID: 1502
Super Column: Name
Column: First_Name: Jane Super Column: Pricing
Column: Last_Name: Doe Column: Price: 2500
Super Column: Address GBP
Column: Number: 321 Super Column: Items
Column: Street: Elm Street Column: Item1: 98456
Super Column: Orders Column: Item2: 59428
Column: Last_Order: 1502

Document Stores
• Have “databases,” which are akin to tables
• Have “documents,” akin to rows
– Documents are typically JSON objects
– Each document has properties and values
– Values can be scalars, arrays, links to documents in other databases
or sub-documents (i.e. contained JSON objects - Allows for hierarchical
storage)
– Can have attachments as well
• Old versions are retained
– So Doc Stores work well for content management
• Some view doc stores as specialized KV stores
• Most popular with developers, startups, VCs
• The biggies:
– CouchDB
– Derivatives
– MongoDB

Document Store
Application Orientation
• Documents can each be addressed by
URIs
• CouchDB supports full REST interface
• Very geared towards JavaScript and JSON
– Documents are JSON objects
– CouchDB/MongoDB use JavaScript as native
language
• In CouchDB, “view functions” also have
unique URIs and they return HTML
– So you can build entire applications in the database

Document Stores
Database: Customers Database: Orders
Document ID: 101
First_Name: Andrew
Last_Name: Brust
Address: Document ID: 1501
Price: 300 USD
Number: 123 Item1: 52134
Street: Main Street
Item2: 24457
Orders:
Most_recent: 1501

Document ID: 202
First_Name: Jane
Last_Name: Doe
Document ID: 1502
Address:
Price: 2500 GBP
Number: 321 Item1: 98456
Street: Elm Street Item2: 59428
Orders:
Most_recent: 1502

Graph Databases
• Great for social network applications and
others where relationships are important
• Nodes and edges
– Edge like a join
– Nodes like rows in a table
• Nodes can also have properties and
values
• Neo4j is a popular graph db

Graph Databases
Database
George Washington

Street: 123 Main Street
City: New York
Friend of State: NY
Zip: 10014

Address

Placed order
Andrew Brust ID: 252
Total Price: 300 USD

Item1 Item2

Joe Smith Jane Doe ID: 52134 ID: 24457
Type: Dress Type: Shirt
Color: Blue Color: Red
Commented on Sent invitation to
photo by

PROVISIONING, MARKET,
APPLICABILITY

NoSQL on Windows Azure
• Platform as a Service
– Cloudant: https://cloudant.com/azure/
– MongoDB (via MongoLab):
http://blog.mongolab.com/2012/10/azure/
• MongoDB, DIY:
– On an Azure Worker Role:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azur
e+Worker+Roles
– On a Windows VM:
e+VM+-+Windows+Installer
– On a Linux VM:
e+VM+-+Linux+Tutorial
http://www.windowsazure.com/en-
us/manage/linux/common-tasks/mongodb-on-a-linux-vm/

NoSQL on Windows Azure
• Others, DIY (Linux VMs):
– Couchbase: http://blog.couchbase.com/couchbase-server-
new-windows-azure
– CouchDB:
http://ossonazure.interoperabilitybridges.com/articles/couch
db-installer-for-windows-azure
– Riak: http://basho.com/blog/technical/2012/10/09/Riak-on-
Microsoft-Azure/
– Redis:
http://blogs.msdn.com/b/tconte/archive/2012/06/08/running-
redis-on-a-centos-linux-vm-in-windows-azure.aspx
– Cassandra: http://www.windowsazure.com/en-
us/manage/linux/other-resources/how-to-run-cassandra-
with-linux/

From Sam
The High-Level Shake Out
• Hadoop will continue to crush data
warehousing

• MongoDB will be the top MySQL / on-prem
alternative

• Cloudant will be the top as-a-Service /
Cloud database

• Basho [Riak] is pivoting toward cloud
object store

NoSQL + BI
• NoSQL databases are bad for ad hoc
query and data warehousing
• BI applications involve models; models
rely on schema
• Extract, transform and load (ETL) may be
your friend
• Wide-column stores, however are good for
“Big Data”
– See next slide
• Wide-column stores and column-oriented
databases are similar technologically

NoSQL + Big Data
• Big Data and NoSQL are interrelated
• Typically, Wide-Column stores used in Big
Data scenarios
• Prime example:
– HBase and Hadoop
• Why?
– Lack of indexing not a problem
– Consistency not an issue
– Fast reads very important
– Distributed file systems important too
– Commodity hardware and disk assumptions also
important
– Not Web scale but massive scale-out, so similar
concerns

Compromises
• Eventual consistency
• Write buffering
• Only primary keys can be indexed
• Queries must be written as programs
• Tooling
– Productivity (= money)

Summing Up
• Line of Business -> Relational
• Large, public (consumer)-facing sites ->
NoSQL

• Complex data structures -> Relational
• Big Data -> NoSQL

• Transactional -> Relational
• Content Management -> NoSQL

• Enterprise->Relational
• Consumer Web -> NoSQL

Thank you

• andrew.brust@bluebadgeinsights.com
• @andrewbrust on twitter
• Want to get on Blue Badge Insights’ list?”
Text “bluebadge” to 22828

NoSQL and The Big Data Hullabaloo

More Related Content

What's hot (20)

Similar to NoSQL and The Big Data Hullabaloo (20)

More from Andrew Brust (7)

Recently uploaded (20)

NoSQL and The Big Data Hullabaloo