NoSQL, SQL, NewSQL - methods of structuring data.

Multiple ways of storing
-> Data <-
SQL -> NOSQL -> NEWSQL
Tony Rogerson
@tonyrogerson
tonyrogerson@torver.net
dataidol.com/tonyrogerson

Agenda
Data structures
◦ Relational, Key/Value pair, Document, Graph, Column/Column Family Store
◦ Key Concepts
◦ Hashing, Partitioning, Sharding, ACID, BASE
Technology Areas
◦ SQL, NoSQL, NewSQL

Who-am-I
Freelance SQL Server professional and Data Specialist
Fellow BCS, MSc in BI, PGCert in Data Science
Started out in 1986 – VSAM, System W, Application System, DB2, Oracle, SQL Server since 4.21a
Awarded SQL Server MVP yearly since 97
Founded UK SQL Server User Group back in ’99, founder member of DDD, SQL Bits, SQL Relay,
SQL Santa
Interested in commodity based distributed processing of Data.

Data Structures
WAYS OF STRUCTURING DATA

What is data?
Tony Rogerson
Harpenden
36 on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-15
46
44

Data needs context and structure
Tony Rogerson FullName
tonyrogerson@torver.net Email
Harpenden PostalTown
36 on 2014-01-01,
36 on 2014-05-01, {WaistInches, RecordedOn}
38 on 2014-10-15
46 ChestInches
44 Ages
Schema gives
Context

Relational [Tables]
FullName (PK) Email PostalTown WaistInches ChestInches AgeYears
Tony Rogerson tonyrogerson@
torver.net
Harpenden 46 44
FullName (FK) WaistInches RecordedDate
Tony Rogerson 36 2014-01-01
People WaistInches
Tony Rogerson
Harpenden
36 on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-15
46
44

Key/Value pair (EAV)
Entity Attribute Value
Person FullName Tony Rogerson
Person Email tonyrogerson@torver.net
Person PostalTown Harpenden
Person ChestInches 46
Person Age 44
WaistInches FullName Tony Rogerson
WaistInches WaistInches 36
WaistInches RecordedDate 2014-01-01
Examples:
Riak, Dyanamo, Redis,
Foundation etc.
Tony Rogerson
Harpenden
36 on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-15
46
44

Document
JSON Schema JSON Document
{
“FullName” : “string”,
“Email” : “string”,
“PostalTown” : “string”,
“WaistInches” : {
“WaistInches” : “number”,
“RecordedDate” : “string” },
“ChestInches” : “number”,
“Age” : “number”
}
{
“FullName” : “Tony Rogerson”,
“Email” : “tonyrogerson@torver.net”,
“PostalTown” : “Harpenden”,
“WaistInches” : [ {
Examples:
MongoDB, Couchbase,
CouchDB etc.
“WaistInches” : 36,
“RecordedDate” : “2014-01-01” },
{
“WaistInches” : 36,
“RecordedDate” : “2014-05-01” } ],
“ChestInches” : 46,
“Age” : 44
}
JSON vs XML discussion: http://stackoverflow.com/questions/4862310/json-and-xml-comparison
Tony Rogerson
Harpenden
36 on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-15
46
44

Schema Design
E.g. 100 machine cluster
Document Database Normal Form (Relational)
{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"height_cm": 167.6,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{ "type": "home", "number": "212 555-1234" },
{ "type": "office", "number": "646 555-4567" }
]
}
person address
phoneN
umbers
Object data
stored together
(collection)
Object data
stored separately
(tables)

MongoDB Example
Use ESTER for MongoVUE
What do documents look like?

Graph
SQL (inherently very poor performance):
◦ Nested Sets
◦ Recursive CTE
Represents “connected” data
All about understanding and exploring relationships
Examples:
Neo4j, Virtuoso, Allegro.
Tony Dave
Fred
Sid
Node
Relationship

Examples:
Cassandra, Druid, HBase
Column
Values stored as a key-value pair
Column Name (unique)
Value
Timestamp
Important bit: It may not appear in each row!
Column Family is: container for columns and rows (like but not a relational table)
Relational Table: Fixed Columns
Column Family: determined by application – flexible

Column storage
Examples:
Cassandra, Druid, HBase
http://www.datastax.com/docs/1.1/ddl/column_family
Stored as…

SQL Server Columnstore
Table sliced into rowgroups (a group of rows – a batch)
Each rowgroup compressed in column-wise manner
Column segment is a column of data from within the rowgroup
Column segment per column in table which is then compressed onto
storage.
SO: a table has rows (sliced into rowgroups), rowgroups have columns
(each column having a column segment)

Key Concepts
SHARDING, PARTITIONING, HASHING

Hashing
Distributed Database Cluster has fixed number of data nodes
Your data is spread across the database cluster
◦ 10 node cluster; each data item may reside on 3 nodes
◦ Which 3 nodes?
Data key is Hashed to a number – hashing algorithm is deterministic
data-node = f( data-key )
◦ print ( checksum( 'All hale to the ale' ) * 1.) % 10
◦ print ( checksum( 'And a glass of wine for the ladies' ) * 1.) % 10

Partitioning
Chop big table up into “horizontal
partitions”
Partition key required
Each partition is self-contained binding rows
by the partitioning key
Access all data through logical view over all
partitions
Table by table basis

Shared Nothing
Partitioning+
Each Shard is self-contained and has all the
procs, meta-data and of course your partition of
data
Shard Key common to multiple tables, for
example CustomerID, Email Address.
Greater autonomy across the distributed
database
Seeing the entire database as a logical unit is
more difficult – joining is a nightmare
Node 1
Node 2
Node 3

Sharding Sync
Node 1
Node 2
Node 3
Full copy of data
Subset of data
Replication

ACID (Automicity, Consistency, Isolation, Durability)
BASE (Basically Available, Soft-state, Eventually Consistent)
ACID is a Transactional model
Not specific to the relational database
◦ eg. HIVE (interface to HADOOP) provides ACID facilities
Durability: write ahead Logging expensive (latency from serialisation of writes)
Distributed transactions – Two Phase Commit (2PC)
◦ Poor scalability because of Latency
◦ ACID across distributed nodes bad design choice
◦ Partition/Shard database and ACID in-node only
Coordinator
Subordinate
Subordinate
INSERT
2PC Transaction
All or nothing

BASE is a Transactional modelish
Specific to Distributed database model
Basically Available – all or some of the system is available
Node 1 Node 2 Node 3

Soft-state
Eventually Consistent
System may change over time [as replica’s become up-to-date (consistent)]
Node 1 Node 2 Node 3
Insert value ‘A’

SQL
AH – THE COMMON DENOMINATOR OF AN ACCESS LAYER

What is SQL?
SQL is NOT a method of storing data!
SQL is a language, it’s just syntax
Relational Theory = thinking in sets
SQL is a language that follows (but does not obey) relational theory
With SQL we associate ACID (but durability is now optional in SQL 2014)

Origins NoSQL?
First NoSQL database was an open source relational database
NoSQL (really NoREL) started in mid 2000’s
Realisation that ACID doesn’t scale easily
Should really be NoACID (Mutually exclusive for some 70’s developers)
Hadoop – came out of Yahoo
Cassandra, Riak and others derivatives of Amazon Dynamo
NoSQL basically means: ACID doesn’t scale, SQL is too restrictive, and I’m a developer and I like
complexity.

But why the need for “NoSQL”?
Feb 2001
◦ BigData - http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-
Data-Volume-Velocity-and-Variety.pdf
Basically Scale-Up (SAN) costs too much and doesn’t scale well
Sick of vendor lock in and associated costs – open source software running across cheap
commodity machines (Redundant Array of Inexpensive Servers)
Availability, Resilience – by design – by software and not expensive hardware
Existing Relational Databases (with SQL as their only language) expensive and too slow (ACID)
BASE v ACID
SQL implements a rigid and inflexible framework (or does it)

Eventual Consistency in SQL Server
Asynchronous Availability Groups/Database Mirroring
Replication
Eventual / Causal Consistency
◦ Eventual no good for order specific [and important] transactions
◦ Like Merge replication
◦ Causal: deliver messages in correct order [e.g. service broker]
◦ Like Transactional Replication

MongoDB – Replica Set
primary
$ mongo --host 10.0.0.1 --port 27017
ROSIE
10.0.0.2
ESTER
10.0.0.1
HAZEL
10.0.0.3
secondary's
replication replication
Heart-beat
• 1 Master – Multiple Secondary’s
• 1 R/W – Multiple Readers
• Setup:
• Use replication.replSetName in mongo config file
• On Primary:
• rs.initiate()
• rs.add( “---secondary address” )
• rs.add( “---secondary address” )
• rs.status()

MongoDB - Sharding
Shards of data (data chopped up into multiple ranges,
range depends where it sits)
Standalone or Replica-Set MongoDB instances
(data storage)
Stores configuration information
about the Shards.

MongoDB – Sharding (with Replica-Set)
mongod: port 27017, replSet: rsDemoRS2
DAISY
10.0.0.4
CONISTON
10.0.0.11
POPPY
10.0.0.5
KARLI
10.0.0.6
mongod: port 27017, replSet: rsDemo
mongos: port 27020 (on ESTER, HAZEL, ROSIE)
ROSIE
10.0.0.2
config servers
port 27019
(shard information
point to replica sets)
ESTER
10.0.0.1
primary
HAZEL
10.0.0.3
secondary's
Heart-beat
THIRLMERE
10.0.0.13
primary
ULLSWATER
10.0.0.12
secondary's
Heart-beat
DAISY
10.0.0.4
Query Balancer
Query

Relational Databases catch up
Maintains ACID
Same scalability and performance of NoSQL systems
Some Vendors: Clustrix, MemSQL, NuoDB, VoltDB, Postgres-XL
Auto-sharding, auto-partitioning
Queries need to take place on same box to save latency
http://www.postgres-xl.org/overview/

NoSQL, SQL, NewSQL - methods of structuring data.

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to NoSQL, SQL, NewSQL - methods of structuring data. (20)

Recently uploaded (20)

NoSQL, SQL, NewSQL - methods of structuring data.

Editor's Notes