Spring One 2012 Presentation – Effective design patterns with NewSQL

Effective design patterns with
NewSQL
Jags Ramnarayan, Chief Architect, GemFire/SQLFire, vFabric
Guillermo Tantachuco, Regional Sr. Systems Engineer, vFabric

© 2012 SpringOne 2GX. All rights reserved. Do not distribute without permission.

We Challenge the traditional RDBMS design NOT SQL

Buffers
primarily First write to
tuned for IO Log

Second write
to Data Files

• Too much I/O
• Design roots don‟t necessarily apply today
• Too much focus on ACID
2
• Disk synchronization bottlenecks

Achieving consistent response times is challenging

– Resources (memory, IO) consumed can vary a lot
– Highly selective query using an index can be very fast one moment
• a high cache hit rate most of the times
– But, complex concurrent queries may wipe out the buffers causing a huge
spike in IO the next moment
http://blog.tonybain.com/tony_bain/2009/05/the-problem-with-the-relational-database-part-2-predictability.html

3

Common themes in next-gen DB architectures
“Shared nothing” commodity clusters
focus shifts to memory, distributing data and
clustering

Scale by partitioning the data and move
behavior to data nodes

HA within cluster and across data centers

Add capacity to scale dynamically
4
NoSQL, Data Grids, Data Fabrics, NewSQL

But, what about sharding?

• Sharding works but can be huge burden over time
• Querying across partitions
– A simple nested loop join can be very expensive
– Aggregations, ordering, Groupings have to be hand coded
– Managing large intermediate data sets become an app problem
• Transactions
– Cross partition transactions are not possible
– Loss of atomicity/isolation means compensatory code needs to be built
• Management, elasticity
– Cannot expand cluster size on demand
– Management in general is difficult

5

NewSQL Concepts with VMWare SQLFire

• Main memory oriented Clustered SQL DB

• NoSQL characteristics of scalability, performance, availability but
retains support for distributed transactions, SQL querying

• It is also designed so you can use it as a operational layer in front of
your legacy databases through a caching framework

SQLFire at a glance
Tables can be replicated or
partitioned.
Replication within cluster is
synchronous
Expand cluster
on demand

Caching Framework – write
Shared nothing „append
through, write-behind to
only‟ disk persistence
RDBMS

Explore features using simple STAR schema
FLIGHTAVAILABILITY
---------------------------------------------

FLIGHTS FLIGHT_ID CHAR(6) NOT NULL ,
SEGMENT_NUMBER INTEGER NOT NULL ,
---------------------------------------------
FLIGHT_DATE DATE NOT NULL ,
ECONOMY_SEATS_TAKEN INTEGER ,
FLIGHT_ID CHAR(6) NOT NULL ,
…..
ORIG_AIRPORT CHAR(3), 1–M
PRIMARY KEY ( FLIGHT_ID,
DEPART_TIME TIME,
SEGMENT_NUMBER,
…..
FLIGHT_DATE))
PRIMARY KEY (FLIGHT_ID,
FOREIGN KEY (FLIGHT_ID,
SEGMENT_NUMBER)
SEGMENT_NUMBER)
REFERENCES FLIGHTS (
FLIGHT_ID,
SEGMENT_NUMBER)
1–1

FLIGHTHISTORY
---------------------------------------------
SEVERAL CODE/DIMENSION TABLES
FLIGHT_ID CHAR(6), ---------------------------------------------
SEGMENT_NUMBER INTEGER,
ORIG_AIRPORT CHAR(3), AIRLINES: AIRLINE INFORMATION (VERY STATIC)
DEPART_TIME TIME, COUNTRIES : LIST OF COUNTRIES SERVED BY FLIGHTS
DEST_AIRPORT CHAR(3), CITIES:
….. MAPS: PHOTOS OF REGIONS SERVED

Assume, thousands of flight rows, millions of flightavailability records
11

Creating tables
CREATE TABLE AIRLINES (
AIRLINE CHAR(2) NOT NULL PRIMARY KEY,
AIRLINE_FULL VARCHAR(24),
BASIC_RATE DOUBLE PRECISION,
DISTANCE_DISCOUNT DOUBLE PRECISION,…. );

Table

SQLF SQLF SQLF

Replicated tables
CREATE TABLE AIRLINES ( Design Pattern
AIRLINE CHAR(2) NOT NULL PRIMARY KEY, Replicate reference tables in
AIRLINE_FULL VARCHAR(24), STAR schemas
BASIC_RATE DOUBLE PRECISION, (seldom change, often
DISTANCE_DISCOUNT DOUBLE PRECISION,…. ) referenced in queries)
REPLICATE;

Replicated Table Replicated Table Replicated Table

SQLF SQLF SQLF

Partitioned tables
CREATE TABLE FLIGHTS (
FLIGHT_ID CHAR(6) NOT NULL , Design Pattern
SEGMENT_NUMBER INTEGER NOT NULL , Partition Fact tables in STAR
ORIG_AIRPORT CHAR(3), schemas for load balancing
DEST_AIRPORT CHAR(3) (large, write heavy)
DEPART_TIME TIME,
FLIGHT_MILES INTEGER NOT NULL)
PARTITION BY COLUMN(FLIGHT_ID);

Replicated Table Replicated Table
Table Replicated Table
Partitioned Table Partitioned Table Partitioned Table

SQLF SQLF SQLF

Partitioned but highly available
Design Pattern
Increase redundant copies
for HA and load balancing
ORIG_AIRPORT CHAR(3),
queries across replicas
DEST_AIRPORT CHAR(3)
DEPART_TIME TIME,
FLIGHT_MILES INTEGER NOT NULL)
PARTITION BY COLUMN (FLIGHT_ID) REDUNDANCY 1;


Redundant Partition Redundant Partition Redundant Partition

SQLF SQLF SQLF

Disk resident tables
SEGMENT_NUMBER INTEGER NOT NULL , Data dictionary is always
….. persisted in each server
PARTITION BY COLUMN (FLIGHT_ID)
PERSISTENT;

sqlf backup /export/fileServerDirectory/sqlfireBackupLocation

Colocated Partition Colocated Partition Colocated Partition
SQLF SQLF SQLF

Partition by Primary Key

To partition using the Primary Key, use:

PARTITION BY PRIMARY KEY
- Consistent hash on key resolves to a
logical bucket
- Buckets map to physical processes
(nodes)
DEPART_TIME TIME,
FLIGHT_MILES INTEGER NOT NULL,
PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER) )
PARTITION BY PRIMARY KEY;

Partition by Column(s)
To partition using a column or columns, use:
PARTITION BY COLUMN (column-name [ , column-name ]*)

CREATE TABLE FLIGHTS ( - Hash key uses all partition columns
DEPART_TIME TIME,
FLIGHT_MILES INTEGER NOT NULL,
PARTITION BY COLUMN (FLIGHT_ID);

Partition by List
To partition based on specific column values:

PARTITION BY LIST (column-name)
VALUES ( value [ , value ]* )
[ , VALUES ( value [ , value ]* ) ]*

Partitioned Table Node 1
…..
PARTITION BY LIST (ORIG_AIRPORT)
(VALUES („PDX‟, „LAX‟) Partitioned Table Node 2
VALUES („AMS‟, „DUB‟));

Partition by Range
To partition based on a range of values of a specific column :
PARTITION BY RANGE (column-name)
VALUES BETWEEN ( value AND value
[ , VALUES BETWEEN ( value AND value ]* )

…..
PARTITION BY RANGE (FLIGHT_MILES)
(VALUES BETWEEN 0 AND 100,
VALUES BETWEEN 100 AND 500,
VALUES BETWEEN 500 AND 1000 );

Partition by Expression
To partition on a derived value:
PARTITION BY (expression)

….
PARTITION BY (HOUR(DEPART_TIME);

Demo environment
SQLFire Locator sqlf locator start
-client-bind-address=loc1
SQL client -client-port=1527
SQLFire server 1 sqlf server start
-locators=loc1[10101]
SQLFire server 2 -client-bind-address=server1
-client-port=1528
SQLFire server 3

JMX agent sqlf agent start

Scaling with
Partitioned tables

Hash partitioning for linear scalability

Key Hashing provides single hop access to its partition
But, what if the access is not based on the key … say, joins are involved

Hash partitioning only goes so far
• Consider this query :
Select * from flights, flightAvailability
where <equijoin flights with flightAvailability>
and flightId =‘AA1116';

• If both tables are hash partitioned the join logic will need execution on
all nodes where flightavailability data is stored

• Distributed joins are expensive and inhibit scaling
• joins across distributed nodes could involve distributed locks and
potentially a lot of intermediate data transfer across nodes

EquiJOIN is supported for only colocated data in SQLFire 1.0

Partition aware DB design

Designer thinks about how data access maps to logical partitions

For scaling try to:
1) minimize excessive data distribution by keeping the most
frequently accessed and joined data collocated on partitions

2) Collocate transaction working set on partitions so complex 2-
phase commits/paxos commit is eliminated or minimized.

Read Pat Helland’s “Life beyond Distributed Transactions” and the Google MegaStore paper


– Identify partition key for “Entity Group”
• "entity groups": set of entities across several related tables that
can all share a single identifier
– flightID is shared between the parent and child tables
– CustomerID shared between customer, order and
shipment tables

CREATE TABLE FLIGHTAVAILABILITY (
…)

PARTITION BY COLUMN (FLIGHT_ID)
COLOCATE WITH (FLIGHTS);


Select * from Flights where flight_id = „UA326‟

Select * from Flights f, flightAvailability fa
where <JOIN clause> and
flight_id = „UA326‟

Select * from Flights f, flightAvailability fa
where <JOIN clause> and
flight_id IN („UA326‟, „AA400‟)

Select * from Flights f where orig_airport = „SFO‟

Partition Aware DB design

• STAR schema design is the norm in OLTP design
• Fact tables (fast changing) are natural partitioning candidates
– Partition by: FlightID … Availability, history rows colocated with Flights
• Dimension tables are natural replicated table candidates
– Replicate Airlines, Countries, Cities on all nodes

• Dealing with Joins involving M-M relationships
– Can the one side of the M-M become a replicated table?
– If not, run the Join logic in a parallel stored procedure to minimize distribution
– Else, split the query into multiple queries in application

1. “Write thru” Distributed caching

“Write thru” – participate in
container transaction

Lazily load using “RowLoader”
for PK queries

Trade-off: Throttled by legacy
database

2. Distributed caching with Async writes to DB
Queues reside in memory
redundantly & persistent
on multiple nodes

Primary / Secondary listeners

Store-and-forward

Demo

Write behind to MySQL using the DBSynchronizer (AsyncEventListener)..

33

3. As a scalable OLTP data store

High throughput, response time, linear scale

Redundant copies, shared-nothing persistence, online backups

Reduce maintenance cost and operational overhead

4. As embedded, clustered Java database
Just deploy a JAR or WAR into
clustered App nodes

Just like H2 or Derby except data can
be sync’d with DB is partitioned or
replicated across the cluster

Simply switch the URL from
jdbc:sqlfire://myHostName:1527/
TO
jdbc:sqlfire:;mcast-port=33666;host-data=true

5. To process app behavior in parallel

Map-reduce but based on simpler RPC

Scaling Application logic with
Parallel “Data Aware
procedures”

Procedures
Java Stored Procedures may be created according to the
SQL Standard
CREATE PROCEDURE getOverBookedFlights ()
LANGUAGE JAVA PARAMETER STYLE JAVA
READS SQL DATA DYNAMIC RESULT SETS 1
EXTERNAL NAME
„examples.OverBookedStatus.getOverBookedStatus‟;

SQLFire also supports the JDBC type Types.JAVA_OBJECT. A parameter of type JAVA_OBJECT supports
an arbitrary Serializable Java object.

In this case, the procedure will be executed on the server to
which a client is connected (or locally for Peer Clients)

Data Aware Procedures
Parallelize procedure and prune to nodes with required data
Extend the procedure call with the following syntax:
CALL [PROCEDURE] Client
procedure_name
( WITH RESULTexpression ]* ] ) processor_name ]
[ [ expression [, PROCESSOR
[ { ON TABLE table_name [ WHERE whereClause ] } |
{ ON {ALL | SERVER GROUPS
(server_group_name [, server_group_name ]*) }}
Fabric Server 1 Fabric Server 2
]

CALL getOverBookedFlights( )
ON TABLE FLIGHTAVAILABILITY
Hint the data the procedure depends on
WHERE FLIGHT_ID = „AA1116‟;

If table is partitioned by columns in the where clause the procedure execution is pruned to nodes with
the data (node with “AA1116” in this case)

Parallelize procedure then aggregate (reduce)
CALL [PROCEDURE]
register a Java Result Processor (optional in some cases): procedure_name
[ WITH RESULT PROCESSOR processor_name ]
[ [ ON TABLE table_name WHERE whereClause
( { expression [, expression[ ]* ] )
]} |
{ ON {ALL | SERVER GROUPS
Client (server_group_name [, server_group_name ]*) }}
]

Fabric Server 1 Fabric Server 2 Fabric Server 3

Demo

Data Aware procedure demo

41

6. To make data visible across sites in real time

Consistency Model without Transactions
– Replication within cluster is always eager and synchronous
– Row updates are always atomic; No need to use transactions
– FIFO consistency: writes performed by a single thread are seen by all
other processes in the order in which they were issued

Consistency Model without Transactions
– Consistency in Partitioned tables
• a partitioned table row owned by one member at a point in time
• all updates are serialized to replicas through owner
• "Total ordering" at a row level: atomic and isolated

– Membership changes and consistency – need another hour 

– Pessimistic concurrency support using „Select for update‟
– Support for referential integrity

Distributed Transactions
• Full support for distributed transactions
• Support READ_COMITTED and REPEATABLE_READ
• Highly scalable without any centralized coordinator or lock manager
• We make some important assumptions
• Most OLTP transactions are small in duration and size
• W-W conflicts are very rare in practice

Distributed Transactions
• How does it work?
• Each data node has a sub-coordinator to track TX state
• Eagerly acquire local “write” locks on each replica
• Object owned by a single primary at a point in time
• Fail fast if lock cannot be obtained
• Atomic and works with the cluster Failure detection system
• Isolated until commit for READ_COMMITTED
• Only support local isolation during commit

Why is disk latency so high?
• Challenges
– Disk seek times is still > 2ms
– OLTP transactions are small writes
• Flushing to disk will result in a seek
• Best rates in 100s per second
• RDBs and NoSQL try to avoid the problem
– Append to transaction logs; out-of-band writes to data files
– But, reads can cause seeks to disk

Disk persistence in SQLF

Memory Memory
Tables Tables

LOG LOG
Compressor Compressor

OS Buffers OS Buffers

Record1 Record1
Record1

Record2
Record2 Append only Record1

Record2
Record2 Append only
Record3
Record3
Operation logs Record3
Record3
Operation logs

• Parallel log structured storage • Don‟t seek to disk
• Each partition writes in parallel • Don‟t flush all the way to disk
• Backups write to disk also – Use OS scheduler to time write
– Increase reliability against h/w loss • Do this on primary + secondary
• Realize very high throughput

How does it perform? Scale?

• Scale from 2 to 10 servers (one per host)
• Scale from 200 to 1200 simulated clients (10 hosts)
• Single partitioned table: int PK, 40 fields (20 ints, 20 strings)

How does it perform? Scale?

• CPU% remained low per server – about 30% indicating many more clients
could be handled

Is latency low with scale?
• Latency decreases with server capacity
• 50-70% take < 1 millisecond
• About 90% take less than 2 milliseconds

Thank you:

You can reach us at …

Jags Ramnarayan: jramnara@vmware.com

Guillermo Tantachuco: gtantachuco@vmware.com

http://communities.vmware.com/community/vmtn/appplatform/vfabric_sqlfire

Q&A

Spring One 2012 Presentation – Effective design patterns with NewSQL

More Related Content

Similar to Spring One 2012 Presentation – Effective design patterns with NewSQL (20)

More from VMware vFabric (13)

Recently uploaded (20)

Spring One 2012 Presentation – Effective design patterns with NewSQL

Editor's Notes