SlideShare a Scribd company logo
S Q L &
N O S Q L
D a v i d S i m o n s
@ S w a m W i t h Tu r t l e s
S Q L &
N O S Q L
D a v i d S i m o n s
@ S w a m W i t h Tu r t l e s
W H O A M I ?
• Tech Lead/Consultant at
Softwire
• Background in Statistics &
Computer Simulation
W H AT D O W E D O ?
• Business Analysis/Mapping
• Architecture
• Project Management
• Design (UI and User Workflows)
• Development
• QA
• Warranty
W H AT D O W E D O ?
• Business Analysis/Mapping
• Architecture
• Project Management
• Design (UI and User Workflows)
• Development
• QA
• Warranty
What problems
are we solving?
How do we solve them?
Solving them now!
Are they still solving
the problem?
T O D AY W E ’ R E G O I N G T O TA L K
A B O U T
• Business Analysis/Mapping
• Architecture
• Project Management
• Design (UI and User Workflows)
• Development
• QA
• Warranty
H O W T O D O A R C H I T E C T U R E
E V O LV I N G
D E S I G N
U P - F R O N T
D E C I S I O N
M A K I N G
T O D AY…
• Part 1: Looking at some
SQL & Database Theory
• Part 2: Looking at a lot of
NoSQL databases
W H AT I S A D ATA B A S E ?
PA R T 1 : T H E O RY
- U N I V E R S I T Y O F G E O R G I A
“A database is a collection of information
organized to provide efficient retrieval.”
T H E M Y T H I C A L D ATA B A S E D I V I D E
S Q LN O S Q L
T H E M Y T H I C A L
D ATA B A S E D I V I D E
• NoSQL (apparently) has
always meant Not Only
SQL
• Considering Databases
that don’t meet the SQL
Standard which covers a
wide range of databases
T H E S Q L S TA N D A R D
PA R T 1 : T H E O RY
H I S T O RY
• First defined by ANSI in
1986 (though around
before then)
• Structured Query
Language
• Different databases have
implemented this standard
way of storing, inserting
and retrieving data
E X A M P L E S O F
S Q L D ATA B A S E S
• MySQL
• Microsoft SQL Server
• Oracle
• PostgreSQL (mostly)
• IBM DB2 and more…
W H AT ’ S I N T H E
S TA N D A R D ?
• Rules for how the
language works
• No opinion as to what the
database looks like
B U T…
• ‘SQL’ has come to mean a
lot more than the
language (especially in the
context of NoSQL)
• Family of RDBMS
databases that follow a set
of rules
W H AT ’ S I N A N
R D B M S ?
• Prescriptive Schema
• Set-based Operations
• Table-driven &
Denormalised
• ACID Transactions
S C H E M A
D R I V E N
Name Species
S E T- B A S E D
O P E R AT I O N
R E A D D A TA O U T W I T H
E V E RY R O W I S A “ T H I N G ”
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
“ W H E R E ” ( I N T E R S E C T I O N )
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
U N I O N S
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
5 Nemo
6 Moby Dick
7 Wanda
– R O N E R N E S T
( & T H E S Q L C O M M U N I T Y AT L A R G E )
“Cursors are evil.”
N O R M A L
F O R M S
Body Level One
J O I N S
Name Species
Species Coolness
Rating
1 Puss 0
2 Dinah 0
3 Einstein 10
4 Jess 0
R E L AT I O N S
B E T W E E N D ATA
• We don’t like
duplicating data
• Goes out of sync
• May not be the
same everywhere
R E L AT I O N S
B E T W E E N D ATA
• Objects have
properties that come in
groups
• For example:
Landmarks have cities
and countries.
• The same city will
always have the same
country
W E S O LV E
T H AT W I T H …
• Denormalisation
• Store linked groups as
its own row in a
separate table
• And store pointers to
that table
• These are combined
by query-time joins
Name Species
Species Coolness
1 Puss
2 Dinah
3 Einstein
4 Jess
Species
Coolness
Rating
1 0
2 10
J O I N S
T R A N S A C T I O N S
W R I T E D A TA I N W I T H
– J O H N N Y A P P L E S E E D
“A unit of work you want to treat as a whole”
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
DonaldPlutoMickey
{ }
Ducks aren’t mammals
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
The database is always in a valid state, as defined
by a whole number of queries
regardless of:
(1) invalid data;
(2) concurrent requests;
(3) system failures
The database is always in a valid state, as defined
by a whole number of queries
regardless of:
(1) invalid data;
(2) concurrent requests;
(3) system failures
The database is always in a valid state, as defined
by a whole number of queries
regardless of:
(1) invalid data;
(2) concurrent requests;
(3) system failures
The database is always in a valid state, as defined
by a whole number of queries
regardless of:
(1) invalid data;
(2) concurrent requests;
(3) system failures
A C I D
• Atomicity
• Consistency
• Isolation
• Durability
W H AT ’ S I N A N
R D B M S ?
• Prescriptive Schema
• Set-based Operations
• Table-driven &
Denormalised
• ACID Transactions
C A PA C I T Y &
S C A L A B I L I T Y
PA R T 1 : T H E O RY
A S K I N G A
S Y S T E M T O D O
S O M E T H I N G
U S E S R E S O U R C E S
W H AT H A P P E N S
A S M O R E
R E Q U E S T S
C O M E I N ?
S Q L I S P R E T T Y
G O O D F O R
L A R G E A M O U N T S
O F D ATA
T R U T H F U L LY
W I T H E N O U G H
D ATA , Y O U
H AV E T O S C A L E
T H E H A R D T R U T H
Y O U R C U R R E N T S Y S T E M
D ATA B A S E A P P L I C AT I O N
U S E R S
A S I T G R O W S
D ATA B A S E A P P L I C AT I O N
U S E R S
H O R I Z O N TA L S C A L A B I L I T Y
D ATA B A S E
A P P L I C AT I O N
U S E R S
D ATA B A S E
D ATA B A S E
V E R T I C A L S C A L A B I L I T Y
M O R E P O W E R F U L
D ATA B A S E
A P P L I C AT I O N
U S E R S
S Q L C A N
S C A L E …
T H E H A R D T R U T H
S Q L C A N S C A L E V E R T I C A L LY
A N D …
• Scaling to meet the
needs of read operations
is very doable
• Master-Slave replication
B U T…
• Scaling writes is
problematic
• How do atomic
transactions work on a
scaled database?
• How can SQL enforce
constraints across
multiple databases?
- J O E R I S E B R A C H T S
“To scale up write operations or the number of
nodes in a cluster beyond a certain point you have
to be able to relax some of the ACID requirements”
T H E C A P T H E O R E M
PA R T 1 : T H E O RY
T H E C O S T O F
S C A L I N G
• You become vulnerable
to network failures
C A P T H E O R E M
• Choose Two:
• Consistency
• Availability
• Partition Tolerance
• WARNING: These have
specific definitions
P R O V I S O
There is a lot of thought in this area,
I am giving a simplified description
that would make many database people
pull their hair out.
https://martin.kleppmann.com/2015/05/11/
please-stop-calling-databases-cp-or-ap.html
C A P T H E O R E M
CP AP
Consistent
& Partition Tolerant
Available
& Partition Tolerant
C A P T H E O R E M
A
BC
Data = “Cat”
Data = “Cat”
Data = “Cat”
C A P T H E O R E M
A
BC
Data = “Cat”
Data = “Dog”
Data = “Cat”
C A P T H E O R E M
A
BC
Data = “Dog”
Data = “Dog”
Data = “Dog”
A P S Y S T E M S
C A P T H E O R E M
A
BC
Data = “Dog”
Data = “Dog” Data = “Dog”
AVA I L A B L E ( “ A P ” ) S Y S T E M S
A
BC
Data = “Wolf”
Data = “Dog” Data = “Dog”
AVA I L A B L E ( “ A P ” ) S Y S T E M S
A
BC
Data = “Wolf”
Data = “Dog” Data = “Wolf”
C P S Y S T E M S
C O N S I S T E N T ( “ C P ” ) S Y S T E M
A
BC
Data = “Dog”
Data = “Dog” Data = “Dog”
C O N S I S T E N T ( “ C P ” ) S Y S T E M
A
BC
Data = “Dog”
Data = “Dog” Data = “Dog”
C O N S I S T E N T ( “ C P ” ) S Y S T E M
A
BC
Data = “Wolf”
Data = “Dog” Data = “Wolf”
part 1 done
What shape is your data?
Are you happy to pay?
What uses your data?
• Databases store data in an accessible way
• SQL database meet a defined standard; NoSQL is a
movement towards considering databases that don’t
• SQL uses tables and schemas to store data, and acts on it like
sets in a transactional way.
I N C O N S I S T E N T
D ATA B A S E S
PA R T 2 : E X A M P L E S
T H E R E ’ S A L O T
O F VA L U E I N
C O N S I S T E N C Y…
– D Y N A M O : A M A Z O N ’ S H I G H LY AVA I L A B L E K E Y- VA L U E
S T O R E
“Reliability at massive scale is one of the biggest
challenges we face at Amazon.com. Even the
slightest outage has significant financial
consequences and impacts customer trust.”
– D Y N A M O : A M A Z O N ’ S H I G H LY AVA I L A B L E K E Y- VA L U E
S T O R E
“Dynamo targets applications that operate with
weaker consistency if this results in high
availability.”
D Y N A M O I M P L E M E N TAT I O N S
N O T
G U A R A N T E E D
C O N S I S T E N C Y
T H E C O S T ?
A M A Z O N
S H O P P I N G
I S T H A T H O N E S T LY O K A Y ?
S M S H I S T O R I C
L O G
I S T H A T H O N E S T LY O K A Y ?
W E U S E D …D Y N A M O I M P L E M E
C A S S A N D R A
• All nodes communicate
with each other through a
Gossip protocol similar to
Dynamo and Riak,
exchanging information
about themselves and
other nodes they have
gossiped with.
D Y N A M O I M P L
C A S S A N D R A
No single point of failure
W H Y
C A S S A N D R A
• We needed fast and high
availability writes
• Data didn’t need to be real
time - it was aggregate
analytics so eventually
consistent was enough.
C A S S A N D R A :
T H E C O N ’ S
• Data is only eventually
consistent - so if you need
100% accuracy it’s not
great
• Not as wide range of
support as SQL (but
nothing does)
• Flexible schema makes it
harder to integrate with
OO languages
C A S S A N D R A :
T H E P R O ’ S
• Very fast write throughput
• SQL-like query language
so you don’t need to
relearn things
• Wide range of language
drivers
• Highly available
H I G H LY R E L AT I O N A L
D ATA
PA R T 2 : E X A M P L E S
E V E RY R O W I S A “ T H I N G ”
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
W H AT S Q L
D O E S W E L L
• Modelling objects:
• With a fixed structure
and shape
• With a limited number of
relations
• With no opinion or
opinion of any deeper
underlying domain
R D B M S
( R E L AT I O N A L D ATA B A S E
M A N A G E M E N T S Y S T E M )
T H E R E A R E
P R O B L E M S T H I S
I S B A D F O R
B U T …
K E V I N B A C O N
S I X D E G R E E S O F …
Bristol Uni - Use Cases of NoSQL
Bristol Uni - Use Cases of NoSQL
Bristol Uni - Use Cases of NoSQL
E L E C T I O N D ATA
E L E C T I O N D ATA
W O R L D ’ S L E A D I N G G R A P H D B :
"embedded, disk-based, fully transactional Java
persistence engine that stores data structured in
graphs rather than in tables"
D ATA S T O R A G E
D ATA S T O R A G E
D ATA
S T O R A G E
• Nodes and edges are all:
• Stored as first-class
objects on the file system
• “typed”
• Key-value stores
D ATA I N T H E
R E L AT I O N S
• “Joins” are first class
objects in the database
that can be queried at no
additional cost
• Certain queries become
trivial (e.g. Joins)
• At a cost: high write-time
cost
P R O T O T Y P I N G
• Easy to see and work with
data
• Schemaless
• Active community with a
lot of libraries
N E O 4 J U S E R S
N E O 4 J : T H E
C O N ’ S
• More expensive writes to
the database
• Not scalable
• Less mature tooling
(especially in non-Java
ecosystems)
N E O 4 J : T H E
P R O ’ S
• Models certain data
models very well
• Prevents costly queries
when running lots of data
• Schemalessness allows for
fast prototyping and
flexible data models
• Commercial buy-in means
language support is not far
behind
S C H E M A L E S S N E S S
PA R T 2 : E X A M P L E S
Bristol Uni - Use Cases of NoSQL
NB: MongoDB claims there’s a lot
of usecases, we’re only covering this one
M O N G O D B :
T H E C O N ’ S
• Mongo was the first
famous NoSQL database
and got used before it was
tested and mature. There’s
lots of articles about
featurelessness and bugs
• Schemalessness makes
data integrity checks and
OO language integration
tricky
M O N G O D B :
T H E P R O ’ S
• Schemalessness - if you
want flexible data models
• People have used it for a
while, and so library
support is not bad
H O W D O Y O U R E T R I E V E
Y O U R D ATA
PA R T 2 : E X A M P L E S
F R E E - T E X T S E A R C H
Bristol Uni - Use Cases of NoSQL
D O C U M E N T
S T O R E
ElasticSearch
D O C U M E N T
S T O R E
E V E RY R O W I S A “ T H I N G ”
N A M E = P U S S
C O O L N E S S = 0
!
N A M E = J E S S
C O O L N E S S = 0
!
N A M E = D I N A H
C O O L N E S S = 0
!
N A M E = E I N S T E I N
C O O L N E S S = 1 0
!
D O C U M E N T
A PA C H E
L U C E N E
“Apache Lucene is a high-performance, full-
featured text search engine library … It is a
technology suitable for nearly any application that
requires full-text search”
F O C U S E D
A R O U N D
T E X T
S E A R C H I N G
Q U E R I E S
Q U E R I E S A R E
TA I L O R E D T O T H E
Q U E S T I O N S
Y O U ’ L L B E A S K I N G
{
"query": {
"match": {"hobbies": "skateboard"}
}
}
{
"query": {
{"fuzzy": {"hobbies": “skateboarig"}}
}
}
{
"query": {
{"match": {"hobbies": {"query": "writing
reddit comments", "type": "phrase"}}}
}
}
W H AT C O N S U M E S Y O U R D ATA ?
E N D U S E R What is the average age of …?
W H AT C O N S U M E S Y O U R D ATA ?
E N D U S E R
Er….
I think it was something like “Campbell”?
O U R C H O I C E I S
I N F O R M E D B Y
O U R P L A N S F O R
T H E A P P L I C AT I O N
R E M E M B E R T H A T
E L A S T I C S E A R C H :
T H E C O N ’ S
• It only does one thing
(even if it does it well)
E L A S T I C S E A R C H :
T H E P R O ’ S
• It has a lot of search related
queries built into it - fuzzy/
phonetic/sentence
matching
• A lot of people use this,
support is mature
• Integration with a large
number of other languages
and frameworks - this is
the industry standard
W H E N I T G O E S W R O N G
PA R T 2 : E X A M P L E S
Bristol Uni - Use Cases of NoSQL
S Q L : T H E C O N ’ S
• It’s very hard to scale writes
• It has a specific data model
- not every data domain
fits into it
• e.g. highly relational
models,
schemalessness
• Domain non-specific query
languages
S Q L : T H E P R O ’ S
• If a library exists for
anything, it exists for SQL
• ACID transactions make
everything easy
• Constraints and Schemas
allow for automated data
integrity checking
• Easy denormalisation of
data
part 2 done
What shape is your data?
Are you happy to pay?
What uses your data?
• Some sites are happy to sacrifice consistency for availability -
Dynamo is a standard that databases can meet to fulfil that
• If you’ll be doing lots of joins, Graph Databases such as Neo4j
improve performance
• Sometimes you want the flexibility to store any objects - there are a
range of schemaless databases available
• Consider what will retrieve your data, and ensure you have a
database efficient for your use case.
A N Y
Q U E S T I O N S ?
D a v i d S i m o n s
@ S w a m W i t h Tu r t l e s

More Related Content

PDF
Statistical Programming with JavaScript
PDF
Decoupled APIs through Microservices
PDF
High quality Front-End
PDF
Choosing the Right Database
PDF
Data Modelling at Scale
PDF
Choosing the right database
PDF
From Content Strategy to Drupal Site Building - Connecting the dots
PDF
Network x python_meetup_2015-08-27
Statistical Programming with JavaScript
Decoupled APIs through Microservices
High quality Front-End
Choosing the Right Database
Data Modelling at Scale
Choosing the right database
From Content Strategy to Drupal Site Building - Connecting the dots
Network x python_meetup_2015-08-27

What's hot (15)

PDF
100% Visibility - Jason Yee - Codemotion Amsterdam 2018
PPTX
SharePoint Saturday Redmond - Building solutions with the future in mind
PPTX
Yammer time
PPTX
eHarmony @ Phoenix Con 2016
PPTX
Wrangle Your Defense Using Offensive Tactics BSides CT 2019
PDF
10 d bs in 30 minutes
PDF
Data Interoperability for Learning Analytics and Lifelong Learning
PDF
Wrangle Your Defense Using Offensive Tactics - ISSA May Meeting
PDF
Tech rfp template
PDF
Thinking like a Network
PDF
TDD Using the SOLID Principles
PDF
Ellicium Solutions - Making Data Science Work
PDF
Backpack Reporting (Updated)
PDF
Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership
PPTX
Big Data and Small Devices: What will it do for us and to us
100% Visibility - Jason Yee - Codemotion Amsterdam 2018
SharePoint Saturday Redmond - Building solutions with the future in mind
Yammer time
eHarmony @ Phoenix Con 2016
Wrangle Your Defense Using Offensive Tactics BSides CT 2019
10 d bs in 30 minutes
Data Interoperability for Learning Analytics and Lifelong Learning
Wrangle Your Defense Using Offensive Tactics - ISSA May Meeting
Tech rfp template
Thinking like a Network
TDD Using the SOLID Principles
Ellicium Solutions - Making Data Science Work
Backpack Reporting (Updated)
Beyond the Retrospective: Embracing Complexity on the Road to Service Ownership
Big Data and Small Devices: What will it do for us and to us
Ad

Similar to Bristol Uni - Use Cases of NoSQL (20)

PDF
Cassandra Data Modelling with CQL (OSCON 2015)
PPTX
Star Schema Overview
PDF
From Content Strategy to Drupal Site Building - Connecting the Dots
PDF
Four Architectural Patterns
PDF
Strategy to double team throughput - Fullstack Porto
PPTX
Why Every Product Manager Needs to Know Big Data
PDF
Graph theory in Practise
PDF
The Expanding Boundaries of CSS
PDF
Data Interoperability for Learning Analytics and Lifelong Learning
PDF
Graph Modelling
PPTX
Data Scientist's Daily Life
PDF
Consistency, Availability, Partition: Make Your Choice
PDF
ResearchGate - How do 'Social Networks for Scientists' Affect Libraries?
PPT
3. ldap
PDF
Witchcraft
PDF
Mirko Lorenz Data Driven Journalism Overview Seminar Ordine dei Giornalisti d...
PDF
Mining Events from Multimedia Streams (WAIS Research group seminar June 2014)
PDF
Reduce, Reuse, Refactor
PDF
Reduce, Reuse, Refactor
PDF
BoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
Cassandra Data Modelling with CQL (OSCON 2015)
Star Schema Overview
From Content Strategy to Drupal Site Building - Connecting the Dots
Four Architectural Patterns
Strategy to double team throughput - Fullstack Porto
Why Every Product Manager Needs to Know Big Data
Graph theory in Practise
The Expanding Boundaries of CSS
Data Interoperability for Learning Analytics and Lifelong Learning
Graph Modelling
Data Scientist's Daily Life
Consistency, Availability, Partition: Make Your Choice
ResearchGate - How do 'Social Networks for Scientists' Affect Libraries?
3. ldap
Witchcraft
Mirko Lorenz Data Driven Journalism Overview Seminar Ordine dei Giornalisti d...
Mining Events from Multimedia Streams (WAIS Research group seminar June 2014)
Reduce, Reuse, Refactor
Reduce, Reuse, Refactor
BoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
Ad

More from David Simons (7)

PDF
Non-Functional Requirements
PPTX
Build Tools & Maven
PDF
Decoupled APIs through microservices
PDF
TDD: What is it good for?
PDF
Domain Driven Design: A Precis
PPTX
Using Clojure to Marry Neo4j and Open Democracy
PDF
Exploring Election Results with Neo4J
Non-Functional Requirements
Build Tools & Maven
Decoupled APIs through microservices
TDD: What is it good for?
Domain Driven Design: A Precis
Using Clojure to Marry Neo4j and Open Democracy
Exploring Election Results with Neo4J

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Approach and Philosophy of On baking technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Electronic commerce courselecture one. Pdf
PDF
KodekX | Application Modernization Development
PDF
Empathic Computing: Creating Shared Understanding
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Understanding_Digital_Forensics_Presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
Empathic Computing: Creating Shared Understanding
NewMind AI Monthly Chronicles - July 2025
Chapter 3 Spatial Domain Image Processing.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Dropbox Q2 2025 Financial Results & Investor Presentation

Bristol Uni - Use Cases of NoSQL

  • 1. S Q L & N O S Q L D a v i d S i m o n s @ S w a m W i t h Tu r t l e s
  • 2. S Q L & N O S Q L D a v i d S i m o n s @ S w a m W i t h Tu r t l e s
  • 3. W H O A M I ? • Tech Lead/Consultant at Softwire • Background in Statistics & Computer Simulation
  • 4. W H AT D O W E D O ? • Business Analysis/Mapping • Architecture • Project Management • Design (UI and User Workflows) • Development • QA • Warranty
  • 5. W H AT D O W E D O ? • Business Analysis/Mapping • Architecture • Project Management • Design (UI and User Workflows) • Development • QA • Warranty What problems are we solving? How do we solve them? Solving them now! Are they still solving the problem?
  • 6. T O D AY W E ’ R E G O I N G T O TA L K A B O U T • Business Analysis/Mapping • Architecture • Project Management • Design (UI and User Workflows) • Development • QA • Warranty
  • 7. H O W T O D O A R C H I T E C T U R E E V O LV I N G D E S I G N U P - F R O N T D E C I S I O N M A K I N G
  • 8. T O D AY… • Part 1: Looking at some SQL & Database Theory • Part 2: Looking at a lot of NoSQL databases
  • 9. W H AT I S A D ATA B A S E ? PA R T 1 : T H E O RY
  • 10. - U N I V E R S I T Y O F G E O R G I A “A database is a collection of information organized to provide efficient retrieval.”
  • 11. T H E M Y T H I C A L D ATA B A S E D I V I D E S Q LN O S Q L
  • 12. T H E M Y T H I C A L D ATA B A S E D I V I D E • NoSQL (apparently) has always meant Not Only SQL • Considering Databases that don’t meet the SQL Standard which covers a wide range of databases
  • 13. T H E S Q L S TA N D A R D PA R T 1 : T H E O RY
  • 14. H I S T O RY • First defined by ANSI in 1986 (though around before then) • Structured Query Language • Different databases have implemented this standard way of storing, inserting and retrieving data
  • 15. E X A M P L E S O F S Q L D ATA B A S E S • MySQL • Microsoft SQL Server • Oracle • PostgreSQL (mostly) • IBM DB2 and more…
  • 16. W H AT ’ S I N T H E S TA N D A R D ? • Rules for how the language works • No opinion as to what the database looks like
  • 17. B U T… • ‘SQL’ has come to mean a lot more than the language (especially in the context of NoSQL) • Family of RDBMS databases that follow a set of rules
  • 18. W H AT ’ S I N A N R D B M S ? • Prescriptive Schema • Set-based Operations • Table-driven & Denormalised • ACID Transactions
  • 19. S C H E M A D R I V E N
  • 21. S E T- B A S E D O P E R AT I O N R E A D D A TA O U T W I T H
  • 22. E V E RY R O W I S A “ T H I N G ” Name Species 1 Puss 2 Dinah 3 Einstein 4 Jess
  • 23. “ W H E R E ” ( I N T E R S E C T I O N ) Name Species 1 Puss 2 Dinah 3 Einstein 4 Jess
  • 24. U N I O N S Name Species 1 Puss 2 Dinah 3 Einstein 4 Jess 5 Nemo 6 Moby Dick 7 Wanda
  • 25. – R O N E R N E S T ( & T H E S Q L C O M M U N I T Y AT L A R G E ) “Cursors are evil.”
  • 26. N O R M A L F O R M S Body Level One
  • 27. J O I N S Name Species Species Coolness Rating 1 Puss 0 2 Dinah 0 3 Einstein 10 4 Jess 0
  • 28. R E L AT I O N S B E T W E E N D ATA • We don’t like duplicating data • Goes out of sync • May not be the same everywhere
  • 29. R E L AT I O N S B E T W E E N D ATA • Objects have properties that come in groups • For example: Landmarks have cities and countries. • The same city will always have the same country
  • 30. W E S O LV E T H AT W I T H … • Denormalisation • Store linked groups as its own row in a separate table • And store pointers to that table • These are combined by query-time joins
  • 31. Name Species Species Coolness 1 Puss 2 Dinah 3 Einstein 4 Jess Species Coolness Rating 1 0 2 10 J O I N S
  • 32. T R A N S A C T I O N S W R I T E D A TA I N W I T H
  • 33. – J O H N N Y A P P L E S E E D “A unit of work you want to treat as a whole”
  • 34. Name Species 1 Puss 2 Dinah 3 Einstein 4 Jess
  • 37. Name Species 1 Puss 2 Dinah 3 Einstein 4 Jess
  • 38. The database is always in a valid state, as defined by a whole number of queries regardless of: (1) invalid data; (2) concurrent requests; (3) system failures
  • 39. The database is always in a valid state, as defined by a whole number of queries regardless of: (1) invalid data; (2) concurrent requests; (3) system failures
  • 40. The database is always in a valid state, as defined by a whole number of queries regardless of: (1) invalid data; (2) concurrent requests; (3) system failures
  • 41. The database is always in a valid state, as defined by a whole number of queries regardless of: (1) invalid data; (2) concurrent requests; (3) system failures
  • 42. A C I D • Atomicity • Consistency • Isolation • Durability
  • 43. W H AT ’ S I N A N R D B M S ? • Prescriptive Schema • Set-based Operations • Table-driven & Denormalised • ACID Transactions
  • 44. C A PA C I T Y & S C A L A B I L I T Y PA R T 1 : T H E O RY
  • 45. A S K I N G A S Y S T E M T O D O S O M E T H I N G U S E S R E S O U R C E S
  • 46. W H AT H A P P E N S A S M O R E R E Q U E S T S C O M E I N ?
  • 47. S Q L I S P R E T T Y G O O D F O R L A R G E A M O U N T S O F D ATA T R U T H F U L LY
  • 48. W I T H E N O U G H D ATA , Y O U H AV E T O S C A L E T H E H A R D T R U T H
  • 49. Y O U R C U R R E N T S Y S T E M D ATA B A S E A P P L I C AT I O N U S E R S
  • 50. A S I T G R O W S D ATA B A S E A P P L I C AT I O N U S E R S
  • 51. H O R I Z O N TA L S C A L A B I L I T Y D ATA B A S E A P P L I C AT I O N U S E R S D ATA B A S E D ATA B A S E
  • 52. V E R T I C A L S C A L A B I L I T Y M O R E P O W E R F U L D ATA B A S E A P P L I C AT I O N U S E R S
  • 53. S Q L C A N S C A L E … T H E H A R D T R U T H
  • 54. S Q L C A N S C A L E V E R T I C A L LY
  • 55. A N D … • Scaling to meet the needs of read operations is very doable • Master-Slave replication
  • 56. B U T… • Scaling writes is problematic • How do atomic transactions work on a scaled database? • How can SQL enforce constraints across multiple databases?
  • 57. - J O E R I S E B R A C H T S “To scale up write operations or the number of nodes in a cluster beyond a certain point you have to be able to relax some of the ACID requirements”
  • 58. T H E C A P T H E O R E M PA R T 1 : T H E O RY
  • 59. T H E C O S T O F S C A L I N G • You become vulnerable to network failures
  • 60. C A P T H E O R E M • Choose Two: • Consistency • Availability • Partition Tolerance • WARNING: These have specific definitions
  • 61. P R O V I S O There is a lot of thought in this area, I am giving a simplified description that would make many database people pull their hair out. https://martin.kleppmann.com/2015/05/11/ please-stop-calling-databases-cp-or-ap.html
  • 62. C A P T H E O R E M CP AP Consistent & Partition Tolerant Available & Partition Tolerant
  • 63. C A P T H E O R E M A BC Data = “Cat” Data = “Cat” Data = “Cat”
  • 64. C A P T H E O R E M A BC Data = “Cat” Data = “Dog” Data = “Cat”
  • 65. C A P T H E O R E M A BC Data = “Dog” Data = “Dog” Data = “Dog”
  • 66. A P S Y S T E M S
  • 67. C A P T H E O R E M A BC Data = “Dog” Data = “Dog” Data = “Dog”
  • 68. AVA I L A B L E ( “ A P ” ) S Y S T E M S A BC Data = “Wolf” Data = “Dog” Data = “Dog”
  • 69. AVA I L A B L E ( “ A P ” ) S Y S T E M S A BC Data = “Wolf” Data = “Dog” Data = “Wolf”
  • 70. C P S Y S T E M S
  • 71. C O N S I S T E N T ( “ C P ” ) S Y S T E M A BC Data = “Dog” Data = “Dog” Data = “Dog”
  • 72. C O N S I S T E N T ( “ C P ” ) S Y S T E M A BC Data = “Dog” Data = “Dog” Data = “Dog”
  • 73. C O N S I S T E N T ( “ C P ” ) S Y S T E M A BC Data = “Wolf” Data = “Dog” Data = “Wolf”
  • 74. part 1 done What shape is your data? Are you happy to pay? What uses your data? • Databases store data in an accessible way • SQL database meet a defined standard; NoSQL is a movement towards considering databases that don’t • SQL uses tables and schemas to store data, and acts on it like sets in a transactional way.
  • 75. I N C O N S I S T E N T D ATA B A S E S PA R T 2 : E X A M P L E S
  • 76. T H E R E ’ S A L O T O F VA L U E I N C O N S I S T E N C Y…
  • 77. – D Y N A M O : A M A Z O N ’ S H I G H LY AVA I L A B L E K E Y- VA L U E S T O R E “Reliability at massive scale is one of the biggest challenges we face at Amazon.com. Even the slightest outage has significant financial consequences and impacts customer trust.”
  • 78. – D Y N A M O : A M A Z O N ’ S H I G H LY AVA I L A B L E K E Y- VA L U E S T O R E “Dynamo targets applications that operate with weaker consistency if this results in high availability.”
  • 79. D Y N A M O I M P L E M E N TAT I O N S
  • 80. N O T G U A R A N T E E D C O N S I S T E N C Y T H E C O S T ?
  • 81. A M A Z O N S H O P P I N G I S T H A T H O N E S T LY O K A Y ?
  • 82. S M S H I S T O R I C L O G I S T H A T H O N E S T LY O K A Y ?
  • 83. W E U S E D …D Y N A M O I M P L E M E
  • 84. C A S S A N D R A • All nodes communicate with each other through a Gossip protocol similar to Dynamo and Riak, exchanging information about themselves and other nodes they have gossiped with. D Y N A M O I M P L
  • 85. C A S S A N D R A No single point of failure
  • 86. W H Y C A S S A N D R A • We needed fast and high availability writes • Data didn’t need to be real time - it was aggregate analytics so eventually consistent was enough.
  • 87. C A S S A N D R A : T H E C O N ’ S • Data is only eventually consistent - so if you need 100% accuracy it’s not great • Not as wide range of support as SQL (but nothing does) • Flexible schema makes it harder to integrate with OO languages
  • 88. C A S S A N D R A : T H E P R O ’ S • Very fast write throughput • SQL-like query language so you don’t need to relearn things • Wide range of language drivers • Highly available
  • 89. H I G H LY R E L AT I O N A L D ATA PA R T 2 : E X A M P L E S
  • 90. E V E RY R O W I S A “ T H I N G ” Name Species 1 Puss 2 Dinah 3 Einstein 4 Jess
  • 91. W H AT S Q L D O E S W E L L • Modelling objects: • With a fixed structure and shape • With a limited number of relations • With no opinion or opinion of any deeper underlying domain R D B M S ( R E L AT I O N A L D ATA B A S E M A N A G E M E N T S Y S T E M )
  • 92. T H E R E A R E P R O B L E M S T H I S I S B A D F O R B U T …
  • 93. K E V I N B A C O N S I X D E G R E E S O F …
  • 97. E L E C T I O N D ATA
  • 98. E L E C T I O N D ATA
  • 99. W O R L D ’ S L E A D I N G G R A P H D B :
  • 100. "embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables"
  • 101. D ATA S T O R A G E
  • 102. D ATA S T O R A G E
  • 103. D ATA S T O R A G E • Nodes and edges are all: • Stored as first-class objects on the file system • “typed” • Key-value stores
  • 104. D ATA I N T H E R E L AT I O N S • “Joins” are first class objects in the database that can be queried at no additional cost • Certain queries become trivial (e.g. Joins) • At a cost: high write-time cost
  • 105. P R O T O T Y P I N G • Easy to see and work with data • Schemaless • Active community with a lot of libraries
  • 106. N E O 4 J U S E R S
  • 107. N E O 4 J : T H E C O N ’ S • More expensive writes to the database • Not scalable • Less mature tooling (especially in non-Java ecosystems)
  • 108. N E O 4 J : T H E P R O ’ S • Models certain data models very well • Prevents costly queries when running lots of data • Schemalessness allows for fast prototyping and flexible data models • Commercial buy-in means language support is not far behind
  • 109. S C H E M A L E S S N E S S PA R T 2 : E X A M P L E S
  • 111. NB: MongoDB claims there’s a lot of usecases, we’re only covering this one
  • 112. M O N G O D B : T H E C O N ’ S • Mongo was the first famous NoSQL database and got used before it was tested and mature. There’s lots of articles about featurelessness and bugs • Schemalessness makes data integrity checks and OO language integration tricky
  • 113. M O N G O D B : T H E P R O ’ S • Schemalessness - if you want flexible data models • People have used it for a while, and so library support is not bad
  • 114. H O W D O Y O U R E T R I E V E Y O U R D ATA PA R T 2 : E X A M P L E S
  • 115. F R E E - T E X T S E A R C H
  • 117. D O C U M E N T S T O R E ElasticSearch
  • 118. D O C U M E N T S T O R E
  • 119. E V E RY R O W I S A “ T H I N G ” N A M E = P U S S C O O L N E S S = 0 ! N A M E = J E S S C O O L N E S S = 0 ! N A M E = D I N A H C O O L N E S S = 0 ! N A M E = E I N S T E I N C O O L N E S S = 1 0 ! D O C U M E N T
  • 120. A PA C H E L U C E N E
  • 121. “Apache Lucene is a high-performance, full- featured text search engine library … It is a technology suitable for nearly any application that requires full-text search”
  • 122. F O C U S E D A R O U N D T E X T S E A R C H I N G Q U E R I E S
  • 123. Q U E R I E S A R E TA I L O R E D T O T H E Q U E S T I O N S Y O U ’ L L B E A S K I N G
  • 124. { "query": { "match": {"hobbies": "skateboard"} } }
  • 125. { "query": { {"fuzzy": {"hobbies": “skateboarig"}} } }
  • 126. { "query": { {"match": {"hobbies": {"query": "writing reddit comments", "type": "phrase"}}} } }
  • 127. W H AT C O N S U M E S Y O U R D ATA ? E N D U S E R What is the average age of …?
  • 128. W H AT C O N S U M E S Y O U R D ATA ? E N D U S E R Er…. I think it was something like “Campbell”?
  • 129. O U R C H O I C E I S I N F O R M E D B Y O U R P L A N S F O R T H E A P P L I C AT I O N R E M E M B E R T H A T
  • 130. E L A S T I C S E A R C H : T H E C O N ’ S • It only does one thing (even if it does it well)
  • 131. E L A S T I C S E A R C H : T H E P R O ’ S • It has a lot of search related queries built into it - fuzzy/ phonetic/sentence matching • A lot of people use this, support is mature • Integration with a large number of other languages and frameworks - this is the industry standard
  • 132. W H E N I T G O E S W R O N G PA R T 2 : E X A M P L E S
  • 134. S Q L : T H E C O N ’ S • It’s very hard to scale writes • It has a specific data model - not every data domain fits into it • e.g. highly relational models, schemalessness • Domain non-specific query languages
  • 135. S Q L : T H E P R O ’ S • If a library exists for anything, it exists for SQL • ACID transactions make everything easy • Constraints and Schemas allow for automated data integrity checking • Easy denormalisation of data
  • 136. part 2 done What shape is your data? Are you happy to pay? What uses your data? • Some sites are happy to sacrifice consistency for availability - Dynamo is a standard that databases can meet to fulfil that • If you’ll be doing lots of joins, Graph Databases such as Neo4j improve performance • Sometimes you want the flexibility to store any objects - there are a range of schemaless databases available • Consider what will retrieve your data, and ensure you have a database efficient for your use case.
  • 137. A N Y Q U E S T I O N S ? D a v i d S i m o n s @ S w a m W i t h Tu r t l e s