Nosql seminar

Agenda
 Introduction to NOSQL
 Objective
 Examples of NOSQL databases
 NOSQL vs SQL
 Conclusion

Basic Concepts

 Database – is a organized collection of data.
 Data base Management System (DBMS)- is a software
package with computer program that controls the
creation , maintainance & use of a database.
 for DBMS , we use structured language to interact with it
 Ex. Oracle , IBM DB2 , Ms Access , MySQL , FoxPro etc.
 Relational DBMS - A relational database is a
collection of data items organized as a set of formally
described tables from which data can be accessed easily.
A relational database is created using the relational
model. The software used in a relational database is
called a relational database management
system (RDBMS).

SQL

 Stuctured Query Language
 Special purpose programming language designed for
managing data in RDBMS.
 Origininally based upon relational algebra & tuple relation
calculas.
 SQl’s scope include data insert,upadte & delete, schema
creation and modification , data access control.
 It is static and strong used in database.
 Most used widely used database language.
 Query is the most important operation in SQL.
 Ex. SELECT *
FROM Book
WHERE price > 100.00
ORDER BY title;

NOSQL

 Stands for Not Only SQL
 Class of non-relational data storage systems
 Usually do not require a fixed table schema nor do
they use the concept of joins
 All NOSQL offerings relax one or more of the ACID
properties .
 Atomicity , Consistancy , Isolation , Durability ( ACID )
 “NOSQL” = “Not Only SQL” =
Not Only using traditional relational DBMS

NOSQL

• Alternative to traditional relational DBMS
• Flexible schema
• Quicker/cheaper to set up
• Massive scalability
• Relaxed consistency higher performance &
availability

* No declarative query language more programming
* Relaxed consistency fewer guarantees

Why NOSQL?

 Every problem cannot be solved by traditional
relational database system exclusively.
 Handles huge databases.
 Redundancy, data is pretty safe on commodity
hardware
 Super flexible queries using map/reduce
 Rapid development (no fixed schema, yeah!)
 Very fast for common use cases

Contd..

 Inspired by Distributed Data Storage problems
 Scale easily by adding servers
 Not suited to all problem types, but super-suited to
certain large problem types
 High-write situations (eg activity tracking or timeline
rendering for millions of users)
 A lot of relational uses are really dumbed down (eg
fetch by PK with update)

How does it work?

 Clients know how to:
Send items to servers (consistent hashing)
What to do when a server fails
How to fetch keys from servers
Can “weigh” to server capacities

 Servers know how to:
Store items they receive
Expire them from the cache
No inter-server comms – everything is unaware

Performance

 RDBMS uses buffer to ensure ACID properties
 NoSQL does not guarantee ACID and is therefore
much faster
 We don’t need ACID everywhere!
 Ex. Data processing (every minute) is 4x faster with
MongoDB, despite being a lot more detailed (due to
much simple development)

Why NOSQL is faster than SQL ? - Scalling

 Simple web application with not much traffic
 Application server, database server all on one machine

Scalling contd..

 More traffic comes in
 Application server

 Database server

 Even more traffic comes in
 Load balancer

 Application server x2

 Database server

Scalling contd..

 Even more traffic comes in
 Load balancer x N
 easy
 Application server x N
 easy
 Database server xN
 hard for SQL databases

SQL Slowdown

 Not linear!

Scalling contd..

 NoSQL Scalling -
 Need more storage?
 Add more servers!

 Need higher performance?

 Need better reliability?

Scalling Summary

 You can scale SQL databases (Oracle, MySQL, SQL
Server…)
 This will cost you dearly
 If you don’t have a lot of money, you will reach limits quickly
 You can scale NoSQL databases
 Very easy horizontal scaling

 Lots of open-source solutions

 Scaling is one of the basic incentives for design, so it is well
handled
 Scaling is the cause of trade-offs causing you to have to use
map/reduce

Characterstics

 Almost infinite horizontal scaling
 Very fast
 Performance doesn’t deteriorate with growth (much)
 No fixed table schemas
 No join operations
 Ad-hoc queries difficult or impossible
 Structured storage
 Almost everything happens in RAM

NOSQL Types

 Wide Column Store / Column Families
 Document Store
 Key Value / Tuple Store
 Graph Databases
 Object Databases
 XML Databases
 Multivalue Databases

Main types -

 Key-Value Stores
 Map Reduce Framework
 Document Databases
 Graph Databases

Key Value Stores

 Lineage: Amazon's Dynamo paper and Distributed
HashTables.
 Data model: A global collection of key-value pairs
 Example systems
 Google BigTable , Amazon Dynamo, Cassandra,
Voldemort , Hbase , …
 Implementation: efficiency, scalability, fault-tolerance
 Records distributed to nodes based on key
 Replication

 Single-record transactions, “eventual consistency”

Documented Databases

 Lineage: Inspired by Lotus Notes.
 Data model: Collections of documents, which
contain key-value collections (called "documents").
 Example: CouchDB, MongoDB, Riak

Graph Database

 Lineage: Draws from Euler and graph theory.
 Data model: Nodes & relationships, both which can
hold key-value pairs
 Example: AllegroGraph, InfoGrid, Neo4j

Map Reduce Framework

 Google’s framework for processing highly
distributable problems across huge datasets
using a large number of computers
 Let’s define large number of computers
 Cluster if all of them have same hardware
 Grid unless Cluster (if !Cluster for old-style programmers)
 Process split into two phases
 Map
 Take the input, partition it delegate to other machines
 Other machines can repeat the process, leading to tree structure
 Each machine returns results to the machine who gave it the task

Map Reduce Framework contd..

 Reduce
 collect results from machines you gave the tasks
 combine results and return it to requester

 Slower than sequential data processing, but massively parallel
 Sort petabyte of data in a few hours
 Input, Map, Shuffle, Reduce, Output

Popular NoSQL

 Hadoop / Hbase  MemcacheDB
 Cassandra  Voldemort
 Amazon  Hypertable
SimpleDB  Cloudata
 MongoDB  IBM
 CouchDB Lotus/Domino
 Redis

Real World Use

 Cassandra
 Facebook (original developer, used it till late 2010)
 Twitter
 Digg
 Reddit
 Rackspace
 Cisco

 BigTable
 Google (open-source version is HBase)

 MongoDB
 Foursquare
 Craigslist
 Bit.ly
 SourceForge
 GitHub

MONGODB

 Document store
 Basic support for dynamic (ad hoc) queries
 Query by example (nice!)

 Conditional Operators
 <, <=, >, >=
 $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $and, $si
ze, $type

MONGODB

 Data is stored as BSON (binary JSON)
 Makes it very well suited for languages with native JSON support
 Map/Reduce written in Javascript
 Slow! There is one single thread of execution in Javascript
 Master/slave replication (auto failover with replica sets)
 Sharding built-in
 Uses memory mapped files for data storage
 Performance over features
 On 32bit systems, limited to ~2.5Gb
 An empty database takes up 192Mb
 GridFS to store big data + metadata (not actually an FS)

CASANDRA

 Written in: Java
 Protocol: Custom, binary (Thrift)
 Tunable trade-offs for distribution and replication
(N, R, W)
 Querying by column, range of keys
 BigTable-like features: columns, column families
 Writes are much faster than reads (!)
 Constant write time regardless of database size
 Map/reduce possible with Apache Hadoop

Some more info about Cassndra in Facebook

 Cassandra is open source DBMS from Appache
software foundation.
 Cassandra provides a structured key-value
store with tunable consistency
 Cassandra is a distributed storage system for
managing structured data that is designed to scale to
a very large size across many commodity
servers, with no single point of failure
 It is a NoSQL solution that was initially developed
by Facebook and powered their Inbox Search feature
until late 2010

HBASE

 Written in: Java
 Main point: Billions of rows X millions of columns
 Modeled after BigTable
 Map/reduce with Hadoop
 Query predicate push down via server side scan and get filters
 Optimizations for real time queries
 A high performance Thrift gateway
 HTTP supports XML, Protobuf, and binary
 Cascading, hive, and pig source and sink modules
 No single point of failure
 While Hadoop streams data efficiently, it has overhead for
starting map/reduce jobs. HBase is column oriented
key/value store and allows for low latency read and writes.
 Random access performance is like MySQL

COUCHDB

 Written in: Erlang
 Main point: DB consistency, ease of use
 Bi-directional (!) replication, continuous or ad-hoc, with conflict
detection, thus, master-master replication. (!)
 MVCC - write operations do not block reads
 Previous versions of documents are available
 Crash-only (reliable) design
 Needs compacting from time to time
 Views: embedded map/reduce
 Formatting views: lists & shows
 Server-side document validation possible
 Authentication possible
 Real-time updates via _changes (!)
 Attachment handling
 CouchApps (standalone JS apps)

HADOOP

 Apache project
 A framework that allows for the distributed processing of
large data sets across clusters of computers
 Designed to scale up from single servers to thousands of
machines
 Designed to detect and handle failures at the application
layer, instead of relying on hardware for it
 Created by Doug Cutting, who named it after his son's toy
elephant
 Hadoop subprojects
 Cassandra
 HBase
 Pig
 Hive was a Hadoop subproject, but is now a top-level Apache project

HADOOP contd..

 Scales to hundreds or thousands of computers, each with several
processor cores
 Designed to efficiently distribute large amounts of work across a
set of machines
 Hundreds of gigabytes of data constitute the low end of Hadoop-
scale
 Built to process "web-scale" data on the order of hundreds of
gigabytes to terabytes or petabytes
 Uses Java, but allows streaming so other languages can easily
send and accept data items to/from Hadoop

HADOOP contd..

 Uses distributed file system (HDFS)
 Designed to hold very large amounts of data (terabytes or even
petabytes)
 Files are stored in a redundant fashion across multiple
machines to ensure their durability to failure and high
availability to very parallel applications
 Data organized into directories and files

 Files are divided into block (64MB by default) and distributed
across nodes
 Design of HDFS is based on the design of the Google
File System

HIVE

 A petabyte-scale data warehouse system for Hadoop
 Easy data summarization, ad-hoc queries
 Query the data using a SQL-like language called
HiveQL
 Hive compiler generates map-reduce jobs for most
queries

Conclusion

 NoSQL is a great problem solver if you need it
 Choose your NoSQL platform carefully as each is
designed for specific purpose
 Get used to Map/Reduce
 It’s not a sin to use NoSQL alongside (yes)SQL
database

Referance

 http://www.facebook.com/note.php?note_id=24413
138919
 http://en.wikipedia.org/wiki/Apache_Cassandra
 http://en.wikipedia.org/wiki/SQL
 http://en.wikipedia.org/wiki/NoSQL
 www.slideshare.com

Nosql seminar

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Nosql seminar (20)

Recently uploaded (20)

Nosql seminar