SlideShare a Scribd company logo
Nick Dimiduk - @xefyr
Founder, Drawn to Scale
nick@drawntoscalehq.com

April 28, 2010
Agenda

 what NoSQL is not
 motivation
 Hadoop
 HBase
whoami
Computer Science & Engineering at Ohio State:
Artificial Intelligence, Programming Languages, Systems
Engineering
Applied Technical Systems: Hierarchical, non-relational
data storage and analysis systems (no-sql before there was
NoSQL). Information Retrieval, Wire Serialization/RPC
(before there was Thrift/Avro), Data Visualization (GB's)
Visible Technologies: Social Media Storage, Processing,
Analytics. Monitoring, Engagement, Warehousing, and BI. (TB's)
Drawn to Scale: Big Data Storage, Processing, Retrieval,
Analytics (TB's, PB's)
Agenda

 what NoSQL is not
 motivation
 Hadoop
 HBase
What NoSQL is not.

movement
What NoSQL is not.

movement - no ANSI NoSQL-2010
one-size-fits-all
Introduction to Hadoop, HBase, and NoSQL
It’s not Anti-RDBMS
It’s about Choice!




   http://www.flickr.com/photos/zakh/337938459/
What NoSQL is not.

movement - no ANSI NoSQL-2010
one-size-fits-all - it’s about choice
silver bullet
What NoSQL is not.

movement - no ANSI NoSQL-2010
one-size-fits-all - it’s about choice
silver bullet - guarantees are hard
Agenda

 what NoSQL is not
 motivation
 Hadoop
 HBase
motivation
more, More, MORE Data!
motivation
more, More, MORE Data!
ACID Burns
motivation
more, More, MORE Data!
ACID Burns
BASE is good enough
motivation
more, More, MORE Data!
ACID Burns
BASE is good enough
Life’s too short
motivation
more, More, MORE Data!
ACID Burns
BASE is good enough
Life’s too short
“typical” application
“typical” application
Data Server                Village People




              App Server
growing pains
Data Server                       Villages of People




              App Servers
vertical partitioning
Data Server                   Villages of People




              App Servers




                                                   Data Server                 Villages of People




                                                                 App Servers
vertical partitioning
Data Server                   Villages of People   Data Server                 Villages of People




              App Servers                                        App Servers




Data Server                   Villages of People   Data Server                 Villages of People




              App Servers                                        App Servers
vertical partitioning
Data Server                   Villages of People




              App Servers




                                                   Data Server                 Villages of People




                                                                 App Servers
“typical” application
growing pains
Data Servers                       Villages of People




               App Servers
horizontal partitioning
              Villages of People
horizontal partitioning
              Villages of People
horizontal partitioning
                     Villages of People




   Data Layer   Application Layer
Agenda

 what NoSQL is not
 motivation
 Hadoop
 HBase
“open source, reliable, distributed
          computing”
“open source, reliable, distributed
          computing”
MapReduce - API for parallel computing
MapReduce - API for parallel computing
HDFS - distributed, replicated file system
MapReduce - API for parallel computing
HDFS - distributed, replicated file system
ZooKeeper - distributed synchronization
MapReduce - API for parallel computing
HDFS - distributed, replicated file system
ZooKeeper - distributed synchronization
Avro - Data Serialization / RPC
Agenda

 what NoSQL is not
 motivation
 Hadoop
 HBase
structured, distributed database for your
         horizontally scalable FS
structured, distributed database for your
         horizontally scalable FS
random access
random access
real-time reads/writes
random access
real-time reads/writes
simple API
random access
real-time reads/writes
simple API
big table
references
           : http://www.nosql-database.org
Eventually Consistent: http://www.allthingsdistributed.com/2007/12/
eventually_consistent.html
Soft State: http://mercury.lcs.mit.edu/~jnc/tech/hard_soft.html
Accuracy and Precision: http://en.wikipedia.org/wiki/Accuracy_and_precision
Compare and Swap: http://en.wikipedia.org/wiki/Compare-and-swap
Apache Hadoop: http://hadoop.apache.org
Google MapReduce: http://labs.google.com/papers/mapreduce.html
Google FS: http://labs.google.com/papers/gfs.html
Apache Thrift: http://incubator.apache.org/thrift/
Protobuf: http://code.google.com/p/protobuf/
Google BigTable: http://labs.google.com/papers/bigtable.html
Google Chubby: http://labs.google.com/papers/chubby.html
Questions?



Nick Dimiduk - @xefyr
Founder, Drawn to Scale
nick@drawntoscalehq.com

April 28, 2010

More Related Content

PPTX
Big data hadoop ecosystem and nosql
PPTX
Big dataarchitecturesandecosystem+nosql
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
PPTX
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
DOCX
PDF
Introduction To Hadoop Ecosystem
PDF
Apache Spark & Hadoop
PPTX
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
Big data hadoop ecosystem and nosql
Big dataarchitecturesandecosystem+nosql
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Introduction To Hadoop Ecosystem
Apache Spark & Hadoop
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.

What's hot (20)

PPTX
Apache Spark in Scientific Applciations
PDF
Overview of stinger interactive query for hive
PPTX
Hive at Yahoo: Letters from the trenches
PPTX
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
PPSX
Hadoop Ecosystem
PPTX
The Future of Hadoop: A deeper look at Apache Spark
PDF
Apache Spark PDF
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
PPTX
Hadoop_arunam_ppt
PDF
Announcing Databricks Cloud (Spark Summit 2014)
PPTX
Analysing big data with cluster service and R
PPTX
Interactive query using hadoop
PDF
Comparison among rdbms, hadoop and spark
PPTX
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
KEY
Processing Big Data
PPTX
Interactive query in hadoop
PPTX
Top Hadoop Big Data Interview Questions and Answers for Fresher
PPTX
Hadoop Big Data A big picture
PPTX
Hadoop introduction , Why and What is Hadoop ?
PDF
Hadoop 2 - More than MapReduce
Apache Spark in Scientific Applciations
Overview of stinger interactive query for hive
Hive at Yahoo: Letters from the trenches
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Hadoop Ecosystem
The Future of Hadoop: A deeper look at Apache Spark
Apache Spark PDF
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Hadoop_arunam_ppt
Announcing Databricks Cloud (Spark Summit 2014)
Analysing big data with cluster service and R
Interactive query using hadoop
Comparison among rdbms, hadoop and spark
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Processing Big Data
Interactive query in hadoop
Top Hadoop Big Data Interview Questions and Answers for Fresher
Hadoop Big Data A big picture
Hadoop introduction , Why and What is Hadoop ?
Hadoop 2 - More than MapReduce
Ad

Viewers also liked (20)

PDF
HBase Client APIs (for webapps?)
PDF
Apache HBase for Architects
PDF
Vpork Nosql
PDF
Hadoop distributed computing framework for big data
PDF
NoSQL with Hadoop and HBase
PDF
Apache Spark Overview
PPTX
HBase Low Latency, StrataNYC 2014
PDF
Bring Cartography to the Cloud
PDF
HBase Data Types (WIP)
PDF
HBase Data Types
PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
PDF
Apache Big Data EU 2015 - HBase
PDF
[Spark meetup] Spark Streaming Overview
PPT
Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis
PDF
Spark architecture
PDF
Apache Big Data EU 2015 - Phoenix
PDF
Apache HBase 1.0 Release
PDF
Apache HBase Low Latency
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PPTX
Introduction to Apache Spark Developer Training
HBase Client APIs (for webapps?)
Apache HBase for Architects
Vpork Nosql
Hadoop distributed computing framework for big data
NoSQL with Hadoop and HBase
Apache Spark Overview
HBase Low Latency, StrataNYC 2014
Bring Cartography to the Cloud
HBase Data Types (WIP)
HBase Data Types
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Apache Big Data EU 2015 - HBase
[Spark meetup] Spark Streaming Overview
Apache Apex & Apace Geode In-Memory Computation, Storage & Analysis
Spark architecture
Apache Big Data EU 2015 - Phoenix
Apache HBase 1.0 Release
Apache HBase Low Latency
Apache Spark 2.0: Faster, Easier, and Smarter
Introduction to Apache Spark Developer Training
Ad

Similar to Introduction to Hadoop, HBase, and NoSQL (20)

PPTX
Microsoft Openness Mongo DB
PPTX
SQL and NoSQL in SQL Server
PPTX
Cluster Computing with Dryad
PDF
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
PPTX
Cluster Computing with Dryad
PDF
Building Applications with AWS
PDF
13h00 p duff-building-applications-with-aws-final
PPTX
Couchbase presentation
PPTX
NoSQL for the SQL Server Pro
PPT
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
PDF
Application Partitioning Wp
PPTX
Intro to Big Data and NoSQL
PPTX
CodeFutures - Scaling Your Database in the Cloud
PPT
Building A Scalable Architecture
KEY
NOSQL, CouchDB, and the Cloud
PDF
Ashfakul_Resume
PPT
Building a Scalable Architecture for web apps
PDF
Scaling Databases On The Cloud
PDF
Scaing databases on the cloud
PDF
A Behind the Scenes Look at the Force.com Platform
Microsoft Openness Mongo DB
SQL and NoSQL in SQL Server
Cluster Computing with Dryad
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Cluster Computing with Dryad
Building Applications with AWS
13h00 p duff-building-applications-with-aws-final
Couchbase presentation
NoSQL for the SQL Server Pro
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Application Partitioning Wp
Intro to Big Data and NoSQL
CodeFutures - Scaling Your Database in the Cloud
Building A Scalable Architecture
NOSQL, CouchDB, and the Cloud
Ashfakul_Resume
Building a Scalable Architecture for web apps
Scaling Databases On The Cloud
Scaing databases on the cloud
A Behind the Scenes Look at the Force.com Platform

Introduction to Hadoop, HBase, and NoSQL

Editor's Notes

  • #4: I’m Not an RDBMS Guy!
  • #5: squish the FUD
  • #6: no central point of organization no committee or standardizing body no plan/strategy/illuminati to take down the RDBMS; lots of "in-fighting"
  • #7: central tenant - there IS NO one-size-fits-all unlike RDBMS assumptions, each engineering effort must be evaluated for data needs
  • #8: is it “anti-RDBMS”?
  • #9: not so much
  • #11: will not magically solve all your data or performance problems applications won’t magically stop crashing, data corruption, etc. Big Data is still hard. These tools make it possible/affordable/approachable
  • #12: data persistence comes down to garantees
  • #13: why are we here?
  • #14: "web scale" more users, content, connections more trends, insight, knowledge
  • #15: Atomicity: fault-tolerance is moving to the application layer - smaller atomic units Consistency: yes! but not necessarily immediate - "availability" (latency, reads) is more important. Isolation: smaller atomic units (multi-step transaction vs. compare-and-swap), greater availability, denormalization => reduced dependency on isolation Durability: some things are more important that getting every last detail, i.e. latency of response, view in aggregate
  • #16: Basically Available: is the data layer up or not? are we serving content to our users or not? Soft State: shifting burden of "correctness" up to application layer. availability is more important than precision. accuracy (correct) vs. precision (repeatable). Eventual Consistency: all operations are recorded and ordered. played back as resources permit.
  • #17: agile dev moves too fast for schema and constraints - this isn’t waterfall data models change quickly up-front schema modeling is akin to waterfall development - not always practical/feasible/possible data is messy - record what you have and leave constraints up to the application
  • #18: at scale, data services look like a DHT anyway! isolated independent services introduced caching layers partitioned data by logical and range boundaries.
  • #19: webapp
  • #21: app servers/session self-contained - load-balanced data’s in one spot - what do you do?
  • #22: 37-signals approach - DHH “scaling is a good thing because scaling => users => $$$”
  • #23: more users, more instances. easy!
  • #24: doesn’t work for social applications: - users cannot interact - old MMO’s vs. new social games
  • #26: redesign data server as “data services” separate independent logical components
  • #27: knowing each service by name becomes “vexing”
  • #28: configuration/logistical nightmare!
  • #29: abstractions! wouldn’t it be nice if...
  • #31: Distributed Computing Made Easy Less Hard
  • #33: programming model/API for parallel computing Google's MapReduce paper
  • #34: replicated, high throughput, fairly UNIX-y (not POSIX). Google FS Paper
  • #35: Distributed Group Services - coordination, synchronization, configuration, naming. Google Chubby Paper
  • #36: efficient, cross-language messaging Facebook/Apache Thrift Google Protobufs
  • #38: Google BigTable
  • #39: Addresses limitations of Raw M/R, HDFS access
  • #40: request by key: vs. hdfs sequential reads
  • #41: low-latency, ms response times vs. m/r high-latency
  • #42: row/column concepts DHT semantics Java, ReST, thrift
  • #43: Billions of rows, millions of columns