SlideShare a Scribd company logo
Brisk: Truly peer-to-peer Hadoop
High-order bits from Cassandra & Hadoop


srisatish ambati
@srisatish
How many in audience…
NoSQL -
Know your queries.
points
•   Usecases
•   Why cassandra?
•   Usecase: Hadoop, Brisk
•   FUD: Consistency
    – Why facebook is not using Cassandra?
• Anti-patterns
• Community, Code, Tools
• Q&A
Users. Netflix.
Key by Customer, read-heavy
Key by Customer:Movie, write-heavy
TimeSeries: (several customers)
periodic readings: dev0,
dev1…deviceID:metric:timestamp ->value


Metrics typically way larger dataset than users.
Why Cassandra?
Operational simplicity
peer-to-peer
write



Operational simplicity      read
peer-to-peer
Replication:
Multi-datacenter
Multi-region ec2
Multi-availability zones
reads local
                     dc1       dc2




Replication:
Multi-datacenter
Multi-region ec2, aws
Multi-availability zones
4.21.2011, Amazon Web Services outage:




“Movie marathons on Netflix awaiting AWS to
come back up.” #ec2disabled
4.21.2011, Amazon Web Services outage:




Netflix was running on AWS.
fast durable writes.
fast reads.
Writes
Sequential, append-only.
~1-5ms
Writes
Sequential, append-only.
~1-5ms

On cloud: ephemeral disks rock!
Reads
Local
Key & row caches, (also, jna-based 0xffheap)
indexes, materialized
Reads
Local
Key & row caches, (also, jna-based 0xffheap)
indexes, materialized

ssds: improved read performance!
amortize
Replication over writes
Repair over reads
Distribution between nodes
Gossip
Anti-entropy
Failure-detector


 L ig h t w e i g h t
Clients: cql, thrift
pycassa, phpcassa
hector, pelops
(scala, ruby, clojure)
Usecase #3: h a d o o  p
Hdfs  cassandra  hive
Logs     stats     analytics
Brisk
Truly peer-to-peer hadoop.
mv computation
not data
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
map(String key, String value):
     // key: document name
 // value: document contents for each word w in value:
     EmitIntermediate(w, "1");


reduce(String key, Iterator values):
      // key: a word
  // values: a list of counts int result = 0;
      for each v in values: result += ParseInt(v);
           Emit(AsString(result));


word count in MapReduce
Parallel Execution View
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
immutable data
write-once-read-many!
Files once created, written & closed..

not changing!
jobtracker, tasktracker
hdfs: namenode, datanode
cloudera
amazon: elastic map reduce
hortonworks
mapR
brisk
Tools & Analytics
Hive, Pig, R
Karmasphere
Datameer
… dozens of stealth startups!
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
“However, given that there is only a single master, it’s failure is unlikely;”
The MapReduce paper, 2004. Sanjay et,al, Google.
Namenode decomposition, explained.
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
NameNode:
Single Master node
Single Machine Address space
Single Point of failure
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Use column families (tables)
inode
sblock
One kind of node
no master node, no spof
peer-to-peer
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
near-real time hadoop
Low latency: cassandra_dc nodes
Batch Analytics: brisk_dc nodes
BriskSimpleSnitch.java

if(TrackerInitializer.isTrackerNode)
    {
         myDC = BRISK_DC;
         logger.info("Detected Hadoop trackers
are enabled, setting my DC to " + myDC);
     }
 else
     {
         myDC = CASSANDRA_DC;
               logger.info("Looks like Vanilla
Cassandra nodes, setting my DC to " + myDC);
     }
Hive: SQL-like access
cli, hwi, jdbc, metastore
Pushdown predicates (v beta2)
hive> CREATE TABLE invites (foo INT, bar
STRING)PARTITIONED BY (ds STRING);


hive> LOAD DATA LOCAL INPATH
'$BRISK_HOME/resources/hive/examples/files/kv2.txt'
OVERWRITE INTO TABLE invites PARTITION (ds='2008-
08-15');


hive> SELECT count(*), ds FROM invites GROUP BY ds;



 http://www.datastax.com/docs/0.8/brisk/about_hive
ETL
  Real-time
Cassandra CFs
 DataCenters
    Scale




                @srisatish
@srisatish
No me in team!
   Ben Coverston         Michael Allen
   Ben Werther           Mike Bulman
   Brandon Williams      Nate McCall
   Cathy Daw             Nick M Bailey
   Jackson Chung         Patricio Echague
   Jake Luciani          Tyler Hobbs
   Joaquin Casares       SriSatish Ambati
   Jonathan Ellis        Yewei Zhang
100-node Brisk Cluster on Opscenter
                                      @srisatish
FUD,
acronym: fear, uncertainty, doubt.
Consistency: R + W > N
ORACLE, 2-node: R=1, W=2, N=2,(T=2)
DNS




* N is replication factor. Not to be confused with T=total #of nodes
Tune-able, flexibility.
For High Consistency:
  read:quorum, write:quorum
For High Availability:
  high W, low R.
Consistency: R + W > N
ORACLE, 2-node: R=1, W=2, N=2,(T=2)
DNS
"brisk.consistencylevel.read", "QUORUM";
"brisk.consistencylevel.write", "QUORUM";



* N is replication factor. Not to be confused with T=total #of nodes
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Inbox Search:
600+cores.120+TB (2008)
Went from 100-500m users.



Average NoSQL deployment size: ~6-12 nodes.
Usecase #5: search
Apache Solr + Cassandra = Solandra

Other inbox/file Searches:
  xobni, c3



github.com/tjake/solandra
“Eventual consistency is harder to program.”
mostly immutable data.
complex systems at scale.
Miscellaneous,
Myth: data-loss, partial rows.
writes are durable.
Anti-Patterns
Transactions
Joins
Read before write
Anti-Patterns for cloud
ebs
jvm, virtualized
single region
A few more good reasons for Cassandra...
Tools
AMIs, OpsCenter, DataStax
AppDynamics

Getting Started with brisk ami


Netflix just builds AMIs for deployment!
Beautiful C 0 d e

= new code(); //less is more
~90k.java.concurrent.@annotate.
bloomfilters, merkletrees.
non-blocking, staged-event-driven.
bigtable, dynamo.
Current & Future Focus:
Distributed Counters, CQL.
Simple client.
operational smoothening.
   compaction.
Community
Robust. Rapid. Brisk #
Professional support from DataStax.
git clone git@github.com:riptano/brisk.git

engineers: independent,startups, large companies,
Rackspace, Twitter, Netflix..


Come join the efforts!
Usecase #4: first NoSQL, then scale!
simpledb  Cassandra
 mongodb  Cassandra
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Copyright: xkcd
Copyright: plantoys
… more than one way to do it!
Summary -
high scale peer-to-peer datastore

best friend for
multi-region, multi-zone availability.

Hadoop – HDFS engulfing the DataWorld

Brisk – best of both worlds!
@srisatish

Q&A
Dynamo, 2007
Bigtable, 2006             +


                               OSS, 2008


                 Incubator 2009
                                                   TLP, 2010


                                               Cassandra
                          +           +


                                           Brisk
NoSQL -
Know your queries.

More Related Content

PPTX
Cassandra at no_sql
PDF
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
ODP
Big data
PPTX
High order bits from cassandra & hadoop
PPTX
High order bits from cassandra & hadoop
PDF
Brisk hadoop june2011
PDF
Brisk hadoop june2011_sfjava
PDF
The Automation Factory
Cassandra at no_sql
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
Big data
High order bits from cassandra & hadoop
High order bits from cassandra & hadoop
Brisk hadoop june2011
Brisk hadoop june2011_sfjava
The Automation Factory

What's hot (20)

PDF
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
PDF
DataStax and Esri: Geotemporal IoT Search and Analytics
PDF
Hadoop Integration in Cassandra
PPTX
Spark application on ec2 cluster
KEY
Cassandra+Hadoop
PDF
Cassandra CLuster Management by Japan Cassandra Community
PDF
Online Analytics with Hadoop and Cassandra
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PPTX
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
PDF
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
PDF
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
PDF
Introduction to Cassandra
PDF
Hadoop Pig: MapReduce the easy way!
PPTX
Cassandra synergy
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
PPTX
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
PDF
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
PDF
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
PDF
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
PDF
Introduction to cassandra 2014
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
DataStax and Esri: Geotemporal IoT Search and Analytics
Hadoop Integration in Cassandra
Spark application on ec2 cluster
Cassandra+Hadoop
Cassandra CLuster Management by Japan Cassandra Community
Online Analytics with Hadoop and Cassandra
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Introduction to Cassandra
Hadoop Pig: MapReduce the easy way!
Cassandra synergy
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
Introduction to cassandra 2014
Ad

Viewers also liked (20)

PPTX
Think Like Spark: Some Spark Concepts and a Use Case
PPTX
Think Like Spark
PDF
Facebook Hadoop Usecase
PPTX
Honu/Big Data @ Riot Games
PPTX
Online learning, Vowpal Wabbit and Hadoop
PPTX
Spark Streaming Early Warning Use Case
PPT
Hadoop basics
PPTX
Big Data At Riot Games - Hadoop Summit'12
ODP
Cassandra at Finn.io — May 30th 2013
PPTX
Social Spark Winter Case Challenge - 1st Place (Rotman Commerce)
PDF
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
PDF
Apache Spark Use case for Education Industry
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PPTX
Big Data Platform adopting Spark and Use Cases with Open Data
PPTX
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...
PDF
Big problems with big data – Hadoop interfaces security
PPTX
BABoK V2 Requirements Analysis (RA)
ODP
Large scale crawling with Apache Nutch
PDF
What is Distributed Computing, Why we use Apache Spark
PDF
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Think Like Spark: Some Spark Concepts and a Use Case
Think Like Spark
Facebook Hadoop Usecase
Honu/Big Data @ Riot Games
Online learning, Vowpal Wabbit and Hadoop
Spark Streaming Early Warning Use Case
Hadoop basics
Big Data At Riot Games - Hadoop Summit'12
Cassandra at Finn.io — May 30th 2013
Social Spark Winter Case Challenge - 1st Place (Rotman Commerce)
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...
Apache Spark Use case for Education Industry
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Big Data Platform adopting Spark and Use Cases with Open Data
SPARK USE CASE- Distributed Reinforcement Learning for Electricity Market Bi...
Big problems with big data – Hadoop interfaces security
BABoK V2 Requirements Analysis (RA)
Large scale crawling with Apache Nutch
What is Distributed Computing, Why we use Apache Spark
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio Alcacer
Ad

Similar to Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop (20)

PDF
MySQL Cluster Scaling to a Billion Queries
PDF
Spring one2gx2010 spring-nonrelational_data
PDF
Cassandra for Sysadmins
PDF
Introduction to Apache Cassandra
PDF
On Rails with Apache Cassandra
ODP
Intro to cassandra
PDF
第17回Cassandra勉強会: MyCassandra
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
20140614 introduction to spark-ben white
PDF
Cassandra & Spark for IoT
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
PPTX
Scaling Big Data Mining Infrastructure Twitter Experience
PPTX
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Developing with Cassandra
PDF
Apache Hadoop & Friends at Utah Java User's Group
PDF
PDF
Managing Cassandra at Scale by Al Tobey
MySQL Cluster Scaling to a Billion Queries
Spring one2gx2010 spring-nonrelational_data
Cassandra for Sysadmins
Introduction to Apache Cassandra
On Rails with Apache Cassandra
Intro to cassandra
第17回Cassandra勉強会: MyCassandra
Jump Start with Apache Spark 2.0 on Databricks
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
20140614 introduction to spark-ben white
Cassandra & Spark for IoT
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Scaling Big Data Mining Infrastructure Twitter Experience
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Apache spark-melbourne-april-2015-meetup
Developing with Cassandra
Apache Hadoop & Friends at Utah Java User's Group
Managing Cassandra at Scale by Al Tobey

More from srisatish ambati (11)

PDF
H2O Open Dallas 2016 keynote for Business Transformation
PDF
Digital Transformation with AI and Data - H2O.ai and Open Source
PDF
Top 10 Performance Gotchas for scaling in-memory Algorithms.
PDF
Cacheconcurrencyconsistency cassandra svcc
PDF
Jvm goes big_data_sfjava
PPT
jvm goes to big data
PPT
Svccg nosql 2011_sri-cassandra
PDF
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...
PPT
How to Stop Worrying and Start Caching in Java
PDF
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
PPT
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
H2O Open Dallas 2016 keynote for Business Transformation
Digital Transformation with AI and Data - H2O.ai and Open Source
Top 10 Performance Gotchas for scaling in-memory Algorithms.
Cacheconcurrencyconsistency cassandra svcc
Jvm goes big_data_sfjava
jvm goes to big data
Svccg nosql 2011_sri-cassandra
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...
How to Stop Worrying and Start Caching in Java
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Machine learning based COVID-19 study performance prediction
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Electronic commerce courselecture one. Pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Big Data Technologies - Introduction.pptx
Empathic Computing: Creating Shared Understanding
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25-Week II
A comparative analysis of optical character recognition models for extracting...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Programs and apps: productivity, graphics, security and other tools
Spectral efficient network and resource selection model in 5G networks
Machine learning based COVID-19 study performance prediction
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Electronic commerce courselecture one. Pdf
Review of recent advances in non-invasive hemoglobin estimation
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...

Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop