SlideShare a Scribd company logo
Brisk: More Powerful Hadoop
    Powered by Cassandra
    jbellis@datastax.com




Monday, July 25, 2011
The evolution of Analytics




                        Analytics + Realtime


Monday, July 25, 2011
The evolution of Analytics




                                    replication




                        Analytics                 Realtime



Monday, July 25, 2011
The evolution of Analytics




                         ETL




Monday, July 25, 2011
Brisk re-unifies realtime and analytics




Monday, July 25, 2011
The Traditional Hadoop Stack
                                          Slave Nodes
                 Master Nodes
                                                Data Node
                        Name Node
                                                Task Tracker
                    Secondary Name Node
                                               Region Server
                         Job Tracker

                        Hbase Master      Client Nodes
                                                    Pig
                         ZooKeeper
                                                    Hive
                         MetaStore
                                               Region Server


Monday, July 25, 2011
7

Monday, July 25, 2011
Brisk Architecture




Monday, July 25, 2011
Brisk Highlights

          ✤    Easy to deploy and operate
          ✤    No single points of failure
          ✤    Scale and change nodes with no downtime
          ✤    Cross-DC, multi-master clusters
          ✤    Allocate resources for OLAP vs OLTP
                ✤       With no ETL




Monday, July 25, 2011
Cassandra data model

          ✤    ColumnFamilies contain rows + columns
          ✤    (Not really schemaless for a while now)


                                  password              name             site
                        zznate           *       Nate McCall
                        driftx           *   Brandon Williams
                        jbellis          *      Jonathan Ellis   datastax.com




Monday, July 25, 2011
Sparse

                                  password         name
                        zznate
                                     *          Nate McCall

                                  password         name
                        driftx
                                     *       Brandon Williams

                                  password       name             site
                        jbellis
                                     *       Jonathan Ellis   datastax.com




Monday, July 25, 2011
Rows as containers / materialized views

                                  driftx   thobbs pcmanus jbellis zznate
                        circle1

                                  xedin    mdennis
                        circle2

                                  xedin     pcmanus    ymorishita
                        circle3




Monday, July 25, 2011
Monday, July 25, 2011
CassandraFS

          ✤    data stored as ByteBuffer internally -- excellent fit for blocks
          ✤    local reads mmap data directly (no rpc)
          ✤    blocks are compressed with google snappy
          ✤    hadoop distcp hdfs:///mydata cfs:///mydata




Monday, July 25, 2011
Hive support

          ✤    Hive MetaStore in Cassandra
                ✤       Unified schema view from any node, with no external systems
                        and no SPOF
                ✤       Automatically maps Cassandra column families to Hive tables
          ✤    Supports static and dynamic column families (and supercolumns)




Monday, July 25, 2011
Hive: CFS and ColumnFamilies

         CREATE TABLE users (name STRING, zip INT); 
         LOAD DATA LOCAL INPATH 'kv2.txt' OVERWRITE INTO TABLE users;
          

         CREATE EXTERNAL TABLE Keyspace1.Users(name STRING, zip INT)
         STORED BY
         'org.apache.hadoop.hive.cassandra.CassandraStorageHandler';


         CREATE EXTERNAL TABLE Keyspace1.Users
         (row_key STRING, column_name STRING, value string)
         STORED BY
         'org.apache.hadoop.hive.cassandra.CassandraStorageHandler';




Monday, July 25, 2011
Pig Support

    ✤    With standard Cassandra:
         $ export PIG_HOME=/path/to/pig
         $ export PIG_INITIAL_ADDRESS=localhost

         $ export PIG_RPC_PORT=9160
         $ export
         PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
         $ contrib/pig/bin/pig_cassandra

         grunt>

    ✤    With Brisk:
         $ bin/brisk pig
         grunt>


Monday, July 25, 2011
Pig: CFS and ColumnFamilies

         grunt> data = LOAD 'cfs:///example.txt' using PigStorage() as
         (name:chararray, value:long);


         data = LOAD 'cassandra://Demo1/Scores' using CassandraStorage()
         AS (key, columns: {T: tuple(name, value)});


         data = LOAD 'cassandra://Demo1/Scores&slice_start=M&slice_end=S'
         using CassandraStorage() AS (key, columns: {T: tuple(name,
         value)});




Monday, July 25, 2011
19

Monday, July 25, 2011
Data model: Realtime
               LiveStocks
                                      last
                         GOOG        $95.52
                          AAPL       $186.10
                         AMZN        $112.98


                 Portfolios
                                     GOOG      LNKD       P        AMZN    AAPLE
                        Portfolio1
                                      80        20       40        100       20


                 StockHist
                                     2011-01-01       2011-01-02     2011-01-03
                         GOOG
                                       $79.85          $75.23            $82.11



Monday, July 25, 2011
Data model: Analytics
               HistLoss
                                     worst_date    loss
                        Portfolio1   2011-07-23   -$34.81
                        Portfolio2   2011-03-11 -$11432.24
                        Portfolio3   2011-05-21 -$1476.93




Monday, July 25, 2011
Data model: Analytics
               10dayreturns
                   ticker      rdate     return
                   GOOG     2011-07-25   $8.23
                   GOOG     2011-07-24   $6.14
                   GOOG     2011-07-23   $7.78
                   AAPL     2011-07-25   $15.32
                   AAPL     2011-07-24   $12.68


              INSERT OVERWRITE TABLE 10dayreturns
              SELECT a.row_key ticker,
                     b.column_name rdate,
                     b.value - a.value
              FROM StockHist a
              JOIN StockHist b
              ON (a.row_key = b.row_key
                  AND date_add(a.column_name,10) = b.column_name);



Monday, July 25, 2011
2011-01-01     2011-01-02   2011-01-03
                GOOG
                           $79.85         $75.23       $82.11




             row_key column_name      value
              GOOG    2011-01-01      $8.23
              GOOG    2011-01-02      $6.14
              GOOG 2011-001-03        $7.78




Monday, July 25, 2011
Data model: Analytics
               portfolio_returns
                    portfolio       rdate      preturn
                    Portfolio1   2011-07-25    $118.21
                    Portfolio1   2011-07-24     $60.78
                    Portfolio1   2011-07-23    -$34.81
                    Portfolio2   2011-07-25   $2143.92
                    Portfolio3   2011-07-24    -$10.19


               INSERT OVERWRITE TABLE portfolio_returns
               SELECT row_key portfolio,
                      rdate,
                      SUM(b.return)
               FROM portfolios a JOIN 10dayreturns b
               ON (a.column_name = b.ticker)
               GROUP BY row_key, rdate;




Monday, July 25, 2011
Data model: Analytics
               HistLoss
                                     worst_date    loss
                        Portfolio1   2011-07-23   -$34.81
                        Portfolio2   2011-03-11 -$11432.24
                        Portfolio3   2011-05-21 -$1476.93



               INSERT OVERWRITE TABLE HistLoss
               SELECT a.portfolio, rdate, minp
               FROM (
                 SELECT portfolio, min(preturn) as minp
                 FROM portfolio_returns
                 GROUP BY portfolio
               ) a
               JOIN portfolio_returns b
               ON (a.portfolio = b.portfolio and a.minp = b.preturn);



Monday, July 25, 2011
Portfolio Demo dataflow


     Portfolios               Web-based Portfolios
     Historical Prices        Live Prices for today
     Intermediate Results
     Largest loss             Largest loss




Monday, July 25, 2011
OpsCenter




Monday, July 25, 2011
Monday, July 25, 2011
Where to get it

    ✤    http://www.datastax.com/brisk




Monday, July 25, 2011
Monday, July 25, 2011

More Related Content

KEY
Query History of a Software Project
PPT
Oracle 10g Performance: chapter 04 new features
PDF
Google guava overview
PDF
Akiban Technologies: Renormalize
PDF
Elastic HBase on Mesos - HBaseCon 2015
PPTX
Low Latency “OLAP” with HBase - HBaseCon 2012
PDF
Cassandra Summit 2015
PPTX
Introduction To HBase
Query History of a Software Project
Oracle 10g Performance: chapter 04 new features
Google guava overview
Akiban Technologies: Renormalize
Elastic HBase on Mesos - HBaseCon 2015
Low Latency “OLAP” with HBase - HBaseCon 2012
Cassandra Summit 2015
Introduction To HBase

Similar to Brisk: more powerful Hadoop powered by Cassandra (20)

PDF
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
PDF
Drizzle 7.0, Future of Virtualizing
PDF
Treasure Data: Big Data Analytics on Heroku
PDF
Introducing the App Engine datastore
PDF
Alternative Databases
PDF
Cassandra in production
PDF
Coding Potpourri: MySQL
PDF
State of Cassandra 2012
PDF
2011 july-gtug-high-replication-datastore
PPTX
PDF
Cassandra at High Performance Transaction Systems 2011
PDF
Cassandra for Python Developers
PDF
Apache HBase: Introduction to a column-oriented data store
PDF
No sql findings
PPT
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
PPTX
NoSql-YesSQL mickey alon
PDF
Cassandra - An Introduction
PDF
Top five questions to ask when choosing a big data solution
PDF
Cassandra at Morningstar (Feb 2011)
PDF
MongoDB is the new MySQL
Cassandra 1.0 and the future of big data (Cassandra Tokyo 2011)
Drizzle 7.0, Future of Virtualizing
Treasure Data: Big Data Analytics on Heroku
Introducing the App Engine datastore
Alternative Databases
Cassandra in production
Coding Potpourri: MySQL
State of Cassandra 2012
2011 july-gtug-high-replication-datastore
Cassandra at High Performance Transaction Systems 2011
Cassandra for Python Developers
Apache HBase: Introduction to a column-oriented data store
No sql findings
Four Problems You Run into When DIY-ing a “Big Data” Analytics System
NoSql-YesSQL mickey alon
Cassandra - An Introduction
Top five questions to ask when choosing a big data solution
Cassandra at Morningstar (Feb 2011)
MongoDB is the new MySQL
Ad

More from jbellis (20)

PPTX
Vector Search @ sw2con for slideshare.pptx
PDF
Five Lessons in Distributed Databases
PDF
Data day texas: Cassandra and the Cloud
PDF
Cassandra summit keynote 2014
PDF
Cassandra 2.1
PDF
Tokyo cassandra conference 2014
PDF
Cassandra Summit EU 2013
PDF
London + Dublin Cassandra 2.0
PDF
Cassandra Summit 2013 Keynote
PDF
Cassandra at NoSql Matters 2012
PDF
Massively Scalable NoSQL with Apache Cassandra
PDF
Cassandra 1.1
PDF
Pycon 2012 What Python can learn from Java
PDF
Apache Cassandra: NoSQL in the enterprise
PDF
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
PDF
What python can learn from java
PDF
State of Cassandra, 2011
PDF
PyCon 2010 SQLAlchemy tutorial
PDF
Cassandra 0.7, Los Angeles High Scalability Group
PDF
Cassandra devoxx 2010
Vector Search @ sw2con for slideshare.pptx
Five Lessons in Distributed Databases
Data day texas: Cassandra and the Cloud
Cassandra summit keynote 2014
Cassandra 2.1
Tokyo cassandra conference 2014
Cassandra Summit EU 2013
London + Dublin Cassandra 2.0
Cassandra Summit 2013 Keynote
Cassandra at NoSql Matters 2012
Massively Scalable NoSQL with Apache Cassandra
Cassandra 1.1
Pycon 2012 What Python can learn from Java
Apache Cassandra: NoSQL in the enterprise
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
What python can learn from java
State of Cassandra, 2011
PyCon 2010 SQLAlchemy tutorial
Cassandra 0.7, Los Angeles High Scalability Group
Cassandra devoxx 2010
Ad

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation_ Review paper, used for researhc scholars
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
A Presentation on Artificial Intelligence
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The AUB Centre for AI in Media Proposal.docx
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Spectral efficient network and resource selection model in 5G networks
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Network Security Unit 5.pdf for BCA BBA.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Building Integrated photovoltaic BIPV_UPV.pdf

Brisk: more powerful Hadoop powered by Cassandra

  • 1. Brisk: More Powerful Hadoop Powered by Cassandra jbellis@datastax.com Monday, July 25, 2011
  • 2. The evolution of Analytics Analytics + Realtime Monday, July 25, 2011
  • 3. The evolution of Analytics replication Analytics Realtime Monday, July 25, 2011
  • 4. The evolution of Analytics ETL Monday, July 25, 2011
  • 5. Brisk re-unifies realtime and analytics Monday, July 25, 2011
  • 6. The Traditional Hadoop Stack Slave Nodes Master Nodes Data Node Name Node Task Tracker Secondary Name Node Region Server Job Tracker Hbase Master Client Nodes Pig ZooKeeper Hive MetaStore Region Server Monday, July 25, 2011
  • 9. Brisk Highlights ✤ Easy to deploy and operate ✤ No single points of failure ✤ Scale and change nodes with no downtime ✤ Cross-DC, multi-master clusters ✤ Allocate resources for OLAP vs OLTP ✤ With no ETL Monday, July 25, 2011
  • 10. Cassandra data model ✤ ColumnFamilies contain rows + columns ✤ (Not really schemaless for a while now) password name site zznate * Nate McCall driftx * Brandon Williams jbellis * Jonathan Ellis datastax.com Monday, July 25, 2011
  • 11. Sparse password name zznate * Nate McCall password name driftx * Brandon Williams password name site jbellis * Jonathan Ellis datastax.com Monday, July 25, 2011
  • 12. Rows as containers / materialized views driftx thobbs pcmanus jbellis zznate circle1 xedin mdennis circle2 xedin pcmanus ymorishita circle3 Monday, July 25, 2011
  • 14. CassandraFS ✤ data stored as ByteBuffer internally -- excellent fit for blocks ✤ local reads mmap data directly (no rpc) ✤ blocks are compressed with google snappy ✤ hadoop distcp hdfs:///mydata cfs:///mydata Monday, July 25, 2011
  • 15. Hive support ✤ Hive MetaStore in Cassandra ✤ Unified schema view from any node, with no external systems and no SPOF ✤ Automatically maps Cassandra column families to Hive tables ✤ Supports static and dynamic column families (and supercolumns) Monday, July 25, 2011
  • 16. Hive: CFS and ColumnFamilies CREATE TABLE users (name STRING, zip INT);  LOAD DATA LOCAL INPATH 'kv2.txt' OVERWRITE INTO TABLE users;   CREATE EXTERNAL TABLE Keyspace1.Users(name STRING, zip INT) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'; CREATE EXTERNAL TABLE Keyspace1.Users (row_key STRING, column_name STRING, value string) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'; Monday, July 25, 2011
  • 17. Pig Support ✤ With standard Cassandra: $ export PIG_HOME=/path/to/pig $ export PIG_INITIAL_ADDRESS=localhost $ export PIG_RPC_PORT=9160 $ export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner $ contrib/pig/bin/pig_cassandra grunt> ✤ With Brisk: $ bin/brisk pig grunt> Monday, July 25, 2011
  • 18. Pig: CFS and ColumnFamilies grunt> data = LOAD 'cfs:///example.txt' using PigStorage() as (name:chararray, value:long); data = LOAD 'cassandra://Demo1/Scores' using CassandraStorage() AS (key, columns: {T: tuple(name, value)}); data = LOAD 'cassandra://Demo1/Scores&slice_start=M&slice_end=S' using CassandraStorage() AS (key, columns: {T: tuple(name, value)}); Monday, July 25, 2011
  • 20. Data model: Realtime LiveStocks last GOOG $95.52 AAPL $186.10 AMZN $112.98 Portfolios GOOG LNKD P AMZN AAPLE Portfolio1 80 20 40 100 20 StockHist 2011-01-01 2011-01-02 2011-01-03 GOOG $79.85 $75.23 $82.11 Monday, July 25, 2011
  • 21. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 Monday, July 25, 2011
  • 22. Data model: Analytics 10dayreturns ticker rdate return GOOG 2011-07-25 $8.23 GOOG 2011-07-24 $6.14 GOOG 2011-07-23 $7.78 AAPL 2011-07-25 $15.32 AAPL 2011-07-24 $12.68 INSERT OVERWRITE TABLE 10dayreturns SELECT a.row_key ticker, b.column_name rdate, b.value - a.value FROM StockHist a JOIN StockHist b ON (a.row_key = b.row_key AND date_add(a.column_name,10) = b.column_name); Monday, July 25, 2011
  • 23. 2011-01-01 2011-01-02 2011-01-03 GOOG $79.85 $75.23 $82.11 row_key column_name value GOOG 2011-01-01 $8.23 GOOG 2011-01-02 $6.14 GOOG 2011-001-03 $7.78 Monday, July 25, 2011
  • 24. Data model: Analytics portfolio_returns portfolio rdate preturn Portfolio1 2011-07-25 $118.21 Portfolio1 2011-07-24 $60.78 Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-07-25 $2143.92 Portfolio3 2011-07-24 -$10.19 INSERT OVERWRITE TABLE portfolio_returns SELECT row_key portfolio, rdate, SUM(b.return) FROM portfolios a JOIN 10dayreturns b ON (a.column_name = b.ticker) GROUP BY row_key, rdate; Monday, July 25, 2011
  • 25. Data model: Analytics HistLoss worst_date loss Portfolio1 2011-07-23 -$34.81 Portfolio2 2011-03-11 -$11432.24 Portfolio3 2011-05-21 -$1476.93 INSERT OVERWRITE TABLE HistLoss SELECT a.portfolio, rdate, minp FROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio ) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn); Monday, July 25, 2011
  • 26. Portfolio Demo dataflow Portfolios Web-based Portfolios Historical Prices Live Prices for today Intermediate Results Largest loss Largest loss Monday, July 25, 2011
  • 29. Where to get it ✤ http://www.datastax.com/brisk Monday, July 25, 2011