SlideShare a Scribd company logo
LondonOur sponsors:Acunu
But first, a short back story…
910111256781234
21222324GC HELL!1718192013141516
333435362930313225262728
333435362930313225262728
Please volunteer if you would like to give a talk, Internet fame awaits
 My experience with Cassandra in    production is positive
 Analytics is more difficult than it    could be
 Welcome Brisk!  Brisk combines Hadoop, Hive and   Cassandra in a “distribution”
In a nutshellCassandraFS as HDFS compatible    layer; no namenode, no SPOF
 Can split cluster for OLAP and OLTP    workloads, scaling up either as    requiredDemonstrating brisk…Building an Ad Network!
Demonstrating brisk…Building anAd Network!
The plan: Simple data model – segment users    into buckets
 System to put users in buckets via   a pixel
 Real-time queries
 AnalyticsWe Have Your KidneysThe ad-network for the paranoid generation Cookie based identification
 API provides:
 Add user to a bucket (including    ability to define expiry time)
 Get buckets a user belongs toSetup Briskhttp://www.datastax.com/docs/0.8/brisk/install_brisk_ami Step-by-step guide with pictures!
Ubuntu 10.10 image with RAID 0    ephemeral disks
Jairam has been bug-fixing some    minor issues
Data modelCF = users[userUUID] [segmentID] = 1CF = segments[segmentID] [userUUID] = 1
Data modelcreate keyspacewhyk...     with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ...     and strategy_options = [{replication_factor:1}];create column family users ...     with comparator = 'AsciiType'...     and rows_cached = 5000;create column family segments...     with comparator = 'AsciiType'...     and rows_cached = 5000;
Data modelcreate keyspacewhyk...     with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ...     and strategy_options = [{replication_factor:1}];create column family users ...     with comparator = 'AsciiType'...     and rows_cached = 5000;create column family segments...     with comparator = 'AsciiType'...     and rows_cached = 5000;
Our pixelhttp://wehaveyourkidneys.com/add.php?	segment=<alphaNumericCode>	&expire=<numberOfSeconds> We’ll use Cassandra’s expiring      columns feature PHP code – uses phpcassa$pool = new ConnectionPool('whyk', array('localhost'));$users = new ColumnFamily($pool, 'users');$segments = new ColumnFamily($pool, 'segments');$users->insert($userUuid,array($segment => 1),NULL,    // default TS$expires     );$segments->insert($segment,array($userUuid => 1),NULL,    // default TS$expires     );
Real-time accesshttp://wehaveyourkidneys.com/show.php$pool = new ConnectionPool('whyk', array('localhost'));$users = new ColumnFamily($pool, 'users');// @todo this only gets first 100!$segments = $users->get($userUuid);header('Content-Type: application/json');echo json_encode(array_keys($segments));
AnalyticsHow many users in each segment?Launch HIVE (very easy!)root@brisk-01:~# brisk hive
CREATE EXTERNAL TABLE whyk.users	(userUuid string, segmentId string, value string)STORED BY 	'org.apache.hadoop.hive.cassandra.CassandraStorageHandler’WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value" );select segmentId, count(1) as totalfrom whyk.usersgroup by segmentIdorder by total desc;
Summaryhttp://www.flickr.com/photos/sovietuk/2956044892/sizes/o/in/photostream/

More Related Content

PDF
Exoscale: Pithos: your personal S3 object store on cassandra
PDF
Heuritech: Apache Spark REX
PPTX
Pig with Cassandra: Adventures in Analytics
KEY
End-to-end Analytics with Apache Cassandra
PPTX
Lightning fast analytics with Cassandra and Spark
PDF
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
PPTX
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
Exoscale: Pithos: your personal S3 object store on cassandra
Heuritech: Apache Spark REX
Pig with Cassandra: Adventures in Analytics
End-to-end Analytics with Apache Cassandra
Lightning fast analytics with Cassandra and Spark
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
Real time data pipeline with spark streaming and cassandra with mesos
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016

What's hot (20)

PPTX
Lightning Fast Analytics with Cassandra and Spark
PDF
Bulk Loading Data into Cassandra
PPTX
Cassandra Summit 2015: Intro to DSE Search
PDF
Using Spark over Cassandra
PPTX
Intro to cassandra + hadoop
PDF
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
PPTX
Apache Tajo - BWC 2014
PPTX
Powering a Virtual Power Station with Big Data
PDF
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
PDF
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
PDF
Lightning fast analytics with Spark and Cassandra
PPTX
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
PDF
Critical Attributes for a High-Performance, Low-Latency Database
PPTX
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
PDF
Buzzwords 2014 / Overview / part1
PDF
ScyllaDB: NoSQL at Ludicrous Speed
PDF
Terraform, Ansible, or pure CloudFormation?
Lightning Fast Analytics with Cassandra and Spark
Bulk Loading Data into Cassandra
Cassandra Summit 2015: Intro to DSE Search
Using Spark over Cassandra
Intro to cassandra + hadoop
Hadoop + Cassandra: Fast queries on data lakes, and wikipedia search tutorial.
Apache Tajo - BWC 2014
Powering a Virtual Power Station with Big Data
Hashidays London 2017 - Evolving your Infrastructure with Terraform By Nicki ...
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Spark + Cassandra = Real Time Analytics on Operational Data
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Lightning fast analytics with Spark and Cassandra
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Critical Attributes for a High-Performance, Low-Latency Database
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Buzzwords 2014 / Overview / part1
ScyllaDB: NoSQL at Ludicrous Speed
Terraform, Ansible, or pure CloudFormation?
Ad

Similar to Cassandra + Hadoop = Brisk (20)

PDF
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
PDF
AWS-Certified-Cloud-Practitioner wiz.pdf
PDF
AWS Cloud Practitioner.PDF
PPTX
Reusable, composable, battle-tested Terraform modules
PPTX
Qubole - Big data in cloud
PDF
Machine Learning on the Cloud with Apache MXNet
PDF
ASHviz - Dats visualization research experiments using ASH data
PDF
WhizCard-CLF-C01-06-09-2022.pdf
PDF
Managing Application Lifecycle using Jira and Bitbucket Cloud and AWS Tooling
PDF
KSQL - Stream Processing simplified!
PDF
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
PPTX
Azure database as a service options
PDF
Single View of Data
PDF
Going Headless with Craft CMS 3.3
PDF
Semantic technologies in practice - KULeuven 2016
PPT
The Future is Now: Leveraging the Cloud with Ruby
PPTX
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017
DOCX
Kafka Spark Realtime stream processing and analytics in 6 steps
PPTX
StrongLoop Overview
PPTX
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Big Data LDN 2017: Look Ma, No Code! Building Streaming Data Pipelines With A...
AWS-Certified-Cloud-Practitioner wiz.pdf
AWS Cloud Practitioner.PDF
Reusable, composable, battle-tested Terraform modules
Qubole - Big data in cloud
Machine Learning on the Cloud with Apache MXNet
ASHviz - Dats visualization research experiments using ASH data
WhizCard-CLF-C01-06-09-2022.pdf
Managing Application Lifecycle using Jira and Bitbucket Cloud and AWS Tooling
KSQL - Stream Processing simplified!
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Azure database as a service options
Single View of Data
Going Headless with Craft CMS 3.3
Semantic technologies in practice - KULeuven 2016
The Future is Now: Leveraging the Cloud with Ruby
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017
Kafka Spark Realtime stream processing and analytics in 6 steps
StrongLoop Overview
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Ad

More from Dave Gardner (13)

PPTX
Cabs, Cassandra, and Hailo (at Cassandra EU)
PPTX
Cabs, Cassandra, and Hailo
PPTX
Planning to Fail #phpne13
PPTX
Planning to Fail #phpuk13
PPTX
Cassandra concepts, patterns and anti-patterns
PPTX
Unique ID generation in distributed systems
PPTX
Learning Cassandra
PPTX
Cassandra's Sweet Spot - an introduction to Apache Cassandra
PPTX
Intro slides from Cassandra London July 2011
KEY
2011.07.18 cassandrameetup
PPTX
Introduction to Cassandra at London Web Meetup
PPTX
Running Cassandra on Amazon EC2
PPTX
PHP and Cassandra
Cabs, Cassandra, and Hailo (at Cassandra EU)
Cabs, Cassandra, and Hailo
Planning to Fail #phpne13
Planning to Fail #phpuk13
Cassandra concepts, patterns and anti-patterns
Unique ID generation in distributed systems
Learning Cassandra
Cassandra's Sweet Spot - an introduction to Apache Cassandra
Intro slides from Cassandra London July 2011
2011.07.18 cassandrameetup
Introduction to Cassandra at London Web Meetup
Running Cassandra on Amazon EC2
PHP and Cassandra

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
KodekX | Application Modernization Development
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
cuic standard and advanced reporting.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
The AUB Centre for AI in Media Proposal.docx
KodekX | Application Modernization Development
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
cuic standard and advanced reporting.pdf
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Big Data Technologies - Introduction.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
MIND Revenue Release Quarter 2 2025 Press Release

Cassandra + Hadoop = Brisk

Editor's Notes

  • #4: Started at Imagini; May 2010New ad-targeting product! Lots of users.MySQL DB for profiles, MySQL based server for events reportingProfile DB cannot update rows so we only insert; this means clients have to merge together all rows for a user on every readMySQL DB has a habbit of dying, requiring a repair and downtime; having 2 DBs managed to put off total death but not for long
  • #5: Choosing Cassandra after some research; no single point of failure attractive, high write throughput attractive, linear scaling attractiveWelcome to GC hell!Start Cassandra London – like alcoholics anonymous; a support network
  • #6: Batch analytics; how? No Hive support, no support for streaming jarPig input readerNo output reader; require HDFS
  • #7: Keep up the meetupsAcunu generous at providing speakers; downside is hearing sales pitch!0.7 comes along; downside is not compatible with 0.6; Thrift interface changes0.8 comes along; CQL, countersBrisk!
  • #9: A summary
  • #10: Some points about “distribution” Some points about Cloudera and reaction
  • #11: Realtime + batch analytics combinedNo single point of failure; we don’t need Hadoop’snamenode anymoreCross DC clusters
  • #14: No adsNo networkNo publishersCool domain name
  • #15: User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • #16: User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • #22: User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • #23: User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)