SlideShare a Scribd company logo
Storing Time Series Metrics

         Implementing Multi-Dimensional
           Aggregate Composites with
           Counters For Reporting
         /*
         Joe Stein
         http://www.linkedin.com/in/charmalloc
         @allthingshadoop
         @cassandranosql
         @allthingsscala
         @charmalloc
         */

         Sample code project up at
           https://github.com/joestein/apophis



                      1
Medialets

What we do




     2
Medialets
• Largest deployment of rich media ads for mobile devices
• Over 300,000,000 devices supported
• 3-4 TB of new data every day
• Thousands of services in production
• Hundreds of Thousands of simultaneous requests per second
• Keeping track of what is and was going on when and where
  used to be difficult before we started using Cassandra
• What do I do for Medialets?
   – Chief Architect and Head of Server Engineering
     Development & Operations.




                             3
What does the schema look like?

CREATE COLUMN FAMILY ByDay                                   Column Families hold
WITH default_validation_class=CounterColumnType              your rows of data. Each
AND key_validation_class=UTF8Type AND comparator=UTF8Type;   row within each column
                                                             family will be equal to the
CREATE COLUMN FAMILY ByHour                                  time period you are
WITH default_validation_class=CounterColumnType              dealing with. So an
AND key_validation_class=UTF8Type AND comparator=UTF8Type;
                                                             “event” occurring at
                                                             10/20/2011 11:22:41 will
CREATE COLUMN FAMILY ByMinute
WITH default_validation_class=CounterColumnType              become 4 rows
AND key_validation_class=UTF8Type AND comparator=UTF8Type;
                                                             BySecond = 20111020112141
                                                             ByMinute= 201110201122
CREATE COLUMN FAMILY BySecond                                ByHour= 2011102011
WITH default_validation_class=CounterColumnType              ByDay=20111020
AND key_validation_class=UTF8Type AND comparator=UTF8Type;




                                            4
Why multiple column families?
http://www.datastax.com/docs/1.0/configuration/storage_configuration




                                 5
Ok now how do we keep track of what?
             Lets setup a quick example data set first

• The Animal Logger – fictitious logger of the world around us
  – animal
  – food
  – sound
  – home

• YYYY/MM/DD HH:MM:SS GET /sample?animal=X&food=Y
  – animal=duck&sound=quack&home=pond
  – animal=cat&sound=meow&home=house
  – animal=cat&sound=meow&home=street
  – animal=pigeon&sound=coo&home=street



                                 6
Now what?
      Columns babe, columns make your aggregates work

• Setup your code for columns you want aggregated
  – animal=
  – animal#sound=
  – animal#home=
  – animal#food=
  – animal#food#home=
  – animal#food#sound=
  – animal#sound#home=
  – food#sound=
  – home#food=
  – sound#animal=



                             7
Inserting data
                   Column aggregate concatenated with values
      2011/10/29 11:22:43 GET /sample?animal=duck&home=pond&sound=quack
•   mutator.insertCounter(“20111029112243, “BySecond”,
    HFactory.createCounterColumn(“animal#sound#home=duck#quack#pond”), 1))
•   mutator.insertCounter(“20111029112243, “BySecond”,
    HFactory.createCounterColumn(“animal#home=duck#pond”), 1))
•   mutator.insertCounter(“20111029112243, “BySecond”, HFactory.createCounterColumn(“animal=duck”), 1))

•   mutator.insertCounter(“201110291122, “ByMinute”,
    HFactory.createCounterColumn(“animal#sound#home=duck#quack#pond”), 1))
•   mutator.insertCounter(“201110291122, “ByMinute”,
    HFactory.createCounterColumn(“animal#home=duck#pond”), 1))
•   mutator.insertCounter(“201110291122, “ByMinute”, HFactory.createCounterColumn(“animal=duck”), 1))

•   mutator.insertCounter(“2011102911, “ByHour”, HFactory.createCounterColumn(“animal#home=duck#pond”),
    1))
•   mutator.insertCounter(“2011102911, “ByHour”,
    HFactory.createCounterColumn(“animal#sound#home=duck#quack#pond”), 1))
•   mutator.insertCounter(“2011102911, “ByHour”, HFactory.createCounterColumn(“animal=duck”), 1))

•   mutator.insertCounter(“20111029, “ByDay”,
    HFactory.createCounterColumn(“animal#sound#home=duck#quack#pond”), 1))
•   mutator.insertCounter(“20111029, “ByDay”, HFactory.createCounterColumn(“animal#home=duck#pond”), 1))
•   mutator.insertCounter(“20111029, “ByDay”, HFactory.createCounterColumn(“animal=duck”), 1))




                                                     8
The implementation, its functional
       kind of like “its electric” but without the boogie woogieoogie

def r(columnName: String): Unit = {
aggregateKeys.foreach{tuple:(ColumnFamily, String) => {
val(columnFamily,row) = tuple
         if (row !=null &&row.size> 0)
                   rows add (columnFamily -> row has columnName inc) //increment the counter
         }
  }
}

def ccAnimal(c: (String) => Unit) = {
c(aggregateColumnNames("Animal") + animal)
}

//rows we are going to write too
aggregateKeys(KEYSPACE  "ByDay") = day
aggregateKeys(KEYSPACE  "ByHour") = hour
aggregateKeys(KEYSPACE  "ByMinute") = minute

aggregateColumnNames("Animal") = "animal=”

ccAnimal(r)


                                                 9
Retrieving Data
                      MultigetSliceCounterQuery

•   setColumnFamily(“ByDay”)
•   setKeys("20111029")
•   setRange(”animal#sound=","animal#sound=~",false,1000)
•   We will get all animals and all of their sounds and counts for
    that day

• setRange(”sound#animal=purr#",”sound#animal=purr#~",false
  ,1000)
• We will get all animals that purr and their count


• What is with the tilde?


                                  10
Sort for success
Not magic, just Cassandra




           11
What it looks like in Cassandra
valsample1: String = "10/12/2011 11:22:33   GET   /sample?animal=duck&sound=quack&home=pond”
valsample4: String = "10/12/2011 11:22:33   GET   /sample?animal=cat&sound=purr&home=house”
valsample5: String = "10/12/2011 11:22:33   GET   /sample?animal=lion&sound=purr&home=zoo”
valsample6: String = "10/12/2011 11:22:33   GET   /sample?animal=dog&sound=woof&home=street"

[default@FixtureTestApophis] get ByDay[20111012];
=> (counter=animal#sound#home=cat#purr#house, value=70)
=> (counter=animal#sound#home=dog#woof#street, value=20)
=> (counter=animal#sound#home=duck#quack#pond, value=98)
=> (counter=animal#sound#home=lion#purr#zoo, value=70)
=> (counter=animal#sound=cat#purr, value=70)
=> (counter=animal#sound=dog#woof, value=20)
=> (counter=animal#sound=duck#quack, value=98)
=> (counter=animal#sound=lion#purr, value=70)
=> (counter=animal=cat, value=70)
=> (counter=animal=dog, value=20)
=> (counter=animal=duck, value=98)
=> (counter=animal=lion, value=70)
=> (counter=sound#animal=purr#cat, value=42)
=> (counter=sound#animal=purr#lion, value=42)
=> (counter=sound#animal=quack#duck, value=43)
=> (counter=sound#animal=woof#dog, value=20)
   (counter=total=, value=258)

https://github.com/joestein/apophis


                                                   12
A few more things about retrieving data

• You need to start backwards from here.
• If you want to-do things adhoc then map/reduce is better
• Sometimes more rows is better allowing more nodes to-dowork
   – If you need to look at 100,000 metrics it is better to pull this out
     of 100 rows than out of 1
   – Don’t be afraid to make CF and composite keys out of Time+
     Aggregate data
       • 20111023#animal=duck
       • This could be the row that holds ALL of the animal duck
         information for that day, if you want to look at 100 animals at
         once with 1000 metrics for each per time period, this is the
         way to go




                                    13
Q&A



Medialets
The rich media
adplatform for mobile.
            connect@medialets.com
            www.medialets.com/showcase




      14

More Related Content

PDF
Introduction to Search Systems - ScaleConf Colombia 2017
PDF
A Search Index is Not a Database Index - Full Stack Toronto 2017
PPTX
Working with Groovy Collections
PPT
MySQLConf2009: Taking ActiveRecord to the Next Level
PDF
MySQL Large Table Schema Changes
PDF
Wykorzystanie języka Kotlin do aplikacji na platformie Android
PDF
ScriptLUA
PDF
Spark with Cassandra by Christopher Batey
Introduction to Search Systems - ScaleConf Colombia 2017
A Search Index is Not a Database Index - Full Stack Toronto 2017
Working with Groovy Collections
MySQLConf2009: Taking ActiveRecord to the Next Level
MySQL Large Table Schema Changes
Wykorzystanie języka Kotlin do aplikacji na platformie Android
ScriptLUA
Spark with Cassandra by Christopher Batey

Similar to Storing Time Series Metrics With Cassandra and Composite Columns (20)

PPTX
jstein.cassandra.nyc.2011
PDF
Slide presentation pycassa_upload
PPT
Scaling web applications with cassandra presentation
PDF
Cassandra in production
PDF
Indexing in Cassandra
PDF
Big Data Grows Up - A (re)introduction to Cassandra
PDF
CQL3 in depth
KEY
PostgreSQL
PPTX
C*ollege Credit: Creating Your First App in Java with Cassandra
PDF
Cassandra
PDF
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
PDF
10 d bs in 30 minutes
PDF
Real-time Cassandra
PDF
Cassandra
PDF
Practical SQL: A Beginner's Guide to Storytelling with Data, 2nd Edition Anth...
PPTX
Apache Cassandra Developer Training Slide Deck
PDF
Cassandra at Morningstar (Feb 2011)
PDF
Cassandra Data Modelling with CQL (OSCON 2015)
ODP
Nyc summit intro_to_cassandra
PDF
NoSQL Overview
jstein.cassandra.nyc.2011
Slide presentation pycassa_upload
Scaling web applications with cassandra presentation
Cassandra in production
Indexing in Cassandra
Big Data Grows Up - A (re)introduction to Cassandra
CQL3 in depth
PostgreSQL
C*ollege Credit: Creating Your First App in Java with Cassandra
Cassandra
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
10 d bs in 30 minutes
Real-time Cassandra
Cassandra
Practical SQL: A Beginner's Guide to Storytelling with Data, 2nd Edition Anth...
Apache Cassandra Developer Training Slide Deck
Cassandra at Morningstar (Feb 2011)
Cassandra Data Modelling with CQL (OSCON 2015)
Nyc summit intro_to_cassandra
NoSQL Overview
Ad

More from Joe Stein (20)

PDF
Streaming Processing with a Distributed Commit Log
PDF
SMACK Stack 1.1
PDF
Get started with Developing Frameworks in Go on Apache Mesos
PPTX
Introduction To Apache Mesos
PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
PPTX
Developing Real-Time Data Pipelines with Apache Kafka
PPTX
Developing Frameworks for Apache Mesos
PPTX
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
PPTX
Containerized Data Persistence on Mesos
PPTX
Making Apache Kafka Elastic with Apache Mesos
PPTX
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
PPTX
Building and Deploying Application to Apache Mesos
PPTX
Apache Kafka, HDFS, Accumulo and more on Mesos
PPTX
Developing with the Go client for Apache Kafka
PPTX
Current and Future of Apache Kafka
PPTX
Introduction Apache Kafka
PPTX
Introduction to Apache Mesos
PDF
Developing Realtime Data Pipelines With Apache Kafka
PDF
Developing Real-Time Data Pipelines with Apache Kafka
PPTX
Real-time streaming and data pipelines with Apache Kafka
Streaming Processing with a Distributed Commit Log
SMACK Stack 1.1
Get started with Developing Frameworks in Go on Apache Mesos
Introduction To Apache Mesos
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Developing Real-Time Data Pipelines with Apache Kafka
Developing Frameworks for Apache Mesos
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Containerized Data Persistence on Mesos
Making Apache Kafka Elastic with Apache Mesos
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Building and Deploying Application to Apache Mesos
Apache Kafka, HDFS, Accumulo and more on Mesos
Developing with the Go client for Apache Kafka
Current and Future of Apache Kafka
Introduction Apache Kafka
Introduction to Apache Mesos
Developing Realtime Data Pipelines With Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
Ad

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation theory and applications.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
A Presentation on Artificial Intelligence
PPTX
Cloud computing and distributed systems.
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Electronic commerce courselecture one. Pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Monthly Chronicles - July 2025
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
A Presentation on Artificial Intelligence
Cloud computing and distributed systems.
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
NewMind AI Weekly Chronicles - August'25 Week I
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation_ Review paper, used for researhc scholars
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Machine learning based COVID-19 study performance prediction
Electronic commerce courselecture one. Pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Storing Time Series Metrics With Cassandra and Composite Columns

  • 1. Storing Time Series Metrics Implementing Multi-Dimensional Aggregate Composites with Counters For Reporting /* Joe Stein http://www.linkedin.com/in/charmalloc @allthingshadoop @cassandranosql @allthingsscala @charmalloc */ Sample code project up at https://github.com/joestein/apophis 1
  • 3. Medialets • Largest deployment of rich media ads for mobile devices • Over 300,000,000 devices supported • 3-4 TB of new data every day • Thousands of services in production • Hundreds of Thousands of simultaneous requests per second • Keeping track of what is and was going on when and where used to be difficult before we started using Cassandra • What do I do for Medialets? – Chief Architect and Head of Server Engineering Development & Operations. 3
  • 4. What does the schema look like? CREATE COLUMN FAMILY ByDay Column Families hold WITH default_validation_class=CounterColumnType your rows of data. Each AND key_validation_class=UTF8Type AND comparator=UTF8Type; row within each column family will be equal to the CREATE COLUMN FAMILY ByHour time period you are WITH default_validation_class=CounterColumnType dealing with. So an AND key_validation_class=UTF8Type AND comparator=UTF8Type; “event” occurring at 10/20/2011 11:22:41 will CREATE COLUMN FAMILY ByMinute WITH default_validation_class=CounterColumnType become 4 rows AND key_validation_class=UTF8Type AND comparator=UTF8Type; BySecond = 20111020112141 ByMinute= 201110201122 CREATE COLUMN FAMILY BySecond ByHour= 2011102011 WITH default_validation_class=CounterColumnType ByDay=20111020 AND key_validation_class=UTF8Type AND comparator=UTF8Type; 4
  • 5. Why multiple column families? http://www.datastax.com/docs/1.0/configuration/storage_configuration 5
  • 6. Ok now how do we keep track of what? Lets setup a quick example data set first • The Animal Logger – fictitious logger of the world around us – animal – food – sound – home • YYYY/MM/DD HH:MM:SS GET /sample?animal=X&food=Y – animal=duck&sound=quack&home=pond – animal=cat&sound=meow&home=house – animal=cat&sound=meow&home=street – animal=pigeon&sound=coo&home=street 6
  • 7. Now what? Columns babe, columns make your aggregates work • Setup your code for columns you want aggregated – animal= – animal#sound= – animal#home= – animal#food= – animal#food#home= – animal#food#sound= – animal#sound#home= – food#sound= – home#food= – sound#animal= 7
  • 8. Inserting data Column aggregate concatenated with values 2011/10/29 11:22:43 GET /sample?animal=duck&home=pond&sound=quack • mutator.insertCounter(“20111029112243, “BySecond”, HFactory.createCounterColumn(“animal#sound#home=duck#quack#pond”), 1)) • mutator.insertCounter(“20111029112243, “BySecond”, HFactory.createCounterColumn(“animal#home=duck#pond”), 1)) • mutator.insertCounter(“20111029112243, “BySecond”, HFactory.createCounterColumn(“animal=duck”), 1)) • mutator.insertCounter(“201110291122, “ByMinute”, HFactory.createCounterColumn(“animal#sound#home=duck#quack#pond”), 1)) • mutator.insertCounter(“201110291122, “ByMinute”, HFactory.createCounterColumn(“animal#home=duck#pond”), 1)) • mutator.insertCounter(“201110291122, “ByMinute”, HFactory.createCounterColumn(“animal=duck”), 1)) • mutator.insertCounter(“2011102911, “ByHour”, HFactory.createCounterColumn(“animal#home=duck#pond”), 1)) • mutator.insertCounter(“2011102911, “ByHour”, HFactory.createCounterColumn(“animal#sound#home=duck#quack#pond”), 1)) • mutator.insertCounter(“2011102911, “ByHour”, HFactory.createCounterColumn(“animal=duck”), 1)) • mutator.insertCounter(“20111029, “ByDay”, HFactory.createCounterColumn(“animal#sound#home=duck#quack#pond”), 1)) • mutator.insertCounter(“20111029, “ByDay”, HFactory.createCounterColumn(“animal#home=duck#pond”), 1)) • mutator.insertCounter(“20111029, “ByDay”, HFactory.createCounterColumn(“animal=duck”), 1)) 8
  • 9. The implementation, its functional kind of like “its electric” but without the boogie woogieoogie def r(columnName: String): Unit = { aggregateKeys.foreach{tuple:(ColumnFamily, String) => { val(columnFamily,row) = tuple if (row !=null &&row.size> 0) rows add (columnFamily -> row has columnName inc) //increment the counter } } } def ccAnimal(c: (String) => Unit) = { c(aggregateColumnNames("Animal") + animal) } //rows we are going to write too aggregateKeys(KEYSPACE "ByDay") = day aggregateKeys(KEYSPACE "ByHour") = hour aggregateKeys(KEYSPACE "ByMinute") = minute aggregateColumnNames("Animal") = "animal=” ccAnimal(r) 9
  • 10. Retrieving Data MultigetSliceCounterQuery • setColumnFamily(“ByDay”) • setKeys("20111029") • setRange(”animal#sound=","animal#sound=~",false,1000) • We will get all animals and all of their sounds and counts for that day • setRange(”sound#animal=purr#",”sound#animal=purr#~",false ,1000) • We will get all animals that purr and their count • What is with the tilde? 10
  • 11. Sort for success Not magic, just Cassandra 11
  • 12. What it looks like in Cassandra valsample1: String = "10/12/2011 11:22:33 GET /sample?animal=duck&sound=quack&home=pond” valsample4: String = "10/12/2011 11:22:33 GET /sample?animal=cat&sound=purr&home=house” valsample5: String = "10/12/2011 11:22:33 GET /sample?animal=lion&sound=purr&home=zoo” valsample6: String = "10/12/2011 11:22:33 GET /sample?animal=dog&sound=woof&home=street" [default@FixtureTestApophis] get ByDay[20111012]; => (counter=animal#sound#home=cat#purr#house, value=70) => (counter=animal#sound#home=dog#woof#street, value=20) => (counter=animal#sound#home=duck#quack#pond, value=98) => (counter=animal#sound#home=lion#purr#zoo, value=70) => (counter=animal#sound=cat#purr, value=70) => (counter=animal#sound=dog#woof, value=20) => (counter=animal#sound=duck#quack, value=98) => (counter=animal#sound=lion#purr, value=70) => (counter=animal=cat, value=70) => (counter=animal=dog, value=20) => (counter=animal=duck, value=98) => (counter=animal=lion, value=70) => (counter=sound#animal=purr#cat, value=42) => (counter=sound#animal=purr#lion, value=42) => (counter=sound#animal=quack#duck, value=43) => (counter=sound#animal=woof#dog, value=20) (counter=total=, value=258) https://github.com/joestein/apophis 12
  • 13. A few more things about retrieving data • You need to start backwards from here. • If you want to-do things adhoc then map/reduce is better • Sometimes more rows is better allowing more nodes to-dowork – If you need to look at 100,000 metrics it is better to pull this out of 100 rows than out of 1 – Don’t be afraid to make CF and composite keys out of Time+ Aggregate data • 20111023#animal=duck • This could be the row that holds ALL of the animal duck information for that day, if you want to look at 100 animals at once with 1000 metrics for each per time period, this is the way to go 13
  • 14. Q&A Medialets The rich media adplatform for mobile. connect@medialets.com www.medialets.com/showcase 14