SlideShare a Scribd company logo
Realtime Analytics
  with Apache
   Cassandra
        Tom Wilkie
 Founder & CTO, Acunu Ltd
       @tom_wilkie
Combining “big” and “real-time” is hard

    Live & historical                    Drill downs
                         Trends...
      aggregates...                      and roll ups




2
                                                        Analytics
Solution              Con

                       Scalability
                         $$$


                      Not realtime


               Spartan query semantics =>
                 complex, DIY solutions

3
                                            Analytics
Example I
    eg “show me the number of mentions of
        ‘Acunu’ per day, between May and
          November 2011, on Twitter”


    Batch (Hadoop) approach would require
    processing ~30 billion tweets, or ~4.2
                 TB of data
                  http://blog.twitter.com/2011/03/numbers.html


4
                                                                 Analytics
Okay, so how are we going to
                   do it?

    For each tweet,
    increment a bunch of counters,
    such that answering a query
    is as easy as reading some counters


5
                                          Analytics
Preparing the data
                              12:32:15 I like #trafficlights
Step 1: Get a feed of    12:33:43 Nobody expects...
        the tweets     12:33:49 I ate a #bee; woe is...
                      12:34:04 Man, @acunu rocks!

Step 2: Tokenise the
        tweet

Step 3: Increment counters            [1234, man]   +1
        in time buckets for           [1234, acunu] +1
        each token                    [1234, rock] +1
6
                                                              Analytics
Querying
                            start: [01/05/11, acunu]
Step 1: Do a range query    end:   [30/05/11, acunu]

                                       Key            #Mentions
                              [01/05/11 00:01, acunu]    3
Step 2: Result table          [01/05/11 00:02, acunu]    5
                                        ...              ...


                              90

Step 3: Plot pretty graph     45
                               0
                                   May Jun Jul Aug Sept Oct Nov
7
                                                              Analytics
Instead of this...
                                  Key            #Mentions
                         [01/05/11 00:01, acunu]    3
                         [01/05/11 00:02, acunu]    5
                                   ...              ...




                         We do this
                      Key           00:01       00:02        ...
                [01/05/11, acunu]     3           5          ...
                [02/05/11, acunu]    12           4          ...
                        ...           ...                    ...

    Row key is ‘big’                     Column key is ‘small’
     time bucket                             time bucket
8
                                                                   Analytics
Towards a more
    general solution...
      (Example II)



9
                          Analytics
count
                grouped by ...
                    day
  count
 distinct
(session)
     count       ... geography

avg(duration)
                  ... browser


10
                          Analytics
21:00      all→1345    :00→45      :01→62      :02→87       ...

                         22:00      all→3221    :00→22      :00→19     :02→104       ...
{
     cust_id: user01,      ...                                                       ...

     session_id: 102,      UK        all→228    user01→1   user14→12   user99→7      ...
     geography: UK,
                           US        all→354    user01→4   user04→8    user56→17     ...
     browser: IE,
     time: 22:02,          ...

}                       UK, 22:00   all→1904       ...

                           ∅        all→87314   UK→238     US→354         ...




11
                                                                                 Analytics
21:00      all→1345     :00→45     :01→62      :02→87       ...

                         22:00      all→3222     :00→22     :00→19     :02→105       ...
{
     cust_id: user01,      ...                                                       ...

     session_id: 102,      UK        all→229    user01→2   user14→12   user99→7      ...
     geography: UK,
                           US        all→354    user01→4   user04→8    user56→17     ...
     browser: IE,
     time: 22:02,          ...

}                       UK, 22:00   all→1905       ...

                           ∅        all→87315   UK→239     US→354         ...




12
                                                                                 Analytics
21:00      all→1345    :00→45      :01→62      :02→87       ...

      22:00      all→3221    :00→22      :00→19     :02→104       ...

        ...                                                       ...

        UK        all→228    user01→1   user14→12   user99→7      ...

        US        all→354    user01→4   user04→8    user56→17     ...

        ...

     UK, 22:00   all→1904       ...

        ∅        all→87314   UK→238     US→354         ...




13
                                                              Analytics
where time 21:00-22:00
 count(*)
                          21:00      all→1345    :00→45      :01→62      :02→87       ...

                          22:00      all→3222    :00→22      :01→19     :02→105       ...

                            ...                                                       ...

                            UK        all→229    user01→2   user14→12   user99→7      ...

                            US        all→354    user01→4   user04→8    user56→17     ...

                            ...

                         UK, 22:00   all→1905       ...

                            ∅        all→87315   UK→239     US→354         ...




14
                                                                                  Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345    :00→45      :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...

                             US        all→354    user01→4   user04→8    user56→17     ...

                             ...

                          UK, 22:00   all→1905       ...

                             ∅        all→87315   UK→239     US→354         ...




15
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

                             ∅        all→87315   UK→239      US→354        ...




16
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

count all                    ∅        all→87315   UK→239      US→354        ...




17
                                                                                   Analytics
where time 21:00-22:00
 count(*)
                           21:00      all→1345     :00→45     :01→62      :02→87       ...


where time 22:00-23:00,    22:00      all→3222    :00→22      :01→19     :02→105       ...


 group by minute             ...                                                       ...

                             UK        all→229    user01→2   user14→12   user99→7      ...


where geography=UK           US        all→354    user01→4   user04→8    user56→17     ...


 group all by user,          ...

                          UK, 22:00   all→1905       ...

count all                    ∅        all→87315   UK→239      US→354        ...




group all by geo
18
                                                                                   Analytics
What about more than
       just aggregates?



19
                            Analytics
Approximate Analytics
                 Exact




     Real-time           Large Scale


20
                                       Analytics
Count Distinct

     Plan A: keep a list of all the things you’ve seen
               count them at query time


                Quick to update
                  ... but at scale ...
                Takes lots of space
                Takes a long time to query
21
                                                         Analytics
Approximate Distinct

     max # leading zeroes seen so far
         item          hash        leading zeroes   max so far

         x        00101001110...          2            2
         y        11010100111...          0            2
         z        00011101011...          3            3
                       ...
     ... to see a max of M takes about        2M    items

22
                                                                 Analytics
Approximate Distinct

            to reduce var, average over m=2k sub-streams

     item          hash          index, zeroes   max so far

     x       00101001110...          0, 0        0,0,0,0
     y       11010100111...          3, 1        0,0,0,1
     z       00011101011...          0, 1        1,0,0,1
                   ...
            take the harmonic mean
23
                                                              Analytics
Okay... now what?




                    Analytics
Analytics

                                     counter
                                     updates
Click stream    events
                          Acunu
Sensor data
                         Analytics
     etc




     •   Aggregate incrementally, on the fly
     •   Store live + historical aggregates
10x vs MySQL...




                  Analytics
Dashboard UI




27
                    Analytics
“Up and running in about 4 hours”
“We found out a competitor
  was scraping our data”

                      “We keep discovering use cases
                         we hadn’t thought of ”




                                                 Analytics
"Quick, efficient and easy to
        get started"
                       "We're still finding new and
                     interesting use cases, which just
                         aren't possible with our
                           current datastores."

                                                         Analytics
Thanks!

     Questions?


30
                  Analytics

More Related Content

PDF
Realtime Analytics with Apache Cassandra
PDF
Acunu Analytics
PDF
Users & Ad Click Costs By Country April2010
PDF
Lista items en minecraft
PPTX
Imagens, minecraft, royal treasure slides
PDF
Xeriscape Plant Selections and Ideas - North Dakota State University
PPTX
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
PDF
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics with Apache Cassandra
Acunu Analytics
Users & Ad Click Costs By Country April2010
Lista items en minecraft
Imagens, minecraft, royal treasure slides
Xeriscape Plant Selections and Ideas - North Dakota State University
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...

More from JAX London (20)

PDF
Everything I know about software in spaghetti bolognese: managing complexity
PDF
Devops with the S for Sharing - Patrick Debois
PPT
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
PDF
It's code but not as we know: Infrastructure as Code - Patrick Debois
KEY
Locks? We Don't Need No Stinkin' Locks - Michael Barker
PDF
Worse is better, for better or for worse - Kevlin Henney
PDF
Java performance: What's the big deal? - Trisha Gee
PDF
Clojure made-simple - John Stevenson
PDF
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
PDF
Play framework 2 : Peter Hilton
PDF
Complexity theory and software development : Tim Berglund
PDF
Why FLOSS is a Java developer's best friend: Dave Gruber
PDF
Akka in Action: Heiko Seeburger
PDF
NoSQL Smackdown 2012 : Tim Berglund
PDF
Closures, the next "Big Thing" in Java: Russel Winder
KEY
Java and the machine - Martijn Verburg and Kirk Pepperdine
PDF
Mongo DB on the JVM - Brendan McAdams
PDF
New opportunities for connected data - Ian Robinson
PDF
HTML5 Websockets and Java - Arun Gupta
PDF
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
Everything I know about software in spaghetti bolognese: managing complexity
Devops with the S for Sharing - Patrick Debois
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
It's code but not as we know: Infrastructure as Code - Patrick Debois
Locks? We Don't Need No Stinkin' Locks - Michael Barker
Worse is better, for better or for worse - Kevlin Henney
Java performance: What's the big deal? - Trisha Gee
Clojure made-simple - John Stevenson
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
Play framework 2 : Peter Hilton
Complexity theory and software development : Tim Berglund
Why FLOSS is a Java developer's best friend: Dave Gruber
Akka in Action: Heiko Seeburger
NoSQL Smackdown 2012 : Tim Berglund
Closures, the next "Big Thing" in Java: Russel Winder
Java and the machine - Martijn Verburg and Kirk Pepperdine
Mongo DB on the JVM - Brendan McAdams
New opportunities for connected data - Ian Robinson
HTML5 Websockets and Java - Arun Gupta
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
Ad

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Big Data Technologies - Introduction.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Modernizing your data center with Dell and AMD
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Big Data Technologies - Introduction.pptx
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Monthly Chronicles - July 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Modernizing your data center with Dell and AMD
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Empathic Computing: Creating Shared Understanding
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
Ad

Realtime analytics with Apache Cassandra - Tom Wilkie

  • 1. Realtime Analytics with Apache Cassandra Tom Wilkie Founder & CTO, Acunu Ltd @tom_wilkie
  • 2. Combining “big” and “real-time” is hard Live & historical Drill downs Trends... aggregates... and roll ups 2 Analytics
  • 3. Solution Con Scalability $$$ Not realtime Spartan query semantics => complex, DIY solutions 3 Analytics
  • 4. Example I eg “show me the number of mentions of ‘Acunu’ per day, between May and November 2011, on Twitter” Batch (Hadoop) approach would require processing ~30 billion tweets, or ~4.2 TB of data http://blog.twitter.com/2011/03/numbers.html 4 Analytics
  • 5. Okay, so how are we going to do it? For each tweet, increment a bunch of counters, such that answering a query is as easy as reading some counters 5 Analytics
  • 6. Preparing the data 12:32:15 I like #trafficlights Step 1: Get a feed of 12:33:43 Nobody expects... the tweets 12:33:49 I ate a #bee; woe is... 12:34:04 Man, @acunu rocks! Step 2: Tokenise the tweet Step 3: Increment counters [1234, man] +1 in time buckets for [1234, acunu] +1 each token [1234, rock] +1 6 Analytics
  • 7. Querying start: [01/05/11, acunu] Step 1: Do a range query end: [30/05/11, acunu] Key #Mentions [01/05/11 00:01, acunu] 3 Step 2: Result table [01/05/11 00:02, acunu] 5 ... ... 90 Step 3: Plot pretty graph 45 0 May Jun Jul Aug Sept Oct Nov 7 Analytics
  • 8. Instead of this... Key #Mentions [01/05/11 00:01, acunu] 3 [01/05/11 00:02, acunu] 5 ... ... We do this Key 00:01 00:02 ... [01/05/11, acunu] 3 5 ... [02/05/11, acunu] 12 4 ... ... ... ... Row key is ‘big’ Column key is ‘small’ time bucket time bucket 8 Analytics
  • 9. Towards a more general solution... (Example II) 9 Analytics
  • 10. count grouped by ... day count distinct (session) count ... geography avg(duration) ... browser 10 Analytics
  • 11. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... { cust_id: user01, ... ... session_id: 102, UK all→228 user01→1 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ... } UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ... 11 Analytics
  • 12. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :00→19 :02→105 ... { cust_id: user01, ... ... session_id: 102, UK all→229 user01→2 user14→12 user99→7 ... geography: UK, US all→354 user01→4 user04→8 user56→17 ... browser: IE, time: 22:02, ... } UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 12 Analytics
  • 13. 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3221 :00→22 :00→19 :02→104 ... ... ... UK all→228 user01→1 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1904 ... ∅ all→87314 UK→238 US→354 ... 13 Analytics
  • 14. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... 22:00 all→3222 :00→22 :01→19 :02→105 ... ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 14 Analytics
  • 15. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... US all→354 user01→4 user04→8 user56→17 ... ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 15 Analytics
  • 16. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... ∅ all→87315 UK→239 US→354 ... 16 Analytics
  • 17. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... count all ∅ all→87315 UK→239 US→354 ... 17 Analytics
  • 18. where time 21:00-22:00 count(*) 21:00 all→1345 :00→45 :01→62 :02→87 ... where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ... group by minute ... ... UK all→229 user01→2 user14→12 user99→7 ... where geography=UK US all→354 user01→4 user04→8 user56→17 ... group all by user, ... UK, 22:00 all→1905 ... count all ∅ all→87315 UK→239 US→354 ... group all by geo 18 Analytics
  • 19. What about more than just aggregates? 19 Analytics
  • 20. Approximate Analytics Exact Real-time Large Scale 20 Analytics
  • 21. Count Distinct Plan A: keep a list of all the things you’ve seen count them at query time Quick to update ... but at scale ... Takes lots of space Takes a long time to query 21 Analytics
  • 22. Approximate Distinct max # leading zeroes seen so far item hash leading zeroes max so far x 00101001110... 2 2 y 11010100111... 0 2 z 00011101011... 3 3 ... ... to see a max of M takes about 2M items 22 Analytics
  • 23. Approximate Distinct to reduce var, average over m=2k sub-streams item hash index, zeroes max so far x 00101001110... 0, 0 0,0,0,0 y 11010100111... 3, 1 0,0,0,1 z 00011101011... 0, 1 1,0,0,1 ... take the harmonic mean 23 Analytics
  • 24. Okay... now what? Analytics
  • 25. Analytics counter updates Click stream events Acunu Sensor data Analytics etc • Aggregate incrementally, on the fly • Store live + historical aggregates
  • 26. 10x vs MySQL... Analytics
  • 27. Dashboard UI 27 Analytics
  • 28. “Up and running in about 4 hours” “We found out a competitor was scraping our data” “We keep discovering use cases we hadn’t thought of ” Analytics
  • 29. "Quick, efficient and easy to get started" "We're still finding new and interesting use cases, which just aren't possible with our current datastores." Analytics
  • 30. Thanks! Questions? 30 Analytics