SlideShare a Scribd company logo
Leveraging Big Data and Real-Time
Analytics at Cxense
Simon Lia-Jonassen
08/04/15
2
Our mission is to help companies understand their
audience and build great online user experiences.
– Stay longer on the site. – Sign up for subscriptions.
– Find interesting articles. – Buy recommended products.
About Cxense
3
Founded in 2010, ~100 employees in 2015.
Offices
–  Melbourne, Tokyo, Singapore, Stockholm, Copenhagen, Oslo*, London,
Buenos Aires, Rio de Janeiro, Miami, New-York, San-Francisco.
Some of our customers
About Cxense
4
Our solutions
5
How does it work!?
6
Event (example)
7
Content Profile (example)
8
9
Data Volume and Traffic
–  5K+ Web-sites
–  50M+ pages (last month)
–  500M+ users (last month)
–  10B+ events/month (20K events/sec peak)
Heterogeneity and Reliability
–  Hundreds of mobile and desktop platforms, browsers, internet providers, etc.
–  Multiple devices per user, cross-domain tracking (3rd party cookie is dying).
–  Web-pages (articles, image/video galleries, chats, search/front pages) and human language.
–  The Internet is Broken™
Constrains and Requirements
–  Online and real-time processing
•  Show and analyze what is happening right now.
–  High and sustainable performance
•  Throughput: peak-load 10K+ request/sec.
•  Latency: 100ms latency constrain for ads and recs.
–  Fault-tolerance and durability
Challenges
10
Architecture and Data Flow (simplified)
11
Communication
–  HTTP with JSON payload.
–  Durable and Idempotent.
Local storage
–  Atomically append to file.
–  Use a new file each hour.
–  Use a separate directory for each partition.
–  Tail files and/or directories.
Metadata
–  Keeps the state.
–  Can go backwards and re-feed when needed.
System
–  Semi-automatic configuration via Upstart and Crontab.
–  Monitoring via Graphite and log files.
–  Automatic alerting and centralized log search.
Data Flow and Feeding
12
What is The Cube?
–  Partitioned column store database.
–  Using efficient string handling and integer compression.
–  Provides fast filtering and aggregation over 50B data points.
–  Guarantees low update latency (100ms).
–  Exists in multiple variants:
•  Disk or memory based.
•  Partitioned by site, by user or by both.
–  Low-level API.
Example:
The Cube
© imdb.com
!me	
   user	
   rnd	
   siteid	
   url	
  
	
  
browser	
  
1409425329634	
   “4szi”	
   “xzst”	
   “9978”	
   “cxnews.com”	
   “Chrome”	
  
1409425329634	
   “zthp”	
   “fd0z”	
   “9978”	
   “cxnews.com/seahawks-­‐win-­‐again…”	
   “Firefox”	
  
1409425329635	
   “4szi”	
   “tzdt”	
   “9978”	
   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”	
   “Chrome”	
  
1409425329640	
   “4szi”	
   “aext”	
   “9978”	
   “cxnews.com/elon-­‐musk-­‐is-­‐awes…”	
   “Chrome”	
  
1409425329640	
   “zx5t”	
   “dxrf”	
   “9978”	
   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”	
   “Safari”	
  
13
Frame of Reference Compression
–  Compress the numbers in groups of 64.
–  If the sequence is increasing – use the first number as the reference and compute the
differences between each two consecutive numbers (deltas).
–  Find the maximum number of bits (width) needed to represent the larges delta and
compress the deltas using fixed bit width.
–  For non-increasing sequences, use the smallest number as the reference and the
differences between the numbers and the reference as deltas.
The Cube – Integer Columns
14
–  A global lexicon maps all strings to numbers and back.
–  For each column, we map global keys to a smaller set of numbers and back.
The Cube – String Columns
15
Filter
–  Keep a bit-filter over a particular range of rows as a state.
Filtering
–  By number or range – pass through a column and update the filter.
Use binary search for ordered columns such as time, inverted index for user id.
–  By key – map the key to a number and filter by the number.
–  By set of keys – map the keys to a bit-set and filter using the bit-set.
–  By pattern – filter by the set of keys matching the pattern.
Logical operations
–  AND, OR, NOT – use unary negation, binary intersection/join and a stack of filters.
Advanced operations
–  Use aggregation output as filtering input (e.g., top-list, explosion, histogram, etc.).
–  Join between different cubes on one or multiple dimensions.
The Cube – Filtering
16
Operations
–  Count – count the number of bits in the filter.
–  Sum – sum the numbers where filter bit is set.
–  Cardinality – count the number of distinct keys/numbers.
–  CardinalityEstimator – create a HyperLogLog cardinality estimator.
–  Frequency – create a map of keys/numbers with the associated count.
–  TopList – create a frequency map with only the k most popular keys/numbers.
–  SumBy – create a map of keys/numbers with the associated sum.
–  CardinalityMap – create a map of keys/numbers with the associated sum.
–  FrequencyDistribution – create a histogram over frequencies.
–  CardinalityDistribution – create a histogram over cardinalities.
–  SumByDistribution – create a histogram over sums.
–  NumericalStatistics – compute distribution statistics for numbers (min, max, percentiles).
The Cube – Aggregation
17
Partitioning
–  Most of the data structures are partitioned into chunks of data in order to improve memory
allocation, materialization, skipping, compression and locking.
Static and dynamic parts
–  Each data column, lexicon or mapping consist of a static and a dynamic part.
–  The static part is ordered – can use binary search and Minimal Perfect Hashing.
–  The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees.
Locking
–  Distinct Read and Read-Write Locks with different granularity/scope.
–  The updates are mostly appends, but some of the columns might be updated later (e.g.,
active time, exit query, etc.).
Maintenance
–  Periodically flush the dynamic part into the static part.
–  Remove the old data, delete unused strings, optimize the mapping.
The Cube – Updates
18
Keyword vectors
–  Represent user and document profiles.
–  Each contain as a document id, version and a set of group-item pairs with a weight.
–  Stored in a separate, highly partitioned set of containers.
–  Each container keeps multiple groups.
–  Each group contains a document ids, items and weights as columns.
The Cube – Advanced Data Types
19
Structured data
–  Can represent any simple JSON object (document).
–  Node types: Null, Object, Array, Integer, Float, String, Boolean.
–  Stored in a separate container, separate columns for each node type.
–  Each document is decomposed into a list of paths and nodes.
–  Each node is added to the corresponding column.
The Cube – Advanced Data Types
20
Analytics API
–  RESTful API – client-server, HTTP requests and response codes, stateless, cacheable, etc.
–  API resource paths, JSON in - JSON out.
–  Most of the APIs require authentication.
–  Simple integration via cx.py, Java/JavaScript/C#/Python/Perl/PHP or HTTP calls directly.
Traffic API
–  A rich set of high-level API.
–  Powerful ad-hoc syntax – types, groups, items, filters, fields, etc.
–  See the demo!
Analytics UI
–  HTML and JavaScript.
–  Is built on top of the Analytics API.
–  Has multiple fixed, functional views which can be combined with arbitrary filters.
–  Premium users have a workspace area for dynamic, configurable widgets.
Analytics API and UI
21
Demo Session
Thank you!
Questions?
Credits: Erik Gorset & Oslo Dev Team
23
…btw, we are hiring!
www.cxense.com
https://twitter.com/cxense
www.facebook.com/cxense
www.linkedin.com/company/cxense
Connect with Cxense
simon.jonassen@cxense.com
©http://www.perspectivaconica.com/

More Related Content

PDF
Large-Scale Real-Time Data Management for Engagement and Monetization
PPTX
Tc accelerate-2019-05
PDF
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
PPTX
Introduction to MongoDB
PPTX
What is hadoop
PPT
2011 mongo FR - scaling with mongodb
PPT
An Introduction to Hadoop
PPTX
Vault2016
Large-Scale Real-Time Data Management for Engagement and Monetization
Tc accelerate-2019-05
WSO2Con Asia 2014 - Simultaneous Analysis of Massive Data Streams in real-tim...
Introduction to MongoDB
What is hadoop
2011 mongo FR - scaling with mongodb
An Introduction to Hadoop
Vault2016

What's hot (20)

PPTX
Modern software design in Big data era
PPT
Google Bigtable paper presentation
PDF
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
PDF
Google Bigtable Paper Presentation
DOCX
Bigdata & Hadoop
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PDF
Elephant in the room: A DBA's Guide to Hadoop
PPTX
Big data and hadoop
PPTX
Big data technology unit 3
KEY
MongoDB NYC Python
PDF
Scaling Storage and Computation with Hadoop
PPTX
BIG DATA: Apache Hadoop
PPT
MongoDb - Details on the POC
PPTX
PPTX
Big table
PPTX
Google Big Table
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PPTX
Fundamental of Big Data with Hadoop and Hive
PDF
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Modern software design in Big data era
Google Bigtable paper presentation
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Google Bigtable Paper Presentation
Bigdata & Hadoop
EclipseCon Keynote: Apache Hadoop - An Introduction
Elephant in the room: A DBA's Guide to Hadoop
Big data and hadoop
Big data technology unit 3
MongoDB NYC Python
Scaling Storage and Computation with Hadoop
BIG DATA: Apache Hadoop
MongoDb - Details on the POC
Big table
Google Big Table
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Fundamental of Big Data with Hadoop and Hive
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Ad

Similar to Leveraging Big Data and Real-Time Analytics at Cxense (20)

PPTX
Cloud storage
PDF
Slide presentation pycassa_upload
PDF
Big Data Fundamentals
PDF
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
PDF
Imply at Apache Druid Meetup in London 1-15-20
PPT
Bigdata processing with Spark
PDF
Is NoSQL The Future of Data Storage?
PDF
Big Data Architecture Workshop - Vahid Amiri
PPTX
Big Data (NJ SQL Server User Group)
PDF
Hadoop & no sql new generation database systems
PPTX
Aeneas:: An Extensible NoSql Enhancing Application System
PPTX
The Big Data Stack
PDF
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
PPTX
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
PDF
Drill architecture 20120913
PDF
Big data tekna may 2016
PPTX
Chapter Six Storage-systemsgggggggg.pptx
PDF
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
PPTX
Inroduction to Big Data
PPTX
Software architecture for data applications
Cloud storage
Slide presentation pycassa_upload
Big Data Fundamentals
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Imply at Apache Druid Meetup in London 1-15-20
Bigdata processing with Spark
Is NoSQL The Future of Data Storage?
Big Data Architecture Workshop - Vahid Amiri
Big Data (NJ SQL Server User Group)
Hadoop & no sql new generation database systems
Aeneas:: An Extensible NoSql Enhancing Application System
The Big Data Stack
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Drill architecture 20120913
Big data tekna may 2016
Chapter Six Storage-systemsgggggggg.pptx
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
Inroduction to Big Data
Software architecture for data applications
Ad

More from Simon Lia-Jonassen (9)

PDF
Building successful and secure products with AI and ML
PPTX
HyperLogLog and friends
PPTX
No more bad news!
PPTX
Xgboost: A Scalable Tree Boosting System - Explained
PPTX
Chatbots are coming!
PDF
Efficient Query Processing in Web Search Engines
PDF
Yet another intro to Apache Spark
PDF
Efficient Query Processing in Distributed Search Engines
PDF
What should be done to IR algorithms to meet current, and possible future, ha...
Building successful and secure products with AI and ML
HyperLogLog and friends
No more bad news!
Xgboost: A Scalable Tree Boosting System - Explained
Chatbots are coming!
Efficient Query Processing in Web Search Engines
Yet another intro to Apache Spark
Efficient Query Processing in Distributed Search Engines
What should be done to IR algorithms to meet current, and possible future, ha...

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
A Presentation on Artificial Intelligence
Big Data Technologies - Introduction.pptx
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Per capita expenditure prediction using model stacking based on satellite ima...
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectral efficient network and resource selection model in 5G networks
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Weekly Chronicles - August'25 Week I
NewMind AI Monthly Chronicles - July 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Review of recent advances in non-invasive hemoglobin estimation

Leveraging Big Data and Real-Time Analytics at Cxense

  • 1. Leveraging Big Data and Real-Time Analytics at Cxense Simon Lia-Jonassen 08/04/15
  • 2. 2 Our mission is to help companies understand their audience and build great online user experiences. – Stay longer on the site. – Sign up for subscriptions. – Find interesting articles. – Buy recommended products. About Cxense
  • 3. 3 Founded in 2010, ~100 employees in 2015. Offices –  Melbourne, Tokyo, Singapore, Stockholm, Copenhagen, Oslo*, London, Buenos Aires, Rio de Janeiro, Miami, New-York, San-Francisco. Some of our customers About Cxense
  • 5. 5 How does it work!?
  • 8. 8
  • 9. 9 Data Volume and Traffic –  5K+ Web-sites –  50M+ pages (last month) –  500M+ users (last month) –  10B+ events/month (20K events/sec peak) Heterogeneity and Reliability –  Hundreds of mobile and desktop platforms, browsers, internet providers, etc. –  Multiple devices per user, cross-domain tracking (3rd party cookie is dying). –  Web-pages (articles, image/video galleries, chats, search/front pages) and human language. –  The Internet is Broken™ Constrains and Requirements –  Online and real-time processing •  Show and analyze what is happening right now. –  High and sustainable performance •  Throughput: peak-load 10K+ request/sec. •  Latency: 100ms latency constrain for ads and recs. –  Fault-tolerance and durability Challenges
  • 10. 10 Architecture and Data Flow (simplified)
  • 11. 11 Communication –  HTTP with JSON payload. –  Durable and Idempotent. Local storage –  Atomically append to file. –  Use a new file each hour. –  Use a separate directory for each partition. –  Tail files and/or directories. Metadata –  Keeps the state. –  Can go backwards and re-feed when needed. System –  Semi-automatic configuration via Upstart and Crontab. –  Monitoring via Graphite and log files. –  Automatic alerting and centralized log search. Data Flow and Feeding
  • 12. 12 What is The Cube? –  Partitioned column store database. –  Using efficient string handling and integer compression. –  Provides fast filtering and aggregation over 50B data points. –  Guarantees low update latency (100ms). –  Exists in multiple variants: •  Disk or memory based. •  Partitioned by site, by user or by both. –  Low-level API. Example: The Cube © imdb.com !me   user   rnd   siteid   url     browser   1409425329634   “4szi”   “xzst”   “9978”   “cxnews.com”   “Chrome”   1409425329634   “zthp”   “fd0z”   “9978”   “cxnews.com/seahawks-­‐win-­‐again…”   “Firefox”   1409425329635   “4szi”   “tzdt”   “9978”   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”   “Chrome”   1409425329640   “4szi”   “aext”   “9978”   “cxnews.com/elon-­‐musk-­‐is-­‐awes…”   “Chrome”   1409425329640   “zx5t”   “dxrf”   “9978”   “cxnews.com/tesla-­‐model-­‐3-­‐will-­‐…”   “Safari”  
  • 13. 13 Frame of Reference Compression –  Compress the numbers in groups of 64. –  If the sequence is increasing – use the first number as the reference and compute the differences between each two consecutive numbers (deltas). –  Find the maximum number of bits (width) needed to represent the larges delta and compress the deltas using fixed bit width. –  For non-increasing sequences, use the smallest number as the reference and the differences between the numbers and the reference as deltas. The Cube – Integer Columns
  • 14. 14 –  A global lexicon maps all strings to numbers and back. –  For each column, we map global keys to a smaller set of numbers and back. The Cube – String Columns
  • 15. 15 Filter –  Keep a bit-filter over a particular range of rows as a state. Filtering –  By number or range – pass through a column and update the filter. Use binary search for ordered columns such as time, inverted index for user id. –  By key – map the key to a number and filter by the number. –  By set of keys – map the keys to a bit-set and filter using the bit-set. –  By pattern – filter by the set of keys matching the pattern. Logical operations –  AND, OR, NOT – use unary negation, binary intersection/join and a stack of filters. Advanced operations –  Use aggregation output as filtering input (e.g., top-list, explosion, histogram, etc.). –  Join between different cubes on one or multiple dimensions. The Cube – Filtering
  • 16. 16 Operations –  Count – count the number of bits in the filter. –  Sum – sum the numbers where filter bit is set. –  Cardinality – count the number of distinct keys/numbers. –  CardinalityEstimator – create a HyperLogLog cardinality estimator. –  Frequency – create a map of keys/numbers with the associated count. –  TopList – create a frequency map with only the k most popular keys/numbers. –  SumBy – create a map of keys/numbers with the associated sum. –  CardinalityMap – create a map of keys/numbers with the associated sum. –  FrequencyDistribution – create a histogram over frequencies. –  CardinalityDistribution – create a histogram over cardinalities. –  SumByDistribution – create a histogram over sums. –  NumericalStatistics – compute distribution statistics for numbers (min, max, percentiles). The Cube – Aggregation
  • 17. 17 Partitioning –  Most of the data structures are partitioned into chunks of data in order to improve memory allocation, materialization, skipping, compression and locking. Static and dynamic parts –  Each data column, lexicon or mapping consist of a static and a dynamic part. –  The static part is ordered – can use binary search and Minimal Perfect Hashing. –  The dynamic, read-write – have to search exhaustively, but improved using Wavelet Trees. Locking –  Distinct Read and Read-Write Locks with different granularity/scope. –  The updates are mostly appends, but some of the columns might be updated later (e.g., active time, exit query, etc.). Maintenance –  Periodically flush the dynamic part into the static part. –  Remove the old data, delete unused strings, optimize the mapping. The Cube – Updates
  • 18. 18 Keyword vectors –  Represent user and document profiles. –  Each contain as a document id, version and a set of group-item pairs with a weight. –  Stored in a separate, highly partitioned set of containers. –  Each container keeps multiple groups. –  Each group contains a document ids, items and weights as columns. The Cube – Advanced Data Types
  • 19. 19 Structured data –  Can represent any simple JSON object (document). –  Node types: Null, Object, Array, Integer, Float, String, Boolean. –  Stored in a separate container, separate columns for each node type. –  Each document is decomposed into a list of paths and nodes. –  Each node is added to the corresponding column. The Cube – Advanced Data Types
  • 20. 20 Analytics API –  RESTful API – client-server, HTTP requests and response codes, stateless, cacheable, etc. –  API resource paths, JSON in - JSON out. –  Most of the APIs require authentication. –  Simple integration via cx.py, Java/JavaScript/C#/Python/Perl/PHP or HTTP calls directly. Traffic API –  A rich set of high-level API. –  Powerful ad-hoc syntax – types, groups, items, filters, fields, etc. –  See the demo! Analytics UI –  HTML and JavaScript. –  Is built on top of the Analytics API. –  Has multiple fixed, functional views which can be combined with arbitrary filters. –  Premium users have a workspace area for dynamic, configurable widgets. Analytics API and UI
  • 22. Thank you! Questions? Credits: Erik Gorset & Oslo Dev Team
  • 23. 23 …btw, we are hiring! www.cxense.com https://twitter.com/cxense www.facebook.com/cxense www.linkedin.com/company/cxense Connect with Cxense simon.jonassen@cxense.com ©http://www.perspectivaconica.com/