SlideShare a Scribd company logo
Using approximate data structures
for small, insightful analytics.
Ben Kornmeier, Engineer
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
About Protectwise
● Cloud security platform, that aims to make threats
actionable and obvious.
● Aims to cut down on the amount of “noise” that a
network can create, and only show the most important
details.
● Has a big emphasis on real time data.
● Ingests and processes terabytes of data a day.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Goals Of Count Sumula
● Quick report generation.
● Support high cardinality data.
● Compute averages, min, and max.
● Easy to add additional aggregations.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Challenge: Daily Data Ingestion
● 2 billion netflow updates.
● Ingests 20TB of raw network traffic.
● Generates 150 million observations.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Challenge: Costs of Processing Data.
● Traditional batch processing is accurate, but slow.
○ We want results in seconds not hours or days.
● Compute resources are very expensive at our scale.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Challenge: Making a Great User Experience
● A user should expect:
○ Hardly any waiting for report generate.
○ Up to date reports.
○ Meaningful reports that are actionable and concise.
○ Reports that are persisted forever and can be
recombined after the fact to gain additional insights.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Some Use Cases
● Show me a count all of the hosts that had a threat on
them in the past year.
● Show me the hosts with the most threats encountered
over the course of a year.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Use Cases Examined
● Show me a count all of the hosts that had a threat on
them in the past year.
○ IP address has a very high cardinality 340 undecillion (ipv6)
■Or: 340,282,366,920,938,463,463,374,607,431,768,211,456 (WOW!)
○ Storage costs could be high.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Use Cases Examined Continued
● Show me the hosts with the most threats encountered
over the course of a year.
○ Once again, high cardinality.
○ Same storage costs as the example before, but now we have to sort,
which is going to be tough. O(n log n).
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Considerations For Our Solution
● Be real time.
● Could not grow without bounds.
● Data must be around for decades or more.
● Be able to return queries for large time ranges.
● Be actionable and concise.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
The Realization
● In general users can live with an approximate result!
○ Approximate results use less space.
○ Can be computed in memory.
○ Approximate results can be bounded by trading accuracy for space
○ Approximate results are fast enough to compute in real time.
○ Meets two of our goals.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Some Approximations We Used
● HyperLogLog
● Count Min Sketch
● Stream Summary
● Bloom Filter
● Layered Bloom Filter
● Compound Approximations
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
HyperLogLog
● Only counts the amount of consecutive 0 bits.
● Uses the count of consecutive 0 bits and the probability
of it occurring to determine an estimate of unique
elements seen.
● Assumes a good hashing function (Murmur 3).
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Example: HyperLogLog
Assuming our hashing function only returns 4 bits (16
combinations).
Bit pattern(s) Chance of occurrence
0000 1 / 16
1000, 0001 2 / 16 or 1 / 8
0011,1001,1100,0100,0010 5 / 16
0111,1011,1101,1110,1010,0110 7 / 16
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
CountMinSketch
● Essentially a matrix.
● Inserts are duplicated across rows.
● Inserts are hashed differently per row.
● Elements can only add.
● Used for frequency estimation.
● Can be used for averages, min, max as well.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Example: CountMinSketch
Inserting an element
“Ben”
“Eric”
1 null null null null
null null 1 null null
1 null 1 null null
null null 2 null null
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Example: CountMinSketch Continued
Retrieving the count for “Ben”
“Ben” 1 null 1 null null
null null 2 null null
Compare the values return, and take the min, in this case 1.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
How Did We Store The Approximations?
● We generate enough approximations that we create
about 1 GB of data each month.
○ Much better than the amount stored for full fidelity data.
● First approach just use Redis.
● Second approach Redis and Cassandra.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
First Approach Redis Only
Advantages
● Easy
● Fast
Disadvantages
● Ticking time bomb since Redis is memory only.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Second Approach C* And Redis
Advantages
● C* scales infinitely.
● Redis can be used when speed is important.
● Not a ticking time bomb.
Disadvantages
● Not as easy as previous solution.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
How We Use Redis With Cassandra
● Elements are placed in Redis and keyed on bucket
name and time.
● Once a element from the next time interval is
encountered, data is moved from Redis to Cassandra.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
{“bucket”: “observation”,”time”:1, “value”: 1}
{“bucket”: “observation”,”time”:1, “value”: 2}
{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
{“bucket”: “observation”,”time”:1, “value”: 2}
{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis
{“bucket”: “observation”,”time”:1, “value”: 1}
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis
{“bucket”: “observation”,”time”:1, “value”: 1}
{“bucket”: “observation”,”time”:1, “value”: 2}
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis
{“bucket”: “observation”,”time”:1, “value”: 1}
{“bucket”: “observation”,”time”:1, “value”: 2}
Elements are
summed
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis
{“bucket”: “observation”,”time”:1, “value”: 3}
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
Cassandra
Redis
{“bucket”: “observation”,”time”:2, “value”: 10}
{“bucket”: “observation”,”time”:1, “value”: 3}
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
Cassandra
{“bucket”: “observation”,”time”:1, “value”: 3}
Redis
{“bucket”: “observation”,”time”:2, “value”: 10}
Element from time 1 is determined to be expired and written to Cassandra
Cassandra Schema
CREATE TABLE buckets (
name text, // bucket name
time_bucket timestamp, // Time floored on next interval up.
time_unit int, // {1: “minute”, 2: “hour”, 3: “day” }
algorithm text, // [HyperLogLog, CountMinSketch, etc]
time timestamp, // the actual time
d blob, //Serialized data
PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
Cassandra Schema
CREATE TABLE buckets (
name text, // bucket name
time_bucket timestamp, // Time floored on next interval up.
time_unit int, // {1: “minute”, 2: “hour”, 3: “day” }
algorithm text, // [HyperLogLog, CountMinSketch, etc]
time timestamp, // the actual time
d blob, //Serialized data
PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
Cassandra Schema
CREATE TABLE buckets (
name text, // bucket name
time_bucket timestamp, // Time floored on next interval up.
time_unit int, // {1: “minute”, 2: “hour”, 3: “day” }
algorithm text, // [HyperLogLog, CountMinSketch, etc]
time timestamp, // the actual time
d blob, //Serialized data
PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Advantages of using Cassandra and Redis
● Elements are written in their finalized form to Cassandra.
○ Compactor friendly.
● Updates can happen very fast since Redis is Fast.
● Redis no longer consumes memory unbounded.
Caveats
● Using approximations are just that, approximate.
● Takes time to understand how they work.
● Tuning needs up front knowledge of usage.
https://www.protectwise.com/careers.html
Especially if you’re in Denver!
We’re Hiring!

More Related Content

PPTX
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
PPTX
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
PPTX
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
PPTX
Real Time Business Intelligence with Cassandra, Kafka and Hadoop - A Real Sto...
PPTX
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
PDF
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
PPTX
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Real Time Business Intelligence with Cassandra, Kafka and Hadoop - A Real Sto...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016

What's hot (20)

PPTX
Processing 50,000 events per second with Cassandra and Spark
PPTX
Cassandra Tuning - above and beyond
PDF
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
PDF
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
PPTX
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
PPTX
Large partition in Cassandra
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
PDF
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
PDF
Cassandra CLuster Management by Japan Cassandra Community
PDF
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
PDF
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
PPTX
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
PPTX
From PoCs to Production
PPTX
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
PDF
Managing Cassandra at Scale by Al Tobey
PDF
Aggregated queries with Druid on terrabytes and petabytes of data
PDF
Deep dive into event store using Apache Cassandra
PPTX
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
PDF
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Processing 50,000 events per second with Cassandra and Spark
Cassandra Tuning - above and beyond
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Large partition in Cassandra
Real time data pipeline with spark streaming and cassandra with mesos
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
Cassandra CLuster Management by Japan Cassandra Community
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
From PoCs to Production
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Managing Cassandra at Scale by Al Tobey
Aggregated queries with Druid on terrabytes and petabytes of data
Deep dive into event store using Apache Cassandra
Oracle: Let My People Go! (Shu Zhang, Ilya Sokolov, Symantec) | Cassandra Sum...
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Ad

Viewers also liked (20)

PDF
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
PPTX
There are More Clouds! Azure and Cassandra (Carlos Rolo, Pythian) | C* Summit...
PDF
Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
PDF
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
PDF
Webinar - Bringing Game Changing Insights with Graph Databases
PPTX
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
PPTX
Webinar: Transforming Customer Experience Through an Always-On Data Platform
PPTX
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...
PPTX
Webinar: Fighting Fraud with Graph Databases
PPTX
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
PDF
Building Killr Applications with DSE
PPTX
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
PDF
Can My Inventory Survive Eventual Consistency?
PDF
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
PDF
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
PPTX
Stratio's Cassandra Lucene index: Geospatial Use Cases (Andrés de la Peña & J...
PDF
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
PDF
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
PPTX
DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
There are More Clouds! Azure and Cassandra (Carlos Rolo, Pythian) | C* Summit...
Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Webinar - Bringing Game Changing Insights with Graph Databases
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar: Transforming Customer Experience Through an Always-On Data Platform
Webinar - DataStax Enterprise 5.1: 3X the operational analytics speed, help f...
Webinar: Fighting Fraud with Graph Databases
Give sense to your Big Data w/ Apache TinkerPop™ & property graph databases
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Building Killr Applications with DSE
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Can My Inventory Survive Eventual Consistency?
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
The Promise and Perils of Encrypting Cassandra Data (Ameesh Divatia, Baffle, ...
Stratio's Cassandra Lucene index: Geospatial Use Cases (Andrés de la Peña & J...
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...
Ad

Similar to Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, ProtectWise) | Cassandra Summit 2016 (20)

PPTX
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
PPTX
Austin Scales- Clickstream Analytics at Bazaarvoice
PDF
An introduction to probabilistic data structures
PPTX
Streaming Algorithms
PPTX
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
PPTX
L6.sp17.pptx
PDF
Bayesian Counters
PDF
Approximate "Now" is Better Than Accurate "Later"
PDF
Cassandra Talk: Austin JUG
PDF
Cassandra - A Decentralized Structured Storage System
PDF
Cassandra Explained
PDF
Consistent hashing algorithmic tradeoffs
PDF
Cassandra for Ruby/Rails Devs
PDF
On Rails with Apache Cassandra
PDF
Outside The Box With Apache Cassnadra
PDF
Slide presentation pycassa_upload
PDF
Cassandra for Sysadmins
PDF
Consistent hashing
PDF
Cassandra in production
PPT
Scaling web applications with cassandra presentation
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Austin Scales- Clickstream Analytics at Bazaarvoice
An introduction to probabilistic data structures
Streaming Algorithms
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)
L6.sp17.pptx
Bayesian Counters
Approximate "Now" is Better Than Accurate "Later"
Cassandra Talk: Austin JUG
Cassandra - A Decentralized Structured Storage System
Cassandra Explained
Consistent hashing algorithmic tradeoffs
Cassandra for Ruby/Rails Devs
On Rails with Apache Cassandra
Outside The Box With Apache Cassnadra
Slide presentation pycassa_upload
Cassandra for Sysadmins
Consistent hashing
Cassandra in production
Scaling web applications with cassandra presentation

More from DataStax (20)

PPTX
Is Your Enterprise Ready to Shine This Holiday Season?
PPTX
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
PPTX
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
PPTX
Best Practices for Getting to Production with DataStax Enterprise Graph
PPTX
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
PPTX
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
PDF
Introduction to Apache Cassandra™ + What’s New in 4.0
PPTX
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
PPTX
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
PDF
Designing a Distributed Cloud Database for Dummies
PDF
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
PDF
How to Evaluate Cloud Databases for eCommerce
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
PPTX
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
PPTX
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
PPTX
Datastax - The Architect's guide to customer experience (CX)
PPTX
An Operational Data Layer is Critical for Transformative Banking Applications
PPTX
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Is Your Enterprise Ready to Shine This Holiday Season?
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Best Practices for Getting to Production with DataStax Enterprise Graph
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | Better Together: Apache Cassandra and Apache Kafka
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Introduction to Apache Cassandra™ + What’s New in 4.0
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Designing a Distributed Cloud Database for Dummies
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Evaluate Cloud Databases for eCommerce
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Datastax - The Architect's guide to customer experience (CX)
An Operational Data Layer is Critical for Transformative Banking Applications
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking

Recently uploaded (20)

PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
ai tools demonstartion for schools and inter college
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Digital Strategies for Manufacturing Companies
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
L1 - Introduction to python Backend.pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
System and Network Administraation Chapter 3
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
ai tools demonstartion for schools and inter college
ManageIQ - Sprint 268 Review - Slide Deck
Online Work Permit System for Fast Permit Processing
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Navsoft: AI-Powered Business Solutions & Custom Software Development
Design an Analysis of Algorithms II-SECS-1021-03
Wondershare Filmora 15 Crack With Activation Key [2025
Digital Strategies for Manufacturing Companies
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
CHAPTER 2 - PM Management and IT Context
PTS Company Brochure 2025 (1).pdf.......
L1 - Introduction to python Backend.pptx
Softaken Excel to vCard Converter Software.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
How to Choose the Right IT Partner for Your Business in Malaysia
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Operating system designcfffgfgggggggvggggggggg
System and Network Administraation Chapter 3

Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, ProtectWise) | Cassandra Summit 2016

  • 1. Using approximate data structures for small, insightful analytics. Ben Kornmeier, Engineer
  • 2. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. About Protectwise ● Cloud security platform, that aims to make threats actionable and obvious. ● Aims to cut down on the amount of “noise” that a network can create, and only show the most important details. ● Has a big emphasis on real time data. ● Ingests and processes terabytes of data a day.
  • 3. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Goals Of Count Sumula ● Quick report generation. ● Support high cardinality data. ● Compute averages, min, and max. ● Easy to add additional aggregations.
  • 4. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Challenge: Daily Data Ingestion ● 2 billion netflow updates. ● Ingests 20TB of raw network traffic. ● Generates 150 million observations.
  • 5. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Challenge: Costs of Processing Data. ● Traditional batch processing is accurate, but slow. ○ We want results in seconds not hours or days. ● Compute resources are very expensive at our scale.
  • 6. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Challenge: Making a Great User Experience ● A user should expect: ○ Hardly any waiting for report generate. ○ Up to date reports. ○ Meaningful reports that are actionable and concise. ○ Reports that are persisted forever and can be recombined after the fact to gain additional insights.
  • 7. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Some Use Cases ● Show me a count all of the hosts that had a threat on them in the past year. ● Show me the hosts with the most threats encountered over the course of a year.
  • 8. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Use Cases Examined ● Show me a count all of the hosts that had a threat on them in the past year. ○ IP address has a very high cardinality 340 undecillion (ipv6) ■Or: 340,282,366,920,938,463,463,374,607,431,768,211,456 (WOW!) ○ Storage costs could be high.
  • 9. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Use Cases Examined Continued ● Show me the hosts with the most threats encountered over the course of a year. ○ Once again, high cardinality. ○ Same storage costs as the example before, but now we have to sort, which is going to be tough. O(n log n).
  • 10. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Considerations For Our Solution ● Be real time. ● Could not grow without bounds. ● Data must be around for decades or more. ● Be able to return queries for large time ranges. ● Be actionable and concise.
  • 11. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. The Realization ● In general users can live with an approximate result! ○ Approximate results use less space. ○ Can be computed in memory. ○ Approximate results can be bounded by trading accuracy for space ○ Approximate results are fast enough to compute in real time. ○ Meets two of our goals.
  • 12. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Some Approximations We Used ● HyperLogLog ● Count Min Sketch ● Stream Summary ● Bloom Filter ● Layered Bloom Filter ● Compound Approximations
  • 13. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. HyperLogLog ● Only counts the amount of consecutive 0 bits. ● Uses the count of consecutive 0 bits and the probability of it occurring to determine an estimate of unique elements seen. ● Assumes a good hashing function (Murmur 3).
  • 14. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Example: HyperLogLog Assuming our hashing function only returns 4 bits (16 combinations). Bit pattern(s) Chance of occurrence 0000 1 / 16 1000, 0001 2 / 16 or 1 / 8 0011,1001,1100,0100,0010 5 / 16 0111,1011,1101,1110,1010,0110 7 / 16
  • 15. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. CountMinSketch ● Essentially a matrix. ● Inserts are duplicated across rows. ● Inserts are hashed differently per row. ● Elements can only add. ● Used for frequency estimation. ● Can be used for averages, min, max as well.
  • 16. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Example: CountMinSketch Inserting an element “Ben” “Eric” 1 null null null null null null 1 null null 1 null 1 null null null null 2 null null
  • 17. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Example: CountMinSketch Continued Retrieving the count for “Ben” “Ben” 1 null 1 null null null null 2 null null Compare the values return, and take the min, in this case 1.
  • 18. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. How Did We Store The Approximations? ● We generate enough approximations that we create about 1 GB of data each month. ○ Much better than the amount stored for full fidelity data. ● First approach just use Redis. ● Second approach Redis and Cassandra.
  • 19. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. First Approach Redis Only Advantages ● Easy ● Fast Disadvantages ● Ticking time bomb since Redis is memory only.
  • 20. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Second Approach C* And Redis Advantages ● C* scales infinitely. ● Redis can be used when speed is important. ● Not a ticking time bomb. Disadvantages ● Not as easy as previous solution.
  • 21. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. How We Use Redis With Cassandra ● Elements are placed in Redis and keyed on bucket name and time. ● Once a element from the next time interval is encountered, data is moved from Redis to Cassandra.
  • 22. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates {“bucket”: “observation”,”time”:1, “value”: 1} {“bucket”: “observation”,”time”:1, “value”: 2} {“bucket”: “observation”,”time”:2, “value”: 10} Cassandra Redis
  • 23. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates {“bucket”: “observation”,”time”:1, “value”: 2} {“bucket”: “observation”,”time”:2, “value”: 10} Cassandra Redis {“bucket”: “observation”,”time”:1, “value”: 1}
  • 24. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates {“bucket”: “observation”,”time”:2, “value”: 10} Cassandra Redis {“bucket”: “observation”,”time”:1, “value”: 1} {“bucket”: “observation”,”time”:1, “value”: 2}
  • 25. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates {“bucket”: “observation”,”time”:2, “value”: 10} Cassandra Redis {“bucket”: “observation”,”time”:1, “value”: 1} {“bucket”: “observation”,”time”:1, “value”: 2} Elements are summed
  • 26. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates {“bucket”: “observation”,”time”:2, “value”: 10} Cassandra Redis {“bucket”: “observation”,”time”:1, “value”: 3}
  • 27. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates Cassandra Redis {“bucket”: “observation”,”time”:2, “value”: 10} {“bucket”: “observation”,”time”:1, “value”: 3}
  • 28. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Incoming Updates Cassandra {“bucket”: “observation”,”time”:1, “value”: 3} Redis {“bucket”: “observation”,”time”:2, “value”: 10} Element from time 1 is determined to be expired and written to Cassandra
  • 29. Cassandra Schema CREATE TABLE buckets ( name text, // bucket name time_bucket timestamp, // Time floored on next interval up. time_unit int, // {1: “minute”, 2: “hour”, 3: “day” } algorithm text, // [HyperLogLog, CountMinSketch, etc] time timestamp, // the actual time d blob, //Serialized data PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
  • 30. Cassandra Schema CREATE TABLE buckets ( name text, // bucket name time_bucket timestamp, // Time floored on next interval up. time_unit int, // {1: “minute”, 2: “hour”, 3: “day” } algorithm text, // [HyperLogLog, CountMinSketch, etc] time timestamp, // the actual time d blob, //Serialized data PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
  • 31. Cassandra Schema CREATE TABLE buckets ( name text, // bucket name time_bucket timestamp, // Time floored on next interval up. time_unit int, // {1: “minute”, 2: “hour”, 3: “day” } algorithm text, // [HyperLogLog, CountMinSketch, etc] time timestamp, // the actual time d blob, //Serialized data PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
  • 32. ©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential. Advantages of using Cassandra and Redis ● Elements are written in their finalized form to Cassandra. ○ Compactor friendly. ● Updates can happen very fast since Redis is Fast. ● Redis no longer consumes memory unbounded.
  • 33. Caveats ● Using approximations are just that, approximate. ● Takes time to understand how they work. ● Tuning needs up front knowledge of usage.