SlideShare a Scribd company logo
ADVANCED DATA MODELING AND BITMAP INDEXES
Matt Stump
mstump@kissmetrics.com
Monday, May 6, 13
WHOAREYOUR
Customers?
Monday, May 6, 13
WHEREDOTHEY
Hangout?
Monday, May 6, 13
HOWSHOULDYOU
Engage?
Monday, May 6, 13
What is User Experience?
Monday, May 6, 13
Whatismy
Data
?
Monday, May 6, 13
FormFollows
Function
Monday, May 6, 13
DataFollows
Queries
Monday, May 6, 13
Primary Key
CREATE TABLE users (
username text PRIMARY KEY,
first_name text,
last_name text,
postal_code text,
last_login timestamp);
INSERT INTO users
(username,first_name,last_name,postal_code,last_login)
VALUES ('cstar','Cassandra','Database','11111','2013-4-4');
SELECT first_name, last_name
FROM users WHERE username = 'cstar';
Monday, May 6, 13
Primary Key
RowKey username first_name last_name postal_code
cstar cstar Cassandra Database 11111
user2 user2 Some Guy 22222
Monday, May 6, 13
Secondary Index
CREATE INDEX user_zipcode ON users(postal_code);
11111 cstar
22222 user2 user3 user456 ...
Monday, May 6, 13
Where Secondary Indexes Break
High Cardinality Data1
Only one index per query2
Indexes are distributed3
Only some datatypes; no counters4
Range queries are expensive5
Monday, May 6, 13
Roll Your Own Using Wide Rows
RowKey 05/02/2012 02/01/2013 05/02/2013 ...
user2 JSON JSON JSON JSON
All events for “user2” indexed by time
Monday, May 6, 13
Limitations to Rolling Your Own
Can’t query across rows1
Only some datatypes; no counters2
Requires lots of work in the application3
No complex queries4
Monday, May 6, 13
WhatdoIneed
?
Monday, May 6, 13
A Query Engine Wishlist
High cardinality data; counters1
Complex queries, multiple clauses2
Results in < 500ms for billions of rows3
Sub-field searching; regex4
Range queries5
Monday, May 6, 13
First Iteration: Ginormus String Sets
11111 cstar
22222 user2 user3 user456 ...
11111 22222
Monday, May 6, 13
Bitmaps
Monday, May 6, 13
Bitmaps
Monday, May 6, 13
Bitmaps: How do they Work?
0-7 8-15 16-23 24-31
11111 11010011 1011011 1010000 00000000
22222 00000000 0011011 00000000 00000000
Monday, May 6, 13
Bitmaps: Equality
0-7 8-15 16-23 24-31
11111 11010011 1011011 1010000 00000000
22222 00000000 0011011 00000000 00000000
SELECT * FROM users WHERE postal_code IN ('11111','22222');
0-7 8-15 16-23 24-31
11111 &
22222
00000000 0011011 00000000 00000000
Monday, May 6, 13
Bitmaps: Range, or How Do I Query Counters?
Field Value 0-7 8-15 16-23 24-31
Event2 1 11010011 1011011 1010000 00000000
Event2 4 00000000 0011011 00000000 00000000
0-7 8-15 16-23 24-31
1 & 4 00000000 0011011 00000000 00000000
SELECT * FROM users WHERE Event2 > 0 AND Event2 < 5;
Monday, May 6, 13
Trigrams; AKA You Promised REGEX
Field Value 0-7 8-15 16-23 24-31
last_name “foo” 11010011 1011011 1010000 00000000
last_name “bar” 00000000 0011011 00000000 00000000
0-7 8-15 16-23 24-31
“foo” &
“bar”
00000000 0011011 00000000 00000000
SELECT * FROM users WHERE last_name ~= ‘f.*bar’;
INSERT INTO users
(username,first_name,last_name,postal_code,last_login)
VALUES ('foobar82','johnny','foobar','94110','2013-4-4');
Monday, May 6, 13
Monday, May 6, 13
Not Everything is Roses and Honey
Indexes can be huge1
Requires a read before write2
Requires synchronization3
4
Monday, May 6, 13
Compression
2
4
Monday, May 6, 13
RLE Compression: How it Works
2
4
Header Fill, 11 blocks of 1s Literal 15 bits Fill,18 blocks of 0s Literal 15 bits
1010 10000000001011 111010000100101 000000000010010 000000010000011
Example taken from PWAH: http://www.sjvs.nl/?p=72
Monday, May 6, 13
Dealing with Read Before Write
Partition Index
Using a Ring
4
{
"product": 124,
"user": 22,
"event": "event2",
"value": "Name=Jonathan+Doe&Age=23"
}
Apply Hash to User
Configured Field
hash(:product) = c62fb32eadd5a0fcceb1ddf2697e2345c604f451
Monday, May 6, 13
Ring Partitioning
Solves read before write1
Solves synchronization issues2
Insures index locality3
4 Easy to isolate big customers4
Index size is limited to the largest
customer
5
Monday, May 6, 13
Sparse Indexes
2
4
Offset 0x00 Offset 0x01 Offset 0xA0 Offset 0xF0
Field1 0111010101101111 1001010100100101 0111010000100101 0111011100100101
OnlyStoretheSetBits
Monday, May 6, 13
Query &
Indexing Engine
The Whole Enchilada
4
Queries and
Events
Monday, May 6, 13
Goals
Core query and index engine, wrapped1
Extensible events and queries via Lua2
Equality, range and REGEX queries3
44
No single point of failure5
Distributed, <500ms for billions of rows
Monday, May 6, 13
Resources
Lots of Papers on Bitmap Compression
http://www-users.cs.umn.edu/~kewu/annotated.html
4
How Google Code Search Worked
http://swtch.com/~rsc/regexp/regexp4.html
Monday, May 6, 13
GOTANY
Questions
?
Monday, May 6, 13
Thanks
4
Eric Tschetter of the Druid Project
and
Cassandra Devs for answering my questions
Monday, May 6, 13
THANKYOU!
Matt Stump
www.matthewstump.com
@mattstump
Monday, May 6, 13

More Related Content

PDF
Apache Cassandra - Data modelling
PPTX
bubble sorting of an array in 8086 assembly language
PDF
Read data from Excel spreadsheets into R
PPT
หน่อยที่ 1
PDF
Lost In The Clouds
PDF
Intro to FIS GT.M
PPTX
Challenges of Implementing an Advanced SQL Engine on Hadoop
PPT
Fosdem 2010 GT.M and OpenStreetMap
Apache Cassandra - Data modelling
bubble sorting of an array in 8086 assembly language
Read data from Excel spreadsheets into R
หน่อยที่ 1
Lost In The Clouds
Intro to FIS GT.M
Challenges of Implementing an Advanced SQL Engine on Hadoop
Fosdem 2010 GT.M and OpenStreetMap

Similar to Advanced Data Modeling and Bitmap Indexes (20)

PDF
Cassandra for impatients
PDF
Cassandra Data Modeling
PDF
Cassandra Community Webinar | Become a Super Modeler
PDF
Become a super modeler
PDF
Introduction to Data Modeling with Apache Cassandra
PDF
Apache Cassandra & Data Modeling
PPT
File organization 1
PDF
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
PDF
RivieraJUG - MySQL 8.0 - What's new for developers.pdf
PDF
MariaDB workshop
PDF
Non-Relational Postgres
 
PDF
MySQL Cheat Sheet
PDF
Indexing in Cassandra
PDF
Indexes overview
PDF
Cassandra Data Modelling with CQL (OSCON 2015)
PDF
Scaling MySQL Strategies for Developers
PPT
[Www.pkbulk.blogspot.com]file and indexing
PDF
MySQL Indexing
PDF
MySQL optimisation Percona LeMug.fr
PPTX
SQL Database Design For Developers at PhpTek 2025.pptx
Cassandra for impatients
Cassandra Data Modeling
Cassandra Community Webinar | Become a Super Modeler
Become a super modeler
Introduction to Data Modeling with Apache Cassandra
Apache Cassandra & Data Modeling
File organization 1
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
RivieraJUG - MySQL 8.0 - What's new for developers.pdf
MariaDB workshop
Non-Relational Postgres
 
MySQL Cheat Sheet
Indexing in Cassandra
Indexes overview
Cassandra Data Modelling with CQL (OSCON 2015)
Scaling MySQL Strategies for Developers
[Www.pkbulk.blogspot.com]file and indexing
MySQL Indexing
MySQL optimisation Percona LeMug.fr
SQL Database Design For Developers at PhpTek 2025.pptx
Ad

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
PPTX
Introduction to DataStax Enterprise Graph Database
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
PPTX
Cassandra on Docker @ Walmart Labs
PDF
Cassandra 3.0 Data Modeling
PPTX
Cassandra Adoption on Cisco UCS & Open stack
PDF
Data Modeling for Apache Cassandra
PDF
Coursera Cassandra Driver
PDF
Production Ready Cassandra
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
PDF
Standing Up Your First Cluster
PDF
Real Time Analytics with Dse
PDF
Introduction to Data Modeling with Apache Cassandra
PDF
Cassandra Core Concepts
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PPTX
Bad Habits Die Hard
PDF
Advanced Data Modeling with Apache Cassandra
PDF
Advanced Cassandra
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Cassandra on Docker @ Walmart Labs
Cassandra 3.0 Data Modeling
Cassandra Adoption on Cisco UCS & Open stack
Data Modeling for Apache Cassandra
Coursera Cassandra Driver
Production Ready Cassandra
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 2
Standing Up Your First Cluster
Real Time Analytics with Dse
Introduction to Data Modeling with Apache Cassandra
Cassandra Core Concepts
Enabling Search in your Cassandra Application with DataStax Enterprise
Bad Habits Die Hard
Advanced Data Modeling with Apache Cassandra
Advanced Cassandra
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Big Data Technologies - Introduction.pptx
PPT
Teaching material agriculture food technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation theory and applications.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Electronic commerce courselecture one. Pdf
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
sap open course for s4hana steps from ECC to s4
Big Data Technologies - Introduction.pptx
Teaching material agriculture food technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Programs and apps: productivity, graphics, security and other tools
Dropbox Q2 2025 Financial Results & Investor Presentation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Empathic Computing: Creating Shared Understanding
Encapsulation theory and applications.pdf
cuic standard and advanced reporting.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Electronic commerce courselecture one. Pdf

Advanced Data Modeling and Bitmap Indexes

  • 1. ADVANCED DATA MODELING AND BITMAP INDEXES Matt Stump mstump@kissmetrics.com Monday, May 6, 13
  • 5. What is User Experience? Monday, May 6, 13
  • 9. Primary Key CREATE TABLE users ( username text PRIMARY KEY, first_name text, last_name text, postal_code text, last_login timestamp); INSERT INTO users (username,first_name,last_name,postal_code,last_login) VALUES ('cstar','Cassandra','Database','11111','2013-4-4'); SELECT first_name, last_name FROM users WHERE username = 'cstar'; Monday, May 6, 13
  • 10. Primary Key RowKey username first_name last_name postal_code cstar cstar Cassandra Database 11111 user2 user2 Some Guy 22222 Monday, May 6, 13
  • 11. Secondary Index CREATE INDEX user_zipcode ON users(postal_code); 11111 cstar 22222 user2 user3 user456 ... Monday, May 6, 13
  • 12. Where Secondary Indexes Break High Cardinality Data1 Only one index per query2 Indexes are distributed3 Only some datatypes; no counters4 Range queries are expensive5 Monday, May 6, 13
  • 13. Roll Your Own Using Wide Rows RowKey 05/02/2012 02/01/2013 05/02/2013 ... user2 JSON JSON JSON JSON All events for “user2” indexed by time Monday, May 6, 13
  • 14. Limitations to Rolling Your Own Can’t query across rows1 Only some datatypes; no counters2 Requires lots of work in the application3 No complex queries4 Monday, May 6, 13
  • 16. A Query Engine Wishlist High cardinality data; counters1 Complex queries, multiple clauses2 Results in < 500ms for billions of rows3 Sub-field searching; regex4 Range queries5 Monday, May 6, 13
  • 17. First Iteration: Ginormus String Sets 11111 cstar 22222 user2 user3 user456 ... 11111 22222 Monday, May 6, 13
  • 20. Bitmaps: How do they Work? 0-7 8-15 16-23 24-31 11111 11010011 1011011 1010000 00000000 22222 00000000 0011011 00000000 00000000 Monday, May 6, 13
  • 21. Bitmaps: Equality 0-7 8-15 16-23 24-31 11111 11010011 1011011 1010000 00000000 22222 00000000 0011011 00000000 00000000 SELECT * FROM users WHERE postal_code IN ('11111','22222'); 0-7 8-15 16-23 24-31 11111 & 22222 00000000 0011011 00000000 00000000 Monday, May 6, 13
  • 22. Bitmaps: Range, or How Do I Query Counters? Field Value 0-7 8-15 16-23 24-31 Event2 1 11010011 1011011 1010000 00000000 Event2 4 00000000 0011011 00000000 00000000 0-7 8-15 16-23 24-31 1 & 4 00000000 0011011 00000000 00000000 SELECT * FROM users WHERE Event2 > 0 AND Event2 < 5; Monday, May 6, 13
  • 23. Trigrams; AKA You Promised REGEX Field Value 0-7 8-15 16-23 24-31 last_name “foo” 11010011 1011011 1010000 00000000 last_name “bar” 00000000 0011011 00000000 00000000 0-7 8-15 16-23 24-31 “foo” & “bar” 00000000 0011011 00000000 00000000 SELECT * FROM users WHERE last_name ~= ‘f.*bar’; INSERT INTO users (username,first_name,last_name,postal_code,last_login) VALUES ('foobar82','johnny','foobar','94110','2013-4-4'); Monday, May 6, 13
  • 25. Not Everything is Roses and Honey Indexes can be huge1 Requires a read before write2 Requires synchronization3 4 Monday, May 6, 13
  • 27. RLE Compression: How it Works 2 4 Header Fill, 11 blocks of 1s Literal 15 bits Fill,18 blocks of 0s Literal 15 bits 1010 10000000001011 111010000100101 000000000010010 000000010000011 Example taken from PWAH: http://www.sjvs.nl/?p=72 Monday, May 6, 13
  • 28. Dealing with Read Before Write Partition Index Using a Ring 4 { "product": 124, "user": 22, "event": "event2", "value": "Name=Jonathan+Doe&Age=23" } Apply Hash to User Configured Field hash(:product) = c62fb32eadd5a0fcceb1ddf2697e2345c604f451 Monday, May 6, 13
  • 29. Ring Partitioning Solves read before write1 Solves synchronization issues2 Insures index locality3 4 Easy to isolate big customers4 Index size is limited to the largest customer 5 Monday, May 6, 13
  • 30. Sparse Indexes 2 4 Offset 0x00 Offset 0x01 Offset 0xA0 Offset 0xF0 Field1 0111010101101111 1001010100100101 0111010000100101 0111011100100101 OnlyStoretheSetBits Monday, May 6, 13
  • 31. Query & Indexing Engine The Whole Enchilada 4 Queries and Events Monday, May 6, 13
  • 32. Goals Core query and index engine, wrapped1 Extensible events and queries via Lua2 Equality, range and REGEX queries3 44 No single point of failure5 Distributed, <500ms for billions of rows Monday, May 6, 13
  • 33. Resources Lots of Papers on Bitmap Compression http://www-users.cs.umn.edu/~kewu/annotated.html 4 How Google Code Search Worked http://swtch.com/~rsc/regexp/regexp4.html Monday, May 6, 13
  • 35. Thanks 4 Eric Tschetter of the Druid Project and Cassandra Devs for answering my questions Monday, May 6, 13