SlideShare a Scribd company logo
Design Patterns for 360º Views
using HBase and Kiji
Jonathan Natkins
Who am I?
Jon “Natty” Natkins
Field Engineer at WibiData
Formerly at Cloudera/Vertica
What is a 360º View?
What is a 360º View For?
Past
What interactions has a customer had in the past?
Present
What is the customer doing right now?
Future
What is the customer likely do to next?
Past and present inform the future
What If I Don’t Care About
Customers?
Generalizing the 360º View:
Entity-Centric Systems
Goal of an Entity-Centric
System
“Show me everything I know
about Natty”
What Data Do I Need to Store?
Static data
Event-oriented data
Derived data
Building Entity-Centric Systems
Often, this is an EDW with a star schema
Fact
Dim
Dim
Dim
Dim
Challenges With Star Schemas
How do we answer the original question?
Full table scan + joins
OLTP systems will likely fall over from the
volume
OLAP systems are usually not optimized for
single-row lookups
Need Something
Else…
Design Patterns for Building 360-degree Views with HBase and Kiji
Why
HBase rows can store both static and
event-oriented data
Cell versions are key
Single-row lookups are extremely fast
is for Building
Entity-Centric Systems
Often used for:
Building recommendation systems
Personalized search
Real-time HBase applications
Underlying technologies:
Designing an Entity-Centric
Datastore
Ask yourself this: what is the entity?
Determine your entity by determining how
you want to analyze the data
It’s ok to have data organized in multiple
ways
Schema Management with Kiji
Sometimes you actually want a schema layer
Defining a schema allows for data discoverability
Column Families in Kiji
Kiji has two types of column families
Group families are similar to relational
tables
Predefined set of columns
Each column has its own data type
Map families specify columns at runtime
Every column has the same data type
sessions:23
45
sessions:23
45
sessions:23
45
sessions:12
34
sessions:12
34
info:purchase
s
Knowing When To Use Different
Family Types
Do you know all of your columns up front?
Then use a group family
Map families are for when you don’t know
your columns ahead of time
info:name info:email
sessions:12
34
sessions:23
45
info:purchase
s
info:purchase
s
Choosing a Row Key
Row keys in Kiji are componentized
[ ‘component1’, ‘component2’, 1234 ]
More efficient than byte arrays
Consider ‘1234567890’ versus [ 1234567890 ]
Good for scanning areas of the keyspace
A Common Use for
Components
Known users IDs versus unknown IDs
On a website, how do you differentiate
between a logged-in or cookie’d user versus a
brand new visitor
[ ‘K’, ‘user1234’ ] or [ ‘U’, ‘unknown2345’ ]
Physically and logically separate rows
Run jobs over all known or unknown users
Identifying Known Users
Problem: Users have many cookies over
time.
Challenge: Ideally, we would have a single
row for each user. How do we ensure that
new data goes to the right row?
Finding Known Users With
Lookup Tables
HBase get operations are fast
It’s easy enough to create a table that
contains a mapping of cookies to known
user IDs
When data is loaded, check the lookup
table to determine if you should write data
to an existing row or a new one
Avoiding Hotspots
Unhashed Row Keys
Node 1 Node 2 Node 3
Region
A-B
Region
B-C
Region
D-E
Region
F-G
Region
H-I
Region
J-K
Hash-Prefixed Row Keys
Node 1 Node 2 Node 3
Region
00A-0fK
Region
10A-1fK
Region
20A-2fK
Region
30A-3fK
Region
40A-4fK
Region
50A-5fK
Storing Event Series
360º views need easy access to all the
transactions and events for a user
HBase cells may contain more than one
version
Kiji leverages this to store event series
data like clicks or purchases sessions:23
45
sessions:23
45
sessions:23
45
sessions:12
34
sessions:12
34
info:purchase
sinfo:name info:email
sessions:12
34
sessions:23
45
info:purchase
s
info:purchase
s
How Many Events is Too Many?
The HBase book warns that too many
versions of a cell can cause StoreFile
bloat
HBase will never split a row
Common tactic is to add a timestamp
range to the row key
Kiji makes this easy with componentized row
Beware of Timestamp Misuse
A major reason the HBase book warns
against mucking with timestamps is that
they can be dangerous
What happens if you use a sequence number
as a timestamp? Think about TTLs
Iterate and Evolve
Why is Evolution Necessary?
No entity-centric system will be the end-all,
be-all the first time around
Data sources in large enterprises are
usually heavily silo’d
Start small
Incorporate new data sources over time
Putting it Together
Kiji includes a shell to use DDL to create
tables
Many of the features that have been
discussed are declarative via the DDL
Users Table
CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.'
)
);
Users Table
CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT
NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.'
)
);
Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default
WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS
com.kiji.avro.Event
WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.'
)
);
Users Table
CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.'
ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id))
PROPERTIES (NUMREGIONS = 32)
WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' (
MAXVERSIONS = INFINITY,
TTL = FOREVER,
INMEMORY = false,
MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events'
),
LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' (
MAXVERSIONS = 10,
TTL = FOREVER,
INMEMORY = true,
FAMILY recs (
recommended CLASS
com.kiji.avro.ProductRecList
WITH DESCRIPTION 'Recommended products.’
)
);
In Summary…
Designing applications in an entity-centric
fashion can make them easier to build and
more efficient
Kiji can speed up the development
process of 360º views
Questions?
Contact me
natty@wibidata.com
@nattyice
The Kiji Project: kiji.org

More Related Content

PDF
Yahoo's Next Generation User Profile Platform
PPTX
HIPAA Compliance in the Cloud
PPTX
Optimizing industrial operations using the big data ecosystem
PDF
Logical-DataWarehouse-Alluxio-meetup
PPTX
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
PDF
High-Scale Entity Resolution in Hadoop
PPTX
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
PDF
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...
Yahoo's Next Generation User Profile Platform
HIPAA Compliance in the Cloud
Optimizing industrial operations using the big data ecosystem
Logical-DataWarehouse-Alluxio-meetup
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
High-Scale Entity Resolution in Hadoop
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Real-time user profiling based on Spark streaming and HBase by Arkadiusz Jach...

What's hot (20)

PPTX
Architecting a datalake
PDF
How to Build Modern Data Architectures Both On Premises and in the Cloud
PPTX
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
PPTX
Big data in Azure
PPTX
The Microsoft BigData Story
PPTX
Intuit Analytics Cloud 101
PPTX
Introduction to DataStax Enterprise Graph Database
PDF
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
PPTX
Hadoop data access layer v4.0
PDF
Treasure Data From MySQL to Redshift
PDF
Data Privacy with Apache Spark: Defensive and Offensive Approaches
PPTX
Big Data in the Real World
PPTX
Hadoop Journey at Walgreens
PDF
Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
PDF
Discovery & Consumption of Analytics Data @Twitter
PDF
GCP Data Engineer cheatsheet
PDF
Qubole hadoop-summit-2013-europe
PDF
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
PDF
Key note big data analytics ecosystem strategy
Architecting a datalake
How to Build Modern Data Architectures Both On Premises and in the Cloud
HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...
Big data in Azure
The Microsoft BigData Story
Intuit Analytics Cloud 101
Introduction to DataStax Enterprise Graph Database
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Hadoop data access layer v4.0
Treasure Data From MySQL to Redshift
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Big Data in the Real World
Hadoop Journey at Walgreens
Data Modeling a Scheduling App (Adam Hutson, DataScale) | Cassandra Summit 2016
Building Data Intensive Analytic Application on Top of Delta Lakes
Discovery & Consumption of Analytics Data @Twitter
GCP Data Engineer cheatsheet
Qubole hadoop-summit-2013-europe
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Key note big data analytics ecosystem strategy
Ad

Viewers also liked (20)

PPTX
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
PDF
Bulk Loading in the Wild: Ingesting the World's Energy Data
PDF
HBaseCon 2012 | Storing and Manipulating Graphs in HBase
PDF
Breaking the Sound Barrier with Persistent Memory
PPTX
HBase Data Modeling and Access Patterns with Kite SDK
PPTX
Keynote: The Future of Apache HBase
PDF
Apache HBase Improvements and Practices at Xiaomi
PPTX
Apache HBase at Airbnb
PPTX
Content Identification using HBase
PDF
New Security Features in Apache HBase 0.98: An Operator's Guide
PDF
Apache HBase - Just the Basics
PPTX
HBase: Just the Basics
PPTX
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
PDF
Intro to HBase Internals & Schema Design (for HBase users)
PPTX
HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit
PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
PPTX
Search2012 ibm vf
PPTX
Streaming map reduce
PPTX
HBase In Action - Chapter 04: HBase table design
PDF
Apache HBase 0.98
HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe
Bulk Loading in the Wild: Ingesting the World's Energy Data
HBaseCon 2012 | Storing and Manipulating Graphs in HBase
Breaking the Sound Barrier with Persistent Memory
HBase Data Modeling and Access Patterns with Kite SDK
Keynote: The Future of Apache HBase
Apache HBase Improvements and Practices at Xiaomi
Apache HBase at Airbnb
Content Identification using HBase
New Security Features in Apache HBase 0.98: An Operator's Guide
Apache HBase - Just the Basics
HBase: Just the Basics
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Intro to HBase Internals & Schema Design (for HBase users)
HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Search2012 ibm vf
Streaming map reduce
HBase In Action - Chapter 04: HBase table design
Apache HBase 0.98
Ad

Similar to Design Patterns for Building 360-degree Views with HBase and Kiji (20)

PDF
The world's next top data model
PDF
Cassandra Community Webinar | The World's Next Top Data Model
PPTX
HBase_-_data_operaet le opérations de calciletions_final.pptx
PDF
Hbase schema design and sizing apache-con europe - nov 2012
PDF
Apache Cassandra - Data modelling
PDF
HBase Advanced - Lars George
PDF
Breaking with relational dbms and dating with hbase
PDF
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
PPTX
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
PPTX
TriHUG January 2012 Talk by Chris Shain
PPTX
HBaseCon 2015: HBase @ CyberAgent
PPTX
Hadoop World 2011: Advanced HBase Schema Design
PDF
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
KEY
Schema Design for Riak
PDF
NoSQL HBase schema design and SQL with Apache Drill
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
PDF
Cassandra in production
PDF
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
PDF
Valerii Moisieienko Apache hbase workshop
ODP
Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...
The world's next top data model
Cassandra Community Webinar | The World's Next Top Data Model
HBase_-_data_operaet le opérations de calciletions_final.pptx
Hbase schema design and sizing apache-con europe - nov 2012
Apache Cassandra - Data modelling
HBase Advanced - Lars George
Breaking with relational dbms and dating with hbase
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
TriHUG January 2012 Talk by Chris Shain
HBaseCon 2015: HBase @ CyberAgent
Hadoop World 2011: Advanced HBase Schema Design
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Schema Design for Riak
NoSQL HBase schema design and SQL with Apache Drill
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cassandra in production
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
Valerii Moisieienko Apache hbase workshop
Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...

More from HBaseCon (20)

PDF
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
PDF
hbaseconasia2017: HBase on Beam
PDF
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
PDF
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
PDF
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
PDF
hbaseconasia2017: Apache HBase at Netease
PDF
hbaseconasia2017: HBase在Hulu的使用和实践
PDF
hbaseconasia2017: 基于HBase的企业级大数据平台
PDF
hbaseconasia2017: HBase at JD.com
PDF
hbaseconasia2017: Large scale data near-line loading method and architecture
PDF
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
PDF
hbaseconasia2017: HBase Practice At XiaoMi
PDF
hbaseconasia2017: hbase-2.0.0
PDF
HBaseCon2017 Democratizing HBase
PDF
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
PDF
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
PDF
HBaseCon2017 Transactions in HBase
PDF
HBaseCon2017 Highly-Available HBase
PDF
HBaseCon2017 Apache HBase at Didi
PDF
HBaseCon2017 gohbase: Pure Go HBase Client
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: hbase-2.0.0
HBaseCon2017 Democratizing HBase
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Transactions in HBase
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 gohbase: Pure Go HBase Client

Recently uploaded (20)

PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Transform Your Business with a Software ERP System
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
L1 - Introduction to python Backend.pptx
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
AI in Product Development-omnex systems
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
ai tools demonstartion for schools and inter college
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Online Work Permit System for Fast Permit Processing
PDF
System and Network Administraation Chapter 3
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Softaken Excel to vCard Converter Software.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
How to Migrate SBCGlobal Email to Yahoo Easily
Transform Your Business with a Software ERP System
2025 Textile ERP Trends: SAP, Odoo & Oracle
L1 - Introduction to python Backend.pptx
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
AI in Product Development-omnex systems
Wondershare Filmora 15 Crack With Activation Key [2025
ai tools demonstartion for schools and inter college
VVF-Customer-Presentation2025-Ver1.9.pptx
Design an Analysis of Algorithms I-SECS-1021-03
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Online Work Permit System for Fast Permit Processing
System and Network Administraation Chapter 3
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Upgrade and Innovation Strategies for SAP ERP Customers
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Softaken Excel to vCard Converter Software.pdf

Design Patterns for Building 360-degree Views with HBase and Kiji

  • 1. Design Patterns for 360º Views using HBase and Kiji Jonathan Natkins
  • 2. Who am I? Jon “Natty” Natkins Field Engineer at WibiData Formerly at Cloudera/Vertica
  • 3. What is a 360º View?
  • 4. What is a 360º View For? Past What interactions has a customer had in the past? Present What is the customer doing right now? Future What is the customer likely do to next? Past and present inform the future
  • 5. What If I Don’t Care About Customers?
  • 6. Generalizing the 360º View: Entity-Centric Systems
  • 7. Goal of an Entity-Centric System “Show me everything I know about Natty”
  • 8. What Data Do I Need to Store? Static data Event-oriented data Derived data
  • 9. Building Entity-Centric Systems Often, this is an EDW with a star schema Fact Dim Dim Dim Dim
  • 10. Challenges With Star Schemas How do we answer the original question? Full table scan + joins OLTP systems will likely fall over from the volume OLAP systems are usually not optimized for single-row lookups
  • 13. Why HBase rows can store both static and event-oriented data Cell versions are key Single-row lookups are extremely fast
  • 14. is for Building Entity-Centric Systems Often used for: Building recommendation systems Personalized search Real-time HBase applications Underlying technologies:
  • 15. Designing an Entity-Centric Datastore Ask yourself this: what is the entity? Determine your entity by determining how you want to analyze the data It’s ok to have data organized in multiple ways
  • 16. Schema Management with Kiji Sometimes you actually want a schema layer Defining a schema allows for data discoverability
  • 17. Column Families in Kiji Kiji has two types of column families Group families are similar to relational tables Predefined set of columns Each column has its own data type Map families specify columns at runtime Every column has the same data type
  • 18. sessions:23 45 sessions:23 45 sessions:23 45 sessions:12 34 sessions:12 34 info:purchase s Knowing When To Use Different Family Types Do you know all of your columns up front? Then use a group family Map families are for when you don’t know your columns ahead of time info:name info:email sessions:12 34 sessions:23 45 info:purchase s info:purchase s
  • 19. Choosing a Row Key Row keys in Kiji are componentized [ ‘component1’, ‘component2’, 1234 ] More efficient than byte arrays Consider ‘1234567890’ versus [ 1234567890 ] Good for scanning areas of the keyspace
  • 20. A Common Use for Components Known users IDs versus unknown IDs On a website, how do you differentiate between a logged-in or cookie’d user versus a brand new visitor [ ‘K’, ‘user1234’ ] or [ ‘U’, ‘unknown2345’ ] Physically and logically separate rows Run jobs over all known or unknown users
  • 21. Identifying Known Users Problem: Users have many cookies over time. Challenge: Ideally, we would have a single row for each user. How do we ensure that new data goes to the right row?
  • 22. Finding Known Users With Lookup Tables HBase get operations are fast It’s easy enough to create a table that contains a mapping of cookies to known user IDs When data is loaded, check the lookup table to determine if you should write data to an existing row or a new one
  • 24. Unhashed Row Keys Node 1 Node 2 Node 3 Region A-B Region B-C Region D-E Region F-G Region H-I Region J-K
  • 25. Hash-Prefixed Row Keys Node 1 Node 2 Node 3 Region 00A-0fK Region 10A-1fK Region 20A-2fK Region 30A-3fK Region 40A-4fK Region 50A-5fK
  • 26. Storing Event Series 360º views need easy access to all the transactions and events for a user HBase cells may contain more than one version Kiji leverages this to store event series data like clicks or purchases sessions:23 45 sessions:23 45 sessions:23 45 sessions:12 34 sessions:12 34 info:purchase sinfo:name info:email sessions:12 34 sessions:23 45 info:purchase s info:purchase s
  • 27. How Many Events is Too Many? The HBase book warns that too many versions of a cell can cause StoreFile bloat HBase will never split a row Common tactic is to add a timestamp range to the row key Kiji makes this easy with componentized row
  • 28. Beware of Timestamp Misuse A major reason the HBase book warns against mucking with timestamps is that they can be dangerous What happens if you use a sequence number as a timestamp? Think about TTLs
  • 30. Why is Evolution Necessary? No entity-centric system will be the end-all, be-all the first time around Data sources in large enterprises are usually heavily silo’d Start small Incorporate new data sources over time
  • 31. Putting it Together Kiji includes a shell to use DDL to create tables Many of the features that have been discussed are declarative via the DDL
  • 32. Users Table CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ) );
  • 33. Users Table CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ) );
  • 34. Users TableCREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.' ) );
  • 35. Users Table CREATE TABLE ’user_events' WITH DESCRIPTION 'Events table for online users.' ROW KEY FORMAT (type STRING, user_id STRING NOT NULL, HASH(THROUGH user_id)) PROPERTIES (NUMREGIONS = 32) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( MAXVERSIONS = INFINITY, TTL = FOREVER, INMEMORY = false, MAP TYPE FAMILY events CLASS com.kiji.avro.Event WITH DESCRIPTION 'events' ), LOCALITY GROUP memory WITH DESCRIPTION 'recs storage' ( MAXVERSIONS = 10, TTL = FOREVER, INMEMORY = true, FAMILY recs ( recommended CLASS com.kiji.avro.ProductRecList WITH DESCRIPTION 'Recommended products.’ ) );
  • 36. In Summary… Designing applications in an entity-centric fashion can make them easier to build and more efficient Kiji can speed up the development process of 360º views