HBase and Drill: How loosley typed SQL is ideal for NoSQL

© 2014 MapR Technologies 2
Contact Information
Ted Dunning
Chief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & Mahout
VP of Incubator at Apache Foundation
Email tdunning@apache.org tdunning@maprtech.com
Twitter @ted_dunning
Hashtag today: #BDE2015

Agenda
• What does good mean?
• What do we mean by loose typing?
• Examples of what you can do
• Real database with 10-20x fewer tables
• Looking forward
• Questions

What Does Good Mean (for a DB)?
• Expressive
– Must express the concepts we need
• Efficient
– Must run fast enough on cheap enough hardware

What Does Good Mean (for a DB)?
• Expressive
– Must express the concepts we need
• Efficient
– Must run fast enough on cheap enough hardware
• Introspectable
– Must be able to inspect the data and schema and gain understanding

What is New Here
• Introspection is better when
– A minimum of data entities are used to describe our model
– No name overflow
– Referential scoping helps narrow our focus to a simpler problem
– Many-to-one relations can in-lined
• Introspection was not a goal for the design of the relational
model
• Introspection was therefore not a result either

Older than Dirt
• Relational theory is old (1970)
– Pre-dates data structures
– Predates mainstream recursive procedures
– Predates lexical scoping
– Predates logic programming
– Predates real functional programming (Church, McCarthy, Iverson,
Backus and not-withstanding)
• Some updates are in order to enhance introspection

Contrast Relational and HBase Style noSQL
Relational
• Rows containing fields
• Fields contain primitive types
• Structure is fixed and uniform
• Structure is pre-defined
• Referential integrity (optional)
• Expressions over sets of
rows
HBase / MapR DB
• Rows contain fields
• Fields bytes
• Structure is flexible
• No pre-defined structure
• Single key
• Column families
• Timestamps
• Versions

Contrast relational and HBase with Structuring
Relational
• Rows containing fields
• Structure is fixed and uniform
• Structure is pre-defined
• Referential integrity (optional)
• Expressions over sets of
rows
HBase + Structuring
• Rows contain fields
– Or objects, or lists
• Structure is flexible, ragged
• No pre-defined structure
• Single key

Turtle Models for Databases
• Allows complex objects in field values
– JSON style lists and objects
• Allow references to objects via join
– Includes references localized within lists
• Lists of objects and objects of lists are isomorphic to tables so …
• Complex data in tables,
• But also tables in complex data,
• Even tables containing complex data containing tables

Proviso and Warning
• This is not your father’s BLOB
• And not the same as arrays with lateral view joins
• Rationale to come as we talk about idioms

A Catalog of noSQL Idioms

Tables as Objects, Objects as Tables
c1 c2 c3
Row-wise form
c1 c2 c3
Column-wise form
[ { c1:v1, c2:v2, c3:v3 },
{ c1:v1, c2:v2, c3:v3 },
{ c1:v1, c2:v2, c3:v3 } ]
List of objects
{ c1:[v1, v2, v3],
c2:[v1, v2, v3],
c3:[v1, v2, v3] }
Object containing lists

c1 c2 c3
c1 c2 c3
Micro Columnar Formats
An entire table stored in
columnar form can be a
first-class value using
these techniques
This is very powerful for
in-lining one-to-many
relations.

Note
• If embedded tables are first-class, schema becomes data
• If schema is data-driven when embedded, constructs that
elevate tables to top-level are impossible
• Thus, embedded first-class objects implies late discovery of
schema information

A first example:
Time-series data

Column names as data
• When column names are not pre-defined, they can convey
information
• Examples
– Time offsets within a window for time series
– Top-level domains for web crawlers
– Vendor id’s for customer purchase profiles
• Predefined schema is impossible for this idiom

Relational Model for Time-series

Table Design: Point-by-Point

Table Design: Hybrid Point-by-Point + Sub-table
After close of window, data in row is restated as column-oriented
tabular value in different column family.

Compression Results
Samples are
64b time, 16 bit sample
Sample time at 10kHz
Sample time jitter makes it
important to keep original
time-stamp
How much overhead to
retain time-stamp?

A second example:
Music meta-data

MusicBrainz on NoSQL
• Artists, albums, tracks and labels are key objects
• Reality check:
– Add works (compositions), recordings, release, release group
• 7 tables for artist alone
• 12 for place, 7 for label, 17 for release/group, 8 for work
– (but only 4 for recording!)
– Total of 12 + 7 + 17 + 8 + 4 = 48 tables
• But wait, there’s more!
– 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86
link tables, 5 cover art tables and 3 tables for CD timing info (138 total)
– And 50 more tables that aren’t documented yet

180 tables
not shown

236 tables
to describe 7 kinds of things

Can we do better?

artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>

artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>

artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
begin_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
{ name, begin_date,
end_date }

{id, recording_id,
name, list<credit>
length}
recording
id
gid
list<credit>
name
list<track_ref>
release
id
gid
release_group_id
list<credit>
name
barcode
status
packaging
language
script
list<medium>
{id, format,
name,
list<track>}
release_group
id
gid
name
list<credit>
type
list<release_id>

27 tables reduce to 4

27 tables reduce to 4
so far

Further Reductions
• All 86 link tables become properties on artists, releases and
other entities
• All 44 tag, rating and annotation tables become list properties
• All 5 cover art tables become lists of file references
• Current score: 162 tables become 4
• You get the idea

Is This Good?
• Expressivity
– The JSON data model is at least as expressive as the original relational
model
• Many cases easier to describe in nested data
• No cases are harder
• Efficiency
– Inlining can increase data size. Locality improves, however
– Sessionizing can substantially decrease data size
– Inlining back-references is more efficient than ordinary indexes
– Inlined columnar data allows 1000x speedup for time series
• Introspection (you decide)

But How Can We Query This?
• Can’t use SQL
– SQL is strongly typed
– SQL is heavily tied into the original relational model
– SQL generating tools require relational model
• Must use SQL
– Vast numbers of tools and people understand how to write SQL
– SQL is the lingua franca of databases

Squaring the Circle
• Enter Apache Drill
• Drill is SQL compliant
– Uses standard syntax and semantics
• Drill extends SQL
– First class treatment of objects, lists
– Full support for destructuring, flattening
– Full power of relational model can be applied to complex data

Drill Provides Scalable and Extended SQL

Sample Query
• Find Elvis
select distinct id, name, alias from (
select id, flatten(alias.name) alias from artist
where alias like 'Elvis%Presley'
)

Example Query
• Find discs where Elvis was credited
select distinct album_id, name
from
(
select id album_id, name, flatten(credit)
from release
) albums
join
(
select distinct artist_id from (
select id artist_id, flatten(alias) from artist
where name like 'Elvis%Presley’
)
) artists
using artist_id

Summary
• Extended relational model allows massive simplification
– On a real example, we see >20x reduction in number of tables
• Simplification drives improved introspection
– This is good
• Apache Drill gives very high performance execution for extended
relational problems
• You can try this out today

Short Books by Ted Dunning & Ellen Friedman
• Published by O’Reilly in 2014 and 2015
• For sale from Amazon or O’Reilly
• Free e-books currently available courtesy of MapR
http://bit.ly/ebook-real-
world-hadoop
http://bit.ly/mapr-tsdb-
ebook
http://bit.ly/ebook-
anomaly
http://bit.ly/recommend
ation-ebook

Thank You!

Q&A
@mapr maprtech
tdunning@mapr.tech.com
Engage with us!
MapR
maprtech
mapr-technologies

HBase and Drill: How loosley typed SQL is ideal for NoSQL

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to HBase and Drill: How loosley typed SQL is ideal for NoSQL (20)

More from DataWorks Summit (20)

Recently uploaded (20)

HBase and Drill: How loosley typed SQL is ideal for NoSQL

Editor's Notes