SlideShare a Scribd company logo
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Contact Information
Ted Dunning
Chief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & Mahout
VP of Incubator at Apache Foundation
Email tdunning@apache.org tdunning@maprtech.com
Twitter @ted_dunning
Hashtag today: #BDE2015
© 2014 MapR Technologies 3
Agenda
• What does good mean?
• What do we mean by loose typing?
• Examples of what you can do
• Real database with 10-20x fewer tables
• Looking forward
• Questions
© 2014 MapR Technologies 4
What Does Good Mean (for a DB)?
• Expressive
– Must express the concepts we need
• Efficient
– Must run fast enough on cheap enough hardware
© 2014 MapR Technologies 5
What Does Good Mean (for a DB)?
• Expressive
– Must express the concepts we need
• Efficient
– Must run fast enough on cheap enough hardware
• Introspectable
– Must be able to inspect the data and schema and gain understanding
© 2014 MapR Technologies 6
What is New Here
• Introspection is better when
– A minimum of data entities are used to describe our model
– No name overflow
– Referential scoping helps narrow our focus to a simpler problem
– Many-to-one relations can in-lined
• Introspection was not a goal for the design of the relational
model
• Introspection was therefore not a result either
© 2014 MapR Technologies 7
Older than Dirt
• Relational theory is old (1970)
– Pre-dates data structures
– Predates mainstream recursive procedures
– Predates lexical scoping
– Predates logic programming
– Predates real functional programming (Church, McCarthy, Iverson,
Backus and not-withstanding)
• Some updates are in order to enhance introspection
© 2014 MapR Technologies 8
Contrast Relational and HBase Style noSQL
Relational
• Rows containing fields
• Fields contain primitive types
• Structure is fixed and uniform
• Structure is pre-defined
• Referential integrity (optional)
• Expressions over sets of
rows
HBase / MapR DB
• Rows contain fields
• Fields bytes
• Structure is flexible
• No pre-defined structure
• Single key
• Column families
• Timestamps
• Versions
© 2014 MapR Technologies 9
Contrast relational and HBase with Structuring
Relational
• Rows containing fields
• Fields contain primitive types
• Structure is fixed and uniform
• Structure is pre-defined
• Referential integrity (optional)
• Expressions over sets of
rows
HBase + Structuring
• Rows contain fields
• Fields contain primitive types
– Or objects, or lists
• Structure is flexible, ragged
• No pre-defined structure
• Single key
© 2014 MapR Technologies 10
Turtle Models for Databases
• Allows complex objects in field values
– JSON style lists and objects
• Allow references to objects via join
– Includes references localized within lists
• Lists of objects and objects of lists are isomorphic to tables so …
• Complex data in tables,
• But also tables in complex data,
• Even tables containing complex data containing tables
© 2014 MapR Technologies 11
Proviso and Warning
• This is not your father’s BLOB
• And not the same as arrays with lateral view joins
• Rationale to come as we talk about idioms
© 2014 MapR Technologies 12
A Catalog of noSQL Idioms
© 2014 MapR Technologies 13
Tables as Objects, Objects as Tables
c1 c2 c3
Row-wise form
c1 c2 c3
Column-wise form
[ { c1:v1, c2:v2, c3:v3 },
{ c1:v1, c2:v2, c3:v3 },
{ c1:v1, c2:v2, c3:v3 } ]
List of objects
{ c1:[v1, v2, v3],
c2:[v1, v2, v3],
c3:[v1, v2, v3] }
Object containing lists
© 2014 MapR Technologies 14
c1 c2 c3
c1 c2 c3
Micro Columnar Formats
An entire table stored in
columnar form can be a
first-class value using
these techniques
This is very powerful for
in-lining one-to-many
relations.
© 2014 MapR Technologies 15
Note
• If embedded tables are first-class, schema becomes data
• If schema is data-driven when embedded, constructs that
elevate tables to top-level are impossible
• Thus, embedded first-class objects implies late discovery of
schema information
© 2014 MapR Technologies 16
A first example:
Time-series data
© 2014 MapR Technologies 17
Column names as data
• When column names are not pre-defined, they can convey
information
• Examples
– Time offsets within a window for time series
– Top-level domains for web crawlers
– Vendor id’s for customer purchase profiles
• Predefined schema is impossible for this idiom
© 2014 MapR Technologies 18
Relational Model for Time-series
© 2014 MapR Technologies 19
Table Design: Point-by-Point
© 2014 MapR Technologies 20
Table Design: Hybrid Point-by-Point + Sub-table
After close of window, data in row is restated as column-oriented
tabular value in different column family.
© 2014 MapR Technologies 21
Compression Results
Samples are
64b time, 16 bit sample
Sample time at 10kHz
Sample time jitter makes it
important to keep original
time-stamp
How much overhead to
retain time-stamp?
© 2014 MapR Technologies 22
A second example:
Music meta-data
© 2014 MapR Technologies 23
MusicBrainz on NoSQL
• Artists, albums, tracks and labels are key objects
• Reality check:
– Add works (compositions), recordings, release, release group
• 7 tables for artist alone
• 12 for place, 7 for label, 17 for release/group, 8 for work
– (but only 4 for recording!)
– Total of 12 + 7 + 17 + 8 + 4 = 48 tables
• But wait, there’s more!
– 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86
link tables, 5 cover art tables and 3 tables for CD timing info (138 total)
– And 50 more tables that aren’t documented yet
© 2014 MapR Technologies 24
© 2014 MapR Technologies 25
180 tables
not shown
© 2014 MapR Technologies 26
236 tables
to describe 7 kinds of things
© 2014 MapR Technologies 27
Can we do better?
© 2014 MapR Technologies 28
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
© 2014 MapR Technologies 29
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
being_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
© 2014 MapR Technologies 30
artist
id
gid
name
sort_name
begin_date
end_date
ended
type
gender
area
begin_area
end_area
comment
list<ipi>
list<isni>
list<alias>
list<release_id>
list<recording_id>
{ name, begin_date,
end_date }
© 2014 MapR Technologies 31
© 2014 MapR Technologies 32
{id, recording_id,
name, list<credit>
length}
recording
id
gid
list<credit>
name
list<track_ref>
release
id
gid
release_group_id
list<credit>
name
barcode
status
packaging
language
script
list<medium>
{id, format,
name,
list<track>}
release_group
id
gid
name
list<credit>
type
list<release_id>
© 2014 MapR Technologies 33
27 tables reduce to 4
© 2014 MapR Technologies 34
27 tables reduce to 4
so far
© 2014 MapR Technologies 35
Further Reductions
• All 86 link tables become properties on artists, releases and
other entities
• All 44 tag, rating and annotation tables become list properties
• All 5 cover art tables become lists of file references
• Current score: 162 tables become 4
• You get the idea
© 2014 MapR Technologies 36
Is This Good?
• Expressivity
– The JSON data model is at least as expressive as the original relational
model
• Many cases easier to describe in nested data
• No cases are harder
• Efficiency
– Inlining can increase data size. Locality improves, however
– Sessionizing can substantially decrease data size
– Inlining back-references is more efficient than ordinary indexes
– Inlined columnar data allows 1000x speedup for time series
• Introspection (you decide)
© 2014 MapR Technologies 37
But How Can We Query This?
• Can’t use SQL
– SQL is strongly typed
– SQL is heavily tied into the original relational model
– SQL generating tools require relational model
• Must use SQL
– Vast numbers of tools and people understand how to write SQL
– SQL is the lingua franca of databases
© 2014 MapR Technologies 38
Squaring the Circle
• Enter Apache Drill
• Drill is SQL compliant
– Uses standard syntax and semantics
• Drill extends SQL
– First class treatment of objects, lists
– Full support for destructuring, flattening
– Full power of relational model can be applied to complex data
© 2014 MapR Technologies 39
Drill Provides Scalable and Extended SQL
© 2014 MapR Technologies 40
Sample Query
• Find Elvis
select distinct id, name, alias from (
select id, flatten(alias.name) alias from artist
where alias like 'Elvis%Presley'
)
© 2014 MapR Technologies 41
Example Query
• Find discs where Elvis was credited
select distinct album_id, name
from
(
select id album_id, name, flatten(credit)
from release
) albums
join
(
select distinct artist_id from (
select id artist_id, flatten(alias) from artist
where name like 'Elvis%Presley’
)
) artists
using artist_id
© 2014 MapR Technologies 42
Summary
• Extended relational model allows massive simplification
– On a real example, we see >20x reduction in number of tables
• Simplification drives improved introspection
– This is good
• Apache Drill gives very high performance execution for extended
relational problems
• You can try this out today
© 2014 MapR Technologies 43
© 2014 MapR Technologies 44
Short Books by Ted Dunning & Ellen Friedman
• Published by O’Reilly in 2014 and 2015
• For sale from Amazon or O’Reilly
• Free e-books currently available courtesy of MapR
http://bit.ly/ebook-real-
world-hadoop
http://bit.ly/mapr-tsdb-
ebook
http://bit.ly/ebook-
anomaly
http://bit.ly/recommend
ation-ebook
© 2014 MapR Technologies 45
Real World Hadoop
by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly)
Free copies at book signing today
© 2014 MapR Technologies 46
Thank You!
© 2014 MapR Technologies 47
Q&A
@mapr maprtech
tdunning@mapr.tech.com
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

PDF
Building a Hadoop Data Warehouse with Impala
PPTX
Applied Deep Learning with Spark and Deeplearning4j
PPTX
Node Labels in YARN
PPTX
Node labels in YARN
PPTX
SQL on Hadoop
PPTX
Mutable Data in Hive's Immutable World
PPTX
The Challenges of SQL on Hadoop
PPTX
Apache Tez: Accelerating Hadoop Query Processing
Building a Hadoop Data Warehouse with Impala
Applied Deep Learning with Spark and Deeplearning4j
Node Labels in YARN
Node labels in YARN
SQL on Hadoop
Mutable Data in Hive's Immutable World
The Challenges of SQL on Hadoop
Apache Tez: Accelerating Hadoop Query Processing

What's hot (20)

PPTX
Introduction to the Hadoop EcoSystem
PPTX
Data warehousing with Hadoop
PPTX
NoSQL Needs SomeSQL
PDF
Building a Hadoop Data Warehouse with Impala
PPTX
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
PDF
Integration of HIve and HBase
PPTX
Hadoop crash course workshop at Hadoop Summit
PPTX
Scaling Deep Learning on Hadoop at LinkedIn
PPTX
Hadoop Infrastructure @Uber Past, Present and Future
PPTX
Impala Unlocks Interactive BI on Hadoop
PDF
Hadoop Architecture Options for Existing Enterprise DataWarehouse
PPTX
The Future of Hadoop Security
PDF
Interactive SQL-on-Hadoop and JethroData
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
PPTX
Big Data Warehousing: Pig vs. Hive Comparison
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
PDF
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
PPTX
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
PDF
Non-Stop Hadoop for Hortonworks
PPTX
Apache drill
Introduction to the Hadoop EcoSystem
Data warehousing with Hadoop
NoSQL Needs SomeSQL
Building a Hadoop Data Warehouse with Impala
YARN: the Key to overcoming the challenges of broad-based Hadoop Adoption
Integration of HIve and HBase
Hadoop crash course workshop at Hadoop Summit
Scaling Deep Learning on Hadoop at LinkedIn
Hadoop Infrastructure @Uber Past, Present and Future
Impala Unlocks Interactive BI on Hadoop
Hadoop Architecture Options for Existing Enterprise DataWarehouse
The Future of Hadoop Security
Interactive SQL-on-Hadoop and JethroData
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Big Data Warehousing: Pig vs. Hive Comparison
Apache Arrow (Strata-Hadoop World San Jose 2016)
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
Non-Stop Hadoop for Hortonworks
Apache drill
Ad

Viewers also liked (20)

PPTX
One Click Hadoop Clusters - Anywhere (Using Docker)
PPTX
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
PPTX
Carpe Datum: Building Big Data Analytical Applications with HP Haven
PPTX
Realistic Synthetic Generation Allows Secure Development
PDF
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
PDF
Inspiring Travel at Airbnb [WIP]
PPTX
Karta an ETL Framework to process high volume datasets
PPT
Hadoop for Genomics__HadoopSummit2010
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PDF
50 Shades of SQL
PPTX
Running Spark and MapReduce together in Production
PPTX
Hadoop in Validated Environment - Data Governance Initiative
PPTX
Big Data Simplified - Is all about Ab'strakSHeN
PPTX
Open Source SQL for Hadoop: Where are we and Where are we Going?
PPTX
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
PPTX
Spark Application Development Made Easy
PPTX
Big Data Challenges in the Energy Sector
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
PPTX
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
PDF
Online Approximate OLAP in SparkSQL
One Click Hadoop Clusters - Anywhere (Using Docker)
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Realistic Synthetic Generation Allows Secure Development
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
Inspiring Travel at Airbnb [WIP]
Karta an ETL Framework to process high volume datasets
Hadoop for Genomics__HadoopSummit2010
Practical Distributed Machine Learning Pipelines on Hadoop
50 Shades of SQL
Running Spark and MapReduce together in Production
Hadoop in Validated Environment - Data Governance Initiative
Big Data Simplified - Is all about Ab'strakSHeN
Open Source SQL for Hadoop: Where are we and Where are we Going?
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
Spark Application Development Made Easy
Big Data Challenges in the Energy Sector
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Online Approximate OLAP in SparkSQL
Ad

Similar to HBase and Drill: How loosley typed SQL is ideal for NoSQL (20)

PPTX
Putting Apache Drill into Production
PPTX
Practical Machine Learning: Innovations in Recommendation Workshop
PPTX
Using Apache Drill
PPTX
Big Data Everywhere Chicago: SQL on Hadoop
PPTX
Introduction to HBase - Phoenix HUG 5/14
PPTX
IoT and Big Data - Iot Asia 2014
PPTX
02 data warehouse applications with hive
PPTX
Real time-hadoop
PDF
2014 08-20-pit-hug
PPTX
Recommendation Techn
PPTX
Apache Kylin - OLAP Cubes for SQL on Hadoop
PPTX
Apache Kylin – Cubes on Hadoop
PPTX
Keys for Success from Streams to Queries
PDF
L17 Data Source Layer
PPTX
Real-time Hadoop: The Ideal Messaging System for Hadoop
PDF
DataFrames: The Extended Cut
PPTX
L15 Data Source Layer
PPTX
Data Analytics with R and SQL Server
PPTX
HUG France - Apache Drill
PPTX
Drill at the Chug 9-19-12
Putting Apache Drill into Production
Practical Machine Learning: Innovations in Recommendation Workshop
Using Apache Drill
Big Data Everywhere Chicago: SQL on Hadoop
Introduction to HBase - Phoenix HUG 5/14
IoT and Big Data - Iot Asia 2014
02 data warehouse applications with hive
Real time-hadoop
2014 08-20-pit-hug
Recommendation Techn
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin – Cubes on Hadoop
Keys for Success from Streams to Queries
L17 Data Source Layer
Real-time Hadoop: The Ideal Messaging System for Hadoop
DataFrames: The Extended Cut
L15 Data Source Layer
Data Analytics with R and SQL Server
HUG France - Apache Drill
Drill at the Chug 9-19-12

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Cloud computing and distributed systems.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Modernizing your data center with Dell and AMD
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
A Presentation on Artificial Intelligence
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Understanding_Digital_Forensics_Presentation.pptx
The AUB Centre for AI in Media Proposal.docx
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Modernizing your data center with Dell and AMD
CIFDAQ's Market Insight: SEC Turns Pro Crypto
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Dropbox Q2 2025 Financial Results & Investor Presentation
A Presentation on Artificial Intelligence
Machine learning based COVID-19 study performance prediction
Advanced methodologies resolving dimensionality complications for autism neur...

HBase and Drill: How loosley typed SQL is ideal for NoSQL

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC for Apache’s Drill, Zookeeper & Mahout VP of Incubator at Apache Foundation Email tdunning@apache.org tdunning@maprtech.com Twitter @ted_dunning Hashtag today: #BDE2015
  • 3. © 2014 MapR Technologies 3 Agenda • What does good mean? • What do we mean by loose typing? • Examples of what you can do • Real database with 10-20x fewer tables • Looking forward • Questions
  • 4. © 2014 MapR Technologies 4 What Does Good Mean (for a DB)? • Expressive – Must express the concepts we need • Efficient – Must run fast enough on cheap enough hardware
  • 5. © 2014 MapR Technologies 5 What Does Good Mean (for a DB)? • Expressive – Must express the concepts we need • Efficient – Must run fast enough on cheap enough hardware • Introspectable – Must be able to inspect the data and schema and gain understanding
  • 6. © 2014 MapR Technologies 6 What is New Here • Introspection is better when – A minimum of data entities are used to describe our model – No name overflow – Referential scoping helps narrow our focus to a simpler problem – Many-to-one relations can in-lined • Introspection was not a goal for the design of the relational model • Introspection was therefore not a result either
  • 7. © 2014 MapR Technologies 7 Older than Dirt • Relational theory is old (1970) – Pre-dates data structures – Predates mainstream recursive procedures – Predates lexical scoping – Predates logic programming – Predates real functional programming (Church, McCarthy, Iverson, Backus and not-withstanding) • Some updates are in order to enhance introspection
  • 8. © 2014 MapR Technologies 8 Contrast Relational and HBase Style noSQL Relational • Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional) • Expressions over sets of rows HBase / MapR DB • Rows contain fields • Fields bytes • Structure is flexible • No pre-defined structure • Single key • Column families • Timestamps • Versions
  • 9. © 2014 MapR Technologies 9 Contrast relational and HBase with Structuring Relational • Rows containing fields • Fields contain primitive types • Structure is fixed and uniform • Structure is pre-defined • Referential integrity (optional) • Expressions over sets of rows HBase + Structuring • Rows contain fields • Fields contain primitive types – Or objects, or lists • Structure is flexible, ragged • No pre-defined structure • Single key
  • 10. © 2014 MapR Technologies 10 Turtle Models for Databases • Allows complex objects in field values – JSON style lists and objects • Allow references to objects via join – Includes references localized within lists • Lists of objects and objects of lists are isomorphic to tables so … • Complex data in tables, • But also tables in complex data, • Even tables containing complex data containing tables
  • 11. © 2014 MapR Technologies 11 Proviso and Warning • This is not your father’s BLOB • And not the same as arrays with lateral view joins • Rationale to come as we talk about idioms
  • 12. © 2014 MapR Technologies 12 A Catalog of noSQL Idioms
  • 13. © 2014 MapR Technologies 13 Tables as Objects, Objects as Tables c1 c2 c3 Row-wise form c1 c2 c3 Column-wise form [ { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 }, { c1:v1, c2:v2, c3:v3 } ] List of objects { c1:[v1, v2, v3], c2:[v1, v2, v3], c3:[v1, v2, v3] } Object containing lists
  • 14. © 2014 MapR Technologies 14 c1 c2 c3 c1 c2 c3 Micro Columnar Formats An entire table stored in columnar form can be a first-class value using these techniques This is very powerful for in-lining one-to-many relations.
  • 15. © 2014 MapR Technologies 15 Note • If embedded tables are first-class, schema becomes data • If schema is data-driven when embedded, constructs that elevate tables to top-level are impossible • Thus, embedded first-class objects implies late discovery of schema information
  • 16. © 2014 MapR Technologies 16 A first example: Time-series data
  • 17. © 2014 MapR Technologies 17 Column names as data • When column names are not pre-defined, they can convey information • Examples – Time offsets within a window for time series – Top-level domains for web crawlers – Vendor id’s for customer purchase profiles • Predefined schema is impossible for this idiom
  • 18. © 2014 MapR Technologies 18 Relational Model for Time-series
  • 19. © 2014 MapR Technologies 19 Table Design: Point-by-Point
  • 20. © 2014 MapR Technologies 20 Table Design: Hybrid Point-by-Point + Sub-table After close of window, data in row is restated as column-oriented tabular value in different column family.
  • 21. © 2014 MapR Technologies 21 Compression Results Samples are 64b time, 16 bit sample Sample time at 10kHz Sample time jitter makes it important to keep original time-stamp How much overhead to retain time-stamp?
  • 22. © 2014 MapR Technologies 22 A second example: Music meta-data
  • 23. © 2014 MapR Technologies 23 MusicBrainz on NoSQL • Artists, albums, tracks and labels are key objects • Reality check: – Add works (compositions), recordings, release, release group • 7 tables for artist alone • 12 for place, 7 for label, 17 for release/group, 8 for work – (but only 4 for recording!) – Total of 12 + 7 + 17 + 8 + 4 = 48 tables • But wait, there’s more! – 10 annotation tables, 10 edit tables, 19 tag tables, 5 rating tables, 86 link tables, 5 cover art tables and 3 tables for CD timing info (138 total) – And 50 more tables that aren’t documented yet
  • 24. © 2014 MapR Technologies 24
  • 25. © 2014 MapR Technologies 25 180 tables not shown
  • 26. © 2014 MapR Technologies 26 236 tables to describe 7 kinds of things
  • 27. © 2014 MapR Technologies 27 Can we do better?
  • 28. © 2014 MapR Technologies 28 artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id> artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias>
  • 29. © 2014 MapR Technologies 29 artist id gid name sort_name begin_date end_date ended type gender area being_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id>
  • 30. © 2014 MapR Technologies 30 artist id gid name sort_name begin_date end_date ended type gender area begin_area end_area comment list<ipi> list<isni> list<alias> list<release_id> list<recording_id> { name, begin_date, end_date }
  • 31. © 2014 MapR Technologies 31
  • 32. © 2014 MapR Technologies 32 {id, recording_id, name, list<credit> length} recording id gid list<credit> name list<track_ref> release id gid release_group_id list<credit> name barcode status packaging language script list<medium> {id, format, name, list<track>} release_group id gid name list<credit> type list<release_id>
  • 33. © 2014 MapR Technologies 33 27 tables reduce to 4
  • 34. © 2014 MapR Technologies 34 27 tables reduce to 4 so far
  • 35. © 2014 MapR Technologies 35 Further Reductions • All 86 link tables become properties on artists, releases and other entities • All 44 tag, rating and annotation tables become list properties • All 5 cover art tables become lists of file references • Current score: 162 tables become 4 • You get the idea
  • 36. © 2014 MapR Technologies 36 Is This Good? • Expressivity – The JSON data model is at least as expressive as the original relational model • Many cases easier to describe in nested data • No cases are harder • Efficiency – Inlining can increase data size. Locality improves, however – Sessionizing can substantially decrease data size – Inlining back-references is more efficient than ordinary indexes – Inlined columnar data allows 1000x speedup for time series • Introspection (you decide)
  • 37. © 2014 MapR Technologies 37 But How Can We Query This? • Can’t use SQL – SQL is strongly typed – SQL is heavily tied into the original relational model – SQL generating tools require relational model • Must use SQL – Vast numbers of tools and people understand how to write SQL – SQL is the lingua franca of databases
  • 38. © 2014 MapR Technologies 38 Squaring the Circle • Enter Apache Drill • Drill is SQL compliant – Uses standard syntax and semantics • Drill extends SQL – First class treatment of objects, lists – Full support for destructuring, flattening – Full power of relational model can be applied to complex data
  • 39. © 2014 MapR Technologies 39 Drill Provides Scalable and Extended SQL
  • 40. © 2014 MapR Technologies 40 Sample Query • Find Elvis select distinct id, name, alias from ( select id, flatten(alias.name) alias from artist where alias like 'Elvis%Presley' )
  • 41. © 2014 MapR Technologies 41 Example Query • Find discs where Elvis was credited select distinct album_id, name from ( select id album_id, name, flatten(credit) from release ) albums join ( select distinct artist_id from ( select id artist_id, flatten(alias) from artist where name like 'Elvis%Presley’ ) ) artists using artist_id
  • 42. © 2014 MapR Technologies 42 Summary • Extended relational model allows massive simplification – On a real example, we see >20x reduction in number of tables • Simplification drives improved introspection – This is good • Apache Drill gives very high performance execution for extended relational problems • You can try this out today
  • 43. © 2014 MapR Technologies 43
  • 44. © 2014 MapR Technologies 44 Short Books by Ted Dunning & Ellen Friedman • Published by O’Reilly in 2014 and 2015 • For sale from Amazon or O’Reilly • Free e-books currently available courtesy of MapR http://bit.ly/ebook-real- world-hadoop http://bit.ly/mapr-tsdb- ebook http://bit.ly/ebook- anomaly http://bit.ly/recommend ation-ebook
  • 45. © 2014 MapR Technologies 45 Real World Hadoop by Ted Dunning and Ellen Friedman © Feb 2015 (published by O’Reilly) Free copies at book signing today
  • 46. © 2014 MapR Technologies 46 Thank You!
  • 47. © 2014 MapR Technologies 47 Q&A @mapr maprtech tdunning@mapr.tech.com Engage with us! MapR maprtech mapr-technologies

Editor's Notes

  • #20: Key ideas: Unique row key based on an id for each time series (looked up from a separate look-up table); important part of the efficiency of design is to have each column be a time off-set from the start time shown in the row key. Note that data is stored point-by-point in this wide table design. Ted’s notes from his original slide: One technique for increasing the rate at which data can be retrieved from a time series database is to store many values in each row. Doing this allows data points to be retrieved at a higher speed Because both HBase and MapR-DB store data ordered by the primary key, this design will cause rows containing data from a single time series to wind up near one another on disk. Retrieving data from a particular time series for a time range will involve largely sequential disk operations and therefore will be much faster than would be the case if the rows were widely scattered. Typically, the time window is adjusted so that 100–1,000 samples are in each row.
  • #21: Ted’s notes from original slide: The table design is improved by collapsing all of the data for a row into a single data structure known as a blob. This blob can be highly compressed so that less data needs to be read from disk. Also, having a single column per row decreases the per-column overhead incurred by the on-disk format that HBase uses, which further increases performance. Data can be progressively converted to the compressed format as soon as it is known that little or no new data is likely to arrive for that time series and time window. Commonly, once the time window ends, new data will only arrive for a few more seconds, and the compression of the data can begin. Since compressed and uncompressed data can coexist in the same row, if a few samples arrive after the row is compressed, the row can simply be compressed again to merge the blob and the late-arriving samples.