SlideShare a Scribd company logo
Drilling into Data with Apache Drill
Tomer Shiran, Apache Drill Founder and PMC Member
Jacques Nadeau, Apache Drill PMC Chair
Tomer Shiran Jacques Nadeau
tshiran@apache.org jnadeau@apache.org
@tshiran @intjesus
Drill founder and PMC Member Drill PMC Chair (VP, Apache Drill)
Apache Drill
• Open source SQL query engine for non-relational datastores
– JSON document model
– Columnar
• Key advantages:
– Query any non-relational datastore
– No overhead (creating and maintaining schemas, transforming data, …)
– Treat your data like a table even when it’s not
– Keep using the BI tools you love
– Scales from one laptop to 1000s of servers
– Great performance and scalability
Omni-SQL (“SQL-on-Everything”)
Drill: Omni-SQL
Whereas the other engines we're discussing here create a relational database
environment on top of Hadoop, Drill instead enables a SQL language interface to
data in numerous formats, without requiring a formal schema to be declared. This
enables plug-and-play discovery over a huge universe of data without
prerequisites and preparation. So while Drill uses SQL, and can connect to
Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might
be SQL-on-Everything, with very low setup requirements.
Andrew Brust,
“
”
Any Non-Relational Datastore
• File systems
– Traditional: Local files and NAS
– Hadoop: HDFS and MapR-FS
– Cloud storage: Amazon S3, Google
Cloud Storage, Azure Blob Storage
• NoSQL databases
– MongoDB
– HBase
– MapR-DB
– Hive
• And you can add new datastores
Any Client
• Multiple interfaces: ODBC, JDBC, REST, C,
Java
• BI tools
– Tableau
– Qlik
– MicroStrategy
– TIBCO Spotfire
– Excel
• Command line (Drill shell)
• Web and mobile apps
– Many JSON-powered chart libraries (see
D3.js)
• SAS, R, …
Drill Integrates With What You Have
Achieving “End-to-End Performance”
Execute fast
• Standard SQL
• Read data fast
• Leverage columnar
encodings and execution
• Execute operations
quickly
• Scale out, not up
Iterate fast
• Work without prep
• Decentralize data
management
• In-situ security
• Explore + query
• Access multiple sources
• Avoid the ETL rinse cycle
JSON Model, Columnar Speed
JSON
BSON
Mongo
HBase
NoSQL
Parquet
Avro
CSV
TSV
Schema-lessFixed schema
Flat
Complex
Name Gender Age
Michael M 6
Jennifer F 3
{
name: {
first: Michael,
last: Smith
},
hobbies: [ski, soccer],
district: Los Altos
}
{
name: {
first: Jennifer,
last: Gates
},
hobbies: [sing],
preschool: CCLC
}
RDBMS/SQL-on-Hadoop table
Apache Drill table
Apache Drill Provides the Best of Both Worlds
Acts Like a Database
• ANSI SQL: SELECT, FROM,
WHERE, JOIN, HAVING, ORDER
BY, WITH, CTAS, ALL, EXISTS,
ANY, IN, SOME
• VarChar, Int, BigInt, Decimal,
VarBinary, Timestamp, Float,
Double, etc.
• Subqueries, scalar subqueries,
partition pruning, CTE
• Data warehouse offload
• Tableau, ODBC, JDBC
• TPC-H & TPC-DS-like workloads
• Supports Hive SerDes
• Supports Hive UDFs
• Supports Hive Metastore
Even When Your Data
Doesn’t
• Path based queries and
wildcards
– select * from /my/logs/
– select * from /revenue/*/q2
• Modern data types
– Map, Array, Any
• Complex Functions and
Relational Operators
– FLATTEN, kvgen, convert_from,
convert_to, repeated_count, etc
• JSON Sensor analytics
• Complex data analysis
• Alternative DSLs
Why? To Support the Changing Data
Organization
Data Dev Circa 2000
1. Developer comes up with
requirements
2. DBA defines tables
3. DBA defines indices
4. DBA defines FK relationships
5. Developer stores data
6. BI builds reports
7. Analyst views reports
8. DBA adds materialized views
Data Today
1. Developer builds app, defines
schema, stores data
2. Analyst queries data
3. Data engineer fixes
performance problems or fills
functionality gaps
HOW DOES IT WORK?
Everything Starts With a Drillbit…
• High performance query executor
• In-memory columnar execution
• Directly interacts with data, acquiring
knowledge as it reads
• Built to leverage large amounts of
memory
• Networked or not
• Exposes ODBC, JDBC, REST
• Built-in Web UI and CLI
• Extensible
Drillbit
Single process
(daemon or CLI)
Data Lake, More Like Data Maelstrom
HDFS HDFS
mongod mongod
HDFS HDFS
HBase HBase
Cassandra Cassandra
HDFS
HDFS
HBase
Windows
Desktop
Mac
Desktop
HBase & HDFS Cluster
HDFS Cluster
MongoDB Cluster
Cassandra Cluster
DesktopClustered Servers
Run Drillbits Wherever; Whatever Your Data
Drillbit
HDFS HDFS
mongod mongod
HDFS HDFS
HBase HBase
Drillbit
DrillbitDrillbit
Drillbit Drillbit
Cassandra Cassandra
Drillbit Drillbit
HDFS
HDFS
HBase
Drillbit
Drillbit
Windows
Desktop
Drillbit
Mac
Desktop
Drillbit
Connect to Any Drillbit with ODBC, JDBC, C, Java,
REST
1. User connects to Drillbit
2. That Drillbit becomes Foreman
– Foreman generates execution plan
– Cost-based query optimization &
locality
3. Execution fragments are farmed
to other Drillbits
4. Drillbits exchange data as
necessary to guarantee relational
algebra
5. Results are returned to user
through Foreman Drillbit
User
Drillbit
Drillbit
(foreman)
ANALYZING YELP DATA
1. DOWNLOAD AND INSTALL
DRILL
Run Drill in Embedded Mode (drill-embedded)
$ tar xf apache-drill-1.0.0.tar.gz
$ cd apache-drill-1.0.0
$ bin/drill-embedded
> SELECT * FROM dfs.root.`/Users/tshiran/yelp/user.json` LIMIT 1;
+----------------+----------------------------------+---------------+-------+
| yelping_since | votes | review_count | name |
+----------------+----------------------------------+---------------+-------+
| 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee |
+----------------+----------------------------------+---------------+-------+
• drillbit (Drill daemon) starts automatically in embedded mode
• No ZooKeeper in embedded mode
• Web UI is available at localhost:8047
Review the Query Profile in the Web UI
(localhost:8047)
Run Drill in Distributed Mode
$ zkServer start # ZooKeeper maintains the list of drillbits in the cluster
$ bin/drillbit.sh start # conf/drill-override.conf includes cluster name and ZK nodes
$ bin/drill-conf # or bin/drill-localhost to skip ZK lookup
> SELECT stars, count(*)
FROM dfs.root.`/Users/tshiran/yelp/review.json`
GROUP BY stars ORDER BY stars;
+--------+---------+
| stars | EXPR$1 |
+--------+---------+
| 1 | 110772 |
| 2 | 102737 |
| 3 | 163761 |
| 4 | 342143 |
| 5 | 406045 |
+--------+---------+
5 rows selected (3.739 seconds)
2. CONFIGURE DATASTORES
(STORAGE PLUGINS)
Enable MongoDB Storage Plugin
Define Workspaces in the File Storage
Plugin
• d
3. EXPLORE THE DATA
The Data: Files
{
"votes": {"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg",
"stars": 5,
"date": "2007-05-17",
"text": "dr. goldberg offers everything ...",
"type": "review",
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
The Data: MongoDB Collections
$ mongo
MongoDB shell version: 2.6.5
> show databases;
admin (empty)
local 0.078GB
yelp 0.453GB
> use yelp
> db.users.findOne()
{
"_id" : ObjectId("54566cdf3237149de181a92a"),
"yelping_since" : "2012-02",
"votes" : {
"funny" : 1,
"useful" : 5,
"cool" : 0
},
"review_count" : 6,
"name" : "Lee",
"user_id" : "qtrmBGNqCvupHMHL_bKFgQ",
"friends" : [ ]
}
Are There More 5-Star or 1-Star Reviews?
> SELECT stars, count(*)
FROM dfs.root.`/Users/tshiran/yelp/review.json`
GROUP BY stars ORDER BY stars;
+--------+---------+
| stars | EXPR$1 |
+--------+---------+
| 1 | 110772 |
| 2 | 102737 |
| 3 | 163761 |
| 4 | 342143 |
| 5 | 406045 |
+--------+---------+
5 rows selected (3.739 seconds)
Using Storage Plugins and Workspaces
> SELECT * FROM dfs.root.`/Users/tshiran/data/yelp/review.json`
LIMIT 1;
> SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1;
> SELECT * FROM mongo.yelp.users LIMIT 1;
> USE mongo.yelp;
> SELECT * FROM users LIMIT 1;
Storage plugin
Workspace
Path relative to workspace
Storage Plugin Workspace Table
dfs Path Path relative to workspace
mongo Database Collection
hive Database Table
hbase Namespace Table
Most Common User Names (MongoDB)
> SELECT name, count(*) AS users
FROM mongo.yelp.users
GROUP BY name
ORDER BY users DESC LIMIT 10;
+------------+------------+
| name | users |
+------------+------------+
| David | 2453 |
| John | 2378 |
| Michael | 2322 |
| Chris | 2202 |
| Mike | 2037 |
| Jennifer | 1867 |
| Jessica | 1463 |
| Jason | 1457 |
| Michelle | 1439 |
| Brian | 1436 |
+------------+------------+
Cities with the Most Businesses
> SELECT state, city, count(*) AS businesses
FROM dfs.demo.`/yelp/business.json`
GROUP BY state, city
ORDER BY businesses DESC LIMIT 10;
+------------+------------+-------------+
| state | city | businesses |
+------------+------------+-------------+
| NV | Las Vegas | 12021 |
| AZ | Phoenix | 7499 |
| AZ | Scottsdale | 3605 |
| EDH | Edinburgh | 2804 |
| AZ | Mesa | 2041 |
| AZ | Tempe | 2025 |
| NV | Henderson | 1914 |
| AZ | Chandler | 1637 |
| WI | Madison | 1630 |
| AZ | Glendale | 1196 |
+------------+------------+-------------+
3. EXPLORING COMPLEX
DATA
business.json (1)
{
"business_id": "4bEjOyTaDG24SY5TxsaUNQ",
"full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109",
"hours": {
"Monday": {"close": "23:00", "open": "07:00"},
"Tuesday": {"close": "23:00", "open": "07:00"},
"Friday": {"close": "00:00", "open": "07:00"},
"Wednesday": {"close": "23:00", "open": "07:00"},
"Thursday": {"close": "23:00", "open": "07:00"},
"Sunday": {"close": "23:00", "open": "07:00"},
"Saturday": {"close": "00:00", "open": "07:00"}
},
"open": true,
"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],
"city": "Las Vegas",
"review_count": 4084,
"name": "Mon Ami Gabi",
"neighborhoods": ["The Strip"],
"longitude": -115.172588519464,
business.json (2)
"state": "NV",
"stars": 4.0,
"attributes": {
"Alcohol": "full_bar”,
"Noise Level": "average",
"Has TV": false,
"Attire": "casual",
"Ambience": {
"romantic": true,
"intimate": false,
"touristy": false,
"hipster": false,
"classy": true,
"trendy": false,
"casual": false
},
"Good For": {"dessert": false, "latenight": false, "lunch": false,
"dinner": true, "breakfast": false, "brunch": false},
}
}
Which Places Are Open Right Now (22:00)?
> SELECT name, b.hours
FROM dfs.demo.`yelp/business.json` b
WHERE b.hours.Saturday.`open` < '22:00' AND
b.hours.Saturday.`close` > '22:00'
LIMIT 2;
+------------------------------+------------------------------------------------+
| name | hours |
+------------------------------+------------------------------------------------+
| Chang Jiang Chinese Kitchen | {"Saturday":{"close":"22:30","open":"11:00"}} |
| Grand China Restaurant | {"Saturday":{"close":"23:00","open":"11:00"}} |
+------------------------------+------------------------------------------------+
It’s 10pm in Vegas and I Want Good Hummus!
> SELECT name, b.hours.Friday AS friday, categories
FROM dfs.demo.`yelp/business.json` b
WHERE b.hours.Friday.`open` < '22:00' AND
b.hours.Friday.`close` > '22:00' AND
REPEATED_CONTAINS(categories, 'Mediterranean') AND
city = 'Las Vegas'
ORDER BY stars DESC
LIMIT 2;
+--------------------------------+-----------------------------------+--------------------------------------------------------------+
| name | friday | categories |
+--------------------------------+-----------------------------------+--------------------------------------------------------------+
| Olives | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] |
| Marrakech Moroccan Restaurant | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |
+--------------------------------+-----------------------------------+--------------------------------------------------------------+
Flatten Repeated Values
> SELECT name, categories
FROM dfs.demo.`yelp/business.json` LIMIT 3;
+-----------------------------+-------------------------------------------+
| name | categories |
+-----------------------------+-------------------------------------------+
| Eric Goldberg, MD | ["Doctors","Health & Medical"] |
| Pine Cone Restaurant | ["Restaurants"] |
| Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |
+-----------------------------+-------------------------------------------+
> SELECT name, FLATTEN(categories) AS categories
FROM dfs.demo.`yelp/business.json` LIMIT 5;
+-----------------------------+-------------------------+
| name | categories |
+-----------------------------+-------------------------+
| Eric Goldberg, MD | Doctors |
| Eric Goldberg, MD | Health & Medical |
| Pine Cone Restaurant | Restaurants |
| Deforest Family Restaurant | American (Traditional) |
| Deforest Family Restaurant | Restaurants |
+-----------------------------+-------------------------+
Most and Least Common Business Categories
> SELECT category, count(*) AS businesses
FROM (SELECT name, FLATTEN(categories) AS category
FROM dfs.demo.`yelp/business.json`) c
GROUP BY category ORDER BY businesses DESC;
+-----------------------------------+-------------+
| category | businesses |
+-----------------------------------+-------------+
| Restaurants | 14303 |
| Shopping | 6428 |
…
| Australian | 1 |
| Boat Dealers | 1 |
| Firewood | 1 |
+-----------------------------------+-------------+
715 rows selected (3.439 seconds)
> SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and
REPEATED_CONTAINS(categories, 'Australian');
+------+------------+
| name | categories |
+------+------------+
| The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] |
+------+------------+
4. LEVERAGING VIEWS
Create a View for Name-Gender Mapping
> CREATE VIEW dfs.tmp.`names` AS
SELECT columns[0] AS name, columns[4] AS gender
FROM dfs.demo.`names.csv`;
> USE dfs.tmp;
> CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender
FROM dfs.demo.`names.csv`;
> SELECT * FROM dfs.tmp.names WHERE name = 'John';
+------------+------------+
| name | gender |
+------------+------------+
| John | Male |
+------------+------------+
columns[0] columns[4]
names.csv:
Most Common Names (and their Genders) on
Yelp
> SELECT u.name, n.gender, count(*) AS number
FROM mongo.yelp.users u, dfs.tmp.names n
WHERE u.name = n.name
GROUP BY u.name, n.gender
ORDER BY number DESC LIMIT 10;
+------------+------------+------------+
| name | gender | number |
+------------+------------+------------+
| David | Male | 2453 |
| John | Male | 2378 |
| Michael | Male | 2322 |
| Chris | Unknown | 2202 |
| Mike | Male | 2037 |
| Jennifer | Female | 1867 |
| Jessica | Female | 1463 |
| Jason | Male | 1457 |
| Michelle | Female | 1439 |
| Brian | Male | 1436 |
+------------+------------+------------+
Who Rates Higher – Men or Women?
> SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars
FROM mongo.yelp.users u, dfs.tmp.names n
WHERE u.name = n.name
GROUP BY n.gender;
+------------+------------+------------+
| gender | users | stars |
+------------+------------+------------+
| Female | 103684 | 3.77 |
| Male | 97430 | 3.696 |
| Unknown | 18409 | 3.727 |
+------------+------------+------------+
Who Writes Longer Reviews – Men or Women?
> SELECT n.gender, round(avg(length(r.text))) AS review_length
FROM dfs.demo.`yelp/review.json` r,
mongo.yelp.users u,
dfs.tmp.names n
WHERE u.name = n.name AND r.user_id = u.user_id
GROUP BY n.gender;
+------------+---------------+
| gender | review_length |
+------------+---------------+
| Male | 665 |
| Female | 730 |
| Unknown | 711 |
+------------+---------------+
It takes a 3-way join to find out…
Thank You!
• Download at drill.apache.org
• Get in touch:
• tshiran@apache.org
• jnadeau@apache.org
• Ask questions:
• user@drill.apache.org
• Tweet: @ApacheDrill

More Related Content

PPTX
Introduction to Redis
PPTX
Redis and it's data types
PDF
Apache Spark Crash Course
PDF
Grafana introduction
PDF
Ansible
PPTX
Introduction to Docker - 2017
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PDF
Deep Dive into GPU Support in Apache Spark 3.x
Introduction to Redis
Redis and it's data types
Apache Spark Crash Course
Grafana introduction
Ansible
Introduction to Docker - 2017
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Deep Dive into GPU Support in Apache Spark 3.x

What's hot (20)

KEY
Redis overview for Software Architecture Forum
PPTX
ODP
Introduction to Ansible
PPTX
Docker introduction
PPTX
Apache Spark Architecture
PPTX
Introduction to Apache Spark
PDF
Introduction to Apache Hive
PDF
What Is RDD In Spark? | Edureka
PDF
Red Hat OpenShift Container Platform Overview
PDF
IT Automation with Ansible
PPT
Seminar Presentation Hadoop
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PPTX
Introduction to redis
PPTX
Caching solutions with Redis
PPTX
Docker 101 : Introduction to Docker and Containers
PDF
The automation challenge: Kubernetes Operators vs Helm Charts
PDF
Introduction to Azure Data Lake
PPTX
A simple introduction to redis
PPTX
Découverte de Redis
PDF
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Redis overview for Software Architecture Forum
Introduction to Ansible
Docker introduction
Apache Spark Architecture
Introduction to Apache Spark
Introduction to Apache Hive
What Is RDD In Spark? | Edureka
Red Hat OpenShift Container Platform Overview
IT Automation with Ansible
Seminar Presentation Hadoop
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Introduction to redis
Caching solutions with Redis
Docker 101 : Introduction to Docker and Containers
The automation challenge: Kubernetes Operators vs Helm Charts
Introduction to Azure Data Lake
A simple introduction to redis
Découverte de Redis
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Ad

Viewers also liked (14)

PDF
IS OLAP DEAD IN THE AGE OF BIG DATA?
PDF
Big Data MDX with Mondrian and Apache Kylin
PPTX
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
PDF
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
PDF
Self-Service Data Exploration with Apache Drill
PPTX
Rethinking SQL for Big Data with Apache Drill
PPTX
SQL-on-Hadoop with Apache Drill
PDF
Low Latency OLAP with Hadoop and HBase
PPTX
Apache Kylin – Cubes on Hadoop
PDF
OLAP with Cassandra and Spark
PPTX
Design cube in Apache Kylin
PDF
eBay Architecture
PDF
OLAP options on Hadoop
PDF
Integration of HIve and HBase
IS OLAP DEAD IN THE AGE OF BIG DATA?
Big Data MDX with Mondrian and Apache Kylin
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Self-Service Data Exploration with Apache Drill
Rethinking SQL for Big Data with Apache Drill
SQL-on-Hadoop with Apache Drill
Low Latency OLAP with Hadoop and HBase
Apache Kylin – Cubes on Hadoop
OLAP with Cassandra and Spark
Design cube in Apache Kylin
eBay Architecture
OLAP options on Hadoop
Integration of HIve and HBase
Ad

Similar to Drilling into Data with Apache Drill (20)

PPTX
Drilling into Data with Apache Drill
PDF
The Heterogeneous Data lake
PDF
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
PPTX
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
PDF
What and Why and How: Apache Drill ! - Tugdual Grall
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
PPTX
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
PPTX
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
PPTX
Data saturday malta - ADX Azure Data Explorer overview
PPTX
Big Data and NoSQL for Database and BI Pros
PDF
Prague data management meetup 2018-03-27
PPTX
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
PPTX
Get started with Microsoft SQL Polybase
PPTX
Survey of the Microsoft Azure Data Landscape
PDF
Couchbase Overview Nov 2013
PDF
Hw09 Sqoop Database Import For Hadoop
PDF
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
PDF
Overview of stinger interactive query for hive
PPTX
מיכאל
PDF
Redis+Spark Structured Streaming: Roshan Kumar
Drilling into Data with Apache Drill
The Heterogeneous Data lake
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
What and Why and How: Apache Drill ! - Tugdual Grall
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Data saturday malta - ADX Azure Data Explorer overview
Big Data and NoSQL for Database and BI Pros
Prague data management meetup 2018-03-27
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Get started with Microsoft SQL Polybase
Survey of the Microsoft Azure Data Landscape
Couchbase Overview Nov 2013
Hw09 Sqoop Database Import For Hadoop
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Overview of stinger interactive query for hive
מיכאל
Redis+Spark Structured Streaming: Roshan Kumar

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Big Data Technologies - Introduction.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation theory and applications.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Modernizing your data center with Dell and AMD
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPT
Teaching material agriculture food technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
KodekX | Application Modernization Development
PPTX
Cloud computing and distributed systems.
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
A Presentation on Artificial Intelligence
NewMind AI Monthly Chronicles - July 2025
Big Data Technologies - Introduction.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
Electronic commerce courselecture one. Pdf
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Modernizing your data center with Dell and AMD
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Teaching material agriculture food technology
Chapter 3 Spatial Domain Image Processing.pdf
KodekX | Application Modernization Development
Cloud computing and distributed systems.
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Drilling into Data with Apache Drill

  • 1. Drilling into Data with Apache Drill Tomer Shiran, Apache Drill Founder and PMC Member Jacques Nadeau, Apache Drill PMC Chair
  • 2. Tomer Shiran Jacques Nadeau tshiran@apache.org jnadeau@apache.org @tshiran @intjesus Drill founder and PMC Member Drill PMC Chair (VP, Apache Drill)
  • 3. Apache Drill • Open source SQL query engine for non-relational datastores – JSON document model – Columnar • Key advantages: – Query any non-relational datastore – No overhead (creating and maintaining schemas, transforming data, …) – Treat your data like a table even when it’s not – Keep using the BI tools you love – Scales from one laptop to 1000s of servers – Great performance and scalability
  • 4. Omni-SQL (“SQL-on-Everything”) Drill: Omni-SQL Whereas the other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements. Andrew Brust, “ ”
  • 5. Any Non-Relational Datastore • File systems – Traditional: Local files and NAS – Hadoop: HDFS and MapR-FS – Cloud storage: Amazon S3, Google Cloud Storage, Azure Blob Storage • NoSQL databases – MongoDB – HBase – MapR-DB – Hive • And you can add new datastores Any Client • Multiple interfaces: ODBC, JDBC, REST, C, Java • BI tools – Tableau – Qlik – MicroStrategy – TIBCO Spotfire – Excel • Command line (Drill shell) • Web and mobile apps – Many JSON-powered chart libraries (see D3.js) • SAS, R, … Drill Integrates With What You Have
  • 6. Achieving “End-to-End Performance” Execute fast • Standard SQL • Read data fast • Leverage columnar encodings and execution • Execute operations quickly • Scale out, not up Iterate fast • Work without prep • Decentralize data management • In-situ security • Explore + query • Access multiple sources • Avoid the ETL rinse cycle
  • 7. JSON Model, Columnar Speed JSON BSON Mongo HBase NoSQL Parquet Avro CSV TSV Schema-lessFixed schema Flat Complex Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC } RDBMS/SQL-on-Hadoop table Apache Drill table
  • 8. Apache Drill Provides the Best of Both Worlds Acts Like a Database • ANSI SQL: SELECT, FROM, WHERE, JOIN, HAVING, ORDER BY, WITH, CTAS, ALL, EXISTS, ANY, IN, SOME • VarChar, Int, BigInt, Decimal, VarBinary, Timestamp, Float, Double, etc. • Subqueries, scalar subqueries, partition pruning, CTE • Data warehouse offload • Tableau, ODBC, JDBC • TPC-H & TPC-DS-like workloads • Supports Hive SerDes • Supports Hive UDFs • Supports Hive Metastore Even When Your Data Doesn’t • Path based queries and wildcards – select * from /my/logs/ – select * from /revenue/*/q2 • Modern data types – Map, Array, Any • Complex Functions and Relational Operators – FLATTEN, kvgen, convert_from, convert_to, repeated_count, etc • JSON Sensor analytics • Complex data analysis • Alternative DSLs
  • 9. Why? To Support the Changing Data Organization Data Dev Circa 2000 1. Developer comes up with requirements 2. DBA defines tables 3. DBA defines indices 4. DBA defines FK relationships 5. Developer stores data 6. BI builds reports 7. Analyst views reports 8. DBA adds materialized views Data Today 1. Developer builds app, defines schema, stores data 2. Analyst queries data 3. Data engineer fixes performance problems or fills functionality gaps
  • 10. HOW DOES IT WORK?
  • 11. Everything Starts With a Drillbit… • High performance query executor • In-memory columnar execution • Directly interacts with data, acquiring knowledge as it reads • Built to leverage large amounts of memory • Networked or not • Exposes ODBC, JDBC, REST • Built-in Web UI and CLI • Extensible Drillbit Single process (daemon or CLI)
  • 12. Data Lake, More Like Data Maelstrom HDFS HDFS mongod mongod HDFS HDFS HBase HBase Cassandra Cassandra HDFS HDFS HBase Windows Desktop Mac Desktop HBase & HDFS Cluster HDFS Cluster MongoDB Cluster Cassandra Cluster DesktopClustered Servers
  • 13. Run Drillbits Wherever; Whatever Your Data Drillbit HDFS HDFS mongod mongod HDFS HDFS HBase HBase Drillbit DrillbitDrillbit Drillbit Drillbit Cassandra Cassandra Drillbit Drillbit HDFS HDFS HBase Drillbit Drillbit Windows Desktop Drillbit Mac Desktop Drillbit
  • 14. Connect to Any Drillbit with ODBC, JDBC, C, Java, REST 1. User connects to Drillbit 2. That Drillbit becomes Foreman – Foreman generates execution plan – Cost-based query optimization & locality 3. Execution fragments are farmed to other Drillbits 4. Drillbits exchange data as necessary to guarantee relational algebra 5. Results are returned to user through Foreman Drillbit User Drillbit Drillbit (foreman)
  • 16. 1. DOWNLOAD AND INSTALL DRILL
  • 17. Run Drill in Embedded Mode (drill-embedded) $ tar xf apache-drill-1.0.0.tar.gz $ cd apache-drill-1.0.0 $ bin/drill-embedded > SELECT * FROM dfs.root.`/Users/tshiran/yelp/user.json` LIMIT 1; +----------------+----------------------------------+---------------+-------+ | yelping_since | votes | review_count | name | +----------------+----------------------------------+---------------+-------+ | 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee | +----------------+----------------------------------+---------------+-------+ • drillbit (Drill daemon) starts automatically in embedded mode • No ZooKeeper in embedded mode • Web UI is available at localhost:8047
  • 18. Review the Query Profile in the Web UI (localhost:8047)
  • 19. Run Drill in Distributed Mode $ zkServer start # ZooKeeper maintains the list of drillbits in the cluster $ bin/drillbit.sh start # conf/drill-override.conf includes cluster name and ZK nodes $ bin/drill-conf # or bin/drill-localhost to skip ZK lookup > SELECT stars, count(*) FROM dfs.root.`/Users/tshiran/yelp/review.json` GROUP BY stars ORDER BY stars; +--------+---------+ | stars | EXPR$1 | +--------+---------+ | 1 | 110772 | | 2 | 102737 | | 3 | 163761 | | 4 | 342143 | | 5 | 406045 | +--------+---------+ 5 rows selected (3.739 seconds)
  • 22. Define Workspaces in the File Storage Plugin • d
  • 24. The Data: Files { "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA" }
  • 25. The Data: MongoDB Collections $ mongo MongoDB shell version: 2.6.5 > show databases; admin (empty) local 0.078GB yelp 0.453GB > use yelp > db.users.findOne() { "_id" : ObjectId("54566cdf3237149de181a92a"), "yelping_since" : "2012-02", "votes" : { "funny" : 1, "useful" : 5, "cool" : 0 }, "review_count" : 6, "name" : "Lee", "user_id" : "qtrmBGNqCvupHMHL_bKFgQ", "friends" : [ ] }
  • 26. Are There More 5-Star or 1-Star Reviews? > SELECT stars, count(*) FROM dfs.root.`/Users/tshiran/yelp/review.json` GROUP BY stars ORDER BY stars; +--------+---------+ | stars | EXPR$1 | +--------+---------+ | 1 | 110772 | | 2 | 102737 | | 3 | 163761 | | 4 | 342143 | | 5 | 406045 | +--------+---------+ 5 rows selected (3.739 seconds)
  • 27. Using Storage Plugins and Workspaces > SELECT * FROM dfs.root.`/Users/tshiran/data/yelp/review.json` LIMIT 1; > SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1; > SELECT * FROM mongo.yelp.users LIMIT 1; > USE mongo.yelp; > SELECT * FROM users LIMIT 1; Storage plugin Workspace Path relative to workspace Storage Plugin Workspace Table dfs Path Path relative to workspace mongo Database Collection hive Database Table hbase Namespace Table
  • 28. Most Common User Names (MongoDB) > SELECT name, count(*) AS users FROM mongo.yelp.users GROUP BY name ORDER BY users DESC LIMIT 10; +------------+------------+ | name | users | +------------+------------+ | David | 2453 | | John | 2378 | | Michael | 2322 | | Chris | 2202 | | Mike | 2037 | | Jennifer | 1867 | | Jessica | 1463 | | Jason | 1457 | | Michelle | 1439 | | Brian | 1436 | +------------+------------+
  • 29. Cities with the Most Businesses > SELECT state, city, count(*) AS businesses FROM dfs.demo.`/yelp/business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10; +------------+------------+-------------+ | state | city | businesses | +------------+------------+-------------+ | NV | Las Vegas | 12021 | | AZ | Phoenix | 7499 | | AZ | Scottsdale | 3605 | | EDH | Edinburgh | 2804 | | AZ | Mesa | 2041 | | AZ | Tempe | 2025 | | NV | Henderson | 1914 | | AZ | Chandler | 1637 | | WI | Madison | 1630 | | AZ | Glendale | 1196 | +------------+------------+-------------+
  • 31. business.json (1) { "business_id": "4bEjOyTaDG24SY5TxsaUNQ", "full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109", "hours": { "Monday": {"close": "23:00", "open": "07:00"}, "Tuesday": {"close": "23:00", "open": "07:00"}, "Friday": {"close": "00:00", "open": "07:00"}, "Wednesday": {"close": "23:00", "open": "07:00"}, "Thursday": {"close": "23:00", "open": "07:00"}, "Sunday": {"close": "23:00", "open": "07:00"}, "Saturday": {"close": "00:00", "open": "07:00"} }, "open": true, "categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"], "city": "Las Vegas", "review_count": 4084, "name": "Mon Ami Gabi", "neighborhoods": ["The Strip"], "longitude": -115.172588519464,
  • 32. business.json (2) "state": "NV", "stars": 4.0, "attributes": { "Alcohol": "full_bar”, "Noise Level": "average", "Has TV": false, "Attire": "casual", "Ambience": { "romantic": true, "intimate": false, "touristy": false, "hipster": false, "classy": true, "trendy": false, "casual": false }, "Good For": {"dessert": false, "latenight": false, "lunch": false, "dinner": true, "breakfast": false, "brunch": false}, } }
  • 33. Which Places Are Open Right Now (22:00)? > SELECT name, b.hours FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Saturday.`open` < '22:00' AND b.hours.Saturday.`close` > '22:00' LIMIT 2; +------------------------------+------------------------------------------------+ | name | hours | +------------------------------+------------------------------------------------+ | Chang Jiang Chinese Kitchen | {"Saturday":{"close":"22:30","open":"11:00"}} | | Grand China Restaurant | {"Saturday":{"close":"23:00","open":"11:00"}} | +------------------------------+------------------------------------------------+
  • 34. It’s 10pm in Vegas and I Want Good Hummus! > SELECT name, b.hours.Friday AS friday, categories FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2; +--------------------------------+-----------------------------------+--------------------------------------------------------------+ | name | friday | categories | +--------------------------------+-----------------------------------+--------------------------------------------------------------+ | Olives | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] | | Marrakech Moroccan Restaurant | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] | +--------------------------------+-----------------------------------+--------------------------------------------------------------+
  • 35. Flatten Repeated Values > SELECT name, categories FROM dfs.demo.`yelp/business.json` LIMIT 3; +-----------------------------+-------------------------------------------+ | name | categories | +-----------------------------+-------------------------------------------+ | Eric Goldberg, MD | ["Doctors","Health & Medical"] | | Pine Cone Restaurant | ["Restaurants"] | | Deforest Family Restaurant | ["American (Traditional)","Restaurants"] | +-----------------------------+-------------------------------------------+ > SELECT name, FLATTEN(categories) AS categories FROM dfs.demo.`yelp/business.json` LIMIT 5; +-----------------------------+-------------------------+ | name | categories | +-----------------------------+-------------------------+ | Eric Goldberg, MD | Doctors | | Eric Goldberg, MD | Health & Medical | | Pine Cone Restaurant | Restaurants | | Deforest Family Restaurant | American (Traditional) | | Deforest Family Restaurant | Restaurants | +-----------------------------+-------------------------+
  • 36. Most and Least Common Business Categories > SELECT category, count(*) AS businesses FROM (SELECT name, FLATTEN(categories) AS category FROM dfs.demo.`yelp/business.json`) c GROUP BY category ORDER BY businesses DESC; +-----------------------------------+-------------+ | category | businesses | +-----------------------------------+-------------+ | Restaurants | 14303 | | Shopping | 6428 | … | Australian | 1 | | Boat Dealers | 1 | | Firewood | 1 | +-----------------------------------+-------------+ 715 rows selected (3.439 seconds) > SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and REPEATED_CONTAINS(categories, 'Australian'); +------+------------+ | name | categories | +------+------------+ | The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] | +------+------------+
  • 38. Create a View for Name-Gender Mapping > CREATE VIEW dfs.tmp.`names` AS SELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`; > USE dfs.tmp; > CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`; > SELECT * FROM dfs.tmp.names WHERE name = 'John'; +------------+------------+ | name | gender | +------------+------------+ | John | Male | +------------+------------+ columns[0] columns[4] names.csv:
  • 39. Most Common Names (and their Genders) on Yelp > SELECT u.name, n.gender, count(*) AS number FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY u.name, n.gender ORDER BY number DESC LIMIT 10; +------------+------------+------------+ | name | gender | number | +------------+------------+------------+ | David | Male | 2453 | | John | Male | 2378 | | Michael | Male | 2322 | | Chris | Unknown | 2202 | | Mike | Male | 2037 | | Jennifer | Female | 1867 | | Jessica | Female | 1463 | | Jason | Male | 1457 | | Michelle | Female | 1439 | | Brian | Male | 1436 | +------------+------------+------------+
  • 40. Who Rates Higher – Men or Women? > SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY n.gender; +------------+------------+------------+ | gender | users | stars | +------------+------------+------------+ | Female | 103684 | 3.77 | | Male | 97430 | 3.696 | | Unknown | 18409 | 3.727 | +------------+------------+------------+
  • 41. Who Writes Longer Reviews – Men or Women? > SELECT n.gender, round(avg(length(r.text))) AS review_length FROM dfs.demo.`yelp/review.json` r, mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name AND r.user_id = u.user_id GROUP BY n.gender; +------------+---------------+ | gender | review_length | +------------+---------------+ | Male | 665 | | Female | 730 | | Unknown | 711 | +------------+---------------+ It takes a 3-way join to find out…
  • 42. Thank You! • Download at drill.apache.org • Get in touch: • tshiran@apache.org • jnadeau@apache.org • Ask questions: • user@drill.apache.org • Tweet: @ApacheDrill

Editor's Notes

  • #8: All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before. If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.