SlideShare a Scribd company logo
© MapR Technologies, confidential 
© 2014 MapR Technologies 
M. C. Srivas, CTO and Founder
© MapR Technologies, confidential 
MapR is Unbiased Open Source
© MapR Technologies, confidential 
Linux Is Unbiased 
• Linux provides choice 
– MySQL 
– PostgreSQL 
– SQLite 
• Linux provides choice 
– Apache httpd 
– Nginx 
– Lighttpd
© MapR Technologies, confidential 
MapR Is Unbiased 
• MapR provides choice 
MapR Distribution for Hadoop Distribution C Distribution H 
Spark Spark (all of it) and SparkSQL Spark only No 
Interactive SQL Impala, Drill, Hive/Tez, 
SparkSQL 
One option 
(Impala) 
One option 
(Hive/Tez) 
Scheduler YARN, Mesos One option 
(YARN) 
One option 
(YARN) 
Versions Hive 0.10, 0.11, 0.12, 0.13 
Pig 0.11, 012 
HBase 0.94, 0.98 
One version One version
© MapR Technologies, confidential 
MapR Distribution for Apache Hadoop 
MapR Data Platform 
(Random Read/Write) 
Enterprise Grade Data Hub Operational 
MapR-FS 
(POSIX) 
MapR-DB 
(High-Performance NoSQL) 
Security 
YARN 
Pig 
Cascading 
Spark 
Batch 
Spark 
Streaming 
Storm* 
Streaming 
HBase 
Solr 
NoSQL & 
Search 
Juju 
Provisioning 
& 
Coordination 
Savannah* 
Mahout 
MLLib 
ML, Graph 
GraphX 
MapReduc 
e v1 & v2 
APACHE HADOOP AND OSS ECOSYSTEM 
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS 
Workflow 
& Data 
Tez* Governance 
Accumulo* 
Hive 
Impala 
Shark 
Drill* 
SQL 
Sqoop Sentry* Oozie ZooKeeper 
Flume Knox* Falcon* Whirr 
Data 
Integration 
& Access 
HttpFS 
Hue 
NFS HDFS API HBase API JSON API 
MapR Control System 
(Management and Monitoring) 
* In Roadmap for inclusion/certification 
CLI REST API GUI
© MapR Technologies, confidential
© MapR Technologies, confidential 
Hadoop an augmentation for EDW—Why?
© MapR Technologies, confidential
© MapR Technologies, confidential
© MapR Technologies, confidential
© MapR Technologies, confidential
© MapR Technologies, confidential
© MapR Technologies, confidential 
But inside, it looks like this …
© MapR Technologies, confidential 
And this …
© MapR Technologies, confidential 
And this …
© MapR Technologies, confidential 
Consolidating schemas is very hard.
© MapR Technologies, confidential 
Consolidating schemas is very hard, causes SILOs
© MapR Technologies, confidential 
Silos make analysis very difficult 
• How do I identify a 
unique {customer, 
trade} across data sets? 
• How can I guarantee 
the lack of anomalous 
behavior if I can’t see 
all data?
© 2014 MapR Technologies 19 
Hard to know what’s of value a priori
© 2014 MapR Technologies 20 
Hard to know what’s of value a priori
© 2014 MapR Technologies 21 
Why Hadoop
© MapR Technologies, confidential 
Rethink SQL for Big Data 
Preserve 
•ANSI SQL 
• Familiar and ubiquitous 
• Performance 
• Interactive nature crucial for BI/Analytics 
• One technology 
• Painful to manage different technologies 
• Enterprise ready 
• System-of-record, HA, DR, Security, Multi-tenancy, 
…
© MapR Technologies, confidential 
Rethink SQL for Big Data 
Preserve 
•ANSI SQL 
• Familiar and ubiquitous 
• Performance 
• Interactive nature crucial for BI/Analytics 
• One technology 
• Painful to manage different technologies 
• Enterprise ready 
• System-of-record, HA, DR, Security, Multi-tenancy, 
… 
Invent 
• Flexible data-model 
• Allow schemas to evolve rapidly 
• Support semi-structured data types 
• Agility 
• Self-service possible when developer and DBA is 
same 
• Scalability 
• In all dimensions: data, speed, schemas, processes, 
management
© MapR Technologies, confidential 
SQL is here to stay
© MapR Technologies, confidential 
Hadoop is here to stay
© MapR Technologies, confidential 
YOU CAN’T HANDLE REAL SQL
© MapR Technologies, confidential 
SQL 
select * from A 
where exists ( 
select 1 from B where B.b < 100 ); 
• Did you know Apache HIVE cannot compute it? 
– eg, Hive, Impala, Spark/Shark
© MapR Technologies, confidential 
Self-described Data 
select cf.month, cf.year 
from hbase.table1; 
• Did you know normal SQL cannot handle the above? 
• Nor can HIVE and its variants like Impala, Shark?
© MapR Technologies, confidential 
Self-described Data 
select cf.month, cf.year 
from hbase.table1; 
• Why? 
• Because there’s no meta-store definition available
© MapR Technologies, confidential 
Self-Describing Data Ubiquitous 
Centralized schema 
- Static 
- Managed by the DBAs 
- In a centralized repository 
Long, meticulous data preparation process (ETL, 
create/alter schema, etc.) 
– can take 6-18 months 
Self-describing, or schema-less, data 
- Dynamic/evolving 
- Managed by the applications 
- Embedded in the data 
Less schema, more suitable for data that has 
higher volume, variety and velocity 
Apache Drill
© MapR Technologies, confidential 
A Quick Tour through Apache Drill
© MapR Technologies, confidential 
Data Source is in the Query 
select timestamp, message 
from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` 
where errorLevel > 2
© MapR Technologies, confidential 
Data Source is in the Query 
select timestamp, message 
from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` 
where errorLevel > 2 
This is a cluster in Apache Drill 
- DFS 
- HBase 
- Hive meta-store
© MapR Technologies, confidential 
Data Source is in the Query 
select timestamp, message 
from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` 
where errorLevel > 2 
This is a cluster in Apache Drill 
- DFS 
- HBase 
- Hive meta-store 
A work-space 
- Typically a sub-directory 
- HIVE database
© MapR Technologies, confidential 
Data Source is in the Query 
select timestamp, message 
from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` 
where errorLevel > 2 
This is a cluster in Apache Drill 
- DFS 
- HBase 
- Hive meta-store 
A work-space 
- Typically a sub-directory 
- HIVE database 
A table 
- pathnames 
- Hbase table 
- Hive table
© MapR Technologies, confidential 
Combine data sources on the fly 
• JSON 
• CSV 
• ORC (ie, all Hive types) 
• Parquet 
• HBase tables 
• … can combine them 
Select USERS.name, USERS.emails.work 
from 
dfs.logs.`/data/logs` LOGS, 
dfs.users.`/profiles.json` USERS, 
where 
LOGS.uid = USERS.uid and 
errorLevel > 5 
order by count(*);
© MapR Technologies, confidential 
Can be an entire directory tree 
// On a file 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` 
group by errorLevel;
© MapR Technologies, confidential 
Can be an entire directory tree 
// On a file 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` 
group by errorLevel; 
// On the entire data collection: all years, all months 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs` 
group by errorLevel
© MapR Technologies, confidential 
Can be an entire directory tree 
// On a file 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` 
group by errorLevel; 
// On the entire data collection: all years, all months 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs` 
group by errorLevel 
dirs[1] dirs[2]
© MapR Technologies, confidential 
Can be an entire directory tree 
// On a file 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` 
group by errorLevel; 
// On the entire data collection: all years, all months 
select errorLevel, count(*) 
from dfs.logs.`/AppServerLogs` 
group by errorLevel 
where dirs[1] > 2012 
, dirs[2] 
dirs[1] dirs[2]
© MapR Technologies, confidential 
Querying JSON 
{ name: classic 
fillings: [ 
{ name: sugar cal: 400 }]} 
{ name: choco 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: chocolate cal: 300 }]} 
{ name: bostoncreme 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: cream cal: 1000 } 
{ name: jelly cal: 600 }]} 
donuts.json
© MapR Technologies, confidential 
Cursors inside Drill 
DrillClient drill = new DrillClient().connect( …); 
ResultReader r = drill.runSqlQuery( "select * from `donuts.json`"); 
while( r.next()) { 
String donutName = r.reader( “name").readString(); 
ListReader fillings = r.reader( "fillings"); 
while( fillings.next()) { 
int calories = fillings.reader( "cal").readInteger(); 
if (calories > 400) 
print( donutName, calories, fillings.reader( "name").readString()); 
} 
} 
{ name: classic 
fillings: [ 
{ name: sugar cal: 400 }]} 
{ name: choco 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: chocolate cal: 300 }]} 
{ name: bostoncreme 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: cream cal: 1000 } 
{ name: jelly cal: 600 }]}
© MapR Technologies, confidential 
Direct queries on nested data 
// Flattening maps in JSON, parquet and other 
nested records 
select name, flatten(fillings) as f 
from dfs.users.`/donuts.json` 
where f.cal < 300; 
// lists the fillings < 300 calories 
{ name: classic 
fillings: [ 
{ name: sugar cal: 400 }]} 
{ name: choco 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: chocolate cal: 300 }]} 
{ name: bostoncreme 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: cream cal: 1000 } 
{ name: jelly cal: 600 }]}
© MapR Technologies, confidential 
Complex Data Using SQL or Fluent API 
// SQL 
Result r = drill.sql( "select name, flatten(fillings) from 
`donuts.json` where fillings.cal < 300`); 
// or Fluent API 
Result r = drill.table(“donuts.json”) 
.lt(“fillings.cal”, 300).all(); 
while( r.next()) { 
String name = r.get( “name").string(); 
List fillings = r.get( “fillings”).list(); 
while(fillings.next()) { 
print(name, calories, fillings.get(“name”).string()); 
} 
} 
{ name: classic 
fillings: [ 
{ name: sugar cal: 400 }]} 
{ name: choco 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: plain: 280 }]} 
{ name: bostoncreme 
fillings: [ 
{ name: sugar cal: 400 } 
{ name: cream cal: 1000 } 
{ name: jelly cal: 600 }]}
© MapR Technologies, confidential 
Queries on embedded data 
// embedded JSON value inside column donut-json inside column-family cf1 
of an hbase table donuts 
select d.name, count( d.fillings), 
from ( 
select convert_from( cf1.donut-json, json) as d 
from hbase.user.`donuts` );
© MapR Technologies, confidential 
Queries inside JSON records 
// Each JSON record itself can be a whole database 
// example: get all donuts with at least 1 filling with > 300 calories 
select d.name, count( d.fillings), 
max(d.fillings.cal) within record as mincal 
from ( select convert_from( cf1.donut-json, json) as d 
from hbase.user.`donuts` ) 
where mincal > 300;
© MapR Technologies, confidential 
a 
• Schema can change over course of query 
• Operators are able to reconfigure themselves on schema 
change events 
– Minimize flexibility overhead 
– Support more advanced execution optimization based on actual data 
characteristics
© MapR Technologies, confidential 
De-centralized metadata 
// count the number of tweets per customer, where the customers are in Hive, and 
their tweets are in HBase. Note that the hbase data has no meta-data information 
select c.customerName, hb.tweets.count 
from hive.CustomersDB.`Customers` c 
join hbase.user.`SocialData` hb 
on c.customerId = convert_from( hb.rowkey, UTF-8);
© MapR Technologies, confidential 
So what does this all mean?
© MapR Technologies, confidential 
A Drill Database 
• What is a database with Drill/MapR?
© MapR Technologies, confidential 
A Drill Database 
• What is a database with Drill/MapR? 
• Just a directory, with a bunch of related files
© MapR Technologies, confidential 
A Drill Database 
• What is a database with Drill/MapR? 
• Just a directory, with a bunch of related files 
• There’s no need for artificial boundaries 
– No need to bunch a set of tables together to call it a “database”
© MapR Technologies, confidential 
A Drill Database 
/user/srivas/work/bugs 
symptom version date bugid dump-name 
impala crash 3.1.1 14/7/14 12345 cust1.tgz 
cldb slow 3.1.0 12/7/14 45678 cust2.tgz 
BugList Customers 
name rep se dump-name 
xxxx dkim junhyuk cust1.tgz 
yyyy yoshi aki cust2.tgz
© MapR Technologies, confidential 
Queries are simple 
select b.bugid, b.symptom, b.date 
from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b 
where c.dump-name = b.dump-name
© MapR Technologies, confidential 
Queries are simple 
select b.bugid, b.symptom, b.date 
from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b 
where c.dump-name = b.dump-name 
Let’s say I want to cross-reference against your list: 
select bugid, symptom 
from dfs.bugs.’/Buglist’ b, dfs.yourbugs.’/YourBugFile’ b2 
where b.bugid = b2.xxx
© MapR Technologies, confidential 
What does it mean?
© MapR Technologies, confidential 
What does it mean? 
• No ETL 
• Reach out directly to the particular table/file 
• As long as the permissions are fine, you can do it 
• No need to have the meta-data 
– None needed
© MapR Technologies, confidential 
Another example 
select d.name, count( d.fillings), 
from ( select convert_from( cf1.donut-json, json) as d 
from hbase.user.`donuts` ); 
• convert_from( xx, json) invokes the json parser inside Drill
© MapR Technologies, confidential 
Another example 
select d.name, count( d.fillings), 
from ( select convert_from( cf1.donut-json, json) as d 
from hbase.user.`donuts` ); 
• convert_from( xx, json) invokes the json parser inside Drill 
• What if you could plug in any parser?
© MapR Technologies, confidential 
Another example 
select d.name, count( d.fillings), 
from ( select convert_from( cf1.donut-json, json) as d 
from hbase.user.`donuts` ); 
• convert_from( xx, json) invokes the json parser inside Drill 
• What if you could plug in any parser 
– XML? 
– Semi-conductor yield-analysis files? Oil-exploration readings? 
– Telescope readings of stars? 
– RFIDs of various things?
© MapR Technologies, confidential 
No ETL 
• Basically, Drill is querying the raw data directly 
• Joining with processed data 
• NO ETL 
• Folks, this is very, very powerful 
• NO ETL
© MapR Technologies, confidential 
Seamless integration with Apache Hive 
• Low latency queries on Hive tables 
• Support for 100s of Hive file formats 
• Ability to reuse Hive UDFs 
• Support for multiple Hive metastores in a single query
© MapR Technologies, confidential 
Underneath the Covers
© MapR Technologies, confidential 
Basic Process 
Zookeeper 
DFS/HBase DFS/HBase DFS/HBase 
Drillbit 
Distributed Cache 
Drillbit 
Distributed Cache 
Drillbit 
Distributed Cache 
Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf) 
2. Drillbit generates execution plan based on query optimization & locality 
3. Fragments are farmed to individual nodes 
4. Result is returned to driving node 
c c c
© MapR Technologies, confidential 
Stages of Query Planning 
Parser 
Logical 
Planner 
Physical 
Planner 
Query 
Foreman 
Plan 
fragments 
sent to drill 
bits 
SQL 
Query 
Heuristic and 
cost based 
Cost based
© MapR Technologies, confidential 
Query Execution 
SQL Parser 
Optimizer 
Scheduler 
Pig Parser 
Physical Plan 
Mongo 
Cassandra 
HiveQL Parser 
RPC Endpoint 
Distributed Cache 
Storage Engine Interface 
OOppereartaotorsr s 
Foreman 
Logical Plan 
HDFS 
HBase 
JDBC Endpoint ODBC Endpoint
© MapR Technologies, confidential 
A Query engine that is… 
• Columnar/Vectorized 
• Optimistic/pipelined 
• Runtime compilation 
• Late binding 
• Extensible
© MapR Technologies, confidential 
Columnar representation 
A B C D E 
A 
B 
C 
D 
On disk 
E
© MapR Technologies, confidential 
Columnar Encoding 
• Values in a col. stored next to one-another 
– Better compression 
– Range-map: save min-max, can skip if not 
present 
• Only retrieve columns participating in 
query 
• Aggregations can be performed without 
decoding 
A 
B 
C 
D 
On disk 
E
© MapR Technologies, confidential 
Run-length-encoding & Sum 
• Dataset encoded as <val> <run-length>: 
– 2, 4 (4 2’s) 
– 8, 10 (10 8’s) 
• Goal: sum all the records 
• Normally: 
– Decompress: 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8 
– Add: 2 + 2 + 2 + 2 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 
• Optimized work: 2 * 4 + 8 * 10 
– Less memory, less operations
© MapR Technologies, confidential 
Bit-packed Dictionary Sort 
• Dataset encoded with a dictionary and bit-positions: 
– Dictionary: [Rupert, Bill, Larry] {0, 1, 2} 
– Values: [1,0,1,2,1,2,1,0] 
• Normal work 
– Decompress & store: Bill, Rupert, Bill, Larry, Bill, Larry, Bill, Rupert 
– Sort: ~24 comparisons of variable width strings 
• Optimized work 
– Sort dictionary: {Bill: 1, Larry: 2, Rupert: 0} 
– Sort bit-packed values 
– Work: max 3 string comparisons, ~24 comparisons of fixed-width dictionary bits
© MapR Technologies, confidential 
Drill 4-value semantics 
• SQL’s 3-valued semantics 
– True 
– False 
– Unknown 
• Drill adds fourth 
– Repeated
© MapR Technologies, confidential 
Vectorization 
• Drill operates on more than one record at a time 
– Word-sized manipulations 
– SIMD-like instructions 
• GCC, LLVM and JVM all do various optimizations automatically 
– Manually code algorithms 
• Logical Vectorization 
– Bitmaps allow lightning fast null-checks 
– Avoid branching to speed CPU pipeline
© MapR Technologies, confidential 
Runtime Compilation is Faster 
• JIT is smart, but 
more gains with 
runtime 
compilation 
• Janino: Java-based 
Java compiler 
From http://bit.ly/16Xk32x
© MapR Technologies, confidential 
Drill compiler 
Loaded class 
Merge byte-code 
of the two 
classes 
Janino compiles 
runtime 
byte-code 
CodeModel 
generates code 
Precompiled 
byte-code 
templates
© MapR Technologies, confidential 
Optimistic 
0 
20 
40 
60 
80 
100 
120 
140 
160 
Speed vs. check-pointing 
No need to checkpoint 
Apache Drill Checkpoint frequently
© MapR Technologies, confidential 
Optimistic Execution 
• Recovery code trivial 
– Running instances discard the failed query’s intermediate state 
• Pipelining possible 
– Send results as soon as batch is large enough 
– Requires barrier-less decomposition of query
© MapR Technologies, confidential 
Batches of Values 
• Value vectors 
– List of values, with same schema 
– With the 4-value semantics for each value 
• Shipped around in batches 
– max 256k bytes in a batch 
– max 64K rows in a batch 
• RPC designed for multiple replies to a request
© MapR Technologies, confidential 
Pipelining 
• Record batches are pipelined 
between nodes 
– ~256kB usually 
• Unit of work for Drill 
– Operators works on a batch 
• Operator reconfiguration happens 
at batch boundaries 
DrillBit 
DrillBit DrillBit
© MapR Technologies, confidential 
Pipelining Record Batches 
SQL Parser 
Optimizer 
Scheduler 
Pig Parser 
Physical Plan 
Mongo 
Cassandra 
HiveQL Parser 
RPC Endpoint 
Distributed Cache 
Storage Engine Interface 
OOppereartaotorsr s 
Foreman 
Logical Plan 
HDFS 
HBase 
JDBC Endpoint ODBC Endpoint
© MapR Technologies, confidential 
DISK 
Pipelining 
• Random access: sort without copy or 
restructuring 
• Avoids serialization/deserialization 
• Off-heap (no GC woes when lots of memory) 
• Full specification + off-heap + batch 
– Enables C/C++ operators (fast!) 
• Read/write to disk 
– when data larger than memory 
Drill Bit 
Memory 
overflow 
uses disk
© MapR Technologies, confidential 
Cost-based Optimization 
• Using Optiq, an extensible framework 
• Pluggable rules, and cost model 
• Rules for distributed plan generation 
• Insert Exchange operator into physical plan 
• Optiq enhanced to explore parallel query plans 
• Pluggable cost model 
– CPU, IO, memory, network cost (data locality) 
– Storage engine features (HDFS vs HIVE vs HBase) 
Query 
Optimizer 
Pluggable 
rules 
Pluggable 
cost model
© MapR Technologies, confidential 
Distributed Plan Cost 
• Operators have distribution property 
• Hash, Broadcast, Singleton, … 
• Exchange operator to enforce distributions 
• Hash: HashToRandomExchange 
• Broadcast: BroadcastExchange 
• Singleton: UnionExchange, SingleMergeExchange 
• Enumerate all, use cost to pick best 
• Merge Join vs Hash Join 
• Partition-based join vs Broadcast-based join 
• Streaming Aggregation vs Hash Aggregation 
• Aggregation in one phase or two phases 
• partial local aggregation followed by final aggregation 
HashToRandomExchange 
Sort 
Streaming-Aggregation 
Data Data Data
© MapR Technologies, confidential 
Drill 1.0 Hive 0.13 w/ Tez Impala 1.x 
Latency Low Medium Low 
Files Yes (all Hive file formats, 
plus JSON, Text, …) 
Yes (all Hive file formats) Yes (Parquet, Sequence, 
…) 
HBase/MapR-DB Yes Yes, perf issues Yes, with issues 
Schema Hive or schema-less Hive Hive 
SQL support ANSI SQL HiveQL HiveQL (subset) 
Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC 
Hive compat High High Low 
Large datasets Yes Yes Limited 
Nested data Yes Limited No 
Concurrency High Limited Medium 
Interactive SQL-on-Hadoop options
© MapR Technologies, confidential 
Apache Drill Roadmap 
•Low-latency SQL 
•Schema-less execution 
•Files & HBase/M7 support 
•Hive integration 
•BI and SQL tool support via 
ODBC/JDBC 
Data exploration/ad-hoc queries 
1.0 
•HBase query speedup 
•Nested data functions 
•Advanced SQL 
functionality 
Advanced analytics and 
operational data 
1.1 
•Ultra low latency queries 
•Single row 
insert/update/delete 
•Workload management 
Operational SQL 
2.0
© MapR Technologies, confidential 
MapR Distribution for Apache Hadoop 
MapR Data Platform 
(Random Read/Write) 
Enterprise Grade Data Hub Operational 
MapR-FS 
(POSIX) 
MapR-DB 
(High-Performance NoSQL) 
Security 
YARN 
Pig 
Cascading 
Spark 
Batch 
Spark 
Streaming 
Storm* 
Streaming 
HBase 
Solr 
NoSQL & 
Search 
Juju 
Provisioning 
& 
Coordination 
Savannah* 
Mahout 
MLLib 
ML, Graph 
GraphX 
MapReduc 
e v1 & v2 
APACHE HADOOP AND OSS ECOSYSTEM 
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS 
Workflow 
& Data 
Tez* Governance 
Accumulo* 
Hive 
Impala 
Shark 
Drill* 
SQL 
Sqoop Sentry* Oozie ZooKeeper 
Flume Knox* Falcon* Whirr 
Data 
Integration 
& Access 
HttpFS 
Hue 
NFS HDFS API HBase API JSON API 
MapR Control System 
(Management and Monitoring) 
* In Roadmap for inclusion/certification 
CLI REST API GUI
© MapR Technologies, confidential 
Apache Drill Resources 
• Drill 0.5 released last week 
• Getting started with Drill is easy 
– just download tarball and start running SQL queries on local files 
• Mailing lists 
– drill-user@incubator.apache.org 
– drill-dev@incubator.apache.org 
• Docs: https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki 
• Fork us on GitHub: http://github.com/apache/incubator-drill/ 
• Create a JIRA: https://issues.apache.org/jira/browse/DRILL
© MapR Technologies, confidential 
Active Drill Community 
• Large community, growing rapidly 
– 35-40 contributors, 16 committers 
– Microsoft, Linked-in, Oracle, Facebook, Visa, Lucidworks, 
Concurrent, many universities 
• In 2014 
– over 20 meet-ups, many more coming soon 
– 3 hackathons, with 40+ participants 
• Encourage you to join, learn, contribute and have fun …
© MapR Technologies, confidential 
Drill at MapR 
• World-class SQL team, ~20 people 
• 150+ years combined experience building commercial 
databases 
• Oracle, DB2, ParAccel, Teradata, SQLServer, Vertica 
• Team works on Drill, Hive, Impala 
• Fixed some of the toughest problems in Apache Hive
© MapR Technologies, confidential 
Thank you! 
M. C. Srivas 
srivas@mapr.com 
Did I mention we are hiring…

More Related Content

PPTX
Working with Delimited Data in Apache Drill 1.6.0
PPTX
Using Apache Drill
PDF
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
PPTX
Spark SQL versus Apache Drill: Different Tools with Different Rules
DOCX
Apache Drill with Oracle, Hive and HBase
PPTX
Analyzing Real-World Data with Apache Drill
PPTX
Apache drill
PPTX
Apache Drill
Working with Delimited Data in Apache Drill 1.6.0
Using Apache Drill
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Spark SQL versus Apache Drill: Different Tools with Different Rules
Apache Drill with Oracle, Hive and HBase
Analyzing Real-World Data with Apache Drill
Apache drill
Apache Drill

What's hot (20)

PPTX
Drilling into Data with Apache Drill
PDF
Scaling HDFS to Manage Billions of Files with Key-Value Stores
PPTX
M7 and Apache Drill, Micheal Hausenblas
PPTX
Hadoop & HDFS for Beginners
PDF
An introduction to apache drill presentation
PPTX
Introduction to Apache Drill
PPTX
Putting Apache Drill into Production
PDF
Hive Anatomy
PPTX
Hadoop and Spark for the SAS Developer
PPTX
Introduction to Apache HBase, MapR Tables and Security
PPTX
Understanding the Value and Architecture of Apache Drill
PPT
Hadoop 1.x vs 2
PPTX
PPTX
Hadoop architecture by ajay
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PDF
Difference between hadoop 2 vs hadoop 3
PDF
Drill into Drill – How Providing Flexibility and Performance is Possible
PDF
Cloudera Impala, updated for v1.0
PPTX
Hadoop And Their Ecosystem
Drilling into Data with Apache Drill
Scaling HDFS to Manage Billions of Files with Key-Value Stores
M7 and Apache Drill, Micheal Hausenblas
Hadoop & HDFS for Beginners
An introduction to apache drill presentation
Introduction to Apache Drill
Putting Apache Drill into Production
Hive Anatomy
Hadoop and Spark for the SAS Developer
Introduction to Apache HBase, MapR Tables and Security
Understanding the Value and Architecture of Apache Drill
Hadoop 1.x vs 2
Hadoop architecture by ajay
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Difference between hadoop 2 vs hadoop 3
Drill into Drill – How Providing Flexibility and Performance is Possible
Cloudera Impala, updated for v1.0
Hadoop And Their Ecosystem
Ad

Similar to Apache Drill - Why, What, How (20)

PDF
Webinar: Selecting the Right SQL-on-Hadoop Solution
PPTX
Big Data Everywhere Chicago: SQL on Hadoop
PDF
Analyzing Real-World Data with Apache Drill
PDF
2014 08-20-pit-hug
PPTX
Real Time and Big Data – It’s About Time
PPTX
Real Time and Big Data – It’s About Time
PPTX
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
PPTX
Drilling on JSON
PPTX
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
PPTX
Get most out of Spark on YARN
PDF
Tez: Accelerating Data Pipelines - fifthel
PDF
Hadoop and NoSQL joining forces by Dale Kim of MapR
PDF
Apache Spark Overview
PPTX
Cleveland Hadoop Users Group - Spark
PDF
Self-Service Data Exploration with Apache Drill
PPTX
Spark One Platform Webinar
PPTX
Tez Data Processing over Yarn
PPTX
Visual Mapping of Clickstream Data
PDF
PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Webinar: Selecting the Right SQL-on-Hadoop Solution
Big Data Everywhere Chicago: SQL on Hadoop
Analyzing Real-World Data with Apache Drill
2014 08-20-pit-hug
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Drilling on JSON
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
Get most out of Spark on YARN
Tez: Accelerating Data Pipelines - fifthel
Hadoop and NoSQL joining forces by Dale Kim of MapR
Apache Spark Overview
Cleveland Hadoop Users Group - Spark
Self-Service Data Exploration with Apache Drill
Spark One Platform Webinar
Tez Data Processing over Yarn
Visual Mapping of Clickstream Data
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Ad

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Mega Projects Data Mega Projects Data
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Lecture1 pattern recognition............
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Mega Projects Data Mega Projects Data
Miokarditis (Inflamasi pada Otot Jantung)
Lecture1 pattern recognition............
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
IB Computer Science - Internal Assessment.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
.pdf is not working space design for the following data for the following dat...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Database Infoormation System (DBIS).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Reliability_Chapter_ presentation 1221.5784
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm

Apache Drill - Why, What, How

  • 1. © MapR Technologies, confidential © 2014 MapR Technologies M. C. Srivas, CTO and Founder
  • 2. © MapR Technologies, confidential MapR is Unbiased Open Source
  • 3. © MapR Technologies, confidential Linux Is Unbiased • Linux provides choice – MySQL – PostgreSQL – SQLite • Linux provides choice – Apache httpd – Nginx – Lighttpd
  • 4. © MapR Technologies, confidential MapR Is Unbiased • MapR provides choice MapR Distribution for Hadoop Distribution C Distribution H Spark Spark (all of it) and SparkSQL Spark only No Interactive SQL Impala, Drill, Hive/Tez, SparkSQL One option (Impala) One option (Hive/Tez) Scheduler YARN, Mesos One option (YARN) One option (YARN) Versions Hive 0.10, 0.11, 0.12, 0.13 Pig 0.11, 012 HBase 0.94, 0.98 One version One version
  • 5. © MapR Technologies, confidential MapR Distribution for Apache Hadoop MapR Data Platform (Random Read/Write) Enterprise Grade Data Hub Operational MapR-FS (POSIX) MapR-DB (High-Performance NoSQL) Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & Coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduc e v1 & v2 APACHE HADOOP AND OSS ECOSYSTEM EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Tez* Governance Accumulo* Hive Impala Shark Drill* SQL Sqoop Sentry* Oozie ZooKeeper Flume Knox* Falcon* Whirr Data Integration & Access HttpFS Hue NFS HDFS API HBase API JSON API MapR Control System (Management and Monitoring) * In Roadmap for inclusion/certification CLI REST API GUI
  • 6. © MapR Technologies, confidential
  • 7. © MapR Technologies, confidential Hadoop an augmentation for EDW—Why?
  • 8. © MapR Technologies, confidential
  • 9. © MapR Technologies, confidential
  • 10. © MapR Technologies, confidential
  • 11. © MapR Technologies, confidential
  • 12. © MapR Technologies, confidential
  • 13. © MapR Technologies, confidential But inside, it looks like this …
  • 14. © MapR Technologies, confidential And this …
  • 15. © MapR Technologies, confidential And this …
  • 16. © MapR Technologies, confidential Consolidating schemas is very hard.
  • 17. © MapR Technologies, confidential Consolidating schemas is very hard, causes SILOs
  • 18. © MapR Technologies, confidential Silos make analysis very difficult • How do I identify a unique {customer, trade} across data sets? • How can I guarantee the lack of anomalous behavior if I can’t see all data?
  • 19. © 2014 MapR Technologies 19 Hard to know what’s of value a priori
  • 20. © 2014 MapR Technologies 20 Hard to know what’s of value a priori
  • 21. © 2014 MapR Technologies 21 Why Hadoop
  • 22. © MapR Technologies, confidential Rethink SQL for Big Data Preserve •ANSI SQL • Familiar and ubiquitous • Performance • Interactive nature crucial for BI/Analytics • One technology • Painful to manage different technologies • Enterprise ready • System-of-record, HA, DR, Security, Multi-tenancy, …
  • 23. © MapR Technologies, confidential Rethink SQL for Big Data Preserve •ANSI SQL • Familiar and ubiquitous • Performance • Interactive nature crucial for BI/Analytics • One technology • Painful to manage different technologies • Enterprise ready • System-of-record, HA, DR, Security, Multi-tenancy, … Invent • Flexible data-model • Allow schemas to evolve rapidly • Support semi-structured data types • Agility • Self-service possible when developer and DBA is same • Scalability • In all dimensions: data, speed, schemas, processes, management
  • 24. © MapR Technologies, confidential SQL is here to stay
  • 25. © MapR Technologies, confidential Hadoop is here to stay
  • 26. © MapR Technologies, confidential YOU CAN’T HANDLE REAL SQL
  • 27. © MapR Technologies, confidential SQL select * from A where exists ( select 1 from B where B.b < 100 ); • Did you know Apache HIVE cannot compute it? – eg, Hive, Impala, Spark/Shark
  • 28. © MapR Technologies, confidential Self-described Data select cf.month, cf.year from hbase.table1; • Did you know normal SQL cannot handle the above? • Nor can HIVE and its variants like Impala, Shark?
  • 29. © MapR Technologies, confidential Self-described Data select cf.month, cf.year from hbase.table1; • Why? • Because there’s no meta-store definition available
  • 30. © MapR Technologies, confidential Self-Describing Data Ubiquitous Centralized schema - Static - Managed by the DBAs - In a centralized repository Long, meticulous data preparation process (ETL, create/alter schema, etc.) – can take 6-18 months Self-describing, or schema-less, data - Dynamic/evolving - Managed by the applications - Embedded in the data Less schema, more suitable for data that has higher volume, variety and velocity Apache Drill
  • 31. © MapR Technologies, confidential A Quick Tour through Apache Drill
  • 32. © MapR Technologies, confidential Data Source is in the Query select timestamp, message from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` where errorLevel > 2
  • 33. © MapR Technologies, confidential Data Source is in the Query select timestamp, message from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` where errorLevel > 2 This is a cluster in Apache Drill - DFS - HBase - Hive meta-store
  • 34. © MapR Technologies, confidential Data Source is in the Query select timestamp, message from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` where errorLevel > 2 This is a cluster in Apache Drill - DFS - HBase - Hive meta-store A work-space - Typically a sub-directory - HIVE database
  • 35. © MapR Technologies, confidential Data Source is in the Query select timestamp, message from dfs1.logs.`AppServerLogs/2014/Jan/p001.parquet` where errorLevel > 2 This is a cluster in Apache Drill - DFS - HBase - Hive meta-store A work-space - Typically a sub-directory - HIVE database A table - pathnames - Hbase table - Hive table
  • 36. © MapR Technologies, confidential Combine data sources on the fly • JSON • CSV • ORC (ie, all Hive types) • Parquet • HBase tables • … can combine them Select USERS.name, USERS.emails.work from dfs.logs.`/data/logs` LOGS, dfs.users.`/profiles.json` USERS, where LOGS.uid = USERS.uid and errorLevel > 5 order by count(*);
  • 37. © MapR Technologies, confidential Can be an entire directory tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` group by errorLevel;
  • 38. © MapR Technologies, confidential Can be an entire directory tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*) from dfs.logs.`/AppServerLogs` group by errorLevel
  • 39. © MapR Technologies, confidential Can be an entire directory tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*) from dfs.logs.`/AppServerLogs` group by errorLevel dirs[1] dirs[2]
  • 40. © MapR Technologies, confidential Can be an entire directory tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Jan/part0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*) from dfs.logs.`/AppServerLogs` group by errorLevel where dirs[1] > 2012 , dirs[2] dirs[1] dirs[2]
  • 41. © MapR Technologies, confidential Querying JSON { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]} donuts.json
  • 42. © MapR Technologies, confidential Cursors inside Drill DrillClient drill = new DrillClient().connect( …); ResultReader r = drill.runSqlQuery( "select * from `donuts.json`"); while( r.next()) { String donutName = r.reader( “name").readString(); ListReader fillings = r.reader( "fillings"); while( fillings.next()) { int calories = fillings.reader( "cal").readInteger(); if (calories > 400) print( donutName, calories, fillings.reader( "name").readString()); } } { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}
  • 43. © MapR Technologies, confidential Direct queries on nested data // Flattening maps in JSON, parquet and other nested records select name, flatten(fillings) as f from dfs.users.`/donuts.json` where f.cal < 300; // lists the fillings < 300 calories { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: chocolate cal: 300 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}
  • 44. © MapR Technologies, confidential Complex Data Using SQL or Fluent API // SQL Result r = drill.sql( "select name, flatten(fillings) from `donuts.json` where fillings.cal < 300`); // or Fluent API Result r = drill.table(“donuts.json”) .lt(“fillings.cal”, 300).all(); while( r.next()) { String name = r.get( “name").string(); List fillings = r.get( “fillings”).list(); while(fillings.next()) { print(name, calories, fillings.get(“name”).string()); } } { name: classic fillings: [ { name: sugar cal: 400 }]} { name: choco fillings: [ { name: sugar cal: 400 } { name: plain: 280 }]} { name: bostoncreme fillings: [ { name: sugar cal: 400 } { name: cream cal: 1000 } { name: jelly cal: 600 }]}
  • 45. © MapR Technologies, confidential Queries on embedded data // embedded JSON value inside column donut-json inside column-family cf1 of an hbase table donuts select d.name, count( d.fillings), from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` );
  • 46. © MapR Technologies, confidential Queries inside JSON records // Each JSON record itself can be a whole database // example: get all donuts with at least 1 filling with > 300 calories select d.name, count( d.fillings), max(d.fillings.cal) within record as mincal from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` ) where mincal > 300;
  • 47. © MapR Technologies, confidential a • Schema can change over course of query • Operators are able to reconfigure themselves on schema change events – Minimize flexibility overhead – Support more advanced execution optimization based on actual data characteristics
  • 48. © MapR Technologies, confidential De-centralized metadata // count the number of tweets per customer, where the customers are in Hive, and their tweets are in HBase. Note that the hbase data has no meta-data information select c.customerName, hb.tweets.count from hive.CustomersDB.`Customers` c join hbase.user.`SocialData` hb on c.customerId = convert_from( hb.rowkey, UTF-8);
  • 49. © MapR Technologies, confidential So what does this all mean?
  • 50. © MapR Technologies, confidential A Drill Database • What is a database with Drill/MapR?
  • 51. © MapR Technologies, confidential A Drill Database • What is a database with Drill/MapR? • Just a directory, with a bunch of related files
  • 52. © MapR Technologies, confidential A Drill Database • What is a database with Drill/MapR? • Just a directory, with a bunch of related files • There’s no need for artificial boundaries – No need to bunch a set of tables together to call it a “database”
  • 53. © MapR Technologies, confidential A Drill Database /user/srivas/work/bugs symptom version date bugid dump-name impala crash 3.1.1 14/7/14 12345 cust1.tgz cldb slow 3.1.0 12/7/14 45678 cust2.tgz BugList Customers name rep se dump-name xxxx dkim junhyuk cust1.tgz yyyy yoshi aki cust2.tgz
  • 54. © MapR Technologies, confidential Queries are simple select b.bugid, b.symptom, b.date from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b where c.dump-name = b.dump-name
  • 55. © MapR Technologies, confidential Queries are simple select b.bugid, b.symptom, b.date from dfs.bugs.’/Customers’ c, dfs.bugs.’/BugList’ b where c.dump-name = b.dump-name Let’s say I want to cross-reference against your list: select bugid, symptom from dfs.bugs.’/Buglist’ b, dfs.yourbugs.’/YourBugFile’ b2 where b.bugid = b2.xxx
  • 56. © MapR Technologies, confidential What does it mean?
  • 57. © MapR Technologies, confidential What does it mean? • No ETL • Reach out directly to the particular table/file • As long as the permissions are fine, you can do it • No need to have the meta-data – None needed
  • 58. © MapR Technologies, confidential Another example select d.name, count( d.fillings), from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` ); • convert_from( xx, json) invokes the json parser inside Drill
  • 59. © MapR Technologies, confidential Another example select d.name, count( d.fillings), from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` ); • convert_from( xx, json) invokes the json parser inside Drill • What if you could plug in any parser?
  • 60. © MapR Technologies, confidential Another example select d.name, count( d.fillings), from ( select convert_from( cf1.donut-json, json) as d from hbase.user.`donuts` ); • convert_from( xx, json) invokes the json parser inside Drill • What if you could plug in any parser – XML? – Semi-conductor yield-analysis files? Oil-exploration readings? – Telescope readings of stars? – RFIDs of various things?
  • 61. © MapR Technologies, confidential No ETL • Basically, Drill is querying the raw data directly • Joining with processed data • NO ETL • Folks, this is very, very powerful • NO ETL
  • 62. © MapR Technologies, confidential Seamless integration with Apache Hive • Low latency queries on Hive tables • Support for 100s of Hive file formats • Ability to reuse Hive UDFs • Support for multiple Hive metastores in a single query
  • 63. © MapR Technologies, confidential Underneath the Covers
  • 64. © MapR Technologies, confidential Basic Process Zookeeper DFS/HBase DFS/HBase DFS/HBase Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf) 2. Drillbit generates execution plan based on query optimization & locality 3. Fragments are farmed to individual nodes 4. Result is returned to driving node c c c
  • 65. © MapR Technologies, confidential Stages of Query Planning Parser Logical Planner Physical Planner Query Foreman Plan fragments sent to drill bits SQL Query Heuristic and cost based Cost based
  • 66. © MapR Technologies, confidential Query Execution SQL Parser Optimizer Scheduler Pig Parser Physical Plan Mongo Cassandra HiveQL Parser RPC Endpoint Distributed Cache Storage Engine Interface OOppereartaotorsr s Foreman Logical Plan HDFS HBase JDBC Endpoint ODBC Endpoint
  • 67. © MapR Technologies, confidential A Query engine that is… • Columnar/Vectorized • Optimistic/pipelined • Runtime compilation • Late binding • Extensible
  • 68. © MapR Technologies, confidential Columnar representation A B C D E A B C D On disk E
  • 69. © MapR Technologies, confidential Columnar Encoding • Values in a col. stored next to one-another – Better compression – Range-map: save min-max, can skip if not present • Only retrieve columns participating in query • Aggregations can be performed without decoding A B C D On disk E
  • 70. © MapR Technologies, confidential Run-length-encoding & Sum • Dataset encoded as <val> <run-length>: – 2, 4 (4 2’s) – 8, 10 (10 8’s) • Goal: sum all the records • Normally: – Decompress: 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8 – Add: 2 + 2 + 2 + 2 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 • Optimized work: 2 * 4 + 8 * 10 – Less memory, less operations
  • 71. © MapR Technologies, confidential Bit-packed Dictionary Sort • Dataset encoded with a dictionary and bit-positions: – Dictionary: [Rupert, Bill, Larry] {0, 1, 2} – Values: [1,0,1,2,1,2,1,0] • Normal work – Decompress & store: Bill, Rupert, Bill, Larry, Bill, Larry, Bill, Rupert – Sort: ~24 comparisons of variable width strings • Optimized work – Sort dictionary: {Bill: 1, Larry: 2, Rupert: 0} – Sort bit-packed values – Work: max 3 string comparisons, ~24 comparisons of fixed-width dictionary bits
  • 72. © MapR Technologies, confidential Drill 4-value semantics • SQL’s 3-valued semantics – True – False – Unknown • Drill adds fourth – Repeated
  • 73. © MapR Technologies, confidential Vectorization • Drill operates on more than one record at a time – Word-sized manipulations – SIMD-like instructions • GCC, LLVM and JVM all do various optimizations automatically – Manually code algorithms • Logical Vectorization – Bitmaps allow lightning fast null-checks – Avoid branching to speed CPU pipeline
  • 74. © MapR Technologies, confidential Runtime Compilation is Faster • JIT is smart, but more gains with runtime compilation • Janino: Java-based Java compiler From http://bit.ly/16Xk32x
  • 75. © MapR Technologies, confidential Drill compiler Loaded class Merge byte-code of the two classes Janino compiles runtime byte-code CodeModel generates code Precompiled byte-code templates
  • 76. © MapR Technologies, confidential Optimistic 0 20 40 60 80 100 120 140 160 Speed vs. check-pointing No need to checkpoint Apache Drill Checkpoint frequently
  • 77. © MapR Technologies, confidential Optimistic Execution • Recovery code trivial – Running instances discard the failed query’s intermediate state • Pipelining possible – Send results as soon as batch is large enough – Requires barrier-less decomposition of query
  • 78. © MapR Technologies, confidential Batches of Values • Value vectors – List of values, with same schema – With the 4-value semantics for each value • Shipped around in batches – max 256k bytes in a batch – max 64K rows in a batch • RPC designed for multiple replies to a request
  • 79. © MapR Technologies, confidential Pipelining • Record batches are pipelined between nodes – ~256kB usually • Unit of work for Drill – Operators works on a batch • Operator reconfiguration happens at batch boundaries DrillBit DrillBit DrillBit
  • 80. © MapR Technologies, confidential Pipelining Record Batches SQL Parser Optimizer Scheduler Pig Parser Physical Plan Mongo Cassandra HiveQL Parser RPC Endpoint Distributed Cache Storage Engine Interface OOppereartaotorsr s Foreman Logical Plan HDFS HBase JDBC Endpoint ODBC Endpoint
  • 81. © MapR Technologies, confidential DISK Pipelining • Random access: sort without copy or restructuring • Avoids serialization/deserialization • Off-heap (no GC woes when lots of memory) • Full specification + off-heap + batch – Enables C/C++ operators (fast!) • Read/write to disk – when data larger than memory Drill Bit Memory overflow uses disk
  • 82. © MapR Technologies, confidential Cost-based Optimization • Using Optiq, an extensible framework • Pluggable rules, and cost model • Rules for distributed plan generation • Insert Exchange operator into physical plan • Optiq enhanced to explore parallel query plans • Pluggable cost model – CPU, IO, memory, network cost (data locality) – Storage engine features (HDFS vs HIVE vs HBase) Query Optimizer Pluggable rules Pluggable cost model
  • 83. © MapR Technologies, confidential Distributed Plan Cost • Operators have distribution property • Hash, Broadcast, Singleton, … • Exchange operator to enforce distributions • Hash: HashToRandomExchange • Broadcast: BroadcastExchange • Singleton: UnionExchange, SingleMergeExchange • Enumerate all, use cost to pick best • Merge Join vs Hash Join • Partition-based join vs Broadcast-based join • Streaming Aggregation vs Hash Aggregation • Aggregation in one phase or two phases • partial local aggregation followed by final aggregation HashToRandomExchange Sort Streaming-Aggregation Data Data Data
  • 84. © MapR Technologies, confidential Drill 1.0 Hive 0.13 w/ Tez Impala 1.x Latency Low Medium Low Files Yes (all Hive file formats, plus JSON, Text, …) Yes (all Hive file formats) Yes (Parquet, Sequence, …) HBase/MapR-DB Yes Yes, perf issues Yes, with issues Schema Hive or schema-less Hive Hive SQL support ANSI SQL HiveQL HiveQL (subset) Client support ODBC/JDBC ODBC/JDBC ODBC/JDBC Hive compat High High Low Large datasets Yes Yes Limited Nested data Yes Limited No Concurrency High Limited Medium Interactive SQL-on-Hadoop options
  • 85. © MapR Technologies, confidential Apache Drill Roadmap •Low-latency SQL •Schema-less execution •Files & HBase/M7 support •Hive integration •BI and SQL tool support via ODBC/JDBC Data exploration/ad-hoc queries 1.0 •HBase query speedup •Nested data functions •Advanced SQL functionality Advanced analytics and operational data 1.1 •Ultra low latency queries •Single row insert/update/delete •Workload management Operational SQL 2.0
  • 86. © MapR Technologies, confidential MapR Distribution for Apache Hadoop MapR Data Platform (Random Read/Write) Enterprise Grade Data Hub Operational MapR-FS (POSIX) MapR-DB (High-Performance NoSQL) Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & Coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduc e v1 & v2 APACHE HADOOP AND OSS ECOSYSTEM EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Tez* Governance Accumulo* Hive Impala Shark Drill* SQL Sqoop Sentry* Oozie ZooKeeper Flume Knox* Falcon* Whirr Data Integration & Access HttpFS Hue NFS HDFS API HBase API JSON API MapR Control System (Management and Monitoring) * In Roadmap for inclusion/certification CLI REST API GUI
  • 87. © MapR Technologies, confidential Apache Drill Resources • Drill 0.5 released last week • Getting started with Drill is easy – just download tarball and start running SQL queries on local files • Mailing lists – drill-user@incubator.apache.org – drill-dev@incubator.apache.org • Docs: https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+Wiki • Fork us on GitHub: http://github.com/apache/incubator-drill/ • Create a JIRA: https://issues.apache.org/jira/browse/DRILL
  • 88. © MapR Technologies, confidential Active Drill Community • Large community, growing rapidly – 35-40 contributors, 16 committers – Microsoft, Linked-in, Oracle, Facebook, Visa, Lucidworks, Concurrent, many universities • In 2014 – over 20 meet-ups, many more coming soon – 3 hackathons, with 40+ participants • Encourage you to join, learn, contribute and have fun …
  • 89. © MapR Technologies, confidential Drill at MapR • World-class SQL team, ~20 people • 150+ years combined experience building commercial databases • Oracle, DB2, ParAccel, Teradata, SQLServer, Vertica • Team works on Drill, Hive, Impala • Fixed some of the toughest problems in Apache Hive
  • 90. © MapR Technologies, confidential Thank you! M. C. Srivas srivas@mapr.com Did I mention we are hiring…