®
© 2015 MapR Technologies 1
®
© 2015 MapR Technologies
Self Service Data Exploration with
Apache Drill
®
© 2015 MapR Technologies 2
Data is doubling in
size every two years
®
© 2015 MapR Technologies 3
2011 2013
In 2020 it is estimated to be 44 zettabytes of data in the world
2020
Source: IDC Digital Universe
44ZETTABYTES*
4.4ZETTABYTES
1.8ZETTABYTES
…
*A zettabyte is 1 BILLION terabytes.
®
© 2015 MapR Technologies 4
2011 2013
In 2020 it is estimated to be 44 zettabytes of data in the world
2020
Source: IDC Digital Universe
44ZETTABYTES*
4.4ZETTABYTES
1.8ZETTABYTES
…
700 trillion
*A zettabyte is 1 BILLION terabytes.
64GB
Each person in the
world had 100k of
these iPhones
®
© 2015 MapR Technologies 5
UNSTRUCTURED
DATA
1980 2000 20101990 2020
Unstructured data will account for more than 80%
of the data collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
TotalDataStored
STRUCTURED
DATA
®
© 2015 MapR Technologies 6
Evolving distance to data
Business
(analysts,
developers)
“Plumbing”
development
Business
(analysts, developers)
Existing approaches require
a middleman (IT)
Data
Data
Data
Business
(analysts,
developers)
Modeling and
transformations
Map/Reduce
Traditional
SQL-on-Hadoop
New
SQL-on-Hadoop
®
© 2015 MapR Technologies 7
SQL in a NoSchema World
•  SQL
•  BI (Tableau, MicroStrategy, etc.)
•  Low latency
•  Scalability
•  Create and maintain schemas on:
–  HDFS (Parquet, JSON, etc.)
–  HBase
–  MongoDB
•  Transform or copy data
2 DON’T WANT WANT
®
© 2015 MapR Technologies 8
• Schema-free scale-out query engine for Hadoop and NoSQL
• Low latency
• Extreme ease of use
• Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
APACHE DRILL
®
© 2015 MapR Technologies 9
Drill’s Data Model is Flexible
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Schema-lessFixed schema
Flat
Complex
Flexibility
Name! Gender! Age!
Michael! M! 6!
Jennifer! F! 3!
{!
name: {!
first: Michael,!
last: Smith!
},!
hobbies: [ski, soccer],!
district: Los Altos!
}!
{!
name: {!
first: Jennifer,!
last: Gates!
},!
hobbies: [sing],!
preschool: CCLC!
}!
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flexibility
®
© 2015 MapR Technologies 10
Running Drill takes 10 minutes
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  full_name	
  	
  	
  	
  |	
  position_title	
  |	
  	
  	
  salary	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Sheri	
  Nowmer	
  |	
  President	
  	
  	
  	
  	
  	
  |	
  80000.0	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
1	
  row	
  selected	
  (0.417	
  seconds)	
  
DOWNLOAD HTTP://DRILL.APACHE.ORG/DOWNLOAD
EXTRACT
$	
  tar	
  xf	
  apache-­‐drill-­‐0.7.0.tar.gz	
  
$	
  cd	
  apache-­‐drill-­‐0.7.0	
  
RUN $	
  bin/sqlline	
  -­‐u	
  jdbc:drill:zk=local	
  
>	
  SELECT	
  full_name,	
  position_title,	
  salary	
  
	
  	
  FROM	
  cp.`employee.json	
  `	
  
	
  	
  LIMIT	
  1;	
  QUERY
& step by step
In SQL format
®
© 2015 MapR Technologies 11
Introduce external data sources to Drill
Ø 
SELECT	
  *	
  FROM	
  dfs.root.`/
E:/drill/data/yelp/
review.json`;	
  
Ø 
SELECT	
  *	
  FROM	
  
dfs.yelp.`review.json`	
  
LIMIT	
  1;	
  
Ø 
USE	
  dfs.yelp;	
  
Ø 
SELECT	
  *	
  FROM	
  
`review.json`	
  LIMIT	
  1;	
  
Ø 
SELECT	
  *	
  FROM	
  hbase.users	
  
LIMIT	
  1;	
  
Storage Plugin
Provider
Workspace Table
files Path Path relative to workspace
mongo Database Collection
hive Database Table
hbase Namespace Table
Coordinates:
Currently
Supported
Providers
. .
…
®
© 2015 MapR Technologies 12
Introduce external data sources to Drill
Example:
Ø  SELECT	
  *	
  FROM	
  dfs.root.`/E:/drill/data/yelp/
review.json`;	
  
Ø  SELECT	
  *	
  FROM	
  dfs.yelp.`review.json`	
  LIMIT	
  1;	
  
Ø  USE	
  dfs.yelp;	
  
Ø  SELECT	
  *	
  FROM	
  `review.json`	
  LIMIT	
  1;	
  
Ø  SELECT	
  *	
  FROM	
  hbase.users	
  LIMIT	
  1;	
  
Ø 
SELECT	
  *	
  FROM	
  dfs.root.`/
E:/drill/data/yelp/
review.json`;	
  
Ø 
SELECT	
  *	
  FROM	
  
dfs.yelp.`review.json`	
  
LIMIT	
  1;	
  
Ø 
USE	
  dfs.yelp;	
  
Ø 
SELECT	
  *	
  FROM	
  
`review.json`	
  LIMIT	
  1;	
  
Ø 
SELECT	
  *	
  FROM	
  hbase.users	
  
LIMIT	
  1;	
  
Storage Plugin
Provider
Workspace Table
files Path Path relative to workspace
mongo Database Collection
hive Database Table
hbase Namespace Table
Coordinates:
Currently
Supported
Providers
. .
…
®
© 2015 MapR Technologies 13
{	
  
	
  	
  "votes":	
  {"funny":	
  0,	
  "useful":	
  2,	
  "cool":	
  1},	
  
	
  	
  "user_id":	
  "Xqd0DzHaiyRqVH3WRG7hzg",	
  
	
  	
  "review_id":	
  "15SdjuK7DmYqUAj6rjGowg",	
  
	
  	
  "stars":	
  5,	
  
	
  	
  "date":	
  "2007-­‐05-­‐17",	
  
	
  	
  "text":	
  "dr.	
  goldberg	
  offers	
  everything	
  ...",	
  
	
  	
  "type":	
  "review",	
  
	
  	
  "business_id":	
  "vcNAWiLM4dR7D2nwwJ7nCA"	
  
}	
  
Inventory: DFS Files
®
© 2015 MapR Technologies 14
business.json (1)
{	
  
	
  "business_id":	
  "4bEjOyTaDG24SY5TxsaUNQ",	
  
	
  "full_address":	
  "3655	
  Las	
  Vegas	
  Blvd	
  SnThe	
  StripnLas	
  Vegas,	
  NV	
  89109",	
  
	
  "hours":	
  {	
  
	
   	
  "Monday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Tuesday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Friday":	
  {"close":	
  "00:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Wednesday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Thursday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Sunday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Saturday":	
  {"close":	
  "00:00",	
  "open":	
  "07:00"}	
  
	
  },	
  
	
  "open":	
  true,	
  
	
  "categories":	
  ["Breakfast	
  &	
  Brunch",	
  "Steakhouses",	
  "French",	
  "Restaurants"],	
  
	
  "city":	
  "Las	
  Vegas",	
  
	
  "review_count":	
  4084,	
  
	
  "name":	
  "Mon	
  Ami	
  Gabi",	
  
	
  "neighborhoods":	
  ["The	
  Strip"],	
  
	
  "longitude":	
  -­‐115.172588519464,	
  
®
© 2015 MapR Technologies 15
business.json (2)
	
  "state":	
  "NV",	
  
	
  "stars":	
  4.0,	
  
	
   	
  "attributes":	
  {	
  
	
   	
  "Alcohol":	
  "full_bar”,	
  
	
   	
   	
  "Noise	
  Level":	
  "average",	
  
	
   	
  "Has	
  TV":	
  false,	
  
	
   	
  "Attire":	
  "casual",	
  
	
   	
  "Ambience":	
  {	
  
	
   	
   	
  "romantic":	
  true,	
  
	
   	
   	
  "intimate":	
  false,	
  
	
   	
   	
  "touristy":	
  false,	
  
	
   	
   	
  "hipster":	
  false,	
  
	
   	
   	
   	
  "classy":	
  true,	
  
	
   	
   	
  "trendy":	
  false,	
  
	
   	
   	
   	
  "casual":	
  false	
  
	
   	
  },	
  
	
   	
  "Good	
  For":	
  {"dessert":	
  false,	
  "latenight":	
  false,	
  "lunch":	
  false,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "dinner":	
  true,	
  "breakfast":	
  false,	
  "brunch":	
  false},	
  
	
  }	
  
}	
  
®
© 2015 MapR Technologies 16
Use cases
LAS VEGAS
NEW
RESTAURANT
®
© 2015 MapR Technologies 17
NEW RESTAURANT
Customers
for opening
party
>	
  SELECT	
  name,	
  review_count	
  
	
  	
  FROM	
  dfs.yelp.`user.json`	
  
	
  	
  ORDER	
  BY	
  review_count	
  DESC	
  
	
  	
  LIMIT	
  50;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  review_count	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Victor	
  	
  	
  	
  	
  |	
  8062	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  Jennifer	
  	
  	
  |	
  4244	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  Anita	
  	
  	
  	
  	
  	
  |	
  3829	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  ......	
  	
  	
  	
  	
  |	
  ....	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  Eileen	
  	
  	
  	
  	
  |	
  1947	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  J	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  1946	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  Matt	
  	
  	
  	
  	
  	
  	
  |	
  1942	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
50	
  rows	
  selected	
  (1.16	
  seconds)	
  
®
© 2015 MapR Technologies 18
Cities
with most
businesses
NEW RESTAURANT
>	
  SELECT	
  state,	
  city,	
  COUNT(*)	
  AS	
  businesses	
  
	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`	
  
	
  	
  	
  	
  GROUP	
  BY	
  state,	
  city	
  
	
  	
  	
  	
  ORDER	
  BY	
  reviews	
  DESC	
  LIMIT	
  10;	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  state	
  	
  	
  	
  |	
  	
  	
  	
  city	
  	
  	
  	
  |	
  businesses	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  NV	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Las	
  Vegas	
  	
  |	
  12021	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Phoenix	
  	
  	
  	
  |	
  7499	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Scottsdale	
  |	
  3605	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  EDH	
  	
  	
  	
  	
  	
  	
  	
  |	
  Edinburgh	
  	
  |	
  2804	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Mesa	
  	
  	
  	
  	
  	
  	
  |	
  2041	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Tempe	
  	
  	
  	
  	
  	
  |	
  2025	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  NV	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Henderson	
  	
  |	
  1914	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Chandler	
  	
  	
  |	
  1637	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  WI	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Madison	
  	
  	
  	
  |	
  1630	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Glendale	
  	
  	
  |	
  1196	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
®
© 2015 MapR Technologies 19
Use cases
LAS VEGAS
LAS VEGAS
RESTAURANT
®
© 2015 MapR Technologies 20
Open
restaurants
at 22:00
LAS VEGAS RESTAURANT
>	
  SELECT	
  name,	
  b.hours	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  b	
  
	
  	
  WHERE	
  b.hours.Saturday.`open`	
  <	
  '22:00'	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  b.hours.Saturday.`close`	
  >	
  '22:00'	
  
	
  	
  LIMIT	
  1;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  	
  	
  hours	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Chang	
  Jiang	
  Chinese	
  Kitchen	
  |	
  {"Tuesday":
{"close":"22:00","open":"11:00"},"Friday":
{"close":"22:30","open":"11:00"},"Monday":
{"close":"22:00","open":"11:00"},"Wednesday":
{"close":"22:00","open":"11:00"},"Thursday":
{"close":"22:00","open":"11:00"},"Sunday":
{"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","open":"11:00"}}	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
1	
  row	
  selected	
  (0.013	
  seconds)	
  
	
  
®
© 2015 MapR Technologies 21
Finding
hummus
at 22:00
LAS VEGAS RESTAURANT
>	
  SELECT	
  name,	
  stars,	
  b.hours.Wednesday,	
  categories	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  b	
  
	
  	
  WHERE	
  b.hours.Wednesday.`open`	
  <	
  '22:00'	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  b.hours.Wednesday.`close`	
  >	
  '22:00'	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  REPEATED_CONTAINS(categories,	
  'Mediterranean')	
  
AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  city	
  =	
  'Las	
  Vegas'	
  
	
  	
  	
  	
  ORDER	
  BY	
  stars	
  DESC	
  
	
  	
  	
  	
  LIMIT	
  1;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  	
  	
  stars	
  	
  	
  	
  |	
  	
  	
  EXPR$2	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Marrakech	
  Moroccan	
  Restaurant	
  |	
  4.0	
  	
  	
  	
  	
  	
  	
  	
  |	
  {"close":"23:00","open":"17:30"}	
  |	
  
["Mediterranean","Middle	
  Eastern","Moroccan","Restaurants"]	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
1	
  row	
  selected	
  (2.185	
  seconds)	
  
®
© 2015 MapR Technologies 22
• Working with repeated values
APACHE DRILL
Unique benefits
®
© 2015 MapR Technologies 23
Flatten Repeated Values
>	
  SELECT	
  name,	
  categories	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  LIMIT	
  2;	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Eric	
  Goldberg,	
  MD	
  |	
  ["Doctors","Health	
  &	
  Medical"]	
  |	
  
|	
  Pine	
  Cone	
  Restaurant	
  |	
  ["Restaurants"]	
  |	
  
|	
  Deforest	
  Family	
  Restaurant	
  |	
  ["American	
  (Traditional)","Restaurants"]	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
	
  
>	
  SELECT	
  name,	
  FLATTEN(categories)	
  AS	
  categories	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  LIMIT	
  3;	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Eric	
  Goldberg,	
  MD	
  |	
  Doctors	
  	
  	
  	
  |	
  
|	
  Eric	
  Goldberg,	
  MD	
  |	
  Health	
  &	
  Medical	
  |	
  
|	
  Pine	
  Cone	
  Restaurant	
  |	
  Restaurants	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
®
© 2015 MapR Technologies 24
Most and Least Common Business Categories
>	
  SELECT	
  category,	
  COUNT(*)	
  AS	
  businesses	
  
	
  	
  FROM	
  (SELECT	
  name,	
  FLATTEN(categories)	
  AS	
  category	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`)	
  
	
  	
  GROUP	
  BY	
  category	
  ORDER	
  BY	
  businesses	
  DESC;	
  
+------------+------------+
| category | businesses |
+------------+------------+
| Restaurants | 14303 |
| ............... |
| Firewood | 1 |
+------------+------------+
715 rows selected (3.439 seconds)	
  
	
  
>	
  SELECT	
  name,	
  categories	
  FROM	
  dfs.yelp.`business.json`	
  
	
  	
  WHERE	
  true	
  AND	
  REPEATED_CONTAINS(categories,	
  'Australian');	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  The	
  Australian	
  AZ	
  |	
  ["Bars","Burgers","Nightlife","Australian","Sports	
  Bars","Restaurants"]	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
®
© 2015 MapR Technologies 25
• Views - Dynamic and Materialized
APACHE DRILL
®
© 2015 MapR Technologies 26
Create a view combining business and reviews datasets.
>	
  CREATE	
  OR	
  REPLACE	
  VIEW	
  dfs.tmp.BusinessReviews	
  AS	
  
	
  	
  	
  	
  SELECT	
  b.name,	
  b.stars,	
  r.votes.funny,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r.votes.useful,	
  r.votes.cool,	
  r.`date`	
  
	
  	
  	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`	
  b,	
  dfs.yelp.`review.json`	
  r	
  
	
  	
  	
  	
  	
  	
  WHERE	
  r.business_id	
  =	
  b.business_id;	
  
	
  
+------------+------------+
| ok | summary |
+------------+------------+
| true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema |
+------------+------------+
	
  
>	
  SELECT	
  COUNT(*)	
  AS	
  Total	
  FROM	
  dfs.tmp.BusinessReviews;	
  
	
  
+------------+
| Total |
+------------+
| 1125458 |
+------------+
®
© 2015 MapR Technologies 27
Materialized Views AKA Tables
>	
  ALTER	
  SESSION	
  SET	
  `store.format`	
  =	
  'parquet';	
  
	
  
>	
  CREATE	
  TABLE	
  dfs.tmp.BusinessReviewsTbl	
  AS	
  
	
  	
  	
  	
  SELECT	
  b.name,	
  b.stars,	
  r.votes.funny	
  funny,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r.votes.useful	
  useful,	
  r.votes.cool	
  cool,	
  r.`date`	
  
	
  	
  	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`	
  b,	
  dfs.yelp.`review.json`	
  r	
  
	
  	
  	
  	
  	
  	
  WHERE	
  r.business_id	
  =	
  b.business_id;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  Fragment	
  	
  |	
  Number	
  of	
  records	
  written	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  1_0	
  	
  	
  	
  	
  	
  	
  	
  |	
  176448	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_1	
  	
  	
  	
  	
  	
  	
  	
  |	
  192439	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_2	
  	
  	
  	
  	
  	
  	
  	
  |	
  198625	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_3	
  	
  	
  	
  	
  	
  	
  	
  |	
  200863	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_4	
  	
  	
  	
  	
  	
  	
  	
  |	
  181420	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_5	
  	
  	
  	
  	
  	
  	
  	
  |	
  175663	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
®
© 2015 MapR Technologies 28
DRILL ARCHITECTURE
Under the hood
®
© 2015 MapR Technologies 29
High Level Architecture
Cluster of commodity servers
–  Daemon (drillbit) on each node
ZooKeeper maintains ephemeral cluster membership information
–  Drillbit uses ZooKeeper to find other drillbits in the cluster
–  Client uses ZooKeeper to find drillbits
Built-in, optimistic query execution engine. Doesn’t require a
particular storage or execution system (MapReduce, Spark, Tez)
–  Better performance and manageability
Data processing unit is columnar record batches	
  
–  Enables schema flexibility with negligible performance impact
®
© 2015 MapR Technologies 30
Drill Maximizes Data Locality
Data Source Best Practice
HDFS or MapR-FS drillbit on each DataNode
HBase or MapR-DB drillbit on each RegionServer
MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)
drillbit
DataNode/
RegionServer/
mongod
drillbit
DataNode/
RegionServer/
mongod
drillbit
DataNode/
RegionServer/
mongod
ZooKeeper
ZooKeeper
ZooKeeper
…
®
© 2015 MapR Technologies 31
Core Modules within drillbit	
  
SQL Parser
Hive
HBase
Distributed Cache
StoragePlugins
MongoDB
DFS
PhysicalPlan
ExecutionLogicalPlan Optimizer
RPC Endpoint
®
© 2015 MapR Technologies 32
SELECT * Query Execution
drillbit	
  
ZooKeeper
Client
(JDBC, ODBC,
REST)
1.  Find drillbits
(once per session)
3.  Create logical and physical execution plans
4.  Farm out execution of fragments to cluster
(completely distributed execution)
ZooKeeper
ZooKeeper
drillbit	
  drillbit	
  
2.  Submit query to
drillbit
5.  Return results
to client
* CTAS (CREATE TABLE AS SELECT) queries include steps 1-4
®
© 2015 MapR Technologies 33
Participate
•  Learn: http://drill.apache.org/
•  Download: http://drill.apache.org/download/
•  Ask Questions: user@drill.apache.org
•  Engage on Twitter: @ApacheDrill
®
© 2015 MapR Technologies 34
Thank You
@mapr maprtech
MapRTechnologies
maprtech
mapr-technologies
®
© 2015 MapR Technologies 35
Or Run Drill in Distributed Mode…
$	
  zkServer	
  start	
  
•  Make sure ZooKeeper (zkServer) is running:
•  Access the Web UI: http://localhost:8047
•  Connect a client to the cluster (e.g., sqlline):
•  Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes
•  If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired
cluster in the JDBC connection string: jdbc:drill:zk=localhost:2181/drill/
<clustername>
•  Not sure if ZooKeeper is running? Run telnet	
  localhost	
  2181 and make sure it connects
•  Define the Drill cluster name and ZooKeeper nodes in conf/drill-­‐override.conf
•  Start drillbit:	
  
$	
  bin/drillbit.sh	
  start	
  
$	
  bin/sqlline	
  -­‐u	
  jdbc:drill:zk=localhost:2181	
  
®
© 2015 MapR Technologies 36
user.json
{	
  
	
  "yelping_since":	
  "2007-­‐08",	
  
	
  "votes":	
  {	
  
	
   	
  "funny":	
  198,	
  
	
   	
  "useful":	
  415,	
  
	
   	
  "cool":	
  206	
  
	
  },	
  
	
  "review_count":	
  283,	
  
	
  "name":	
  "Adele",	
  
	
  "user_id":	
  "9NJdKpRNwwaL4cvKq0cN6g",	
  
	
  "friends":	
  ["DrKQzBFAvxhyjLgbPSW2Qw",	
  "ebXx-­‐G5eFqWkfDuk22f81w",	
  "qWLezzHxOXN-­‐
GQdInixZzw"],	
  
	
  "fans":	
  10,	
  
	
  "average_stars":	
  3.6499999999999999,	
  
	
  "compliments":	
  {	
  
	
   	
  "funny":	
  4,	
  
	
   	
  "hot":	
  17,	
  
	
   	
  "cool":	
  20	
  
	
  },	
  
	
  "elite":	
  [2008,	
  2009,	
  2010,	
  2011,	
  2012,	
  2013,	
  2014]	
  
}	
  

More Related Content

PDF
Introduction to Apache Drill - NYC Apache Drill Meetup
PPTX
Big Data Everywhere Chicago: SQL on Hadoop
PPTX
Unlocking value in your (big) data
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
PDF
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
PDF
Apache drill self service data exploration (113)
PDF
Adding Complex Data to Spark Stack by Tug Grall
PPTX
SQL-on-Hadoop with Apache Drill
Introduction to Apache Drill - NYC Apache Drill Meetup
Big Data Everywhere Chicago: SQL on Hadoop
Unlocking value in your (big) data
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Apache drill self service data exploration (113)
Adding Complex Data to Spark Stack by Tug Grall
SQL-on-Hadoop with Apache Drill

Similar to Self service data exploration with apache drill (20)

PDF
What and Why and How: Apache Drill ! - Tugdual Grall
PPTX
Free Code Friday: Drill 101 - Basics of Apache Drill
PPTX
Rethinking SQL for Big Data with Apache Drill
PPTX
Apache Drill – Hands-On SQL References
PDF
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
KEY
Fosscon 2012 firewall workshop
PDF
Practical JSON in MySQL 5.7 and Beyond
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
PPTX
Building and Scaling the Internet of Things with MongoDB at Vivint
PDF
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
PPTX
Analyzing Real-World Data with Apache Drill
PDF
Big Ruby 2014: A 4 Pack of Lightning Talks
PDF
Analyzing Real-World Data with Apache Drill
PDF
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
PPTX
MongoDB 3.2 - Analytics
PPTX
Neo4j Makes Graphs Easy: Nicole White
PDF
NoSQL's biggest lie: SQL never went away - Martin Esmann
PPT
Geolocation and Cassandra at Physi
PDF
SQL window functions for MySQL
PDF
Blinded Stack Overflow: Just Another Common Technique
What and Why and How: Apache Drill ! - Tugdual Grall
Free Code Friday: Drill 101 - Basics of Apache Drill
Rethinking SQL for Big Data with Apache Drill
Apache Drill – Hands-On SQL References
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
Fosscon 2012 firewall workshop
Practical JSON in MySQL 5.7 and Beyond
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Building and Scaling the Internet of Things with MongoDB at Vivint
Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO
Analyzing Real-World Data with Apache Drill
Big Ruby 2014: A 4 Pack of Lightning Talks
Analyzing Real-World Data with Apache Drill
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
MongoDB 3.2 - Analytics
Neo4j Makes Graphs Easy: Nicole White
NoSQL's biggest lie: SQL never went away - Martin Esmann
Geolocation and Cassandra at Physi
SQL window functions for MySQL
Blinded Stack Overflow: Just Another Common Technique
Ad

More from MapR Technologies (20)

PPTX
Converging your data landscape
PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
PPTX
Self-Service Data Science for Leveraging ML & AI on All of Your Data
PPTX
Enabling Real-Time Business with Change Data Capture
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
PPTX
Machine Learning Success: The Key to Easier Model Management
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
PDF
Live Machine Learning Tutorial: Churn Prediction
PDF
An Introduction to the MapR Converged Data Platform
PPTX
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
PPTX
Best Practices for Data Convergence in Healthcare
PPTX
Geo-Distributed Big Data and Analytics
PPTX
MapR Product Update - Spring 2017
PPTX
3 Benefits of Multi-Temperature Data Management for Data Analytics
PPTX
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
PPTX
MapR and Cisco Make IT Better
PPTX
Evolving from RDBMS to NoSQL + SQL
Converging your data landscape
ML Workshop 2: Machine Learning Model Comparison & Evaluation
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Enabling Real-Time Business with Change Data Capture
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
ML Workshop 1: A New Architecture for Machine Learning Logistics
Machine Learning Success: The Key to Easier Model Management
Data Warehouse Modernization: Accelerating Time-To-Action
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Live Machine Learning Tutorial: Churn Prediction
An Introduction to the MapR Converged Data Platform
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
Best Practices for Data Convergence in Healthcare
Geo-Distributed Big Data and Analytics
MapR Product Update - Spring 2017
3 Benefits of Multi-Temperature Data Management for Data Analytics
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR and Cisco Make IT Better
Evolving from RDBMS to NoSQL + SQL
Ad

Recently uploaded (20)

PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Hybrid model detection and classification of lung cancer
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Getting Started with Data Integration: FME Form 101
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
STKI Israel Market Study 2025 version august
PDF
Unlock new opportunities with location data.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Getting started with AI Agents and Multi-Agent Systems
A review of recent deep learning applications in wood surface defect identifi...
Developing a website for English-speaking practice to English as a foreign la...
Final SEM Unit 1 for mit wpu at pune .pptx
Enhancing emotion recognition model for a student engagement use case through...
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
O2C Customer Invoices to Receipt V15A.pptx
Hybrid model detection and classification of lung cancer
Chapter 5: Probability Theory and Statistics
Getting Started with Data Integration: FME Form 101
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Web Crawler for Trend Tracking Gen Z Insights.pptx
WOOl fibre morphology and structure.pdf for textiles
Module 1.ppt Iot fundamentals and Architecture
Zenith AI: Advanced Artificial Intelligence
STKI Israel Market Study 2025 version august
Unlock new opportunities with location data.pdf
DP Operators-handbook-extract for the Mautical Institute
Group 1 Presentation -Planning and Decision Making .pptx
Getting started with AI Agents and Multi-Agent Systems

Self service data exploration with apache drill

  • 1. ® © 2015 MapR Technologies 1 ® © 2015 MapR Technologies Self Service Data Exploration with Apache Drill
  • 2. ® © 2015 MapR Technologies 2 Data is doubling in size every two years
  • 3. ® © 2015 MapR Technologies 3 2011 2013 In 2020 it is estimated to be 44 zettabytes of data in the world 2020 Source: IDC Digital Universe 44ZETTABYTES* 4.4ZETTABYTES 1.8ZETTABYTES … *A zettabyte is 1 BILLION terabytes.
  • 4. ® © 2015 MapR Technologies 4 2011 2013 In 2020 it is estimated to be 44 zettabytes of data in the world 2020 Source: IDC Digital Universe 44ZETTABYTES* 4.4ZETTABYTES 1.8ZETTABYTES … 700 trillion *A zettabyte is 1 BILLION terabytes. 64GB Each person in the world had 100k of these iPhones
  • 5. ® © 2015 MapR Technologies 5 UNSTRUCTURED DATA 1980 2000 20101990 2020 Unstructured data will account for more than 80% of the data collected by organizations Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data TotalDataStored STRUCTURED DATA
  • 6. ® © 2015 MapR Technologies 6 Evolving distance to data Business (analysts, developers) “Plumbing” development Business (analysts, developers) Existing approaches require a middleman (IT) Data Data Data Business (analysts, developers) Modeling and transformations Map/Reduce Traditional SQL-on-Hadoop New SQL-on-Hadoop
  • 7. ® © 2015 MapR Technologies 7 SQL in a NoSchema World •  SQL •  BI (Tableau, MicroStrategy, etc.) •  Low latency •  Scalability •  Create and maintain schemas on: –  HDFS (Parquet, JSON, etc.) –  HBase –  MongoDB •  Transform or copy data 2 DON’T WANT WANT
  • 8. ® © 2015 MapR Technologies 8 • Schema-free scale-out query engine for Hadoop and NoSQL • Low latency • Extreme ease of use • Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs APACHE DRILL
  • 9. ® © 2015 MapR Technologies 9 Drill’s Data Model is Flexible HBase JSON BSON CSV TSV Parquet Avro Schema-lessFixed schema Flat Complex Flexibility Name! Gender! Age! Michael! M! 6! Jennifer! F! 3! {! name: {! first: Michael,! last: Smith! },! hobbies: [ski, soccer],! district: Los Altos! }! {! name: {! first: Jennifer,! last: Gates! },! hobbies: [sing],! preschool: CCLC! }! RDBMS/SQL-on-Hadoop table Apache Drill table Flexibility
  • 10. ® © 2015 MapR Technologies 10 Running Drill takes 10 minutes   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  full_name        |  position_title  |      salary      |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Sheri  Nowmer  |  President            |  80000.0        |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   1  row  selected  (0.417  seconds)   DOWNLOAD HTTP://DRILL.APACHE.ORG/DOWNLOAD EXTRACT $  tar  xf  apache-­‐drill-­‐0.7.0.tar.gz   $  cd  apache-­‐drill-­‐0.7.0   RUN $  bin/sqlline  -­‐u  jdbc:drill:zk=local   >  SELECT  full_name,  position_title,  salary      FROM  cp.`employee.json  `      LIMIT  1;  QUERY & step by step In SQL format
  • 11. ® © 2015 MapR Technologies 11 Introduce external data sources to Drill Ø  SELECT  *  FROM  dfs.root.`/ E:/drill/data/yelp/ review.json`;   Ø  SELECT  *  FROM   dfs.yelp.`review.json`   LIMIT  1;   Ø  USE  dfs.yelp;   Ø  SELECT  *  FROM   `review.json`  LIMIT  1;   Ø  SELECT  *  FROM  hbase.users   LIMIT  1;   Storage Plugin Provider Workspace Table files Path Path relative to workspace mongo Database Collection hive Database Table hbase Namespace Table Coordinates: Currently Supported Providers . . …
  • 12. ® © 2015 MapR Technologies 12 Introduce external data sources to Drill Example: Ø  SELECT  *  FROM  dfs.root.`/E:/drill/data/yelp/ review.json`;   Ø  SELECT  *  FROM  dfs.yelp.`review.json`  LIMIT  1;   Ø  USE  dfs.yelp;   Ø  SELECT  *  FROM  `review.json`  LIMIT  1;   Ø  SELECT  *  FROM  hbase.users  LIMIT  1;   Ø  SELECT  *  FROM  dfs.root.`/ E:/drill/data/yelp/ review.json`;   Ø  SELECT  *  FROM   dfs.yelp.`review.json`   LIMIT  1;   Ø  USE  dfs.yelp;   Ø  SELECT  *  FROM   `review.json`  LIMIT  1;   Ø  SELECT  *  FROM  hbase.users   LIMIT  1;   Storage Plugin Provider Workspace Table files Path Path relative to workspace mongo Database Collection hive Database Table hbase Namespace Table Coordinates: Currently Supported Providers . . …
  • 13. ® © 2015 MapR Technologies 13 {      "votes":  {"funny":  0,  "useful":  2,  "cool":  1},      "user_id":  "Xqd0DzHaiyRqVH3WRG7hzg",      "review_id":  "15SdjuK7DmYqUAj6rjGowg",      "stars":  5,      "date":  "2007-­‐05-­‐17",      "text":  "dr.  goldberg  offers  everything  ...",      "type":  "review",      "business_id":  "vcNAWiLM4dR7D2nwwJ7nCA"   }   Inventory: DFS Files
  • 14. ® © 2015 MapR Technologies 14 business.json (1) {    "business_id":  "4bEjOyTaDG24SY5TxsaUNQ",    "full_address":  "3655  Las  Vegas  Blvd  SnThe  StripnLas  Vegas,  NV  89109",    "hours":  {      "Monday":  {"close":  "23:00",  "open":  "07:00"},      "Tuesday":  {"close":  "23:00",  "open":  "07:00"},      "Friday":  {"close":  "00:00",  "open":  "07:00"},      "Wednesday":  {"close":  "23:00",  "open":  "07:00"},      "Thursday":  {"close":  "23:00",  "open":  "07:00"},      "Sunday":  {"close":  "23:00",  "open":  "07:00"},      "Saturday":  {"close":  "00:00",  "open":  "07:00"}    },    "open":  true,    "categories":  ["Breakfast  &  Brunch",  "Steakhouses",  "French",  "Restaurants"],    "city":  "Las  Vegas",    "review_count":  4084,    "name":  "Mon  Ami  Gabi",    "neighborhoods":  ["The  Strip"],    "longitude":  -­‐115.172588519464,  
  • 15. ® © 2015 MapR Technologies 15 business.json (2)  "state":  "NV",    "stars":  4.0,      "attributes":  {      "Alcohol":  "full_bar”,        "Noise  Level":  "average",      "Has  TV":  false,      "Attire":  "casual",      "Ambience":  {        "romantic":  true,        "intimate":  false,        "touristy":  false,        "hipster":  false,          "classy":  true,        "trendy":  false,          "casual":  false      },      "Good  For":  {"dessert":  false,  "latenight":  false,  "lunch":  false,                                                  "dinner":  true,  "breakfast":  false,  "brunch":  false},    }   }  
  • 16. ® © 2015 MapR Technologies 16 Use cases LAS VEGAS NEW RESTAURANT
  • 17. ® © 2015 MapR Technologies 17 NEW RESTAURANT Customers for opening party >  SELECT  name,  review_count      FROM  dfs.yelp.`user.json`      ORDER  BY  review_count  DESC      LIMIT  50;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  review_count  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Victor          |  8062                  |   |  Jennifer      |  4244                  |   |  Anita            |  3829                  |   |  ......          |  ....                  |   |  Eileen          |  1947                  |   |  J                    |  1946                  |   |  Matt              |  1942                  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   50  rows  selected  (1.16  seconds)  
  • 18. ® © 2015 MapR Technologies 18 Cities with most businesses NEW RESTAURANT >  SELECT  state,  city,  COUNT(*)  AS  businesses          FROM  dfs.yelp.`business.json`          GROUP  BY  state,  city          ORDER  BY  reviews  DESC  LIMIT  10;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |      state        |        city        |  businesses  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  NV                  |  Las  Vegas    |  12021            |   |  AZ                  |  Phoenix        |  7499              |   |  AZ                  |  Scottsdale  |  3605              |   |  EDH                |  Edinburgh    |  2804              |   |  AZ                  |  Mesa              |  2041              |   |  AZ                  |  Tempe            |  2025              |   |  NV                  |  Henderson    |  1914              |   |  AZ                  |  Chandler      |  1637              |   |  WI                  |  Madison        |  1630              |   |  AZ                  |  Glendale      |  1196              |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  • 19. ® © 2015 MapR Technologies 19 Use cases LAS VEGAS LAS VEGAS RESTAURANT
  • 20. ® © 2015 MapR Technologies 20 Open restaurants at 22:00 LAS VEGAS RESTAURANT >  SELECT  name,  b.hours      FROM  dfs.yelp.`business.json`  b      WHERE  b.hours.Saturday.`open`  <  '22:00'  AND                  b.hours.Saturday.`close`  >  '22:00'      LIMIT  1;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |      hours        |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Chang  Jiang  Chinese  Kitchen  |  {"Tuesday": {"close":"22:00","open":"11:00"},"Friday": {"close":"22:30","open":"11:00"},"Monday": {"close":"22:00","open":"11:00"},"Wednesday": {"close":"22:00","open":"11:00"},"Thursday": {"close":"22:00","open":"11:00"},"Sunday": {"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","open":"11:00"}}  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   1  row  selected  (0.013  seconds)    
  • 21. ® © 2015 MapR Technologies 21 Finding hummus at 22:00 LAS VEGAS RESTAURANT >  SELECT  name,  stars,  b.hours.Wednesday,  categories      FROM  dfs.yelp.`business.json`  b      WHERE  b.hours.Wednesday.`open`  <  '22:00'  AND                  b.hours.Wednesday.`close`  >  '22:00'  AND                  REPEATED_CONTAINS(categories,  'Mediterranean')   AND                  city  =  'Las  Vegas'          ORDER  BY  stars  DESC          LIMIT  1;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |      stars        |      EXPR$2      |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Marrakech  Moroccan  Restaurant  |  4.0                |  {"close":"23:00","open":"17:30"}  |   ["Mediterranean","Middle  Eastern","Moroccan","Restaurants"]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   1  row  selected  (2.185  seconds)  
  • 22. ® © 2015 MapR Technologies 22 • Working with repeated values APACHE DRILL Unique benefits
  • 23. ® © 2015 MapR Technologies 23 Flatten Repeated Values >  SELECT  name,  categories      FROM  dfs.yelp.`business.json`  LIMIT  2;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Eric  Goldberg,  MD  |  ["Doctors","Health  &  Medical"]  |   |  Pine  Cone  Restaurant  |  ["Restaurants"]  |   |  Deforest  Family  Restaurant  |  ["American  (Traditional)","Restaurants"]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+     >  SELECT  name,  FLATTEN(categories)  AS  categories      FROM  dfs.yelp.`business.json`  LIMIT  3;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Eric  Goldberg,  MD  |  Doctors        |   |  Eric  Goldberg,  MD  |  Health  &  Medical  |   |  Pine  Cone  Restaurant  |  Restaurants  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  • 24. ® © 2015 MapR Technologies 24 Most and Least Common Business Categories >  SELECT  category,  COUNT(*)  AS  businesses      FROM  (SELECT  name,  FLATTEN(categories)  AS  category                          FROM  dfs.yelp.`business.json`)      GROUP  BY  category  ORDER  BY  businesses  DESC;   +------------+------------+ | category | businesses | +------------+------------+ | Restaurants | 14303 | | ............... | | Firewood | 1 | +------------+------------+ 715 rows selected (3.439 seconds)     >  SELECT  name,  categories  FROM  dfs.yelp.`business.json`      WHERE  true  AND  REPEATED_CONTAINS(categories,  'Australian');   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  The  Australian  AZ  |  ["Bars","Burgers","Nightlife","Australian","Sports  Bars","Restaurants"]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  • 25. ® © 2015 MapR Technologies 25 • Views - Dynamic and Materialized APACHE DRILL
  • 26. ® © 2015 MapR Technologies 26 Create a view combining business and reviews datasets. >  CREATE  OR  REPLACE  VIEW  dfs.tmp.BusinessReviews  AS          SELECT  b.name,  b.stars,  r.votes.funny,                        r.votes.useful,  r.votes.cool,  r.`date`              FROM  dfs.yelp.`business.json`  b,  dfs.yelp.`review.json`  r              WHERE  r.business_id  =  b.business_id;     +------------+------------+ | ok | summary | +------------+------------+ | true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema | +------------+------------+   >  SELECT  COUNT(*)  AS  Total  FROM  dfs.tmp.BusinessReviews;     +------------+ | Total | +------------+ | 1125458 | +------------+
  • 27. ® © 2015 MapR Technologies 27 Materialized Views AKA Tables >  ALTER  SESSION  SET  `store.format`  =  'parquet';     >  CREATE  TABLE  dfs.tmp.BusinessReviewsTbl  AS          SELECT  b.name,  b.stars,  r.votes.funny  funny,                        r.votes.useful  useful,  r.votes.cool  cool,  r.`date`              FROM  dfs.yelp.`business.json`  b,  dfs.yelp.`review.json`  r              WHERE  r.business_id  =  b.business_id;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    Fragment    |  Number  of  records  written  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  1_0                |  176448                                        |   |  1_1                |  192439                                        |   |  1_2                |  198625                                        |   |  1_3                |  200863                                        |   |  1_4                |  181420                                        |   |  1_5                |  175663                                        |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  • 28. ® © 2015 MapR Technologies 28 DRILL ARCHITECTURE Under the hood
  • 29. ® © 2015 MapR Technologies 29 High Level Architecture Cluster of commodity servers –  Daemon (drillbit) on each node ZooKeeper maintains ephemeral cluster membership information –  Drillbit uses ZooKeeper to find other drillbits in the cluster –  Client uses ZooKeeper to find drillbits Built-in, optimistic query execution engine. Doesn’t require a particular storage or execution system (MapReduce, Spark, Tez) –  Better performance and manageability Data processing unit is columnar record batches   –  Enables schema flexibility with negligible performance impact
  • 30. ® © 2015 MapR Technologies 30 Drill Maximizes Data Locality Data Source Best Practice HDFS or MapR-FS drillbit on each DataNode HBase or MapR-DB drillbit on each RegionServer MongoDB drillbit on each mongod node (when using replicas, run it on the replica node) drillbit DataNode/ RegionServer/ mongod drillbit DataNode/ RegionServer/ mongod drillbit DataNode/ RegionServer/ mongod ZooKeeper ZooKeeper ZooKeeper …
  • 31. ® © 2015 MapR Technologies 31 Core Modules within drillbit   SQL Parser Hive HBase Distributed Cache StoragePlugins MongoDB DFS PhysicalPlan ExecutionLogicalPlan Optimizer RPC Endpoint
  • 32. ® © 2015 MapR Technologies 32 SELECT * Query Execution drillbit   ZooKeeper Client (JDBC, ODBC, REST) 1.  Find drillbits (once per session) 3.  Create logical and physical execution plans 4.  Farm out execution of fragments to cluster (completely distributed execution) ZooKeeper ZooKeeper drillbit  drillbit   2.  Submit query to drillbit 5.  Return results to client * CTAS (CREATE TABLE AS SELECT) queries include steps 1-4
  • 33. ® © 2015 MapR Technologies 33 Participate •  Learn: http://drill.apache.org/ •  Download: http://drill.apache.org/download/ •  Ask Questions: user@drill.apache.org •  Engage on Twitter: @ApacheDrill
  • 34. ® © 2015 MapR Technologies 34 Thank You @mapr maprtech MapRTechnologies maprtech mapr-technologies
  • 35. ® © 2015 MapR Technologies 35 Or Run Drill in Distributed Mode… $  zkServer  start   •  Make sure ZooKeeper (zkServer) is running: •  Access the Web UI: http://localhost:8047 •  Connect a client to the cluster (e.g., sqlline): •  Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes •  If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired cluster in the JDBC connection string: jdbc:drill:zk=localhost:2181/drill/ <clustername> •  Not sure if ZooKeeper is running? Run telnet  localhost  2181 and make sure it connects •  Define the Drill cluster name and ZooKeeper nodes in conf/drill-­‐override.conf •  Start drillbit:   $  bin/drillbit.sh  start   $  bin/sqlline  -­‐u  jdbc:drill:zk=localhost:2181  
  • 36. ® © 2015 MapR Technologies 36 user.json {    "yelping_since":  "2007-­‐08",    "votes":  {      "funny":  198,      "useful":  415,      "cool":  206    },    "review_count":  283,    "name":  "Adele",    "user_id":  "9NJdKpRNwwaL4cvKq0cN6g",    "friends":  ["DrKQzBFAvxhyjLgbPSW2Qw",  "ebXx-­‐G5eFqWkfDuk22f81w",  "qWLezzHxOXN-­‐ GQdInixZzw"],    "fans":  10,    "average_stars":  3.6499999999999999,    "compliments":  {      "funny":  4,      "hot":  17,      "cool":  20    },    "elite":  [2008,  2009,  2010,  2011,  2012,  2013,  2014]   }