SlideShare a Scribd company logo
Apache Drill
Interactive Analysis of Large-Scale Datasets


              Jason Frantz
             Architect, MapR
My Background
•   Caltech
•   Clustrix
•   MapR
•   Founding member of Apache Drill
MapR Technologies
• The open enterprise-grade distribution for Hadoop
   – Easy, dependable and fast
   – Open source with standards-based extensions

• MapR is deployed at 1000’s of companies
   – From small Internet startups to the world’s largest enterprises

• MapR customers analyze massive amounts of data:
   – Hundreds of billions of events daily
   – 90% of the world’s Internet population monthly
   – $1 trillion in retail purchases annually

• MapR has partnered with Google to provide Hadoop on Google
  Compute Engine
Latency Matters
• Ad-hoc analysis with interactive tools

• Real-time dashboards

• Event/trend detection and analysis
  – Network intrusions
  – Fraud
  – Failures
Big Data Processing
                  Batch processing   Interactive analysis   Stream processing
Query runtime     Minutes to hours   Milliseconds to        Never-ending
                                     minutes
Data volume       TBs to PBs         GBs to PBs             Continuous stream
Programming       MapReduce          Queries                DAG
model
Users             Developers         Analysts and           Developers
                                     developers
Google project    MapReduce          Dremel
Open source       Hadoop                                    Storm and S4
project           MapReduce


                 Introducing Apache Drill…
GOOGLE DREMEL
Google Dremel
• Interactive analysis of large-scale datasets
    –   Trillion records at interactive speeds
    –   Complementary to MapReduce
    –   Used by thousands of Google employees
    –   Paper published at VLDB 2010
         • Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva
           Shivakumar, Matt Tolton, Theo Vassilakis


• Model
    – Nested data model with schema
         • Most data at Google is stored/transferred in Protocol Buffers
         • Normalization (to relational) is prohibitive
    – SQL-like query language with nested data support

• Implementation
    – Column-based storage and processing
    – In-situ data access (GFS and Bigtable)
    – Tree architecture as in Web search (and databases)
Google BigQuery
• Hosted Dremel (Dremel as a Service)
• CLI (bq) and Web UI
• Import data from Google Cloud Storage or local files
   – Files must be in CSV format
       • Nested data not supported [yet] except built-in datasets
   – Schema definition required
APACHE DRILL
Architecture



• Only the execution engine knows the physical attributes of the cluster
    – # nodes, hardware, file locations, …

• Public interfaces enable extensibility
    – Developers can build parsers for new query languages
    – Developers can provide an execution plan directly

• Each level of the plan has a human readable representation
    – Facilitates debugging and unit testing
Architecture (2)
Execution Engine Layers
• Drill execution engine has two layers
   – Operator layer is serialization-aware
       • Processes individual records
   – Execution layer is not serialization-aware
       • Processes batches of records (blobs)
       • Responsible for communication, dependencies and fault tolerance
Data Flow
Nested Query Languages
• DrQL
   – SQL-like query language for nested data
   – Compatible with Google BigQuery/Dremel
      • BigQuery applications should work with Drill
   – Designed to support efficient column-based processing
      • No record assembly during query processing


• Mongo Query Language
   – {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}

• Other languages/programming models can plug in
Nested Data Model
•   The data model in Dremel is Protocol Buffers
     – Nested
     – Schema
•   Apache Drill is designed to support multiple data models
     – Schema: Protocol Buffers, Apache Avro, …
     – Schema-less: JSON, BSON, …
•   Flat records are supported as a special case of nested data
     – CSV, TSV, …

                 Avro IDL                                         JSON
     enum Gender {                                 {
       MALE, FEMALE                                    "name": "Srivas",
     }                                                 "gender": "Male",
                                                       "followers": 100
     record User {                                 }
       string name;                                {
       Gender gender;                                  "name": "Raina",
       long followers;                                 "gender": "Female",
     }                                                 "followers": 200,
                                                       "zip": "94305"
                                                   }
DrQL Example

SELECT DocId AS Id,
  COUNT(Name.Language.Code) WITHIN Name AS
Cnt,
  Name.Url + ',' + Name.Language.Code AS
Str
FROM t
WHERE REGEXP(Name.Url, '^http')
  AND DocId < 20;




                                    * Example from the Dremel paper
Query Components
• Query components:
   –   SELECT
   –   FROM
   –   WHERE
   –   GROUP BY
   –   HAVING
   –   (JOIN)

• Key logical operators:
   –   Scan
   –   Filter
   –   Aggregate
   –   (Join)
Extensibility
•   Nested query languages
     –   Pluggable model
     –   DrQL
     –   Mongo Query Language
     –   Cascading

•   Distributed execution engine
     – Extensible model (eg, Dryad)
     – Low-latency
     – Fault tolerant

•   Nested data formats
     – Pluggable model
     – Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV)
     – Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)

•   Scalable data sources
     – Pluggable model
     – Hadoop
     – HBase
Scan Operators
• Drill supports multiple data formats by having per-format scan operators
   • Queries involving multiple data formats/sources are supported

• Fields and predicates can be pushed down into the scan operator

• Scan operators may have adaptive side-effects (database cracking)
   • Produce ColumnIO from RecordIO
   • Google PowerDrill stores materialized expressions with the data
               Scan with schema                          Scan without schema

Operator       Protocol Buffers                          JSON-like (MessagePack)
output
Supported      ColumnIO (column-based protobuf/Dremel)   JSON
data formats   RecordIO (row-based protobuf)             HBase
               CSV
SELECT …       ColumnIO(proto URI, data URI)             Json(data URI)
FROM …         RecordIO(proto URI, data URI)             HBase(table name)
Design Principles
Flexible                          Easy
• Pluggable query languages       •   Unzip and run
• Extensible execution engine     •   Zero configuration
• Pluggable data formats          •   Reverse DNS not needed
  • Column-based and row-based    •   IP addresses can change
  • Schema and schema-less        •   Clear and concise log messages
• Pluggable data sources


Dependable                        Fast
• No SPOF                         • C/C++ core with Java support
• Instant recovery from crashes     • Google C++ style guide
                                  • Min latency and max throughput
                                    (limited only by hardware)
Hadoop Integration
• Hadoop data sources
   – Hadoop FileSystem API (HDFS/MapR-FS)
   – HBase
• Hadoop data formats
   – Apache Avro
   – RCFile
• MapReduce-based tools to create column-based formats
• Table registry in HCatalog
• Run long-running services in YARN
Get Involved!
• Download these slides
    – http://www.mapr.com/company/events/bay-area-hug/9-19-2012


• Join the mailing list
    – drill-dev-subscribe@incubator.apache.org


• Join MapR
    – jobs@mapr.com

More Related Content

PDF
Swiss Big Data User Group - Introduction to Apache Drill
PPTX
HUG France - Apache Drill
PDF
Hadoop User Group - Status Apache Drill
PPTX
Pptx present
PPTX
Hadoop overview
PDF
Big Data and Hadoop Ecosystem
PPTX
Drill dchug-29 nov2012
Swiss Big Data User Group - Introduction to Apache Drill
HUG France - Apache Drill
Hadoop User Group - Status Apache Drill
Pptx present
Hadoop overview
Big Data and Hadoop Ecosystem
Drill dchug-29 nov2012

What's hot (20)

PPTX
Building a Scalable Web Crawler with Hadoop
PDF
An introduction to apache drill presentation
PPTX
מיכאל
PPTX
Asbury Hadoop Overview
PPTX
The Hadoop Ecosystem
PPTX
Drill Bay Area HUG 2012-09-19
PPTX
HADOOP TECHNOLOGY ppt
PPTX
Hive: Data Warehousing for Hadoop
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PDF
Realtime Computation with Storm
PPTX
Hug france-2012-12-04
PPTX
Apache Hadoop at 10
PPTX
Column Stores and Google BigQuery
PPT
Nextag talk
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
PDF
Practical Problem Solving with Apache Hadoop & Pig
PPT
Hadoop Technologies
PPTX
Introduction to Pig
PPTX
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
PDF
Hadoop Overview & Architecture
 
Building a Scalable Web Crawler with Hadoop
An introduction to apache drill presentation
מיכאל
Asbury Hadoop Overview
The Hadoop Ecosystem
Drill Bay Area HUG 2012-09-19
HADOOP TECHNOLOGY ppt
Hive: Data Warehousing for Hadoop
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Realtime Computation with Storm
Hug france-2012-12-04
Apache Hadoop at 10
Column Stores and Google BigQuery
Nextag talk
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Practical Problem Solving with Apache Hadoop & Pig
Hadoop Technologies
Introduction to Pig
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
Hadoop Overview & Architecture
 
Ad

Viewers also liked (12)

PDF
Leveraging Bagging for Evolving Data Streams
PPT
Retail site assessment
PDF
Bagging Decision Trees on Data Sets with Classification Noise
PPT
Drill Down- Maximizing Business Retention Programming
PPTX
Lecture 6: Ensemble Methods
PDF
A Short Course in Data Stream Mining
PPTX
K Nearest Neighbor Presentation
PPT
k Nearest Neighbor
PPTX
Drill Down URL Presentation
PDF
Linux Performance Analysis and Tools
PPTX
Slideshare ppt
Leveraging Bagging for Evolving Data Streams
Retail site assessment
Bagging Decision Trees on Data Sets with Classification Noise
Drill Down- Maximizing Business Retention Programming
Lecture 6: Ensemble Methods
A Short Course in Data Stream Mining
K Nearest Neighbor Presentation
k Nearest Neighbor
Drill Down URL Presentation
Linux Performance Analysis and Tools
Slideshare ppt
Ad

Similar to Sep 2012 HUG: Apache Drill for Interactive Analysis (20)

PPTX
Drill at the Chug 9-19-12
PPTX
Drill at the Chicago Hug
PPTX
Drill lightning-london-big-data-10-01-2012
PPTX
Apache Drill
PPTX
PhillyDB Talk - Beyond Batch
PPTX
Drill Lightning London Big Data
PDF
Apache Hadoop 1.1
PDF
Drill architecture 20120913
PPTX
Drill njhug -19 feb2013
PPTX
Berlin Hadoop Get Together Apache Drill
PPTX
Hadoop Ecosystem
PPTX
PDF
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
PDF
2014 08-20-pit-hug
PPTX
Real time hadoop + mapreduce intro
PDF
Apache Drill @ PJUG, Jan 15, 2013
PPTX
Apache Drill at ApacheCon2014
KEY
TriHUG - Beyond Batch
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Drill at the Chug 9-19-12
Drill at the Chicago Hug
Drill lightning-london-big-data-10-01-2012
Apache Drill
PhillyDB Talk - Beyond Batch
Drill Lightning London Big Data
Apache Hadoop 1.1
Drill architecture 20120913
Drill njhug -19 feb2013
Berlin Hadoop Get Together Apache Drill
Hadoop Ecosystem
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
2014 08-20-pit-hug
Real time hadoop + mapreduce intro
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill at ApacheCon2014
TriHUG - Beyond Batch
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
PDF
CICD at Oath using Screwdriver
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
PDF
Architecting Petabyte Scale AI Applications
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
CICD at Oath using Screwdriver
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Moving the Oath Grid to Docker, Eric Badger, Oath
Architecting Petabyte Scale AI Applications
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Electronic commerce courselecture one. Pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
Teaching material agriculture food technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
cuic standard and advanced reporting.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
KodekX | Application Modernization Development
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
Diabetes mellitus diagnosis method based random forest with bat algorithm
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
20250228 LYD VKU AI Blended-Learning.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Per capita expenditure prediction using model stacking based on satellite ima...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MIND Revenue Release Quarter 2 2025 Press Release
Electronic commerce courselecture one. Pdf
Network Security Unit 5.pdf for BCA BBA.
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
MYSQL Presentation for SQL database connectivity
KodekX | Application Modernization Development
“AI and Expert System Decision Support & Business Intelligence Systems”

Sep 2012 HUG: Apache Drill for Interactive Analysis

  • 1. Apache Drill Interactive Analysis of Large-Scale Datasets Jason Frantz Architect, MapR
  • 2. My Background • Caltech • Clustrix • MapR • Founding member of Apache Drill
  • 3. MapR Technologies • The open enterprise-grade distribution for Hadoop – Easy, dependable and fast – Open source with standards-based extensions • MapR is deployed at 1000’s of companies – From small Internet startups to the world’s largest enterprises • MapR customers analyze massive amounts of data: – Hundreds of billions of events daily – 90% of the world’s Internet population monthly – $1 trillion in retail purchases annually • MapR has partnered with Google to provide Hadoop on Google Compute Engine
  • 4. Latency Matters • Ad-hoc analysis with interactive tools • Real-time dashboards • Event/trend detection and analysis – Network intrusions – Fraud – Failures
  • 5. Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce Introducing Apache Drill…
  • 7. Google Dremel • Interactive analysis of large-scale datasets – Trillion records at interactive speeds – Complementary to MapReduce – Used by thousands of Google employees – Paper published at VLDB 2010 • Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis • Model – Nested data model with schema • Most data at Google is stored/transferred in Protocol Buffers • Normalization (to relational) is prohibitive – SQL-like query language with nested data support • Implementation – Column-based storage and processing – In-situ data access (GFS and Bigtable) – Tree architecture as in Web search (and databases)
  • 8. Google BigQuery • Hosted Dremel (Dremel as a Service) • CLI (bq) and Web UI • Import data from Google Cloud Storage or local files – Files must be in CSV format • Nested data not supported [yet] except built-in datasets – Schema definition required
  • 10. Architecture • Only the execution engine knows the physical attributes of the cluster – # nodes, hardware, file locations, … • Public interfaces enable extensibility – Developers can build parsers for new query languages – Developers can provide an execution plan directly • Each level of the plan has a human readable representation – Facilitates debugging and unit testing
  • 12. Execution Engine Layers • Drill execution engine has two layers – Operator layer is serialization-aware • Processes individual records – Execution layer is not serialization-aware • Processes batches of records (blobs) • Responsible for communication, dependencies and fault tolerance
  • 14. Nested Query Languages • DrQL – SQL-like query language for nested data – Compatible with Google BigQuery/Dremel • BigQuery applications should work with Drill – Designed to support efficient column-based processing • No record assembly during query processing • Mongo Query Language – {$query: {x: 3, y: "abc"}, $orderby: {x: 1}} • Other languages/programming models can plug in
  • 15. Nested Data Model • The data model in Dremel is Protocol Buffers – Nested – Schema • Apache Drill is designed to support multiple data models – Schema: Protocol Buffers, Apache Avro, … – Schema-less: JSON, BSON, … • Flat records are supported as a special case of nested data – CSV, TSV, … Avro IDL JSON enum Gender { { MALE, FEMALE "name": "Srivas", } "gender": "Male", "followers": 100 record User { } string name; { Gender gender; "name": "Raina", long followers; "gender": "Female", } "followers": 200, "zip": "94305" }
  • 16. DrQL Example SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20; * Example from the Dremel paper
  • 17. Query Components • Query components: – SELECT – FROM – WHERE – GROUP BY – HAVING – (JOIN) • Key logical operators: – Scan – Filter – Aggregate – (Join)
  • 18. Extensibility • Nested query languages – Pluggable model – DrQL – Mongo Query Language – Cascading • Distributed execution engine – Extensible model (eg, Dryad) – Low-latency – Fault tolerant • Nested data formats – Pluggable model – Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV) – Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON) • Scalable data sources – Pluggable model – Hadoop – HBase
  • 19. Scan Operators • Drill supports multiple data formats by having per-format scan operators • Queries involving multiple data formats/sources are supported • Fields and predicates can be pushed down into the scan operator • Scan operators may have adaptive side-effects (database cracking) • Produce ColumnIO from RecordIO • Google PowerDrill stores materialized expressions with the data Scan with schema Scan without schema Operator Protocol Buffers JSON-like (MessagePack) output Supported ColumnIO (column-based protobuf/Dremel) JSON data formats RecordIO (row-based protobuf) HBase CSV SELECT … ColumnIO(proto URI, data URI) Json(data URI) FROM … RecordIO(proto URI, data URI) HBase(table name)
  • 20. Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)
  • 21. Hadoop Integration • Hadoop data sources – Hadoop FileSystem API (HDFS/MapR-FS) – HBase • Hadoop data formats – Apache Avro – RCFile • MapReduce-based tools to create column-based formats • Table registry in HCatalog • Run long-running services in YARN
  • 22. Get Involved! • Download these slides – http://www.mapr.com/company/events/bay-area-hug/9-19-2012 • Join the mailing list – drill-dev-subscribe@incubator.apache.org • Join MapR – jobs@mapr.com