SlideShare a Scribd company logo
Cloudera Impala:
A Modern SQL Engine for Apache Hadoop
Mark Grover
Software Engineer, Cloudera
February 27, 2013
What is Impala?
●   General-purpose SQL engine
●   Real-time queries in Apache Hadoop
●   Beta version released since October 2012
●   General availability (GA) release slated for April 2013
●   Open source under Apache license
Overview
●   User View of Impala
●   Architecture of Impala
●   Comparing Impala with Dremel
●   Comparing Impala with Hive
●   Impala Roadmap
Impala Overview: Goals
●   General-purpose SQL query engine:
     ●   should work both for analytical and transactional workloads
     ●   will support queries that take from milliseconds to hours
●   Runs directly within Hadoop:
     ●   reads widely used Hadoop file formats
     ●   talks to widely used Hadoop storage managers
     ●   runs on same nodes that run Hadoop processes
●   High performance:
     ●   C++ instead of Java
     ●   runtime code generation
     ●   completely new execution engine that doesn't build on MapReduce
User View of Impala: Overview
●   Runs as a distributed service in cluster: one Impala daemon on each
    node with data
●   User submits query via ODBC/JDBC to any of the daemons
●   Query is distributed to all nodes with relevant data
●   If any node fails, the query fails
●   Impala uses Hive's metadata interface, connects to Hive's metastore
●   Supported file formats:
     ●   uncompressed/lzo-compressed text files
     ●   sequence files and RCFile with snappy/gzip compression
     ●   GA: Avro data files
     ●   GA: columnar format (more on that later)
User View of Impala: SQL
●   SQL support:
     ●   patterned after Hive's version of SQL
     ●   essentially SQL-92, minus correlated subqueries
     ●   limited to Select, Project, Join, Union, Subqueries, Aggregation and
         Insert
     ●   only equi-joins; no non-equi joins, no cross products
     ●   Order By only with Limit
     ●   GA: DDL support (CREATE, ALTER)
●   Functional limitations:
     ●   no custom UDFs, file formats, SerDes
     ●   no beyond SQL (buckets, samples, transforms, arrays, structs, maps,
         xpath, json)
     ●   only hash joins; joined table has to fit in memory:
           ● beta: of single node

           ● GA: aggregate memory of all (executing) nodes
User View of Impala: Apache HBase
●   HBase functionality:
     ●   uses Hive's mapping of HBase table into metastore table
     ●   predicates on rowkey columns are mapped into start/stop
         row
     ●   predicates on other columns are mapped into
         SingleColumnValueFilters
●   HBase functional limitations:
     ●   no nested-loop joins
     ●   all data stored as text
Impala Architecture
●   Two binaries: impalad and statestored
●   Impala daemon (impalad)
     ●   handles client requests and all internal requests related to
         query execution
     ●   exports Thrift services for these two roles
●   State store daemon (statestored)
     ●   provides name service and metadata distribution
     ●   also exports a Thrift service
Impala Architecture
●   Query execution phases
     ●   request arrives via odbc/jdbc
     ●   planner turns request into collections of plan fragments
     ●   coordinator initiates execution on remote impalad's
     ●   during execution
           ● intermediate results are streamed between executors

           ● query results are streamed back to client

           ● subject to limitations imposed to blocking operators

             (top-n, aggregation)
Impala Architecture: Planner
●   2-phase planning process:
     ●   single-node plan: left-deep tree of plan operators
     ●   plan partitioning: partition single-node plan to maximize scan locality,
         minimize data movement
●   Plan operators: Scan, HashJoin, HashAggregation, Union, TopN,
    Exchange
●   Distributed aggregation: pre-aggregation in all nodes, merge
    aggregation in single node.
    GA: hash-partitioned aggregation: re-partition aggregation
    input on grouping columns in order to reduce per-node
    memory requirement
●   Join order = FROM clause order
    GA target: rudimentary cost-based optimizer
Impala Architecture: Planner
●   Example: query with join and aggregation
    SELECT state, SUM(revenue)
    FROM HdfsTbl h JOIN HbaseTbl b ON (...)
    GROUP BY 1 ORDER BY 2 desc LIMIT 10

       TopN
                                                  Agg
                          TopN
       Agg                                       Hash
       Hash                Agg                   Join
       Join                                HDFS                   HBase
                           Exch                         Exch
                                           Scan                    Scan
    HDFS      HBase     at coordinator   at DataNodes          at region servers
    Scan       Scan
Impala Architecture: Query Execution
Request arrives via odbc/jdbc

      SQL App                               Hive
                                                      HDFS NN   Statestore
       ODBC                             Metastore
                          SQL
                        request

     Query Planner                 Query Planner           Query Planner
    Query Coordinator             Query Coordinator      Query Coordinator
     Query Executor               Query Executor          Query Executor
   HDFS DN      HBase             HDFS DN     HBase      HDFS DN    HBase
Impala Architecture: Query Execution
Planner turns request into collections of plan fragments
Coordinator initiates execution on remote impalad's
      SQL App                     Hive
                                            HDFS NN   Statestore
       ODBC                   Metastore




     Query Planner       Query Planner           Query Planner
    Query Coordinator   Query Coordinator      Query Coordinator
     Query Executor      Query Executor         Query Executor
    HDFS DN     HBase   HDFS DN     HBase      HDFS DN    HBase
Impala Architecture: Query Execution
Intermediate results are streamed between impalad's Query
results are streamed back to client

     SQL App                          Hive
                                                HDFS NN   Statestore
      ODBC                        Metastore

                   query
                  results

     Query Planner           Query Planner           Query Planner
   Query Coordinator        Query Coordinator      Query Coordinator
    Query Executor           Query Executor         Query Executor
   HDFS DN     HBase        HDFS DN     HBase      HDFS DN    HBase
Impala Architecture
●   Metadata handling:
     ●   utilizes Hive's metastore
     ●   caches metadata: no synchronous metastore API calls
         during query execution
     ●   beta: impalad's read metadata from metastore at startup
     ●   Post-GA: metadata distribution through statestore
     ●   Post-GA: HCatalog
Impala Architecture
●   Execution engine
     ●   written in C++
     ●   runtime code generation for "big loops"
     ●   internal in-memory tuple format plus fixed-width data at
         fixed offsets
     ●   uses intrinsics/special cpu instructions for text parsing,
         crc32 computation, etc.
Impala Execution Engine
●   More on runtime code generation
     ●   example of "big loop": insert batch of rows into hash table
     ●   known at query compile time: # of tuples in a batch, tuple
         layout, column types, etc.
     ●   generate at compile time: unrolled loop that inlines all
         function calls, contains no dead code, minimizes branches
     ●   code generated using llvm
Impala's Statestore
●   Central system state repository
     ●   name service (membership)
     ●   Post-GA: metadata
     ●   Post-GA: other scheduling-relevant or diagnostic state
●   Soft-state
     ●   all data can be reconstructed from the rest of the system
     ●   cluster continues to function when statestore fails, but per-node state
         becomes increasingly stale
●   Sends periodic heartbeats
     ●   pushes new data
     ●   checks for liveness
Statestore: Why not ZooKeeper?
●   ZK is not a good pub-sub system
     ●   Watch API is awkward and requires a lot of client logic
     ●   multiple round-trips required to get data for changes to
         node's children
     ●   push model is more natural for our use case
●   Don't need all the guarantees ZK provides:
     ●   serializability
     ●   persistence
     ●   prefer to avoid complexity where possible
●   ZK is bad at the things we care about and good at the
    things we don't
Comparing Impala to Dremel
●   What is Dremel?
     ●   columnar storage for data with nested structures
     ●   distributed scalable aggregation on top of that
●   Columnar storage in Hadoop: joint project between Cloudera
    and Twitter
     ●   new columnar format: Parquet; derived from Doug Cutting's Trevni
     ●   stores data in appropriate native/binary types
     ●   can also store nested structures similar to Dremel's ColumnIO
●   Distributed aggregation: Impala
●   Impala plus Parquet: a superset of the published version of
    Dremel (which didn't support joins)
More about Parquet
●   What is it:
     ●   container format for all popular serialization formats: Avro, Thrift,
         Protocol Buffers
     ●   successor to Trevni
     ●   jointly developed between Cloudera and Twitter
     ●   open source; hosted on github
●   Features
     ●   rowgroup format: file contains multiple horiz. slices
     ●   supports storing each column in separate file
     ●   supports fully shredded nested data; repetition and definition levels
         similar to Dremel's ColumnIO
     ●   column values stored in native types (bool, int<x>, float, double, byte
         array)
     ●   support for index pages for fast lookup
     ●   extensible value encodings
Comparing Impala to Hive
●   Hive: MapReduce as an execution engine
     ●   High latency, low throughput queries
     ●   Fault-tolerance model based on MapReduce's on-disk
         checkpointing; materializes all intermediate results
     ●   Java runtime allows for easy late-binding of functionality:
         file formats and UDFs.
     ●   Extensive layering imposes high runtime overhead
●   Impala:
     ●   direct, process-to-process data exchange
     ●   no fault tolerance
     ●   an execution engine designed for low runtime overhead
Comparing Impala to Hive
●   Impala's performance advantage over Hive: no hard
    numbers, but
     ●   Impala can get full disk throughput (~100MB/sec/disk);
         I/O-bound workloads often faster by 3-4x
     ●   queries that require multiple map-reduce phases in Hive
         see a higher speedup
     ●   queries that run against in-memory data see a higher
         speedup (observed up to 100x)
Impala Roadmap: GA – April 2013
●   New data formats:
     ●   Avro
     ●   Parquet
●   Improved query execution: partitioned joins
●   Further performance improvements
●   Guidelines for production deployment:
     ●   load balancing across impalad's
     ●   resource isolation within MR cluster
Impala Roadmap: 2013
●   Additional SQL:
     ●   UDFs
     ●   SQL authorization and DDL
     ●   ORDER BY without LIMIT
     ●   window functions
     ●   support for structured data types
●   Improved HBase support:
     ●   composite keys, complex types in columns,
         index nested-loop joins,
         INSERT/UPDATE/DELETE
Impala Roadmap: 2013
●   Runtime optimizations:
     ●   straggler handling
     ●   join order optimization
     ●   improved cache management
     ●   data collocation for improved join performance
●   Better metadata handling:
     ●   automatic metadata distribution through statestore
●   Resource management:
     ●   goal: run exploratory and production workloads in same
         cluster, against same data, w/o impacting production jobs
Try it out!
●   Beta version available since 10/24/12
●   Latest version is 0.6
●   We have packages for:
●   RHEL 6.2/5.7
●   Ubuntu 10.04 and 12.04
●   SLES 11
●   Debian 6
●   We are targeting GA for April 2013
●   Questions/comments? impala-user@cloudera.org
●   My email address: mgrover@cloudera.com
●   My twitter handle: mark_grover

More Related Content

PDF
Introduction to Impala
PPTX
Impala presentation
PPTX
Architecting Applications with Hadoop
PDF
Applications on Hadoop
PDF
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
PDF
Cloudera impala
PDF
Cloudera Impala: A modern SQL Query Engine for Hadoop
PDF
Impala: Real-time Queries in Hadoop
Introduction to Impala
Impala presentation
Architecting Applications with Hadoop
Applications on Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera impala
Cloudera Impala: A modern SQL Query Engine for Hadoop
Impala: Real-time Queries in Hadoop

What's hot (20)

PDF
Impala Architecture presentation
PDF
NYC HUG - Application Architectures with Apache Hadoop
PPTX
Cloudera Impala: A Modern SQL Engine for Hadoop
PDF
An Introduction to Impala – Low Latency Queries for Apache Hadoop
PPTX
The Impala Cookbook
PDF
SQL Engines for Hadoop - The case for Impala
PDF
How Impala Works
PDF
Cloudera Impala, updated for v1.0
PDF
Presentations from the Cloudera Impala meetup on Aug 20 2013
PDF
Cloudera Impala
PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
PPT
Cloudera Impala Internals
PPTX
Incredible Impala
PDF
Real-time Big Data Analytics Engine using Impala
PDF
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
PDF
Application architectures with Hadoop – Big Data TechCon 2014
PDF
Impala 2.0 Update #impalajp
PDF
Building a Hadoop Data Warehouse with Impala
PDF
HBase and Impala Notes - Munich HUG - 20131017
PPTX
A brave new world in mutable big data relational storage (Strata NYC 2017)
Impala Architecture presentation
NYC HUG - Application Architectures with Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
The Impala Cookbook
SQL Engines for Hadoop - The case for Impala
How Impala Works
Cloudera Impala, updated for v1.0
Presentations from the Cloudera Impala meetup on Aug 20 2013
Cloudera Impala
Impala 2.0 - The Best Analytic Database for Hadoop
Cloudera Impala Internals
Incredible Impala
Real-time Big Data Analytics Engine using Impala
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Application architectures with Hadoop – Big Data TechCon 2014
Impala 2.0 Update #impalajp
Building a Hadoop Data Warehouse with Impala
HBase and Impala Notes - Munich HUG - 20131017
A brave new world in mutable big data relational storage (Strata NYC 2017)
Ad

Similar to Cloudera Impala presentation (20)

PDF
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
PDF
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
PDF
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
PPTX
Impala for PhillyDB Meetup
PPTX
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
PDF
Impala presentation ahad rana
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
PPTX
BDM8 - Near-realtime Big Data Analytics using Impala
PPTX
Technical Overview on Cloudera Impala
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
PDF
Impala tech-talk by Dimitris Tsirogiannis
PDF
Cloudera Impala - HUG Karlsruhe, July 04, 2013
PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PDF
(Aaron myers) hdfs impala
PPTX
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
PPTX
SQL on Hadoop for the Oracle Professional
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Cloudera Impala Overview (via Scott Leberknight)
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Impala for PhillyDB Meetup
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Impala presentation ahad rana
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
BDM8 - Near-realtime Big Data Analytics using Impala
Technical Overview on Cloudera Impala
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Impala tech-talk by Dimitris Tsirogiannis
Cloudera Impala - HUG Karlsruhe, July 04, 2013
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
(Aaron myers) hdfs impala
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
SQL on Hadoop for the Oracle Professional
Building a Hadoop Data Warehouse with Impala
Cloudera Impala Overview (via Scott Leberknight)
Ad

More from markgrover (20)

PDF
From discovering to trusting data
PDF
Amundsen lineage designs - community meeting, Dec 2020
PDF
Amundsen at Brex and Looker integration
PDF
REA Group's journey with Data Cataloging and Amundsen
PDF
Amundsen gremlin proxy design
PDF
Amundsen: From discovering to security data
PDF
Amundsen: From discovering to security data
PDF
Data Discovery & Trust through Metadata
PDF
Data Discovery and Metadata
PDF
The Lyft data platform: Now and in the future
PDF
Disrupting Data Discovery
PDF
TensorFlow Extension (TFX) and Apache Beam
PDF
Big Data at Speed
PDF
Near real-time anomaly detection at Lyft
PDF
Dogfooding data at Lyft
PDF
Fighting cybersecurity threats with Apache Spot
PDF
Fraud Detection with Hadoop
PDF
Top 5 mistakes when writing Spark applications
PDF
Top 5 mistakes when writing Spark applications
PDF
Intro to hadoop tutorial
From discovering to trusting data
Amundsen lineage designs - community meeting, Dec 2020
Amundsen at Brex and Looker integration
REA Group's journey with Data Cataloging and Amundsen
Amundsen gremlin proxy design
Amundsen: From discovering to security data
Amundsen: From discovering to security data
Data Discovery & Trust through Metadata
Data Discovery and Metadata
The Lyft data platform: Now and in the future
Disrupting Data Discovery
TensorFlow Extension (TFX) and Apache Beam
Big Data at Speed
Near real-time anomaly detection at Lyft
Dogfooding data at Lyft
Fighting cybersecurity threats with Apache Spot
Fraud Detection with Hadoop
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
Intro to hadoop tutorial

Cloudera Impala presentation

  • 1. Cloudera Impala: A Modern SQL Engine for Apache Hadoop Mark Grover Software Engineer, Cloudera February 27, 2013
  • 2. What is Impala? ● General-purpose SQL engine ● Real-time queries in Apache Hadoop ● Beta version released since October 2012 ● General availability (GA) release slated for April 2013 ● Open source under Apache license
  • 3. Overview ● User View of Impala ● Architecture of Impala ● Comparing Impala with Dremel ● Comparing Impala with Hive ● Impala Roadmap
  • 4. Impala Overview: Goals ● General-purpose SQL query engine: ● should work both for analytical and transactional workloads ● will support queries that take from milliseconds to hours ● Runs directly within Hadoop: ● reads widely used Hadoop file formats ● talks to widely used Hadoop storage managers ● runs on same nodes that run Hadoop processes ● High performance: ● C++ instead of Java ● runtime code generation ● completely new execution engine that doesn't build on MapReduce
  • 5. User View of Impala: Overview ● Runs as a distributed service in cluster: one Impala daemon on each node with data ● User submits query via ODBC/JDBC to any of the daemons ● Query is distributed to all nodes with relevant data ● If any node fails, the query fails ● Impala uses Hive's metadata interface, connects to Hive's metastore ● Supported file formats: ● uncompressed/lzo-compressed text files ● sequence files and RCFile with snappy/gzip compression ● GA: Avro data files ● GA: columnar format (more on that later)
  • 6. User View of Impala: SQL ● SQL support: ● patterned after Hive's version of SQL ● essentially SQL-92, minus correlated subqueries ● limited to Select, Project, Join, Union, Subqueries, Aggregation and Insert ● only equi-joins; no non-equi joins, no cross products ● Order By only with Limit ● GA: DDL support (CREATE, ALTER) ● Functional limitations: ● no custom UDFs, file formats, SerDes ● no beyond SQL (buckets, samples, transforms, arrays, structs, maps, xpath, json) ● only hash joins; joined table has to fit in memory: ● beta: of single node ● GA: aggregate memory of all (executing) nodes
  • 7. User View of Impala: Apache HBase ● HBase functionality: ● uses Hive's mapping of HBase table into metastore table ● predicates on rowkey columns are mapped into start/stop row ● predicates on other columns are mapped into SingleColumnValueFilters ● HBase functional limitations: ● no nested-loop joins ● all data stored as text
  • 8. Impala Architecture ● Two binaries: impalad and statestored ● Impala daemon (impalad) ● handles client requests and all internal requests related to query execution ● exports Thrift services for these two roles ● State store daemon (statestored) ● provides name service and metadata distribution ● also exports a Thrift service
  • 9. Impala Architecture ● Query execution phases ● request arrives via odbc/jdbc ● planner turns request into collections of plan fragments ● coordinator initiates execution on remote impalad's ● during execution ● intermediate results are streamed between executors ● query results are streamed back to client ● subject to limitations imposed to blocking operators (top-n, aggregation)
  • 10. Impala Architecture: Planner ● 2-phase planning process: ● single-node plan: left-deep tree of plan operators ● plan partitioning: partition single-node plan to maximize scan locality, minimize data movement ● Plan operators: Scan, HashJoin, HashAggregation, Union, TopN, Exchange ● Distributed aggregation: pre-aggregation in all nodes, merge aggregation in single node. GA: hash-partitioned aggregation: re-partition aggregation input on grouping columns in order to reduce per-node memory requirement ● Join order = FROM clause order GA target: rudimentary cost-based optimizer
  • 11. Impala Architecture: Planner ● Example: query with join and aggregation SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 TopN Agg TopN Agg Hash Hash Agg Join Join HDFS HBase Exch Exch Scan Scan HDFS HBase at coordinator at DataNodes at region servers Scan Scan
  • 12. Impala Architecture: Query Execution Request arrives via odbc/jdbc SQL App Hive HDFS NN Statestore ODBC Metastore SQL request Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 13. Impala Architecture: Query Execution Planner turns request into collections of plan fragments Coordinator initiates execution on remote impalad's SQL App Hive HDFS NN Statestore ODBC Metastore Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 14. Impala Architecture: Query Execution Intermediate results are streamed between impalad's Query results are streamed back to client SQL App Hive HDFS NN Statestore ODBC Metastore query results Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Executor Query Executor Query Executor HDFS DN HBase HDFS DN HBase HDFS DN HBase
  • 15. Impala Architecture ● Metadata handling: ● utilizes Hive's metastore ● caches metadata: no synchronous metastore API calls during query execution ● beta: impalad's read metadata from metastore at startup ● Post-GA: metadata distribution through statestore ● Post-GA: HCatalog
  • 16. Impala Architecture ● Execution engine ● written in C++ ● runtime code generation for "big loops" ● internal in-memory tuple format plus fixed-width data at fixed offsets ● uses intrinsics/special cpu instructions for text parsing, crc32 computation, etc.
  • 17. Impala Execution Engine ● More on runtime code generation ● example of "big loop": insert batch of rows into hash table ● known at query compile time: # of tuples in a batch, tuple layout, column types, etc. ● generate at compile time: unrolled loop that inlines all function calls, contains no dead code, minimizes branches ● code generated using llvm
  • 18. Impala's Statestore ● Central system state repository ● name service (membership) ● Post-GA: metadata ● Post-GA: other scheduling-relevant or diagnostic state ● Soft-state ● all data can be reconstructed from the rest of the system ● cluster continues to function when statestore fails, but per-node state becomes increasingly stale ● Sends periodic heartbeats ● pushes new data ● checks for liveness
  • 19. Statestore: Why not ZooKeeper? ● ZK is not a good pub-sub system ● Watch API is awkward and requires a lot of client logic ● multiple round-trips required to get data for changes to node's children ● push model is more natural for our use case ● Don't need all the guarantees ZK provides: ● serializability ● persistence ● prefer to avoid complexity where possible ● ZK is bad at the things we care about and good at the things we don't
  • 20. Comparing Impala to Dremel ● What is Dremel? ● columnar storage for data with nested structures ● distributed scalable aggregation on top of that ● Columnar storage in Hadoop: joint project between Cloudera and Twitter ● new columnar format: Parquet; derived from Doug Cutting's Trevni ● stores data in appropriate native/binary types ● can also store nested structures similar to Dremel's ColumnIO ● Distributed aggregation: Impala ● Impala plus Parquet: a superset of the published version of Dremel (which didn't support joins)
  • 21. More about Parquet ● What is it: ● container format for all popular serialization formats: Avro, Thrift, Protocol Buffers ● successor to Trevni ● jointly developed between Cloudera and Twitter ● open source; hosted on github ● Features ● rowgroup format: file contains multiple horiz. slices ● supports storing each column in separate file ● supports fully shredded nested data; repetition and definition levels similar to Dremel's ColumnIO ● column values stored in native types (bool, int<x>, float, double, byte array) ● support for index pages for fast lookup ● extensible value encodings
  • 22. Comparing Impala to Hive ● Hive: MapReduce as an execution engine ● High latency, low throughput queries ● Fault-tolerance model based on MapReduce's on-disk checkpointing; materializes all intermediate results ● Java runtime allows for easy late-binding of functionality: file formats and UDFs. ● Extensive layering imposes high runtime overhead ● Impala: ● direct, process-to-process data exchange ● no fault tolerance ● an execution engine designed for low runtime overhead
  • 23. Comparing Impala to Hive ● Impala's performance advantage over Hive: no hard numbers, but ● Impala can get full disk throughput (~100MB/sec/disk); I/O-bound workloads often faster by 3-4x ● queries that require multiple map-reduce phases in Hive see a higher speedup ● queries that run against in-memory data see a higher speedup (observed up to 100x)
  • 24. Impala Roadmap: GA – April 2013 ● New data formats: ● Avro ● Parquet ● Improved query execution: partitioned joins ● Further performance improvements ● Guidelines for production deployment: ● load balancing across impalad's ● resource isolation within MR cluster
  • 25. Impala Roadmap: 2013 ● Additional SQL: ● UDFs ● SQL authorization and DDL ● ORDER BY without LIMIT ● window functions ● support for structured data types ● Improved HBase support: ● composite keys, complex types in columns, index nested-loop joins, INSERT/UPDATE/DELETE
  • 26. Impala Roadmap: 2013 ● Runtime optimizations: ● straggler handling ● join order optimization ● improved cache management ● data collocation for improved join performance ● Better metadata handling: ● automatic metadata distribution through statestore ● Resource management: ● goal: run exploratory and production workloads in same cluster, against same data, w/o impacting production jobs
  • 27. Try it out! ● Beta version available since 10/24/12 ● Latest version is 0.6 ● We have packages for: ● RHEL 6.2/5.7 ● Ubuntu 10.04 and 12.04 ● SLES 11 ● Debian 6 ● We are targeting GA for April 2013 ● Questions/comments? impala-user@cloudera.org ● My email address: mgrover@cloudera.com ● My twitter handle: mark_grover