SlideShare a Scribd company logo
Near-realtime
Big Data Analytics
using Impala

David Lauzon
Big Data Montreal #8
January 10th 2013
                       1 / 18
Plan

•   What is Impala?
•   Why Google built Dremel?
•   Use cases for Impala
•   Use cases for Map-Reduce
•   Cloudera Customer Survey
•   Impala Features
•   Impala Performance Expectations and Benchmarks
•   Impala Components
•   Impala Architecture
•   Impala Development Roadmap
•   Where to learn more and get started


                                                     2 / 18
Disclaimer


• In order to preserve the best accuracy in
  the description of Dremel and Impala, most of
  the contents in this presentation have been
  gathered from the authors of the respective
  technologies. References are found at the end
  of the presentation.
• I am not affiliated or sponsored by Cloudera or
  Google.

                                               3 / 18
What is Impala?

  “An Impala is an athletic, gracious,
  african antilope, famous for its
  velocity and its agility to jump”
  - Wikipedia




                                         4 / 18
Seriously, what is Impala?


• “Impala enables real-time, interactive,
  analytical queries of the data stored in
  HBase or HDFS” – Cloudera

• Inspired by Google Dremel Paper (2010)
  – BigQuery is a Dremel implementation service,
     • It’s proprietary, not free, and requires to upload your
       data to Google servers



                                                                 5 / 18
Why Google built Dremel?

• Problems with Data Warehouse Solutions for OLAP/BI:
   – Relational OLAP (ROLAP) :
       • Need to build indices for every possible query (for performance
         concerns)
            Indices size could take up the whole RAM
   – Multi-dimensional OLAP (MOLAP):
       • Require extensive time and money to design and build the data cubes
   – Ad-hoc query (specific non-optimised query) :
       • When you don’t know what you’ll need / or need to work in
         iterations. e.g. quite often !!


• Solution:
   – Increase full-scan speed without requiring indexing or pre-
     aggregated values


                                                                           6 / 18
Come on, give me some use cases!

• Finding particular records with specified
  conditions.
   – “Find all the locations where account “ABC” was
     accessed from”.
• Quick aggregation of statistics with dynamically-
  changing conditions:
   – “Can you give me yesterday’s number of impressions
     for Google AdWords display ads – but only in the
     Tokyo region?”
• Trial-and-error data analysis:
   – “And between 11am to 1pm?”


                                                          7 / 18
Use cases for which you should stick
with Map-Reduce based applications

• Very long running, batch-oriented tasks
  such as ETL:
  – e.g. exporting large amount of data after processing
• Complex event processing:
  – e.g. stream-processing
• “Complex data mining on Big Data which requires
  multiple iterations and paths of data processing
  with programmed algorithms” - Google


                                                           8 / 18
Integration with Hadoop


• Cloudera Customer Survey (Aug. 2012)
  – 80% needs faster queries on Hadoop data
  – 65% query Hadoop using Hive
  – 70% move data from Hadoop to RDBMS for
    interactive SQL
  – 60% see value today in consolidating to a single
    platform




                                                       9 / 18
Impala Features


• Shared with Hive:
   –   Hive MetaStore
   –   Hive SQL (most common SQL-92 features)
   –   ODBC Driver
   –   User Interface (Hue Beeswax)
• Specific to Impala:
   –   No Map Reduce, but in memory transfers
   –   Host and Disk Awareness (data locality)
   –   Table data caching in RAM
   –   No virtual columns, or locking

                                                 10 / 18
Impala Performance Expectations


• Performance improvements over Hive
  – 3 - 4X for purely I/O bound queries
  – 7 - 45X for queries with at least one join
  – 20 - 90X when data available in the cache




                                                 11 / 18
External Benchmarks


• Searching log files at 37 signals
  (creators of Ruby on Rails web framework)
Workload                                                               Impala       Hive        MySQL
                                                                       Query        Query       Query
                                                                       Time         Time        Time
5.2 Gb HAproxy log – top IPs by request count                                3.1s      65.4s       146.0s
5.2 Gb HAproxy log – top IPs by total request time                           3.3s      65.2s       164.0s
800 Mb parsed rails log – slowest accounts                                   1.0s      33.2s        48.1s
800 Mb parsed rails log – highest database time paths                        1.1s      33.7s        49.6s
8 Gb pageview table – daily pageviews and unique                           22.4s       92.2s       180.0s
visitors

http://37signals.com/svn/posts/3315-how-i-came-to-love-big-data-or-at-least-acknowledge-its-existence



                                                                                                        12 / 18
Impala Components


• Impala State Store : 1 per cluster
   – Coordinates information (location and status) about
     all the running impalad instances
• Impala Daemon : 1 per DataNode
   – Coordinates and executes queries
   – Distributes query fragments to other Impala Daemon
• Impala Shell : 1 per node
   – Provides Command Line Interface allowing
     interactions with Impala

                                                           13 / 18
Impala Architecture




                      14 / 18
Roadmap : 0.3 Beta Version

• Operation System:
    – Only RHEL/CentOS 6.2 is supported
• File formats:
    – Text files, SequenceFiles, HBase table
• Compression:
    – Snappy, Gzip, BZip
•   No UDFs or user extensibility *
•   Largest table in joins must be specified first *
•   Right-side of join must fit in RAM *
•   No support for complex nested structures * :
    – e.g. maps, structs and arrays.
               * Post 1.0 G.A. Version Top Asks
                                                       15 / 18
Roadmap : 1.0 General Availability
(Q1 2013)

• File Formats:
   – RCFile, Avro, LZO
   – Trevni : new columnar file-format by Doug Cutting
• More OS Support:
   – Same as those supported by CDH4
• Performance:
   – Faster, bigger, and more memory efficient joins and
     aggregations
   – Straggler handling :
       • more work to faster machines, and less to slower machines
• DDL : enables users to create tables from Impala
• JDBC Driver (shared with Hive)

                                                                     16 / 18
Where to learn more and get started

• Impala Documentation
   https://ccp.cloudera.com/display/IMPALA10BETADOC/Cloudera+Impala+1.0+Beta+Documentation
• Clouder’s Impala Demo VM
   https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Impala+Demo+VM
• Cloudera Blog
   http://blog.cloudera.com/blog/category/impala/
• Impala-user Google Group
   https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/impala-user
• (Unofficial) presentation at Apache Asia Road Show
   http://sizeofvoid.net/wp-content/uploads/ImpalaIntroduction.pdf
• Official announcement of Impala at Strata Conference NY 2012
   http://www.slideshare.net/cloudera/2012-1025-hadoop-world-impala-16x9
• Dremel: Interactive Analysis of Web-Scale Datasets
   http://research.google.com/pubs/pub36632.html
• BigQuery Technical White Paper
   https://cloud.google.com/files/BigQueryTechnicalWP.pdf


                                                                                             17 / 18
Conclusion


• Uses Impala when you need to
  find / compute quickly little data from a large
  data source
• Impala does not replace batch-oriented jobs
• Impala beta and documentation is quite good
  for a beta
  – If you can’t wait for Impala v1.0, try BigQuery



                                                      18 / 18

More Related Content

PPTX
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
PDF
ODI11g, Hadoop and "Big Data" Sources
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PPTX
SQL on Hadoop
PPTX
Bigdata antipatterns
PPSX
Hadoop Ecosystem
PPTX
SQL-on-Hadoop Tutorial
PPT
Boston Hadoop Meetup, April 26 2012
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
ODI11g, Hadoop and "Big Data" Sources
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
SQL on Hadoop
Bigdata antipatterns
Hadoop Ecosystem
SQL-on-Hadoop Tutorial
Boston Hadoop Meetup, April 26 2012

What's hot (20)

PDF
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
PDF
Big Data and Hadoop Ecosystem
PPTX
Column Stores and Google BigQuery
PDF
Hadoop: The Default Machine Learning Platform ?
PDF
Hadoop Architecture Options for Existing Enterprise DataWarehouse
PPT
SQL, NoSQL, BigData in Data Architecture
PPTX
Apache HBase™
PDF
Conhecendo o Apache HBase
PPTX
Hadoop Solutions
PPTX
NoSQL Needs SomeSQL
PDF
ETL Practices for Better or Worse
PPTX
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
PDF
Hadoop and IDW - When_to_use_which
PPTX
Hadoop vs. RDBMS for Advanced Analytics
PPT
MySql to HBase in 5 Steps
PDF
Building tiered data stores using aesop to bridge sql and no sql systems
PPTX
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
PDF
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
PPTX
Agile data warehousing
PDF
Introduction To Hadoop Ecosystem
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
Big Data and Hadoop Ecosystem
Column Stores and Google BigQuery
Hadoop: The Default Machine Learning Platform ?
Hadoop Architecture Options for Existing Enterprise DataWarehouse
SQL, NoSQL, BigData in Data Architecture
Apache HBase™
Conhecendo o Apache HBase
Hadoop Solutions
NoSQL Needs SomeSQL
ETL Practices for Better or Worse
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Hadoop and IDW - When_to_use_which
Hadoop vs. RDBMS for Advanced Analytics
MySql to HBase in 5 Steps
Building tiered data stores using aesop to bridge sql and no sql systems
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
Reimagining Devon Energy’s Data Estate with a Unified Approach to Integration...
Agile data warehousing
Introduction To Hadoop Ecosystem
Ad

Viewers also liked (9)

PPTX
BDM29: AdamCloud Project - Part I
PPTX
BDM24 - Cassandra use case at Netflix 20140429 montrealmeetup
PPTX
BDM32: AdamCloud Project - Part II
PPTX
BDM26: Spark Summit 2014 Debriefing
PDF
BDM25 - Spark runtime internal
PDF
Breakthrough OLAP performance with Cassandra and Spark
PDF
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
PDF
OLAP with Cassandra and Spark
PDF
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
BDM29: AdamCloud Project - Part I
BDM24 - Cassandra use case at Netflix 20140429 montrealmeetup
BDM32: AdamCloud Project - Part II
BDM26: Spark Summit 2014 Debriefing
BDM25 - Spark runtime internal
Breakthrough OLAP performance with Cassandra and Spark
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
OLAP with Cassandra and Spark
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
Ad

Similar to BDM8 - Near-realtime Big Data Analytics using Impala (20)

PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
PPTX
Bay Area Impala User Group Meetup (Sept 16 2014)
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Building a Hadoop Data Warehouse with Impala
PPTX
Hadoop ppt1
PPTX
Apache Spark in Industry
PDF
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
PPTX
Real time fraud detection at 1+M scale on hadoop stack
PDF
Impala presentation ahad rana
PPTX
Big Data tools in practice
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
PDF
DrupalCampLA 2014 - Drupal backend performance and scalability
PPTX
Drupal performance
PDF
Pldc2012 monitoring-and-trending-with-mysql
PDF
Apache Spark Presentation good for big data
PPTX
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
PPTX
Apache drill
PDF
Hadoop Summit 2014 - recap
PDF
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
PPTX
Intro to Apache Spark by CTO of Twingo
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Bay Area Impala User Group Meetup (Sept 16 2014)
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Hadoop ppt1
Apache Spark in Industry
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
Real time fraud detection at 1+M scale on hadoop stack
Impala presentation ahad rana
Big Data tools in practice
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
DrupalCampLA 2014 - Drupal backend performance and scalability
Drupal performance
Pldc2012 monitoring-and-trending-with-mysql
Apache Spark Presentation good for big data
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
Apache drill
Hadoop Summit 2014 - recap
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Intro to Apache Spark by CTO of Twingo

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Machine learning based COVID-19 study performance prediction
PPTX
MYSQL Presentation for SQL database connectivity
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Approach and Philosophy of On baking technology
PDF
Advanced Soft Computing BINUS July 2025.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Advanced methodologies resolving dimensionality complications for autism neur...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Machine learning based COVID-19 study performance prediction
MYSQL Presentation for SQL database connectivity
NewMind AI Monthly Chronicles - July 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Approach and Philosophy of On baking technology
Advanced Soft Computing BINUS July 2025.pdf
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...

BDM8 - Near-realtime Big Data Analytics using Impala

  • 1. Near-realtime Big Data Analytics using Impala David Lauzon Big Data Montreal #8 January 10th 2013 1 / 18
  • 2. Plan • What is Impala? • Why Google built Dremel? • Use cases for Impala • Use cases for Map-Reduce • Cloudera Customer Survey • Impala Features • Impala Performance Expectations and Benchmarks • Impala Components • Impala Architecture • Impala Development Roadmap • Where to learn more and get started 2 / 18
  • 3. Disclaimer • In order to preserve the best accuracy in the description of Dremel and Impala, most of the contents in this presentation have been gathered from the authors of the respective technologies. References are found at the end of the presentation. • I am not affiliated or sponsored by Cloudera or Google. 3 / 18
  • 4. What is Impala? “An Impala is an athletic, gracious, african antilope, famous for its velocity and its agility to jump” - Wikipedia 4 / 18
  • 5. Seriously, what is Impala? • “Impala enables real-time, interactive, analytical queries of the data stored in HBase or HDFS” – Cloudera • Inspired by Google Dremel Paper (2010) – BigQuery is a Dremel implementation service, • It’s proprietary, not free, and requires to upload your data to Google servers 5 / 18
  • 6. Why Google built Dremel? • Problems with Data Warehouse Solutions for OLAP/BI: – Relational OLAP (ROLAP) : • Need to build indices for every possible query (for performance concerns)  Indices size could take up the whole RAM – Multi-dimensional OLAP (MOLAP): • Require extensive time and money to design and build the data cubes – Ad-hoc query (specific non-optimised query) : • When you don’t know what you’ll need / or need to work in iterations. e.g. quite often !! • Solution: – Increase full-scan speed without requiring indexing or pre- aggregated values 6 / 18
  • 7. Come on, give me some use cases! • Finding particular records with specified conditions. – “Find all the locations where account “ABC” was accessed from”. • Quick aggregation of statistics with dynamically- changing conditions: – “Can you give me yesterday’s number of impressions for Google AdWords display ads – but only in the Tokyo region?” • Trial-and-error data analysis: – “And between 11am to 1pm?” 7 / 18
  • 8. Use cases for which you should stick with Map-Reduce based applications • Very long running, batch-oriented tasks such as ETL: – e.g. exporting large amount of data after processing • Complex event processing: – e.g. stream-processing • “Complex data mining on Big Data which requires multiple iterations and paths of data processing with programmed algorithms” - Google 8 / 18
  • 9. Integration with Hadoop • Cloudera Customer Survey (Aug. 2012) – 80% needs faster queries on Hadoop data – 65% query Hadoop using Hive – 70% move data from Hadoop to RDBMS for interactive SQL – 60% see value today in consolidating to a single platform 9 / 18
  • 10. Impala Features • Shared with Hive: – Hive MetaStore – Hive SQL (most common SQL-92 features) – ODBC Driver – User Interface (Hue Beeswax) • Specific to Impala: – No Map Reduce, but in memory transfers – Host and Disk Awareness (data locality) – Table data caching in RAM – No virtual columns, or locking 10 / 18
  • 11. Impala Performance Expectations • Performance improvements over Hive – 3 - 4X for purely I/O bound queries – 7 - 45X for queries with at least one join – 20 - 90X when data available in the cache 11 / 18
  • 12. External Benchmarks • Searching log files at 37 signals (creators of Ruby on Rails web framework) Workload Impala Hive MySQL Query Query Query Time Time Time 5.2 Gb HAproxy log – top IPs by request count 3.1s 65.4s 146.0s 5.2 Gb HAproxy log – top IPs by total request time 3.3s 65.2s 164.0s 800 Mb parsed rails log – slowest accounts 1.0s 33.2s 48.1s 800 Mb parsed rails log – highest database time paths 1.1s 33.7s 49.6s 8 Gb pageview table – daily pageviews and unique 22.4s 92.2s 180.0s visitors http://37signals.com/svn/posts/3315-how-i-came-to-love-big-data-or-at-least-acknowledge-its-existence 12 / 18
  • 13. Impala Components • Impala State Store : 1 per cluster – Coordinates information (location and status) about all the running impalad instances • Impala Daemon : 1 per DataNode – Coordinates and executes queries – Distributes query fragments to other Impala Daemon • Impala Shell : 1 per node – Provides Command Line Interface allowing interactions with Impala 13 / 18
  • 15. Roadmap : 0.3 Beta Version • Operation System: – Only RHEL/CentOS 6.2 is supported • File formats: – Text files, SequenceFiles, HBase table • Compression: – Snappy, Gzip, BZip • No UDFs or user extensibility * • Largest table in joins must be specified first * • Right-side of join must fit in RAM * • No support for complex nested structures * : – e.g. maps, structs and arrays. * Post 1.0 G.A. Version Top Asks 15 / 18
  • 16. Roadmap : 1.0 General Availability (Q1 2013) • File Formats: – RCFile, Avro, LZO – Trevni : new columnar file-format by Doug Cutting • More OS Support: – Same as those supported by CDH4 • Performance: – Faster, bigger, and more memory efficient joins and aggregations – Straggler handling : • more work to faster machines, and less to slower machines • DDL : enables users to create tables from Impala • JDBC Driver (shared with Hive) 16 / 18
  • 17. Where to learn more and get started • Impala Documentation https://ccp.cloudera.com/display/IMPALA10BETADOC/Cloudera+Impala+1.0+Beta+Documentation • Clouder’s Impala Demo VM https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Impala+Demo+VM • Cloudera Blog http://blog.cloudera.com/blog/category/impala/ • Impala-user Google Group https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!forum/impala-user • (Unofficial) presentation at Apache Asia Road Show http://sizeofvoid.net/wp-content/uploads/ImpalaIntroduction.pdf • Official announcement of Impala at Strata Conference NY 2012 http://www.slideshare.net/cloudera/2012-1025-hadoop-world-impala-16x9 • Dremel: Interactive Analysis of Web-Scale Datasets http://research.google.com/pubs/pub36632.html • BigQuery Technical White Paper https://cloud.google.com/files/BigQueryTechnicalWP.pdf 17 / 18
  • 18. Conclusion • Uses Impala when you need to find / compute quickly little data from a large data source • Impala does not replace batch-oriented jobs • Impala beta and documentation is quite good for a beta – If you can’t wait for Impala v1.0, try BigQuery 18 / 18

Editor's Notes

  • #3: Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
  • #6: Marcel Kornacker is the architect of Impala. Prior to joining Cloudera, he was the lead developer for the query engine of Google’s F1 project
  • #8: You may well have an OLAP cube, but not for this specific use case…
  • #12: Impala uses SSE4.2 for checksumming (2X faster than without SSE4.2)e.g. Intel Nehalem+, AMD Bulldozer+
  • #13: 37 signals – Web Application Company (where Ruby on Rails originated)
  • #19: Rappeler objectif de l’exposé:Points importants:Message significatif facile à retenir: