SlideShare a Scribd company logo
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
1	
1	
Performance  Evaluation  of
Cloudera  impala  1.0
May  1,  2013
CELLANT  Corp.  R&D  Strategy  Division
Yukinori  SUDA
@sudabon
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v  Support  for  a  subset  of  ANSI-‐‑‒92  SQL
v  CREATE,  ALTER,  SELECT,  INSERT,  JOIN,  and  subqueries
v  Support  for  partitioned  joins,  fully  distributed  aggregations,  and  
fully  distributed  top-‐‑‒n  queries
v  Support  for  a  variety  of  data  formats:
v  Hadoop  native  (Apache  Avro,  SequenceFile,  RCFile  with  Snappy,  GZIP,  BZIP,  
or  uncompressed)
v  text  (uncompressed  or  LZO-‐‑‒compressed)
v  Parquet  (Snappy  or  uncompressed)
v  Support  for  all  CDH4  64-‐‑‒bit  packages:
v  RHEL  6.2/5.7,  Ubuntu,  Debian,  SLES
v  Connectivity  via  JDBC,  ODBC,  Hue  GUI,  or  command-‐‑‒line  shell
v  Kerberos  authentication  and  MR/Impala  resource  isolation
v  etc
Cloudera  Impala  GA  was  released  !!
2
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
Our  System  Environment
3
v  Install  using  Cloudera  Manager  Free  Edition  4.5.2
Master Slave
11  Servers
All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch
Active
NameNode
DataNode
TaskTracker
Impalad
Stand-‐‑‒by
NameNode
JobTracker
statestored
3  Servers
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
DataNode
TaskTracker
Impalad
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v CPU
l Intel  Core  2  Duo  2.13  GHz  with  Hyper  Threading
v Memory
l 4GB
v Disk
l 7,200  rpm  SATA  mechanical  Hard  Disk  Drive  *  1
v OS
l Cent  OS  6.2
Our  “wimpy”  Server  Specification
4
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v  Use  CDH4.2.1  +  Impala  version  1.0
v  Use  hivebench  in  open-‐‑‒sourced  benchmark  tool  “HiBench”
l  https://github.com/hibench
v  Modified  datasets  to  1/10  scale
l  Default  configuration  generates  table  with  1  billion  rows
v  Modified  query  sentence
l  Deleted  “INSERT  INTO  TABLE  …”  to  evaluate  read-‐‑‒only  performance
v  Combines  a  few  storage  format  with  a  few  compression  method
l  TextFile,  SequenceFile,  RCFile,  ParquestFile
l  No  compression,  Gzip,  Snappy
v  Comparison  with  job  query  latency
v  Average  job  latency  over  5  measurements
Benchmark
5
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
•  Uservisits  table
–  100  million  rows
–  16,895  MB  as  TextFile
–  Table  Definitions
•  sourceIP   string
•  destURL   string
•  visitDate   string
•  adRevenue   double
•  userAgent   string
•  countryCode   string
•  languageCode  string
•  searchWord   string
•  duration   int
•  Rankings  table
–  12  million  rows
–  744  MB  as  TextFile
–  Table  Definitions
•  pageURL string
•  pageRank int
•  avgDuration int
Modified  Datasets
6
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
SELECT
  sourceIP,
  sum(adRevenue)  as  totalRevenue,
  avg(pageRank)  
FROM
  rankings_̲t  R
JOIN  (
  SELECT
    sourceIP,
    destURL,
    adRevenue
  FROM
    uservisits_̲t  UV
  WHERE
    (datediff(UV.visitDate,  '1999-‐‑‒01-‐‑‒01')>=0
    AND
    datediff(UV.visitDate,  '2000-‐‑‒01-‐‑‒01')<=0)
  )  NUV
ON
  (R.pageURL  =  NUV.destURL)
group  by  sourceIP
order  by  totalRevenue  DESC
limit  1;
Modified  Query
7
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
Benchmark  Result  (Hive)
cited  from  “Performance  evaluation  of  Cloudera  impala  0.6  beta...”
8
0 50 100 150 200 250
No  Comp.
Gzip
Snappy
Gzip
Snappy
TextFileSequenceFileRCFile
235.843
227.883
213.616
234.289
197.894
Avg.  Job  Latency  [sec]
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
Benchmark  Result  (Impala)
9
0 50 100 150 200 250
No  Comp.
Gzip
Snappy
Gzip
Snappy
Snappy
Text
File
Sequence
FileRCFile
Parquet
File
36.61
29.736
24.024
26.083
19.586
16.2
Avg.  Job  Latency  [sec]
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v Exchange  the  order  of  JOINed  Tables  like  below
SELECT
sourceIP,  sum(adRevenue)  as  totalRevenue,  avg(pageRank)
FROM
(SELECT  sourceIP,  destURL,  adRevenue  FROM  uservisits_̲ps  UV  WHERE  
(datediff(UV.visitDate,  '1999-‐‑‒01-‐‑‒01')>=0  AND  datediff(UV.visitDate,  
'2000-‐‑‒01-‐‑‒01')<=0))  NUV
JOIN
rankings_̲ps  R
ON
(R.pageURL  =  NUV.destURL)
group  by  sourceIP
order  by  totalRevenue  DESC
limit  1;
v Result
l Parquet  compressed  as  Snappy:  34.374  sec
Additional  Experiments
10
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
v Parquet  +  Snappy  is  the  fastest
v Specifically,
l ParquetFile  compressed  as  Snappy:  16.2  sec
v Need  to  take  care  the  order  of  JOINed  tables
v Hope  for  future  extension
l Support  UDF
l Window  Function
l etc
Conclusion
11
Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p /
12
Letʼ’s  try  it  out  on  your  envrionment!!
Thanks!

More Related Content

PDF
Evaluation of cloudera impala 1.1
PDF
Os Gopal
PDF
RedGateWebinar - Where did my CPU go?
PDF
Aerospike DB and Storm for real-time analytics
PPT
OakTableWorld 2013: Ultimate Exadata IO monitoring – Flash, HardDisk , & Writ...
PDF
OOW 2013: Where did my CPU go
PDF
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PDF
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
Evaluation of cloudera impala 1.1
Os Gopal
RedGateWebinar - Where did my CPU go?
Aerospike DB and Storm for real-time analytics
OakTableWorld 2013: Ultimate Exadata IO monitoring – Flash, HardDisk , & Writ...
OOW 2013: Where did my CPU go
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi

What's hot (20)

PDF
Embulk, an open-source plugin-based parallel bulk data loader
PDF
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PDF
HBase replication
PDF
Oracle Exadata Exam Dump
PPTX
Ceph Day KL - Ceph Tiering with High Performance Archiecture
PDF
HBase Replication for Bulk Loaded Data
PPTX
Ceph Day Tokyo - Bring Ceph to Enterprise
PPTX
Oracle ebs db platform migration
PDF
Ceph Day Taipei - Bring Ceph to Enterprise
PDF
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PDF
PGConf.ASIA 2019 Bali - Foreign Data Wrappers - Etsuro Fujita & Tatsuro Yamada
PPTX
StackiFest16: How PayPal got a 300 Nodes up in 14 minutes - Greg Bruno
TXT
Live issues resolution on Kubernates Cluster
PPTX
OpenShift4 Installation by UPI on kvm
PPTX
HBaseCon 2013: A Developer’s Guide to Coprocessors
PDF
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
PPT
2011 384 hackworth_ppt
PDF
Quay 3.3 installation
PDF
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
PDF
Virtualize and automate your development environment for fun and profit
Embulk, an open-source plugin-based parallel bulk data loader
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
HBase replication
Oracle Exadata Exam Dump
Ceph Day KL - Ceph Tiering with High Performance Archiecture
HBase Replication for Bulk Loaded Data
Ceph Day Tokyo - Bring Ceph to Enterprise
Oracle ebs db platform migration
Ceph Day Taipei - Bring Ceph to Enterprise
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - Foreign Data Wrappers - Etsuro Fujita & Tatsuro Yamada
StackiFest16: How PayPal got a 300 Nodes up in 14 minutes - Greg Bruno
Live issues resolution on Kubernates Cluster
OpenShift4 Installation by UPI on kvm
HBaseCon 2013: A Developer’s Guide to Coprocessors
ClickHouse Mark Cache, by Mik Kocikowski, Cloudflare
2011 384 hackworth_ppt
Quay 3.3 installation
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Virtualize and automate your development environment for fun and profit
Ad

Viewers also liked (6)

PDF
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
PPTX
Bay Area Impala User Group Meetup (Sept 16 2014)
PDF
Performance evaluation of cloudera impala (with Comparison to Hive)
PDF
1763 murcia
PDF
Presentations from the Cloudera Impala meetup on Aug 20 2013
PPTX
ImpalaToGo introduction
Performance evaluation of cloudera impala 0.6 beta with comparison to Hive
Bay Area Impala User Group Meetup (Sept 16 2014)
Performance evaluation of cloudera impala (with Comparison to Hive)
1763 murcia
Presentations from the Cloudera Impala meetup on Aug 20 2013
ImpalaToGo introduction
Ad

Similar to Performance Evaluation of Cloudera Impala GA (20)

PPTX
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
PDF
Cloudera Impala technical deep dive
PPTX
The Impala Cookbook
PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
PDF
Strata London 2019 Scaling Impala
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Building a Hadoop Data Warehouse with Impala
PPTX
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
PPTX
Strata London 2019 Scaling Impala.pptx
PDF
Cloudera 5.3 Update
PPTX
Performance Optimizations in Apache Impala
PPTX
Cloudera Impala + PostgreSQL
PPTX
Hug meetup impala 2.5 performance overview
PPTX
Apache Impala (incubating) 2.5 Performance Update
PDF
Impala use case @ edge
PDF
Impala: Real-time Queries in Hadoop
PDF
Impala Performance Update
PDF
Real-time Big Data Analytics Engine using Impala
PPTX
Deep dive into enterprise data lake through Impala
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
Cloudera Impala technical deep dive
The Impala Cookbook
Impala 2.0 - The Best Analytic Database for Hadoop
Strata London 2019 Scaling Impala
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Strata London 2019 Scaling Impala.pptx
Cloudera 5.3 Update
Performance Optimizations in Apache Impala
Cloudera Impala + PostgreSQL
Hug meetup impala 2.5 performance overview
Apache Impala (incubating) 2.5 Performance Update
Impala use case @ edge
Impala: Real-time Queries in Hadoop
Impala Performance Update
Real-time Big Data Analytics Engine using Impala
Deep dive into enterprise data lake through Impala
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

More from Yukinori Suda (7)

PDF
Hadoop operation chaper 4
PDF
Cloudera Impalaをサービスに組み込むときに苦労した話
PDF
Hadoopエコシステムを駆使したこれからのWebアクセス解析サービス
PDF
自宅でHive愛を育む方法 〜Raspberry Pi編〜
PDF
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
PDF
HiveとImpalaのおいしいとこ取り
PDF
Cloudera impalaの性能評価(Hiveとの比較)
Hadoop operation chaper 4
Cloudera Impalaをサービスに組み込むときに苦労した話
Hadoopエコシステムを駆使したこれからのWebアクセス解析サービス
自宅でHive愛を育む方法 〜Raspberry Pi編〜
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
HiveとImpalaのおいしいとこ取り
Cloudera impalaの性能評価(Hiveとの比較)

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Tartificialntelligence_presentation.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Empathic Computing: Creating Shared Understanding
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Spectroscopy.pptx food analysis technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
A Presentation on Artificial Intelligence
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
MIND Revenue Release Quarter 2 2025 Press Release
MYSQL Presentation for SQL database connectivity
Tartificialntelligence_presentation.pptx
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding
SOPHOS-XG Firewall Administrator PPT.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectroscopy.pptx food analysis technology
Unlocking AI with Model Context Protocol (MCP)
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Accuracy of neural networks in brain wave diagnosis of schizophrenia

Performance Evaluation of Cloudera Impala GA

  • 1. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1 Performance  Evaluation  of Cloudera  impala  1.0 May  1,  2013 CELLANT  Corp.  R&D  Strategy  Division Yukinori  SUDA @sudabon
  • 2. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v  Support  for  a  subset  of  ANSI-‐‑‒92  SQL v  CREATE,  ALTER,  SELECT,  INSERT,  JOIN,  and  subqueries v  Support  for  partitioned  joins,  fully  distributed  aggregations,  and   fully  distributed  top-‐‑‒n  queries v  Support  for  a  variety  of  data  formats: v  Hadoop  native  (Apache  Avro,  SequenceFile,  RCFile  with  Snappy,  GZIP,  BZIP,   or  uncompressed) v  text  (uncompressed  or  LZO-‐‑‒compressed) v  Parquet  (Snappy  or  uncompressed) v  Support  for  all  CDH4  64-‐‑‒bit  packages: v  RHEL  6.2/5.7,  Ubuntu,  Debian,  SLES v  Connectivity  via  JDBC,  ODBC,  Hue  GUI,  or  command-‐‑‒line  shell v  Kerberos  authentication  and  MR/Impala  resource  isolation v  etc Cloudera  Impala  GA  was  released  !! 2
  • 3. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Our  System  Environment 3 v  Install  using  Cloudera  Manager  Free  Edition  4.5.2 Master Slave 11  Servers All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch Active NameNode DataNode TaskTracker Impalad Stand-‐‑‒by NameNode JobTracker statestored 3  Servers DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad
  • 4. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v CPU l Intel  Core  2  Duo  2.13  GHz  with  Hyper  Threading v Memory l 4GB v Disk l 7,200  rpm  SATA  mechanical  Hard  Disk  Drive  *  1 v OS l Cent  OS  6.2 Our  “wimpy”  Server  Specification 4
  • 5. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v  Use  CDH4.2.1  +  Impala  version  1.0 v  Use  hivebench  in  open-‐‑‒sourced  benchmark  tool  “HiBench” l  https://github.com/hibench v  Modified  datasets  to  1/10  scale l  Default  configuration  generates  table  with  1  billion  rows v  Modified  query  sentence l  Deleted  “INSERT  INTO  TABLE  …”  to  evaluate  read-‐‑‒only  performance v  Combines  a  few  storage  format  with  a  few  compression  method l  TextFile,  SequenceFile,  RCFile,  ParquestFile l  No  compression,  Gzip,  Snappy v  Comparison  with  job  query  latency v  Average  job  latency  over  5  measurements Benchmark 5
  • 6. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / •  Uservisits  table –  100  million  rows –  16,895  MB  as  TextFile –  Table  Definitions •  sourceIP  string •  destURL  string •  visitDate  string •  adRevenue  double •  userAgent  string •  countryCode  string •  languageCode  string •  searchWord  string •  duration  int •  Rankings  table –  12  million  rows –  744  MB  as  TextFile –  Table  Definitions •  pageURL string •  pageRank int •  avgDuration int Modified  Datasets 6
  • 7. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / SELECT   sourceIP,   sum(adRevenue)  as  totalRevenue,   avg(pageRank)   FROM   rankings_̲t  R JOIN  (   SELECT     sourceIP,     destURL,     adRevenue   FROM     uservisits_̲t  UV   WHERE     (datediff(UV.visitDate,  '1999-‐‑‒01-‐‑‒01')>=0     AND     datediff(UV.visitDate,  '2000-‐‑‒01-‐‑‒01')<=0)   )  NUV ON   (R.pageURL  =  NUV.destURL) group  by  sourceIP order  by  totalRevenue  DESC limit  1; Modified  Query 7
  • 8. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Benchmark  Result  (Hive) cited  from  “Performance  evaluation  of  Cloudera  impala  0.6  beta...” 8 0 50 100 150 200 250 No  Comp. Gzip Snappy Gzip Snappy TextFileSequenceFileRCFile 235.843 227.883 213.616 234.289 197.894 Avg.  Job  Latency  [sec]
  • 9. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Benchmark  Result  (Impala) 9 0 50 100 150 200 250 No  Comp. Gzip Snappy Gzip Snappy Snappy Text File Sequence FileRCFile Parquet File 36.61 29.736 24.024 26.083 19.586 16.2 Avg.  Job  Latency  [sec]
  • 10. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v Exchange  the  order  of  JOINed  Tables  like  below SELECT sourceIP,  sum(adRevenue)  as  totalRevenue,  avg(pageRank) FROM (SELECT  sourceIP,  destURL,  adRevenue  FROM  uservisits_̲ps  UV  WHERE   (datediff(UV.visitDate,  '1999-‐‑‒01-‐‑‒01')>=0  AND  datediff(UV.visitDate,   '2000-‐‑‒01-‐‑‒01')<=0))  NUV JOIN rankings_̲ps  R ON (R.pageURL  =  NUV.destURL) group  by  sourceIP order  by  totalRevenue  DESC limit  1; v Result l Parquet  compressed  as  Snappy:  34.374  sec Additional  Experiments 10
  • 11. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v Parquet  +  Snappy  is  the  fastest v Specifically, l ParquetFile  compressed  as  Snappy:  16.2  sec v Need  to  take  care  the  order  of  JOINed  tables v Hope  for  future  extension l Support  UDF l Window  Function l etc Conclusion 11
  • 12. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 12 Letʼ’s  try  it  out  on  your  envrionment!! Thanks!