SlideShare a Scribd company logo
Using BigBench to compare Hive
and Spark
Alejandro Montero, Nicolas Poggi
1
What is BigBench (TPCx-BB1)?
• Specification-based benchmark with an open-source implementation, proposed as
the first Big Data benchmark standard.
• BigBench covers all major Big Data characteristics
• Volume -> Scale factor.
• Velocity -> Table refresh.
• Variety -> Data type disparity.
• Extension of TPC-DS.
• Borrows 10 pure QL queries from TCP-DS.
• Added BigData tables and use cases: Machine Learning, Natural Language Processing, ...
• Can support multiple implementations, multiple BigData engines and table formats.
• Can execute multiple parallel streams.
• Defines scale factors for data.
• Tested: 100 GB.
2[1]: http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-bb_v1.2.0.pdf
BigBench – Overview
3
Unstructured DataStructured Data
Semi-Structured Data
Marketprice Items
Sales
Web Page Customers
Reviews
Web Log
Workload:
• 14 Pure QL queries.
• 10 Borrowed from TPC-DS.
• 4 Queries with MapReduce pre-processing.
• 7 Natural Language Processing Queries.
• 5 Machine Learning Queries.
BigBench v1.2 – Reference Implementation
HDFS
Hive Metastore
MapReduce Tez Spark
Yarn
Hive Spark SQL
Mahout ML Custom Spark MLlibApplication
SQL Engine
Table Metastore
Execution Engine
Filesystem
Benchmarked systems:
• Hive + MapReduce + Mahout
• Hive + MapReduce + Spark_MLlib
• Hive + Tez + Mahout
• Hive + Tez + Spark_MLlib
• Spark SQL + Mahout
• Spark SQL + Spark_MLlib
• Spark 2 SQL + Mahout
Work in progress:
• Hive 2
• Spark 2 SQL + Spark_MLlib
The cluster – HDInsight PaaS
5
Model HDInsight D4v3
# Head nodes 2
# Working nodes 4
# Zookeeper nodes 3
CPU Intel(R) Xeon(R) CPU E5-2673 v3
8 x 2,4 GHz cores
RAM 28 GB
HDFS Remote
Software HortonWorks Data Platform 2.5
Spark config 1 executor/working node
3 cores/executor
Pure QL
6
6223
1601
1848
1457
0
1000
2000
3000
4000
5000
6000
7000
Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1
Timeinseconds
Average of three executions using 100 GB Scale Factor
Query 12 CPU behavior
7
Tez Spark 1.6.2 Spark 2.0.2
Average of three executions using 100 GB Scale Factor
Custom Reducers
8
2815
1122
1629
1466
0
500
1000
1500
2000
2500
3000
Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1
Timeinseconds
Average of three executions using 100 GB Scale Factor
Natural Language Processing
9
5004
1100
2289
1913
0
1000
2000
3000
4000
5000
6000
Hive_MR Hive_Tez Spark_1.6.2 Spark_2.0.1
Timeinseconds
Average of three executions using 100 GB Scale Factor
Machine Learning
10
3550
1613
3769
1898
4045
1937
3390
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Hive_MR +
Mahout
Hive_MR +
Spark_ml
Hive_tez +
Mahout
Hive_Tez +
Spark_ML
Spark_1.6.2 +
Mahout
Spark_1.6.2 +
Spark_ML
Spark_2.0.1 +
Mahout
Timeinseconds
Average of three executions using 100 GB Scale Factor
11
Aggregated Results
6223 6223
1601 1623 1848 1951 1457
2815 2815
1122 1137
1629 1489
1466
5004 5004
1100 1082
2289 2356
1913
3550
1613
3769
1898
4045
1937 3390
17592
15655
7592
5740
9811
7733
8226
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
Hive_MR +
Mahout
Hive_MR +
Spark_ML
Hive_tez +
Mahout
Hive_Tez +
Spark_ML
Spark_1.6.2 +
Mahout
Spark_1.6.2 +
Spark_ML
Spark_2.0.1 +
Mahout
Timeinseconds
Pure-QL Custom Reducers NLP ML Total
Average of three executions using 100 GB Scale Factor
Conclusions
• Hive on Tez greatly improves SQL performance over Hive on MapReduce.
• It is also faster than Hive on spark 1.
• Hive on spark 2 is slightly faster.
• The Spark implementation is based on hive…
• Spark MLlib has an excellent performance over Mahout.
• Best production combination: Apache Tez for SQL + Spark MLlib for
Machine Learning.
12
Thanks, questions?
Follow up / feedback : Alejandro.montero@bsc.es
Using BigBench to compare Hive and Spark
13

More Related Content

PDF
The state of Spark in the cloud
PDF
The state of SQL-on-Hadoop in the Cloud
PDF
Using BigBench to compare Hive and Spark (Long version)
PDF
The state of Hive and Spark in the Cloud (July 2017)
PDF
sudoers: Benchmarking Hadoop with ALOJA
PPTX
Hive, Presto, and Spark on TPC-DS benchmark
PDF
State of Spark in the cloud (Spark Summit EU 2017)
PPTX
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
The state of Spark in the cloud
The state of SQL-on-Hadoop in the Cloud
Using BigBench to compare Hive and Spark (Long version)
The state of Hive and Spark in the Cloud (July 2017)
sudoers: Benchmarking Hadoop with ALOJA
Hive, Presto, and Spark on TPC-DS benchmark
State of Spark in the cloud (Spark Summit EU 2017)
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...

What's hot (20)

PDF
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
PDF
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
PPTX
Hadoop Query Performance Smackdown
PDF
Managing Apache Spark Workload and Automatic Optimizing
PPTX
Distributed Deep Learning on Hadoop Clusters
PPTX
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
PPTX
CaffeOnSpark Update: Recent Enhancements and Use Cases
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
PDF
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
PDF
Using Spark with Tachyon by Gene Pang
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PPTX
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
PPTX
A Developer’s View into Spark's Memory Model with Wenchen Fan
PDF
The Hidden Life of Spark Jobs
PPTX
data science toolkit 101: set up Python, Spark, & Jupyter
PDF
Rapids: Data Science on GPUs
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
PDF
What's New in Upcoming Apache Spark 2.3
PPTX
Tuning up with Apache Tez
PDF
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
Hadoop Query Performance Smackdown
Managing Apache Spark Workload and Automatic Optimizing
Distributed Deep Learning on Hadoop Clusters
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
CaffeOnSpark Update: Recent Enhancements and Use Cases
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Using Spark with Tachyon by Gene Pang
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
A Developer’s View into Spark's Memory Model with Wenchen Fan
The Hidden Life of Spark Jobs
data science toolkit 101: set up Python, Spark, & Jupyter
Rapids: Data Science on GPUs
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
What's New in Upcoming Apache Spark 2.3
Tuning up with Apache Tez
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Ad

Viewers also liked (20)

PDF
Accelerating HBase with NVMe and Bucket Cache
PPTX
Factors affecting lls usage
PPTX
스마트큐 발표자료 김재우
PDF
High-Speed Remote-Field Testing in Carbon Steel Tubing
DOCX
EE 305 Project_1 The Effective External Defibrillators
PDF
JLL JF 100 Excercise Bike Manual
PDF
State-of-the-Art RFT— Meeting the Ferromagnetic Tube Challenge
PDF
Looking for Cracks in Control Rod Drive Mechanisms (CRDM)
PPTX
OBA.BY
PDF
Defect Detection & Prevention in Cast Turbine Wheels
PDF
Inspecting Laser Welds in Component Manufacturing
PDF
The case for Hadoop performance
PDF
Inspecting Lead-Clad Pipes with Pulsed Eddy Current (PEC)
PDF
Assessing Circumferential Cracking in Non-Ferromagnetic Heat Exchanger Tubes
PDF
Assessing Flow-Accelerated Corrosion in Hard-to-Reach Places
PPTX
Detecting Flaws in Condenser Tubing Welds With the DefHi® Probe
DOCX
Lab 7 diode with operational amplifiers by kehali b. haileselassie and kou
PPTX
Bmc mumbai recruitment 2014 clerk jobs
PPT
Alejandra Ortiz bibliography
PPTX
что такое Smm в 2013 году на примере
Accelerating HBase with NVMe and Bucket Cache
Factors affecting lls usage
스마트큐 발표자료 김재우
High-Speed Remote-Field Testing in Carbon Steel Tubing
EE 305 Project_1 The Effective External Defibrillators
JLL JF 100 Excercise Bike Manual
State-of-the-Art RFT— Meeting the Ferromagnetic Tube Challenge
Looking for Cracks in Control Rod Drive Mechanisms (CRDM)
OBA.BY
Defect Detection & Prevention in Cast Turbine Wheels
Inspecting Laser Welds in Component Manufacturing
The case for Hadoop performance
Inspecting Lead-Clad Pipes with Pulsed Eddy Current (PEC)
Assessing Circumferential Cracking in Non-Ferromagnetic Heat Exchanger Tubes
Assessing Flow-Accelerated Corrosion in Hard-to-Reach Places
Detecting Flaws in Condenser Tubing Welds With the DefHi® Probe
Lab 7 diode with operational amplifiers by kehali b. haileselassie and kou
Bmc mumbai recruitment 2014 clerk jobs
Alejandra Ortiz bibliography
что такое Smm в 2013 году на примере
Ad

Similar to Using BigBench to compare Hive and Spark (short version) (20)

PDF
The State of Spark in the Cloud with Nicolas Poggi
PDF
Hadoop to spark-v2
PDF
Spark Summit EU talk by Berni Schiefer
PDF
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
PPTX
Hive on spark is blazing fast or is it final
PDF
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Introduction to Spark Training
PPTX
Intro to Spark development
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Adios hadoop, Hola Spark! T3chfest 2015
PDF
Unified Big Data Processing with Apache Spark
PDF
Lessons Learned on Benchmarking Big Data Platforms
PPTX
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
PPTX
Intro to Apache Spark by CTO of Twingo
PDF
LAS16-305: Smart City Big Data Visualization on 96Boards
PDF
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
PDF
Big Data Processing: Performance Gain Through In-Memory Computation
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PPTX
In Memory Analytics with Apache Spark
The State of Spark in the Cloud with Nicolas Poggi
Hadoop to spark-v2
Spark Summit EU talk by Berni Schiefer
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
Hive on spark is blazing fast or is it final
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
Unified Big Data Processing with Apache Spark (QCON 2014)
Introduction to Spark Training
Intro to Spark development
Simplifying Big Data Analytics with Apache Spark
Adios hadoop, Hola Spark! T3chfest 2015
Unified Big Data Processing with Apache Spark
Lessons Learned on Benchmarking Big Data Platforms
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Intro to Apache Spark by CTO of Twingo
LAS16-305: Smart City Big Data Visualization on 96Boards
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
Big Data Processing: Performance Gain Through In-Memory Computation
Apache Spark: The Next Gen toolset for Big Data Processing
In Memory Analytics with Apache Spark

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Fluorescence-microscope_Botany_detailed content
IB Computer Science - Internal Assessment.pptx
Foundation of Data Science unit number two notes
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Quality review (1)_presentation of this 21
oil_refinery_comprehensive_20250804084928 (1).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Business Acumen Training GuidePresentation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Miokarditis (Inflamasi pada Otot Jantung)
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
climate analysis of Dhaka ,Banglades.pptx
1_Introduction to advance data techniques.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Launch Your Data Science Career in Kochi – 2025
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Mega Projects Data Mega Projects Data
Business Ppt On Nestle.pptx huunnnhhgfvu
Major-Components-ofNKJNNKNKNKNKronment.pptx
Fluorescence-microscope_Botany_detailed content

Using BigBench to compare Hive and Spark (short version)

  • 1. Using BigBench to compare Hive and Spark Alejandro Montero, Nicolas Poggi 1
  • 2. What is BigBench (TPCx-BB1)? • Specification-based benchmark with an open-source implementation, proposed as the first Big Data benchmark standard. • BigBench covers all major Big Data characteristics • Volume -> Scale factor. • Velocity -> Table refresh. • Variety -> Data type disparity. • Extension of TPC-DS. • Borrows 10 pure QL queries from TCP-DS. • Added BigData tables and use cases: Machine Learning, Natural Language Processing, ... • Can support multiple implementations, multiple BigData engines and table formats. • Can execute multiple parallel streams. • Defines scale factors for data. • Tested: 100 GB. 2[1]: http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-bb_v1.2.0.pdf
  • 3. BigBench – Overview 3 Unstructured DataStructured Data Semi-Structured Data Marketprice Items Sales Web Page Customers Reviews Web Log Workload: • 14 Pure QL queries. • 10 Borrowed from TPC-DS. • 4 Queries with MapReduce pre-processing. • 7 Natural Language Processing Queries. • 5 Machine Learning Queries.
  • 4. BigBench v1.2 – Reference Implementation HDFS Hive Metastore MapReduce Tez Spark Yarn Hive Spark SQL Mahout ML Custom Spark MLlibApplication SQL Engine Table Metastore Execution Engine Filesystem Benchmarked systems: • Hive + MapReduce + Mahout • Hive + MapReduce + Spark_MLlib • Hive + Tez + Mahout • Hive + Tez + Spark_MLlib • Spark SQL + Mahout • Spark SQL + Spark_MLlib • Spark 2 SQL + Mahout Work in progress: • Hive 2 • Spark 2 SQL + Spark_MLlib
  • 5. The cluster – HDInsight PaaS 5 Model HDInsight D4v3 # Head nodes 2 # Working nodes 4 # Zookeeper nodes 3 CPU Intel(R) Xeon(R) CPU E5-2673 v3 8 x 2,4 GHz cores RAM 28 GB HDFS Remote Software HortonWorks Data Platform 2.5 Spark config 1 executor/working node 3 cores/executor
  • 6. Pure QL 6 6223 1601 1848 1457 0 1000 2000 3000 4000 5000 6000 7000 Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1 Timeinseconds Average of three executions using 100 GB Scale Factor
  • 7. Query 12 CPU behavior 7 Tez Spark 1.6.2 Spark 2.0.2 Average of three executions using 100 GB Scale Factor
  • 8. Custom Reducers 8 2815 1122 1629 1466 0 500 1000 1500 2000 2500 3000 Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1 Timeinseconds Average of three executions using 100 GB Scale Factor
  • 9. Natural Language Processing 9 5004 1100 2289 1913 0 1000 2000 3000 4000 5000 6000 Hive_MR Hive_Tez Spark_1.6.2 Spark_2.0.1 Timeinseconds Average of three executions using 100 GB Scale Factor
  • 10. Machine Learning 10 3550 1613 3769 1898 4045 1937 3390 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Hive_MR + Mahout Hive_MR + Spark_ml Hive_tez + Mahout Hive_Tez + Spark_ML Spark_1.6.2 + Mahout Spark_1.6.2 + Spark_ML Spark_2.0.1 + Mahout Timeinseconds Average of three executions using 100 GB Scale Factor
  • 11. 11 Aggregated Results 6223 6223 1601 1623 1848 1951 1457 2815 2815 1122 1137 1629 1489 1466 5004 5004 1100 1082 2289 2356 1913 3550 1613 3769 1898 4045 1937 3390 17592 15655 7592 5740 9811 7733 8226 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 Hive_MR + Mahout Hive_MR + Spark_ML Hive_tez + Mahout Hive_Tez + Spark_ML Spark_1.6.2 + Mahout Spark_1.6.2 + Spark_ML Spark_2.0.1 + Mahout Timeinseconds Pure-QL Custom Reducers NLP ML Total Average of three executions using 100 GB Scale Factor
  • 12. Conclusions • Hive on Tez greatly improves SQL performance over Hive on MapReduce. • It is also faster than Hive on spark 1. • Hive on spark 2 is slightly faster. • The Spark implementation is based on hive… • Spark MLlib has an excellent performance over Mahout. • Best production combination: Apache Tez for SQL + Spark MLlib for Machine Learning. 12
  • 13. Thanks, questions? Follow up / feedback : Alejandro.montero@bsc.es Using BigBench to compare Hive and Spark 13