Using BigBench to compare Hive and Spark (short version)

Using BigBench to compare Hive
and Spark
Alejandro Montero, Nicolas Poggi
1

What is BigBench (TPCx-BB1)?
• Specification-based benchmark with an open-source implementation, proposed as
the first Big Data benchmark standard.
• BigBench covers all major Big Data characteristics
• Volume -> Scale factor.
• Velocity -> Table refresh.
• Variety -> Data type disparity.
• Extension of TPC-DS.
• Borrows 10 pure QL queries from TCP-DS.
• Added BigData tables and use cases: Machine Learning, Natural Language Processing, ...
• Can support multiple implementations, multiple BigData engines and table formats.
• Can execute multiple parallel streams.
• Defines scale factors for data.
• Tested: 100 GB.
2[1]: http://www.tpc.org/tpc_documents_current_versions/pdf/tpcx-bb_v1.2.0.pdf

BigBench – Overview
3
Unstructured DataStructured Data
Semi-Structured Data
Marketprice Items
Sales
Web Page Customers
Reviews
Web Log
Workload:
• 14 Pure QL queries.
• 10 Borrowed from TPC-DS.
• 4 Queries with MapReduce pre-processing.
• 7 Natural Language Processing Queries.
• 5 Machine Learning Queries.

BigBench v1.2 – Reference Implementation
HDFS
Hive Metastore
MapReduce Tez Spark
Yarn
Hive Spark SQL
Mahout ML Custom Spark MLlibApplication
SQL Engine
Table Metastore
Execution Engine
Filesystem
Benchmarked systems:
• Hive + MapReduce + Mahout
• Hive + MapReduce + Spark_MLlib
• Hive + Tez + Mahout
• Hive + Tez + Spark_MLlib
• Spark SQL + Mahout
• Spark SQL + Spark_MLlib
• Spark 2 SQL + Mahout
Work in progress:
• Hive 2
• Spark 2 SQL + Spark_MLlib

The cluster – HDInsight PaaS
5
Model HDInsight D4v3
# Head nodes 2
# Working nodes 4
# Zookeeper nodes 3
CPU Intel(R) Xeon(R) CPU E5-2673 v3
8 x 2,4 GHz cores
RAM 28 GB
HDFS Remote
Software HortonWorks Data Platform 2.5
Spark config 1 executor/working node
3 cores/executor

Pure QL
6
6223
1601
1848
1457
0
1000
2000
3000
4000
5000
6000
7000
Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1
Timeinseconds
Average of three executions using 100 GB Scale Factor

Query 12 CPU behavior
7
Tez Spark 1.6.2 Spark 2.0.2

Custom Reducers
8
2815
1122
1629
1466
0
500
1000
1500
2000
2500
3000
Hive_MR Hive_tez Spark_1.6.2 Spark_2.0.1
Timeinseconds

Natural Language Processing
9
5004
1100
2289
1913
0
1000
2000
3000
4000
5000
6000
Hive_MR Hive_Tez Spark_1.6.2 Spark_2.0.1
Timeinseconds

Machine Learning
10
3550
1613
3769
1898
4045
1937
3390
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Hive_MR +
Mahout
Hive_MR +
Spark_ml
Hive_tez +
Mahout
Hive_Tez +
Spark_ML
Spark_1.6.2 +
Mahout
Spark_1.6.2 +
Spark_ML
Spark_2.0.1 +
Mahout
Timeinseconds

11
Aggregated Results
6223 6223
1601 1623 1848 1951 1457
2815 2815
1122 1137
1629 1489
1466
5004 5004
1100 1082
2289 2356
1913
3550
1613
3769
1898
4045
1937 3390
17592
15655
7592
5740
9811
7733
8226
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
Hive_MR +
Mahout
Hive_MR +
Spark_ML
Hive_tez +
Mahout
Hive_Tez +
Spark_ML
Spark_1.6.2 +
Mahout
Spark_1.6.2 +
Spark_ML
Spark_2.0.1 +
Mahout
Timeinseconds
Pure-QL Custom Reducers NLP ML Total

Conclusions
• Hive on Tez greatly improves SQL performance over Hive on MapReduce.
• It is also faster than Hive on spark 1.
• Hive on spark 2 is slightly faster.
• The Spark implementation is based on hive…
• Spark MLlib has an excellent performance over Mahout.
• Best production combination: Apache Tez for SQL + Spark MLlib for
Machine Learning.
12

Thanks, questions?
Follow up / feedback : Alejandro.montero@bsc.es
Using BigBench to compare Hive and Spark
13

Using BigBench to compare Hive and Spark (short version)

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Using BigBench to compare Hive and Spark (short version) (20)

Recently uploaded (20)

Using BigBench to compare Hive and Spark (short version)