SlideShare a Scribd company logo
Overview of the Hive Stinger
Initiative
Eric N. Hanson
Principal Software Development Engineer
Microsoft HDInsight Team
30 June 2014
What is Stinger?  Umbrella term for…
• Faster query in Hive
• ORC
• Vectorization
• Tez
• Better language features for analysis
• Window functions etc.
Why Stinger?
• Hive has good functionality
• But it started out sloooowww
• Need to speed it up
• keep it competitive
• make it fun to use
ORC
• A good columnstore format
• Run length encoding, value encoding, dictionary encoding
• Layers stream compression over the top
• Written by Owen O’Malley
• http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-
2.0.0.2/ds_Hive/orcfile.html
Using ORC
• create table Tbl (col int) stored as orc;
• orc.compress default ZLIB
• See http://www.slideshare.net/oom65/orc-
andvectorizationhadoopsummit
TPC-DS File Sizes
Page 6
*Courtesy of Hortonworks
Vectorization
Page 7
How the code works (simplified)
Page 8
class LongColumnAddLongScalarExpression {
int inputColumn;
int outputColumn;
long scalar;
void evaluate(VectorizedRowBatch batch) {
long [] inVector =
((LongColumnVector) batch.columns[inputColumn]).vector;
long [] outVector =
((LongColumnVector) batch.columns[outputColumn]).vector;
if (batch.selectedInUse) {
for (int j = 0; j < batch.size; j++) {
int i = batch.selected[j];
outVector[i] = inVector[i] + scalar;
}
} else {
for (int i = 0; i < batch.size; i++) {
outVector[i] = inVector[i] + scalar;
}
}
}
}
}
No method calls
Low instruction count
Cache locality to 1024 values
No pipeline stalls
SIMD in Java 8
Vectorization and Compilation
• Vectorization “instructions” generated from templates
• Example’s:
–Int add col-col
–Int add col-scalar
–Int add scalar-col
–Double add col-col
–Double add col-scalar
–Double add scalar-col
–And hundreds more!
• Pre-compilation of expressions
• Reduces # of function calls and instructions at runtime
• Expressions like (a + 2) / b are interpreted with these primitives
Example of vectorized template code
} else {
if (batch.selectedInUse) {
for(int j = 0; j != n; j++) {
int i = sel[j];
outputVector[i] = vector1[i] <OperatorSymbol> vector2[i];
}
} else {
for(int i = 0; i != n; i++) {
outputVector[i] = vector1[i] <OperatorSymbol> vector2[i];
}
}
}
Using vectorization in Hive
• set hive.vectorized.execution.enabled = true;
• Run query over ORC
• Only works for scalar types
• https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+
Execution
• ~5X CPU reduction
Apache Tez (“Speed”)
• Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive queries
– Higher throughput for batch queries
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft
YARN ApplicationMaster to run DAG of Tez Tasks
Task with pluggable Input, Processor and Output
Tez Task - <Input, Processor, Output>
Task
ProcessorInput Output
*Courtesy of Hortonworks
Tez: Building blocks for scalable data processing
Classical ‘Map’ Classical ‘Reduce’
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
Map
Processor
HDFS
Input
Sorted
Output
Reduce
Processor
Shuffle
Input
HDFS
Output
Reduce
Processor
Shuffle
Input
Sorted
Output
*Courtesy of Hortonworks
Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez
SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids unneeded
writes to HDFS
*Courtesy of Hortonworks
Tez Sessions
… because Map/Reduce query startup is expensive
• Tez Sessions
–Hot containers ready for immediate use
–Removes task and job launch overhead (~5s – 30s)
• Hive
–Session launch/shutdown in background (seamless, user not aware)
–Submits query plan directly to Tez Session
Native Hadoop service, not ad-hoc
*Courtesy of Hortonworks
Stinger Phase 3: Interactive Query In Hadoop
Page 16
Hive 10 Trunk (Phase 3)Hive 0.11 (Phase 1)
190x
Improvement
1400s
39s
7.2s
TPC-DS Query 27
3200s
65s
14.9s
TPC-DS Query 82
200x
Improvement
Query 27: Pricing Analytics using Star Schema Join
Query 82: Inventory Analytics Joining 2 Large Fact Tables
All Results at Scale Factor 200 (Approximately 200GB Data)
*Courtesy of Hortonworks
How you can use Stinger enhancements
• Use Hive 13
• Use ORC: create table … stored as ORC
• Enable vectorization:
set hive.vectorized.execution.enabled=true
• Enable Tez: set hive.execution.engine=tez
• See http://hortonworks.com/hadoop-tutorial/supercharging-
interactive-queries-hive-tez/
Reference(s)
• Stinger overview, Strata, fall 2013:
http://www.slideshare.net/alanfgates/strata-stingertalk-
oct2013?qid=09d16028-bd7e-47d8-8438-
34f3242c6f0e&v=qf1&b=&from_search=1
Slides marked “Courtesy of Hortonworks” are from Hortonworks talks

More Related Content

PDF
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
What's new in pandas and the SciPy stack for financial users
PDF
Spark Under the Hood - Meetup @ Data Science London
PPTX
Multi dimension aggregations using spark and dataframes
PDF
Using Apache Spark as ETL engine. Pros and Cons
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Jump Start into Apache® Spark™ and Databricks
What's new in pandas and the SciPy stack for financial users
Spark Under the Hood - Meetup @ Data Science London
Multi dimension aggregations using spark and dataframes
Using Apache Spark as ETL engine. Pros and Cons
Spark Application Carousel: Highlights of Several Applications Built with Spark

What's hot (20)

PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
PDF
Lessons from Running Large Scale Spark Workloads
PDF
New Developments in Spark
PDF
Introduction to Spark (Intern Event Presentation)
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
PDF
Hugfr SPARK & RIAK -20160114_hug_france
PDF
Operational Tips for Deploying Spark
PDF
Visualizing big data in the browser using spark
PDF
SparkSQL: A Compiler from Queries to RDDs
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
PDF
Distributed ML in Apache Spark
PPTX
Spark etl
PPTX
Building data pipelines
PPTX
Spark - The Ultimate Scala Collections by Martin Odersky
PPTX
ETL with SPARK - First Spark London meetup
PDF
Spark what's new what's coming
PPTX
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Lessons from Running Large Scale Spark Workloads
New Developments in Spark
Introduction to Spark (Intern Event Presentation)
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Hugfr SPARK & RIAK -20160114_hug_france
Operational Tips for Deploying Spark
Visualizing big data in the browser using spark
SparkSQL: A Compiler from Queries to RDDs
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Real-Time Spark: From Interactive Queries to Streaming
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Distributed ML in Apache Spark
Spark etl
Building data pipelines
Spark - The Ultimate Scala Collections by Martin Odersky
ETL with SPARK - First Spark London meetup
Spark what's new what's coming
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Ad

Viewers also liked (9)

PDF
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
PPTX
October 2014 HUG : Hive On Spark
PDF
Hive Now Sparks
PDF
Big Data/Hadoop Infrastructure Considerations
PDF
Overview of stinger interactive query for hive
PDF
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
PDF
Practical Problem Solving with Apache Hadoop & Pig
PPTX
Big Data Analytics with Hadoop
PPTX
Big data ppt
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
October 2014 HUG : Hive On Spark
Hive Now Sparks
Big Data/Hadoop Infrastructure Considerations
Overview of stinger interactive query for hive
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Practical Problem Solving with Apache Hadoop & Pig
Big Data Analytics with Hadoop
Big data ppt
Ad

Similar to Overview of the Hive Stinger Initiative (20)

PPTX
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
PPTX
Return of c++
PPTX
Using Apache Hive with High Performance
PDF
Nodejs - Should Ruby Developers Care?
PDF
Modern C++
PDF
Ehsan parallel accelerator-dec2015
PDF
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
PDF
Python高级编程(二)
PPTX
embedded C.pptx
PDF
C++ Windows Forms L01 - Intro
PPTX
ORC File and Vectorization - Hadoop Summit 2013
PPTX
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
PPTX
¡El mejor lenguaje para automatizar pruebas!
PDF
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
PDF
Новый InterSystems: open-source, митапы, хакатоны
PDF
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
PDF
Software Engineering
PPT
CPlusPus
PPT
Abhishek lingineni
PDF
Tajo_Meetup_20141120
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Return of c++
Using Apache Hive with High Performance
Nodejs - Should Ruby Developers Care?
Modern C++
Ehsan parallel accelerator-dec2015
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Python高级编程(二)
embedded C.pptx
C++ Windows Forms L01 - Intro
ORC File and Vectorization - Hadoop Summit 2013
ORC File & Vectorization - Improving Hive Data Storage and Query Performance
¡El mejor lenguaje para automatizar pruebas!
[Td 2015] what is new in visual c++ 2015 and future directions(ulzii luvsanba...
Новый InterSystems: open-source, митапы, хакатоны
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Software Engineering
CPlusPus
Abhishek lingineni
Tajo_Meetup_20141120

More from Modern Data Stack France (20)

PDF
Stash - Data FinOPS
PDF
Vue d'ensemble Dremio
PDF
From Data Warehouse to Lakehouse
PDF
Talend spark meetup 03042017 - Paris Spark Meetup
PDF
Paris Spark Meetup - Trifacta - 03_04_2017
PDF
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
PDF
Hadoop France meetup Feb2016 : recommendations with spark
PPTX
Hug janvier 2016 -EDF
PPTX
HUG France - 20160114 industrialisation_process_big_data CanalPlus
PDF
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
PDF
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
PDF
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
PDF
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
PDF
Spark dataframe
PDF
June Spark meetup : search as recommandation
PDF
Spark ML par Xebia (Spark Meetup du 11/06/2015)
PPTX
Spark meetup at viadeo
PPTX
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
PPTX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
PDF
The Cascading (big) data application framework
Stash - Data FinOPS
Vue d'ensemble Dremio
From Data Warehouse to Lakehouse
Talend spark meetup 03042017 - Paris Spark Meetup
Paris Spark Meetup - Trifacta - 03_04_2017
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop France meetup Feb2016 : recommendations with spark
Hug janvier 2016 -EDF
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Spark dataframe
June Spark meetup : search as recommandation
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark meetup at viadeo
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
The Cascading (big) data application framework

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
cuic standard and advanced reporting.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
KodekX | Application Modernization Development
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Digital-Transformation-Roadmap-for-Companies.pptx
MYSQL Presentation for SQL database connectivity
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Dropbox Q2 2025 Financial Results & Investor Presentation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf
Spectral efficient network and resource selection model in 5G networks
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KodekX | Application Modernization Development
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
The Rise and Fall of 3GPP – Time for a Sabbatical?

Overview of the Hive Stinger Initiative

  • 1. Overview of the Hive Stinger Initiative Eric N. Hanson Principal Software Development Engineer Microsoft HDInsight Team 30 June 2014
  • 2. What is Stinger?  Umbrella term for… • Faster query in Hive • ORC • Vectorization • Tez • Better language features for analysis • Window functions etc.
  • 3. Why Stinger? • Hive has good functionality • But it started out sloooowww • Need to speed it up • keep it competitive • make it fun to use
  • 4. ORC • A good columnstore format • Run length encoding, value encoding, dictionary encoding • Layers stream compression over the top • Written by Owen O’Malley • http://docs.hortonworks.com/HDPDocuments/HDP2/HDP- 2.0.0.2/ds_Hive/orcfile.html
  • 5. Using ORC • create table Tbl (col int) stored as orc; • orc.compress default ZLIB • See http://www.slideshare.net/oom65/orc- andvectorizationhadoopsummit
  • 6. TPC-DS File Sizes Page 6 *Courtesy of Hortonworks
  • 8. How the code works (simplified) Page 8 class LongColumnAddLongScalarExpression { int inputColumn; int outputColumn; long scalar; void evaluate(VectorizedRowBatch batch) { long [] inVector = ((LongColumnVector) batch.columns[inputColumn]).vector; long [] outVector = ((LongColumnVector) batch.columns[outputColumn]).vector; if (batch.selectedInUse) { for (int j = 0; j < batch.size; j++) { int i = batch.selected[j]; outVector[i] = inVector[i] + scalar; } } else { for (int i = 0; i < batch.size; i++) { outVector[i] = inVector[i] + scalar; } } } } } No method calls Low instruction count Cache locality to 1024 values No pipeline stalls SIMD in Java 8
  • 9. Vectorization and Compilation • Vectorization “instructions” generated from templates • Example’s: –Int add col-col –Int add col-scalar –Int add scalar-col –Double add col-col –Double add col-scalar –Double add scalar-col –And hundreds more! • Pre-compilation of expressions • Reduces # of function calls and instructions at runtime • Expressions like (a + 2) / b are interpreted with these primitives
  • 10. Example of vectorized template code } else { if (batch.selectedInUse) { for(int j = 0; j != n; j++) { int i = sel[j]; outputVector[i] = vector1[i] <OperatorSymbol> vector2[i]; } } else { for(int i = 0; i != n; i++) { outputVector[i] = vector1[i] <OperatorSymbol> vector2[i]; } } }
  • 11. Using vectorization in Hive • set hive.vectorized.execution.enabled = true; • Run query over ORC • Only works for scalar types • https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+ Execution • ~5X CPU reduction
  • 12. Apache Tez (“Speed”) • Replaces MapReduce as primitive for Pig, Hive, Cascading etc. – Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft YARN ApplicationMaster to run DAG of Tez Tasks Task with pluggable Input, Processor and Output Tez Task - <Input, Processor, Output> Task ProcessorInput Output *Courtesy of Hortonworks
  • 13. Tez: Building blocks for scalable data processing Classical ‘Map’ Classical ‘Reduce’ Intermediate ‘Reduce’ for Map-Reduce-Reduce Map Processor HDFS Input Sorted Output Reduce Processor Shuffle Input HDFS Output Reduce Processor Shuffle Input Sorted Output *Courtesy of Hortonworks
  • 14. Hive – MR Hive – Tez Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) SELECT b.id Tez avoids unneeded writes to HDFS *Courtesy of Hortonworks
  • 15. Tez Sessions … because Map/Reduce query startup is expensive • Tez Sessions –Hot containers ready for immediate use –Removes task and job launch overhead (~5s – 30s) • Hive –Session launch/shutdown in background (seamless, user not aware) –Submits query plan directly to Tez Session Native Hadoop service, not ad-hoc *Courtesy of Hortonworks
  • 16. Stinger Phase 3: Interactive Query In Hadoop Page 16 Hive 10 Trunk (Phase 3)Hive 0.11 (Phase 1) 190x Improvement 1400s 39s 7.2s TPC-DS Query 27 3200s 65s 14.9s TPC-DS Query 82 200x Improvement Query 27: Pricing Analytics using Star Schema Join Query 82: Inventory Analytics Joining 2 Large Fact Tables All Results at Scale Factor 200 (Approximately 200GB Data) *Courtesy of Hortonworks
  • 17. How you can use Stinger enhancements • Use Hive 13 • Use ORC: create table … stored as ORC • Enable vectorization: set hive.vectorized.execution.enabled=true • Enable Tez: set hive.execution.engine=tez • See http://hortonworks.com/hadoop-tutorial/supercharging- interactive-queries-hive-tez/
  • 18. Reference(s) • Stinger overview, Strata, fall 2013: http://www.slideshare.net/alanfgates/strata-stingertalk- oct2013?qid=09d16028-bd7e-47d8-8438- 34f3242c6f0e&v=qf1&b=&from_search=1 Slides marked “Courtesy of Hortonworks” are from Hortonworks talks