SlideShare a Scribd company logo
Copyright 2013 by Hortonworks and Microsoft
ORC File & Vectorization
Improving Hive Data Storage and Query Performance
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
Jitendra Pandey
jitendra@hortonworks.com
Eric Hanson
ehans@microsoft.com
owen@hortonworks.c
om
ORC – Optimized RC File
Page 2
History
Page 3
Remaining Challenges
Page 4
Requirements
Page 5
File Structure
Page 6
Stripe Structure
Page 7
File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4
Compression
Page 9
Integer Column Serialization
Page 10
String Column Serialization
Page 11
Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double
Compound Type Serialization
Page 13
Generic Compression
Page 14
Column Projection
Page 15
How Do You Use ORC
Page 16
Managing Memory
Page 17
TPC-DS File Sizes
Page 18
ORC Predicate Pushdown
Page 19
Additional Details
Page 20
Current work for Hive 0.12
Page 21
Future Work
Page 22
Comparison
Page 23
RC File Trevni Parquet ORC
Hive Integration Y N N Y
Active Development N N Y Y
Hive Type Model N N N Y
Shred complex columns N Y Y Y
Splits found quickly N Y Y Y
Files per a bucket 1 many 1 or many 1
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N Y Y
Store min, max, sum, count N N N Y
Store internal indexes N N N Y
No overhead for non-null N N N Y ≥ 0.12
Predicate Pushdown N N N Y ≥ 0.12
Vectorization
Page 24
Vectorization
Page 25
Why row-at-a-time execution is slow
Page 26
• Hive uses Object Inspectors to work on a row
• Enables level of abstraction
• Costs major performance
• Exacerbated by using lazy serdes
• Inner loop has many method, new(), and if-
then-else calls
• Lots of CPU instructions
• Pipeline stalls Poor instructions/cycle
• Poor cache locality
How the code works (simplified)
Page 27
class LongColumnAddLongScalarExpression {
int inputColumn;
int outputColumn;
long scalar;
void evaluate(VectorizedRowBatch batch) {
long [] inVector =
((LongColumnVector) batch.columns[inputColumn]).vector;
long [] outVector =
((LongColumnVector) batch.columns[outputColumn]).vector;
if (batch.selectedInUse) {
for (int j = 0; j < batch.size; j++) {
int i = batch.selected[j];
outVector[i] = inVector[i] + scalar;
}
} else {
for (int i = 0; i < batch.size; i++) {
outVector[i] = inVector[i] + scalar;
}
}
}
}
}
No method calls
Low instruction count
Cache locality to 1024 values
No pipeline stalls
SIMD in Java 8
Vectorization project
Page 28
Preliminary performance results
• NOT a benchmark
• 218 million row fact table of real data, 25 columns
• 18GB raw data
• 6 core, 12 thread workstation, 1 disk, 16GB RAM
• select a, b, count(*) from t
where c >= const group by a, b -- 53 row result
Page 29
warm start times RC non-
vectorized
(default, not
compressed)
ORC non-
vectorized
(default,
compressed)
ORC vectorized
(default,
compressed)
Runtime (sec) 261 58 43
Total CPU (sec) 381 159 42
Thanks to contributors!
Page 30
• Microsoft Big Data:
• Eric Hanson, Remus Rusanu, Sarvesh
Sakalanaga, Tony Murphy, Ashit Gosalia
• Hortonworks:
• Jitendra Pandey, Owen O’Malley, Gopal V
• Others:
• Teddy Choi, Tim Chen
Jitendra/Eric are joint leads

More Related Content

PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PDF
Understanding InfluxDB’s New Storage Engine
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Parquet performance tuning: the missing guide
PDF
The Apache Spark File Format Ecosystem
PDF
Parquet - Data I/O - Philadelphia 2013
PPTX
Query Compilation in Impala
PPTX
Apache Arrow: In Theory, In Practice
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
Understanding InfluxDB’s New Storage Engine
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Parquet performance tuning: the missing guide
The Apache Spark File Format Ecosystem
Parquet - Data I/O - Philadelphia 2013
Query Compilation in Impala
Apache Arrow: In Theory, In Practice

What's hot (20)

PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PDF
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
PDF
Rethinking State Management in Cloud-Native Streaming Systems
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
PDF
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
PPTX
PDF
Inside Parquet Format
PPTX
Hive, Presto, and Spark on TPC-DS benchmark
PPTX
LLAP: long-lived execution in Hive
PPTX
Mongo DB Presentation
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
PDF
Cloudera Impala Source Code Explanation and Analysis
PDF
Hortonworks Technical Workshop: Interactive Query with Apache Hive
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Apache Arrow Flight: A New Gold Standard for Data Transport
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Rethinking State Management in Cloud-Native Streaming Systems
Optimizing Delta/Parquet Data Lakes for Apache Spark
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Inside Parquet Format
Hive, Presto, and Spark on TPC-DS benchmark
LLAP: long-lived execution in Hive
Mongo DB Presentation
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Cloudera Impala Source Code Explanation and Analysis
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Evening out the uneven: dealing with skew in Flink
The columnar roadmap: Apache Parquet and Apache Arrow
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Ad

Viewers also liked (7)

PDF
Ingesting Data at Blazing Speed Using Apache Orc
PDF
ORC Files
PPTX
Big data - Apache Hadoop for Beginner's
PPTX
Get started with R lang
PPTX
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
PPTX
Storage Cloud and Spectrum deck 2017 June update
PDF
Alphorm.com Formation Docker (2/2) - Administration Avancée
Ingesting Data at Blazing Speed Using Apache Orc
ORC Files
Big data - Apache Hadoop for Beginner's
Get started with R lang
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
Storage Cloud and Spectrum deck 2017 June update
Alphorm.com Formation Docker (2/2) - Administration Avancée
Ad

Similar to ORC File and Vectorization - Hadoop Summit 2013 (20)

PDF
Overview of the Hive Stinger Initiative
PPTX
Master tuning
PDF
Web analytics at scale with Druid at naver.com
PDF
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
PDF
CBStreams - Java Streams for ColdFusion (CFML)
PPTX
User Group3009
PDF
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
PDF
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
PDF
Fighting Against Chaotically Separated Values with Embulk
PDF
WebObjects Optimization
PDF
Nodejs - Should Ruby Developers Care?
PPT
NOSQL and Cassandra
PPTX
Google cloud Dataflow & Apache Flink
PPTX
Using Apache Hive with High Performance
PPTX
Orms vs Micro-ORMs
PPTX
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
PDF
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
PPT
Performance optimization - JavaScript
PPTX
Node.js: The What, The How and The When
PDF
Building microservices with Kotlin
Overview of the Hive Stinger Initiative
Master tuning
Web analytics at scale with Druid at naver.com
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
CBStreams - Java Streams for ColdFusion (CFML)
User Group3009
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
Fighting Against Chaotically Separated Values with Embulk
WebObjects Optimization
Nodejs - Should Ruby Developers Care?
NOSQL and Cassandra
Google cloud Dataflow & Apache Flink
Using Apache Hive with High Performance
Orms vs Micro-ORMs
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight
Performance optimization - JavaScript
Node.js: The What, The How and The When
Building microservices with Kotlin

More from Owen O'Malley (20)

PPTX
Running An Apache Project: 10 Traps and How to Avoid Them
PPTX
Big Data's Journey to ACID
PPTX
ORC Deep Dive 2020
PPTX
Protect your private data with ORC column encryption
PPTX
Fine Grain Access Control for Big Data: ORC Column Encryption
PPTX
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
PDF
Strata NYC 2018 Iceberg
PPTX
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
PPTX
ORC Column Encryption
PPTX
File Format Benchmarks - Avro, JSON, ORC, & Parquet
PPTX
Protecting Enterprise Data in Apache Hadoop
PPTX
Data protection2015
PPTX
Structor - Automated Building of Virtual Hadoop Clusters
PPT
Hadoop Security Architecture
PPTX
Adding ACID Updates to Hive
PPTX
ORC File Introduction
PDF
Optimizing Hive Queries
PDF
Next Generation Hadoop Operations
PDF
Next Generation MapReduce
PDF
Bay Area HUG Feb 2011 Intro
Running An Apache Project: 10 Traps and How to Avoid Them
Big Data's Journey to ACID
ORC Deep Dive 2020
Protect your private data with ORC column encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Strata NYC 2018 Iceberg
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
ORC Column Encryption
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Protecting Enterprise Data in Apache Hadoop
Data protection2015
Structor - Automated Building of Virtual Hadoop Clusters
Hadoop Security Architecture
Adding ACID Updates to Hive
ORC File Introduction
Optimizing Hive Queries
Next Generation Hadoop Operations
Next Generation MapReduce
Bay Area HUG Feb 2011 Intro

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation theory and applications.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
KodekX | Application Modernization Development
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Empathic Computing: Creating Shared Understanding
Encapsulation theory and applications.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation_ Review paper, used for researhc scholars
Spectral efficient network and resource selection model in 5G networks
Unlocking AI with Model Context Protocol (MCP)
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MIND Revenue Release Quarter 2 2025 Press Release
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25 Week I
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
KodekX | Application Modernization Development
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Agricultural_Statistics_at_a_Glance_2022_0.pdf

ORC File and Vectorization - Hadoop Summit 2013

  • 1. Copyright 2013 by Hortonworks and Microsoft ORC File & Vectorization Improving Hive Data Storage and Query Performance June 2013 Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley Jitendra Pandey jitendra@hortonworks.com Eric Hanson ehans@microsoft.com owen@hortonworks.c om
  • 2. ORC – Optimized RC File Page 2
  • 8. File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4
  • 12. Hive Compound Types Page 12 0 Struct 4 Struct 3 String 1 Int 2 Map 7 Time 5 String 6 Double
  • 16. How Do You Use ORC Page 16
  • 21. Current work for Hive 0.12 Page 21
  • 23. Comparison Page 23 RC File Trevni Parquet ORC Hive Integration Y N N Y Active Development N N Y Y Hive Type Model N N N Y Shred complex columns N Y Y Y Splits found quickly N Y Y Y Files per a bucket 1 many 1 or many 1 Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N Y Y Store min, max, sum, count N N N Y Store internal indexes N N N Y No overhead for non-null N N N Y ≥ 0.12 Predicate Pushdown N N N Y ≥ 0.12
  • 26. Why row-at-a-time execution is slow Page 26 • Hive uses Object Inspectors to work on a row • Enables level of abstraction • Costs major performance • Exacerbated by using lazy serdes • Inner loop has many method, new(), and if- then-else calls • Lots of CPU instructions • Pipeline stalls Poor instructions/cycle • Poor cache locality
  • 27. How the code works (simplified) Page 27 class LongColumnAddLongScalarExpression { int inputColumn; int outputColumn; long scalar; void evaluate(VectorizedRowBatch batch) { long [] inVector = ((LongColumnVector) batch.columns[inputColumn]).vector; long [] outVector = ((LongColumnVector) batch.columns[outputColumn]).vector; if (batch.selectedInUse) { for (int j = 0; j < batch.size; j++) { int i = batch.selected[j]; outVector[i] = inVector[i] + scalar; } } else { for (int i = 0; i < batch.size; i++) { outVector[i] = inVector[i] + scalar; } } } } } No method calls Low instruction count Cache locality to 1024 values No pipeline stalls SIMD in Java 8
  • 29. Preliminary performance results • NOT a benchmark • 218 million row fact table of real data, 25 columns • 18GB raw data • 6 core, 12 thread workstation, 1 disk, 16GB RAM • select a, b, count(*) from t where c >= const group by a, b -- 53 row result Page 29 warm start times RC non- vectorized (default, not compressed) ORC non- vectorized (default, compressed) ORC vectorized (default, compressed) Runtime (sec) 261 58 43 Total CPU (sec) 381 159 42
  • 30. Thanks to contributors! Page 30 • Microsoft Big Data: • Eric Hanson, Remus Rusanu, Sarvesh Sakalanaga, Tony Murphy, Ashit Gosalia • Hortonworks: • Jitendra Pandey, Owen O’Malley, Gopal V • Others: • Teddy Choi, Tim Chen Jitendra/Eric are joint leads