SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Improvement & Roadmap
in Apache Spark 2.3 and 2.4
Dongjoon Hyun
Principal Software Engineer @ Hortonworks Data Science Team
June 2018
2 © Hortonworks Inc. 2011–2018. All rights reserved
Dongjoon Hyun
• Hortonworks
− Principal Software Engineer @ Data Science Team
• Apache Project
− Apache REEF Project Management Committee(PMC) Member & Committer
− Apache Spark Project Contributor
• GitHub
− https://github.com/dongjoon-hyun
3 © Hortonworks Inc. 2011–2018. All rights reserved
HDP 2.6.5 (May 2018)
• Apache Spark
− 2.3.0 (2018 FEB)
• Apache ORC
− 1.4.3 (2018 FEB)
• Apache KAFKA
− 1.0.0 (2017 NOV)
4 © Hortonworks Inc. 2011–2018. All rights reserved
• Vectorized ORC Reader
• Structured Streaming with ORC
• Schema evolution with ORC
• PySpark Performance Enhancements
with Apache Arrow and ORC
• Structured stream-stream joins
• Spark History Server V2
• Spark on Kubernetes
• Data source API V2
• Streaming API V2
• Continuous Structured Streaming
Processing
Major Features Experimental Features
Apache Spark 2.3.x
Spark 2.3.0 (and 2.3.1) has 1409 (and 134) JIRA issues.
5 © Hortonworks Inc. 2011–2018. All rights reserved
• Vectorized ORC Reader
• Structured Streaming with ORC
• Schema evolution with ORC
• PySpark Performance Enhancements
with Apache Arrow and ORC
• Structured stream-stream joins
• Spark History Server V2
• Spark on Kubernetes
• Data source API V2
• Streaming API V2
• Continuous Structured Streaming
Processing
Major Features Experimental Features
Apache Spark 2.3.x
Spark 2.3.0 (and 2.3.1) has 1409 (and 134) JIRA issues.
6 © Hortonworks Inc. 2011–2018. All rights reserved
Spark’s built-in file-based data sources
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Storage-efficient and popular for shared Hive tables
7 © Hortonworks Inc. 2011–2018. All rights reserved
Motivation
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Storage-efficient and popular for shared Hive tables
Fast
Flexible
Hive Table Access
8 © Hortonworks Inc. 2011–2018. All rights reserved
The story of Spark, ORC, and Hive
• Before Apache ORC
− Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
9 © Hortonworks Inc. 2011–2018. All rights reserved
The story of Spark, ORC, and Hive – Cont.
• Before Apache ORC
− Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
• After Apache ORC
− v1.0.0 (2016 JAN)
− v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3
10 © Hortonworks Inc. 2011–2018. All rights reserved
The story of Spark, ORC, and Hive – Cont.
• Before Apache ORC
− Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
• After Apache ORC
− v1.0.0 (2016 JAN)
− v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3
− v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB)
− v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY)
− v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN)
11 © Hortonworks Inc. 2011–2018. All rights reserved
The story of Spark, ORC, and Hive – Cont.
• Before Apache ORC
− Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
• After Apache ORC
− v1.0.0 (2016 JAN)
− v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3
− v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB)
− v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY)
− v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN)
− v1.5.1 (2018 MAY)  SPARK-24576, HIVE-19669 Hive 3.1 Spark 2.4
12 © Hortonworks Inc. 2011–2018. All rights reserved
The story of Spark, ORC, and Hive – Cont.
• Before Apache ORC
− Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
• After Apache ORC
− v1.0.0 (2016 JAN)
− v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3
− v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB)
− v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY)
− v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN)
− v1.5.1 (2018 MAY)  SPARK-24576, HIVE-19669 Hive 3.1 Spark 2.4
13 © Hortonworks Inc. 2011–2018. All rights reserved
Previous ORC Issues in Spark
14 © Hortonworks Inc. 2011–2018. All rights reserved
Six Issue Categories
• ORC Writer Versions
• Performance
• Structured streaming
• Column names
• Hive tables and schema evolution
• Robustness
15 © Hortonworks Inc. 2011–2018. All rights reserved
Category 1 – ORC Writer Versions
• ORIGINAL
• HIVE_8732 (2014) ORC string statistics are not merged correctly
• HIVE_4243 (2015) Use real column names from Hive tables
• HIVE_12055(2015) Vectorized Writer
• HIVE_13083(2016) Decimals write present stream correctly
• ORC_101 (2016) Correct the use of the default charset in bloomfilter
• ORC_135 (2018) PPD for timestamp is wrong when reader/writer
timezones are different
16 © Hortonworks Inc. 2011–2018. All rights reserved
Category 2 – Performance
• Vectorized ORC Reader (SPARK-16060)
• Fast reading partition-columns (SPARK-22712)
• Pushing down filters for DateType (SPARK-21787)
17 © Hortonworks Inc. 2011–2018. All rights reserved
• `FileNotFoundException` at writing
empty partitions as ORC
• Create structured steam with ORC files
Write (SPARK-15474) Read (SPARK-22781)
Category 3 – Structured streaming
spark.readStream.orc(path)
18 © Hortonworks Inc. 2011–2018. All rights reserved
Category 4 – Column names
• Unicode column names (SPARK-23072)
• Column names with dot (SPARK-21791)
• Should not create invalid column names (SPARK-21912)
19 © Hortonworks Inc. 2011–2018. All rights reserved
Category 5 – Hive tables and schema evolution
• Support `ALTER TABLE ADD COLUMNS` (SPARK-21929)
− Introduced at Spark 2.2, but throws AnalysisException for ORC
• Support column positional mismatch (SPARK-22267)
− Return wrong result if ORC file schema is different from Hive MetaStore schema order
• Support table properties during `convertMetastoreOrc/Parquet` (SPARK-23355, Spark 2.4)
− For ORC/Parquet Hive tables, `convertMetastore` ignores table properties
20 © Hortonworks Inc. 2011–2018. All rights reserved
Category 6 – Robustness
• ORC metadata exceed ProtoBuf message size limit (SPARK-19109)
• NullPointerException on zero-size ORC file (SPARK-19809)
• Support `ignoreCorruptFiles` (SPARK-23049)
• Support `ignoreMissingFiles` (SPARK-23305)
21 © Hortonworks Inc. 2011–2018. All rights reserved
Current Approach
22 © Hortonworks Inc. 2011–2018. All rights reserved
Supports two ORC file formats
• Adding a new OrcFileFormat (SPARK-20682)
FileFormat
TextBasedFileFormat
ParquetFileFormat
OrcFileFormat
HiveFileFormat
JsonFileFormat
LibSVMFileFormat
CSVFileFormat
TextFileFormat
o.a.s.sql.execution.datasources
o.a.s.ml.source.libsvmo.a.s.sql.hive.orc
OrcFileFormat
`hive` OrcFileFormat
from Hive 1.2.1
`native` OrcFileFormat
with ORC 1.4+
23 © Hortonworks Inc. 2011–2018. All rights reserved
In Reality – Four cases for ORC Reader/Writer
`hive` Reader`native` Reader
`hive` Writer
`native` Writer
• New Data
• New Apps
• Best performance
(Vectorized Reader)
• New Data
• Old Apps
• Improved performance
(Non-vectorized Reader)
• Old Data
• New Apps
• Improved performance
(Vectorized Reader)
• Old Data
• Old Apps
• As-Is performance
(Non-vectorized Reader)
1
2
3
4
24 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / native reader
native writer / hive reader hive writer / hive reader
4x 1
2
3
4
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
25 © Hortonworks Inc. 2011–2018. All rights reserved
Switch ORC implementation (SPARK-20728)
• spark.sql.orc.impl=native (default: `hive`)
CREATE TABLE people (name string, age int)
USING ORC OPTIONS (orc.compress 'ZLIB')
spark.read.orc(path)
df.write.orc(path)
spark.read.format("orc").load (path)
df.write.format("orc").save(path)
Read/Write Dataset
Read/Write Dataset
Create ORC Table
26 © Hortonworks Inc. 2011–2018. All rights reserved
Switch ORC implementation (SPARK-20728) – Cont.
• spark.sql.orc.impl=native (default: `hive`)
spark.readStream.orc(path)
spark.readStream.format("orc").load(path)
df.writeStream
.option("checkpointLocation", path1)
.format("orc")
.option("path", path2)
.start
Read/Write
Structured Stream
27 © Hortonworks Inc. 2011–2018. All rights reserved
Support vectorized read on Hive ORC Tables
• spark.sql.hive.convertMetastoreOrc=true (default: false)
− `spark.sql.orc.impl=native` is required, too.
CREATE TABLE people (name string, age int)
STORED AS ORC
CREATE TABLE people (name string, age int)
USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip')
28 © Hortonworks Inc. 2011–2018. All rights reserved
Schema evolution at reading file-based data sources
• Frequently, new files can have wider column types or new columns
− Before SPARK-21929, users drop and recreate ORC table with an updated schema.
• User-defined schema reduces schema inference cost and handles upcasting
− boolean -> byte -> short -> int -> long
− float -> double
spark.read.schema("col1 int").orc(path)
spark.read.schema("col1 long, col2 long").orc(path)
Old Data
New Data
29 © Hortonworks Inc. 2011–2018. All rights reserved
Schema evolution at reading file-based data sources – Cont.
1. Native Vectorized ORC Reader
2. Only safe change via upcasting
3. JSON is the most flexible for changing types
File Format TEXT CSV JSON ORC
`hive`
ORC
`native`1
PARQUET
Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️
Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️
Hide Column ✔️ ✔️ ✔️
Change Column Type2 ✔️ ✔️3 ✔️
Change Column Position ✔️ ✔️ ✔️
30 © Hortonworks Inc. 2011–2018. All rights reserved
Performance
31 © Hortonworks Inc. 2011–2018. All rights reserved
Micro Benchmark (Apache Spark 2.3.0)
• Target
− Apache Spark 2.3.0
− Apache ORC 1.4.1
• Machine
− MacBook Pro (2015 Mid)
− Intel® Core™ i7-4770JQ CPI @ 2.20GHz
− Mac OS X 10.13.4
− JDK 1.8.0_161
32 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / hive reader
4x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
33 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Vectorized Read
0
500
1000
1500
2000
2500
TINYINT SMALLINT INT BIGINT FLOAT DOULBE
native hive
15M rows in a single-column table
Time
(ms)
10x
5x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
11x
34 © Hortonworks Inc. 2011–2018. All rights reserved
Performance – Partitioned table read
0
500
1000
1500
2000
2500
Data column Partition column Both columns
native hive
Time
(ms)
21x7x
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
15M rows in a partitioned table
35 © Hortonworks Inc. 2011–2018. All rights reserved
Predicate Pushdown
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Select 10% rows (id < value)
Select 50% rows (id < value)
Select 90% rows (id < value)
Select all rows (id IS NOT NULL)
parquet native Time (ms)
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala
15M rows with 5 data columns and 1 sequential id column
36 © Hortonworks Inc. 2011–2018. All rights reserved
Demo
37 © Hortonworks Inc. 2011–2018. All rights reserved
Support Matrix
Future Roadmap
38 © Hortonworks Inc. 2011–2018. All rights reserved
Support Matrix
• Spark 2.3 and ORC 1.4 becomes GA at HDP 2.6.5.
HDP 2.6.3~4 HDP 2.6.5 HDP 3.0 EA1
TP for ORC on Spark GA for ORC on Spark Early Access
Spark 2.2 Spark 2.3.0+ Spark 2.3.1+
N/A ORC 1.4.3 ORC 1.4.3+
spark.sql.orc.enabled=true spark.sql.orc.impl=native spark.sql.orc.impl=native
spark.sql.orc.char.enabled=true N/A N/A
1. https://hortonworks.com/info/early-access-hdp-3-0/
39 © Hortonworks Inc. 2011–2018. All rights reserved
Future Roadmap – Targeting Apache Spark 2.4 (2018 Fall)
Umbrella Issue
• Feature Parity for ORC with Parquet SPARK-20901
Sub issues
• Upgrade Apache ORC to 1.5.1 SPARK-24576
• Use `native` ORC implementation by default SPARK-23456
• Use ORC predicate pushdown by default SPARK-21783
• Use `convertMetastoreOrc` by default SPARK-22279
• Support table properties with `convertMetastoreOrc/Parquet` SPARK-23355
• Test ORC as default data source format SPARK-23553
• Test and support Bloom Filters SPARK-12417
40 © Hortonworks Inc. 2011–2018. All rights reserved
Future Roadmap – On-going work
• ORC Column-level encryption (with ORC 1.6)
• Support VectorUDT/MatrixUDT (SPARK-22320)
• Vectorized Writer with DataSource V2
• Support CHAR/VARCHAR Types
• ALTER TABLE … CHANGE column type (SPARK-18727)
41 © Hortonworks Inc. 2011–2018. All rights reserved
Summary
• Like Hive, Apache Spark 2.3 starts to take advantage of Apache ORC
− Improved feature parity between Spark and Hive
• Native vectorized ORC reader
− boosts Spark ORC performance
− provides better schema evolution ability
• Structured streaming starts to work with ORC (both reader/writer)
• Spark is going to become faster and faster with ORC
42 © Hortonworks Inc. 2011–2018. All rights reserved
Reference
• https://www.slideshare.net/DongjoonHyun/orc-improvement-in-apache-spark-23,
Dataworks Summit 2018 Berlin
• https://youtu.be/EL-NHiwqCSY, ORC configuration in Apache Spark 2.3
• https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow
• https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-
spark-22.html
• https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc-
met-apache-spark-81023199, Dataworks Summit 2017 Sydney
• https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data,
Dataworks Summit 2017 San Jose
43 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
44 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you

More Related Content

PPTX
ORC improvement in Apache Spark 2.3
PPTX
ORC improvement in Apache Spark 2.3
PPTX
ORC File - Optimizing Your Big Data
PPTX
Performance Update: When Apache ORC Met Apache Spark
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
PPTX
LLAP: Building Cloud First BI
PPTX
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
PPTX
File Format Benchmark - Avro, JSON, ORC and Parquet
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
ORC File - Optimizing Your Big Data
Performance Update: When Apache ORC Met Apache Spark
File Format Benchmark - Avro, JSON, ORC & Parquet
LLAP: Building Cloud First BI
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet

What's hot (20)

PPTX
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
PPTX
File Format Benchmarks - Avro, JSON, ORC, & Parquet
PPTX
Major advancements in Apache Hive towards full support of SQL compliance
PPSX
LLAP Nov Meetup
PPTX
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
PPTX
HiveWarehouseConnector
PDF
HadoopFileFormats_2016
PDF
Next Generation Execution for Apache Storm
PPTX
Ozone- Object store for Apache Hadoop
PPTX
Hive acid and_2.x new_features
PPT
State of Security: Apache Spark & Apache Zeppelin
PPTX
ORC 2015
PPTX
Running Services on YARN
PPTX
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
PPTX
Running Enterprise Workloads in the Cloud
PDF
An Apache Hive Based Data Warehouse
PPTX
Apache Phoenix Query Server PhoenixCon2016
PDF
What's new in Apache Spark 2.4
PPTX
ORC File and Vectorization - Hadoop Summit 2013
PDF
Optimizing Hive Queries
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Major advancements in Apache Hive towards full support of SQL compliance
LLAP Nov Meetup
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
HiveWarehouseConnector
HadoopFileFormats_2016
Next Generation Execution for Apache Storm
Ozone- Object store for Apache Hadoop
Hive acid and_2.x new_features
State of Security: Apache Spark & Apache Zeppelin
ORC 2015
Running Services on YARN
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Running Enterprise Workloads in the Cloud
An Apache Hive Based Data Warehouse
Apache Phoenix Query Server PhoenixCon2016
What's new in Apache Spark 2.4
ORC File and Vectorization - Hadoop Summit 2013
Optimizing Hive Queries
Ad

Similar to ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4 (20)

PPTX
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
PDF
ORC 2015: Faster, Better, Smaller
PPTX
ORC 2015: Faster, Better, Smaller
PPTX
ORC: 2015 Faster, Better, Smaller
PDF
What’s new in Apache Spark 2.3 and Spark 2.4
PPTX
Using Apache Hive with High Performance
PDF
Why you should care about data layout in the file system with Cheng Lian and ...
PDF
What s new in spark 2.3 and spark 2.4
PPTX
Hive present-and-feature-shanghai
PPTX
Intro to Spark with Zeppelin
PPTX
Hive analytic workloads hadoop summit san jose 2014
PPTX
Hive for Analytic Workloads
PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
PDF
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
PDF
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
PDF
Spark SQL
PDF
Veracity think bugdata #2 6.7.2015
PDF
Gunther hagleitner:apache hive & stinger
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
Apache Hive 2.0; SQL, Speed, Scale
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
ORC: 2015 Faster, Better, Smaller
What’s new in Apache Spark 2.3 and Spark 2.4
Using Apache Hive with High Performance
Why you should care about data layout in the file system with Cheng Lian and ...
What s new in spark 2.3 and spark 2.4
Hive present-and-feature-shanghai
Intro to Spark with Zeppelin
Hive analytic workloads hadoop summit san jose 2014
Hive for Analytic Workloads
File Format Benchmark - Avro, JSON, ORC & Parquet
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
Spark SQL
Veracity think bugdata #2 6.7.2015
Gunther hagleitner:apache hive & stinger
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
Ad

Recently uploaded (20)

PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
PPT on Performance Review to get promotions
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
OOP with Java - Java Introduction (Basics)
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Internet of Things (IOT) - A guide to understanding
Automation-in-Manufacturing-Chapter-Introduction.pdf
CH1 Production IntroductoryConcepts.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
bas. eng. economics group 4 presentation 1.pptx
Lecture Notes Electrical Wiring System Components
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Geodesy 1.pptx...............................................
CYBER-CRIMES AND SECURITY A guide to understanding
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPT on Performance Review to get promotions
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
UNIT 4 Total Quality Management .pptx
Foundation to blockchain - A guide to Blockchain Tech

ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4 Dongjoon Hyun Principal Software Engineer @ Hortonworks Data Science Team June 2018
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Dongjoon Hyun • Hortonworks − Principal Software Engineer @ Data Science Team • Apache Project − Apache REEF Project Management Committee(PMC) Member & Committer − Apache Spark Project Contributor • GitHub − https://github.com/dongjoon-hyun
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved HDP 2.6.5 (May 2018) • Apache Spark − 2.3.0 (2018 FEB) • Apache ORC − 1.4.3 (2018 FEB) • Apache KAFKA − 1.0.0 (2017 NOV)
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features Apache Spark 2.3.x Spark 2.3.0 (and 2.3.1) has 1409 (and 134) JIRA issues.
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features Apache Spark 2.3.x Spark 2.3.0 (and 2.3.1) has 1409 (and 134) JIRA issues.
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Spark’s built-in file-based data sources • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Storage-efficient and popular for shared Hive tables
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Motivation • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Storage-efficient and popular for shared Hive tables Fast Flexible Hive Table Access
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive • Before Apache ORC − Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3 − v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB) − v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY) − v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN)
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3 − v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB) − v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY) − v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN) − v1.5.1 (2018 MAY)  SPARK-24576, HIVE-19669 Hive 3.1 Spark 2.4
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3 − v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB) − v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY) − v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN) − v1.5.1 (2018 MAY)  SPARK-24576, HIVE-19669 Hive 3.1 Spark 2.4
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Previous ORC Issues in Spark
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Six Issue Categories • ORC Writer Versions • Performance • Structured streaming • Column names • Hive tables and schema evolution • Robustness
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Category 1 – ORC Writer Versions • ORIGINAL • HIVE_8732 (2014) ORC string statistics are not merged correctly • HIVE_4243 (2015) Use real column names from Hive tables • HIVE_12055(2015) Vectorized Writer • HIVE_13083(2016) Decimals write present stream correctly • ORC_101 (2016) Correct the use of the default charset in bloomfilter • ORC_135 (2018) PPD for timestamp is wrong when reader/writer timezones are different
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Category 2 – Performance • Vectorized ORC Reader (SPARK-16060) • Fast reading partition-columns (SPARK-22712) • Pushing down filters for DateType (SPARK-21787)
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved • `FileNotFoundException` at writing empty partitions as ORC • Create structured steam with ORC files Write (SPARK-15474) Read (SPARK-22781) Category 3 – Structured streaming spark.readStream.orc(path)
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Category 4 – Column names • Unicode column names (SPARK-23072) • Column names with dot (SPARK-21791) • Should not create invalid column names (SPARK-21912)
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Category 5 – Hive tables and schema evolution • Support `ALTER TABLE ADD COLUMNS` (SPARK-21929) − Introduced at Spark 2.2, but throws AnalysisException for ORC • Support column positional mismatch (SPARK-22267) − Return wrong result if ORC file schema is different from Hive MetaStore schema order • Support table properties during `convertMetastoreOrc/Parquet` (SPARK-23355, Spark 2.4) − For ORC/Parquet Hive tables, `convertMetastore` ignores table properties
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Category 6 – Robustness • ORC metadata exceed ProtoBuf message size limit (SPARK-19109) • NullPointerException on zero-size ORC file (SPARK-19809) • Support `ignoreCorruptFiles` (SPARK-23049) • Support `ignoreMissingFiles` (SPARK-23305)
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Current Approach
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Supports two ORC file formats • Adding a new OrcFileFormat (SPARK-20682) FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvmo.a.s.sql.hive.orc OrcFileFormat `hive` OrcFileFormat from Hive 1.2.1 `native` OrcFileFormat with ORC 1.4+
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved In Reality – Four cases for ORC Reader/Writer `hive` Reader`native` Reader `hive` Writer `native` Writer • New Data • New Apps • Best performance (Vectorized Reader) • New Data • Old Apps • Improved performance (Non-vectorized Reader) • Old Data • New Apps • Improved performance (Vectorized Reader) • Old Data • Old Apps • As-Is performance (Non-vectorized Reader) 1 2 3 4
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / native reader native writer / hive reader hive writer / hive reader 4x 1 2 3 4 https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) • spark.sql.orc.impl=native (default: `hive`) CREATE TABLE people (name string, age int) USING ORC OPTIONS (orc.compress 'ZLIB') spark.read.orc(path) df.write.orc(path) spark.read.format("orc").load (path) df.write.format("orc").save(path) Read/Write Dataset Read/Write Dataset Create ORC Table
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) – Cont. • spark.sql.orc.impl=native (default: `hive`) spark.readStream.orc(path) spark.readStream.format("orc").load(path) df.writeStream .option("checkpointLocation", path1) .format("orc") .option("path", path2) .start Read/Write Structured Stream
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Support vectorized read on Hive ORC Tables • spark.sql.hive.convertMetastoreOrc=true (default: false) − `spark.sql.orc.impl=native` is required, too. CREATE TABLE people (name string, age int) STORED AS ORC CREATE TABLE people (name string, age int) USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip')
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources • Frequently, new files can have wider column types or new columns − Before SPARK-21929, users drop and recreate ORC table with an updated schema. • User-defined schema reduces schema inference cost and handles upcasting − boolean -> byte -> short -> int -> long − float -> double spark.read.schema("col1 int").orc(path) spark.read.schema("col1 long, col2 long").orc(path) Old Data New Data
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources – Cont. 1. Native Vectorized ORC Reader 2. Only safe change via upcasting 3. JSON is the most flexible for changing types File Format TEXT CSV JSON ORC `hive` ORC `native`1 PARQUET Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️ Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️ Hide Column ✔️ ✔️ ✔️ Change Column Type2 ✔️ ✔️3 ✔️ Change Column Position ✔️ ✔️ ✔️
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Performance
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Micro Benchmark (Apache Spark 2.3.0) • Target − Apache Spark 2.3.0 − Apache ORC 1.4.1 • Machine − MacBook Pro (2015 Mid) − Intel® Core™ i7-4770JQ CPI @ 2.20GHz − Mac OS X 10.13.4 − JDK 1.8.0_161
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / hive reader 4x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Vectorized Read 0 500 1000 1500 2000 2500 TINYINT SMALLINT INT BIGINT FLOAT DOULBE native hive 15M rows in a single-column table Time (ms) 10x 5x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 11x
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved Performance – Partitioned table read 0 500 1000 1500 2000 2500 Data column Partition column Both columns native hive Time (ms) 21x7x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 15M rows in a partitioned table
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved Predicate Pushdown 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Select 10% rows (id < value) Select 50% rows (id < value) Select 90% rows (id < value) Select all rows (id IS NOT NULL) parquet native Time (ms) https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala 15M rows with 5 data columns and 1 sequential id column
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved Demo
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved Support Matrix Future Roadmap
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved Support Matrix • Spark 2.3 and ORC 1.4 becomes GA at HDP 2.6.5. HDP 2.6.3~4 HDP 2.6.5 HDP 3.0 EA1 TP for ORC on Spark GA for ORC on Spark Early Access Spark 2.2 Spark 2.3.0+ Spark 2.3.1+ N/A ORC 1.4.3 ORC 1.4.3+ spark.sql.orc.enabled=true spark.sql.orc.impl=native spark.sql.orc.impl=native spark.sql.orc.char.enabled=true N/A N/A 1. https://hortonworks.com/info/early-access-hdp-3-0/
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved Future Roadmap – Targeting Apache Spark 2.4 (2018 Fall) Umbrella Issue • Feature Parity for ORC with Parquet SPARK-20901 Sub issues • Upgrade Apache ORC to 1.5.1 SPARK-24576 • Use `native` ORC implementation by default SPARK-23456 • Use ORC predicate pushdown by default SPARK-21783 • Use `convertMetastoreOrc` by default SPARK-22279 • Support table properties with `convertMetastoreOrc/Parquet` SPARK-23355 • Test ORC as default data source format SPARK-23553 • Test and support Bloom Filters SPARK-12417
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved Future Roadmap – On-going work • ORC Column-level encryption (with ORC 1.6) • Support VectorUDT/MatrixUDT (SPARK-22320) • Vectorized Writer with DataSource V2 • Support CHAR/VARCHAR Types • ALTER TABLE … CHANGE column type (SPARK-18727)
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved Summary • Like Hive, Apache Spark 2.3 starts to take advantage of Apache ORC − Improved feature parity between Spark and Hive • Native vectorized ORC reader − boosts Spark ORC performance − provides better schema evolution ability • Structured streaming starts to work with ORC (both reader/writer) • Spark is going to become faster and faster with ORC
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved Reference • https://www.slideshare.net/DongjoonHyun/orc-improvement-in-apache-spark-23, Dataworks Summit 2018 Berlin • https://youtu.be/EL-NHiwqCSY, ORC configuration in Apache Spark 2.3 • https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow • https://community.hortonworks.com/articles/148917/orc-improvements-for-apache- spark-22.html • https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc- met-apache-spark-81023199, Dataworks Summit 2017 Sydney • https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data, Dataworks Summit 2017 San Jose
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved Thank you