ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4

1 © Hortonworks Inc. 2011–2018. All rights reserved
ORC Improvement & Roadmap
in Apache Spark 2.3 and 2.4
Dongjoon Hyun
Principal Software Engineer @ Hortonworks Data Science Team
June 2018

Dongjoon Hyun
• Hortonworks
− Principal Software Engineer @ Data Science Team
• Apache Project
− Apache REEF Project Management Committee(PMC) Member & Committer
− Apache Spark Project Contributor
• GitHub
− https://github.com/dongjoon-hyun

HDP 2.6.5 (May 2018)
• Apache Spark
− 2.3.0 (2018 FEB)
• Apache ORC
− 1.4.3 (2018 FEB)
• Apache KAFKA
− 1.0.0 (2017 NOV)

• Vectorized ORC Reader
• Structured Streaming with ORC
• Schema evolution with ORC
• PySpark Performance Enhancements
with Apache Arrow and ORC
• Structured stream-stream joins
• Spark History Server V2
• Spark on Kubernetes
• Data source API V2
• Streaming API V2
• Continuous Structured Streaming
Processing
Major Features Experimental Features
Apache Spark 2.3.x
Spark 2.3.0 (and 2.3.1) has 1409 (and 134) JIRA issues.

Spark’s built-in file-based data sources
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Storage-efficient and popular for shared Hive tables

Motivation
• TEXT The simplest one with one string column schema
• CSV Popular for data science workloads
• JSON The most flexible one for schema changes
• PARQUET The only one with vectorized reader
• ORC Storage-efficient and popular for shared Hive tables
Fast
Flexible
Hive Table Access

The story of Spark, ORC, and Hive
• Before Apache ORC
− Hive 1.2.1 (2015 JUN)  SPARK-2883 Hive 1.2.1 Spark 1.4

The story of Spark, ORC, and Hive – Cont.
• After Apache ORC
− v1.0.0 (2016 JAN)
− v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3

− v1.0.0 (2016 JAN)
− v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3
− v1.4.1 (2017 OCT)  SPARK-22300 Spark 2.3.0 (FEB)
− v1.4.3 (2018 FEB)  SPARK-23340, HIVE-18674 Hive 3.0 (MAY)
− v1.4.4 (2018 MAY)  SPARK-24322 Spark 2.3.1 (JUN)

− v1.0.0 (2016 JAN)
− v1.3.3 (2017 FEB)  HIVE-15841 Hive 2.3.0 ~ 2.3.3
− v1.5.1 (2018 MAY)  SPARK-24576, HIVE-19669 Hive 3.1 Spark 2.4

Previous ORC Issues in Spark

Six Issue Categories
• ORC Writer Versions
• Performance
• Structured streaming
• Column names
• Hive tables and schema evolution
• Robustness

Category 1 – ORC Writer Versions
• ORIGINAL
• HIVE_8732 (2014) ORC string statistics are not merged correctly
• HIVE_4243 (2015) Use real column names from Hive tables
• HIVE_12055(2015) Vectorized Writer
• HIVE_13083(2016) Decimals write present stream correctly
• ORC_101 (2016) Correct the use of the default charset in bloomfilter
• ORC_135 (2018) PPD for timestamp is wrong when reader/writer
timezones are different

Category 2 – Performance
• Vectorized ORC Reader (SPARK-16060)
• Fast reading partition-columns (SPARK-22712)
• Pushing down filters for DateType (SPARK-21787)

• `FileNotFoundException` at writing
empty partitions as ORC
• Create structured steam with ORC files
Write (SPARK-15474) Read (SPARK-22781)
Category 3 – Structured streaming
spark.readStream.orc(path)

Category 4 – Column names
• Unicode column names (SPARK-23072)
• Column names with dot (SPARK-21791)
• Should not create invalid column names (SPARK-21912)

Category 5 – Hive tables and schema evolution
• Support `ALTER TABLE ADD COLUMNS` (SPARK-21929)
− Introduced at Spark 2.2, but throws AnalysisException for ORC
• Support column positional mismatch (SPARK-22267)
− Return wrong result if ORC file schema is different from Hive MetaStore schema order
• Support table properties during `convertMetastoreOrc/Parquet` (SPARK-23355, Spark 2.4)
− For ORC/Parquet Hive tables, `convertMetastore` ignores table properties

Category 6 – Robustness
• ORC metadata exceed ProtoBuf message size limit (SPARK-19109)
• NullPointerException on zero-size ORC file (SPARK-19809)
• Support `ignoreCorruptFiles` (SPARK-23049)
• Support `ignoreMissingFiles` (SPARK-23305)

Current Approach

Supports two ORC file formats
• Adding a new OrcFileFormat (SPARK-20682)
FileFormat
TextBasedFileFormat
ParquetFileFormat
OrcFileFormat
HiveFileFormat
JsonFileFormat
LibSVMFileFormat
CSVFileFormat
TextFileFormat
o.a.s.sql.execution.datasources
o.a.s.ml.source.libsvmo.a.s.sql.hive.orc
OrcFileFormat
`hive` OrcFileFormat
from Hive 1.2.1
`native` OrcFileFormat
with ORC 1.4+

In Reality – Four cases for ORC Reader/Writer
`hive` Reader`native` Reader
`hive` Writer
`native` Writer
• New Data
• New Apps
• Best performance
(Vectorized Reader)
• New Data
• Old Apps
• Improved performance
(Non-vectorized Reader)
• Old Data
• New Apps
• Improved performance
(Vectorized Reader)
• Old Data
• Old Apps
• As-Is performance
(Non-vectorized Reader)
1
2
3
4

Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / native reader
native writer / hive reader hive writer / hive reader
4x 1
2
3
4
https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

Switch ORC implementation (SPARK-20728)
• spark.sql.orc.impl=native (default: `hive`)
CREATE TABLE people (name string, age int)
USING ORC OPTIONS (orc.compress 'ZLIB')
spark.read.orc(path)
df.write.orc(path)
spark.read.format("orc").load (path)
df.write.format("orc").save(path)
Read/Write Dataset
Read/Write Dataset
Create ORC Table

Switch ORC implementation (SPARK-20728) – Cont.
• spark.sql.orc.impl=native (default: `hive`)
spark.readStream.orc(path)
spark.readStream.format("orc").load(path)
df.writeStream
.option("checkpointLocation", path1)
.format("orc")
.option("path", path2)
.start
Read/Write
Structured Stream

Support vectorized read on Hive ORC Tables
• spark.sql.hive.convertMetastoreOrc=true (default: false)
− `spark.sql.orc.impl=native` is required, too.
STORED AS ORC
USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip')

Schema evolution at reading file-based data sources
• Frequently, new files can have wider column types or new columns
− Before SPARK-21929, users drop and recreate ORC table with an updated schema.
• User-defined schema reduces schema inference cost and handles upcasting
− boolean -> byte -> short -> int -> long
− float -> double
spark.read.schema("col1 int").orc(path)
spark.read.schema("col1 long, col2 long").orc(path)
Old Data
New Data

Schema evolution at reading file-based data sources – Cont.
1. Native Vectorized ORC Reader
2. Only safe change via upcasting
3. JSON is the most flexible for changing types
File Format TEXT CSV JSON ORC
`hive`
ORC
`native`1
PARQUET
Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️
Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️
Hide Column ✔️ ✔️ ✔️
Change Column Type2 ✔️ ✔️3 ✔️
Change Column Position ✔️ ✔️ ✔️

Performance

Micro Benchmark (Apache Spark 2.3.0)
• Target
− Apache Spark 2.3.0
− Apache ORC 1.4.1
• Machine
− MacBook Pro (2015 Mid)
− Intel® Core™ i7-4770JQ CPI @ 2.20GHz
− Mac OS X 10.13.4
− JDK 1.8.0_161

Performance – Single column scan from wide tables
Number of columns
Time
(ms)
1M rows with all BIGINT columns
0
200
400
600
800
1000
1200
100 200 300
native writer / native reader hive writer / hive reader
4x

Performance – Vectorized Read
0
500
1000
1500
2000
2500
TINYINT SMALLINT INT BIGINT FLOAT DOULBE
native hive
15M rows in a single-column table
Time
(ms)
10x
5x
11x

Performance – Partitioned table read
0
500
1000
1500
2000
2500
Data column Partition column Both columns
native hive
Time
(ms)
21x7x
15M rows in a partitioned table

Predicate Pushdown
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Select 10% rows (id < value)
Select all rows (id IS NOT NULL)
parquet native Time (ms)
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala
15M rows with 5 data columns and 1 sequential id column

Demo

Support Matrix
Future Roadmap

Support Matrix
• Spark 2.3 and ORC 1.4 becomes GA at HDP 2.6.5.
HDP 2.6.3~4 HDP 2.6.5 HDP 3.0 EA1
TP for ORC on Spark GA for ORC on Spark Early Access
Spark 2.2 Spark 2.3.0+ Spark 2.3.1+
N/A ORC 1.4.3 ORC 1.4.3+
spark.sql.orc.enabled=true spark.sql.orc.impl=native spark.sql.orc.impl=native
spark.sql.orc.char.enabled=true N/A N/A
1. https://hortonworks.com/info/early-access-hdp-3-0/

Future Roadmap – Targeting Apache Spark 2.4 (2018 Fall)
Umbrella Issue
• Feature Parity for ORC with Parquet SPARK-20901
Sub issues
• Upgrade Apache ORC to 1.5.1 SPARK-24576
• Use `native` ORC implementation by default SPARK-23456
• Use ORC predicate pushdown by default SPARK-21783
• Use `convertMetastoreOrc` by default SPARK-22279
• Support table properties with `convertMetastoreOrc/Parquet` SPARK-23355
• Test ORC as default data source format SPARK-23553
• Test and support Bloom Filters SPARK-12417

Future Roadmap – On-going work
• ORC Column-level encryption (with ORC 1.6)
• Support VectorUDT/MatrixUDT (SPARK-22320)
• Vectorized Writer with DataSource V2
• Support CHAR/VARCHAR Types
• ALTER TABLE … CHANGE column type (SPARK-18727)

Summary
• Like Hive, Apache Spark 2.3 starts to take advantage of Apache ORC
− Improved feature parity between Spark and Hive
• Native vectorized ORC reader
− boosts Spark ORC performance
− provides better schema evolution ability
• Structured streaming starts to work with ORC (both reader/writer)
• Spark is going to become faster and faster with ORC

Reference
• https://www.slideshare.net/DongjoonHyun/orc-improvement-in-apache-spark-23,
Dataworks Summit 2018 Berlin
• https://youtu.be/EL-NHiwqCSY, ORC configuration in Apache Spark 2.3
• https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow
• https://community.hortonworks.com/articles/148917/orc-improvements-for-apache-
spark-22.html
• https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc-
met-apache-spark-81023199, Dataworks Summit 2017 Sydney
• https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data,
Dataworks Summit 2017 San Jose

Questions?

Thank you

ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4

More Related Content

What's hot (20)

Similar to ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4 (20)

Recently uploaded (20)

ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4