SlideShare a Scribd company logo
Why you should care about
data layout in the file system
Cheng Lian, @liancheng
Vida Ha, @femineer
Spark Summit 2017
1
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
22
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
Apache Spark is a
powerful framework
with some temper
3
4
Just like
super mario
5
Serve him the
right ingredients
6
Powers up and
gets more efficient
7
Keep serving
8
He even knows
how to Spark!
9
However,
once served
a wrong dish...
10
Meh...
11
And sometimes...
12
It can be messy...
13
Secret sauces
we feed Spark
13
File Formats
14
Choosing a compression scheme
15
The obvious
• Compression ratio: the higher the better
• De/compression speed: the faster the better
Choosing a compression scheme
16
Splittable v.s. non-splittable
• Affects parallelism, crucial for big data
• Common splittable compression schemes
• LZ4, Snappy, BZip2, LZO, and etc.
• GZip is non-splittable
• Still common if file sizes are << 1GB
• Still applicable for Parquet
Columnar formats
Smart, analytics friendly, optimized for big data
• Support for nested data types
• Efficient data skipping
• Column pruning
• Min/max statistics based predicate push-down
• Nice interoperability
• Examples:
• Spark SQL built-in support: Apache Parquet and Apache ORC
• Newly emerging: Apache CarbonData and Spinach
17
Columnar formats
Parquet
• Apache Spark default output format
• Usually the best practice for Spark SQL
• Relatively heavy write path
• Worth the time to encode for repeated analytics scenario
• Does not support fine grained appending
• Not ideal for, e.g., collecting logs
• Check out Parquet presentations for more details
18
Semi-structured text formats
Sort of structured but not self-describing
• Excellent write path performance but slow on the read path
• Good candidates for collecting raw data (e.g., logs)
• Subject to inconsistent and/or malformed records
• Schema inference provided by Spark (for JSON and CSV)
• Sampling-based
• Handy for exploratory scenario but can be inaccurate
• Always specify an accurate schema in production
19
Semi-structured text formats
JSON
• Supported by Apache Spark out of the box
• One JSON object per line for fast file splitting
• JSON object: map or struct?
• Spark schema inference always treats JSON objects as structs
• Watch out for arbitrary number of keys (may OOM executors)
• Specify an accurate schema if you decide to stick with maps
20
Semi-structured text formats
JSON
• Malformed records
• Bad records are collected into column _corrupted_record
• All other columns are set to null
21
Semi-structured text formats
CSV
• Supported by Spark 2.x out of the box
• Check out the spark-csv package for Spark 1.x
• Often used for handling legacy data providers & consumers
• Lacks of a standard file specification
– Separator, escaping, quoting, and etc.
• Lacks of support for nested data types
22
Raw text files
23
Arbitrary line-based text files
• Splitting files into lines using spark.read.text()
• Keep your lines a reasonable size
• Keep file size < 1GB if compressed with a non-splittable
compression scheme (e.g., GZip)
• Handing inevitable malformed data
• Use a filter() transformation to drop bad lines, or
• Use a map() transformation to fix bad line
Directory layout
24
Partitioning
year=2017 genre=classic albums
albumsgenre=folk
25
Overview
• Coarse-grained data skipping
• Available for both persisted
tables and raw directories
• Automatically discovers Hive
style partitioned directories
CREATE TABLE ratings
USING PARQUET
PARTITIONED BY (year, genre)
AS SELECT artist, rating, year, genre
FROM music
Partitioning
SQL DataFrame API
spark
.table(“music”)
.select(’artist, ’rating, ’year, ’genre)
.write
.format(“parquet”)
.partitionBy(’year, ’genre)
.saveAsTable(“ratings”)
26
Partitioning
year=2017 albums
albums
genre=classic
genre=folk
27
Filter predicates
Use simple filter predicates
containing partition columns to
leverage partition pruning
Partitioning
Filter predicates
• year = 2000 AND genre = ‘folk’
• year > 2000 AND rating > 3
• year > 2000 OR genre <> ‘rock’
28
Partitioning
29
Filter predicates
• year > 2000 OR rating = 5
• year > rating
Partitioning
Avoid excessive partitions
• Stress metastore for persisted tables
• Stress file system when reading directly from the file system
• Suggestions
• Avoid using too many partition columns
• Avoid using partition columns with too many distinct values
– Try hashing the values
– E.g., partition by first letter of first name rather than first name
30
Partitioning
Using persisted partitioned tables
with Spark 2.1+
• Per-partition metadata gets
persisted into the metastore
• Avoids unnecessary partition
discovery (esp. valuable for S3)
Check our blog post for more details
31
Scalable partition handling
• Pre-shuffles and optionally
pre-sorts the data while writing
• Layout information gets persisted
in the metastore
• Avoids shuffling and sorting when
joining large datasets
• Only available for persisted tables
Bucketing
Overview
32
CREATE TABLE ratings
USING PARQUET
PARTITIONED BY (year, genre)
CLUSTERED BY (rating) INTO 5 BUCKETS
SORTED BY (rating)
AS SELECT artist, rating, year, genre
FROM music
Bucketing
SQL
33
DataFrame
ratings
.select(’artist, ’rating, ’year, ’genre)
.write
.format(“parquet”)
.partitionBy(“year”, “genre”)
.bucketBy(5, “rating”)
.sortBy(“rating”)
.saveAsTable(“ratings”)
Bucketing
In combo with columnar formats
• Bucketing
• Per-bucket sorting
• Columnar formats
• Efficient data skipping based on min/max statistics
• Works best when the searched columns are sorted
34
Bucketing
35
Bucketing
36
min=0, max=99
min=100, max=199
min=200, max=249
Bucketing
In combo with columnar formats
Perfect combination, makes your Spark jobs FLY!
37
More tips
38
File size and compaction
Avoid small files
• Cause excessive parallelism
• Spark 2.x improves this by packing small files
• Cause extra file metadata operations
• Particularly bad when hosted on S3
39
File size and compaction
40
How to control output file sizes
• In general, one task in the output stage writes one file
• Tune parallelism of the output stage
• coalesce(N), for
• Reduces parallelism for small jobs
• repartition(N), for
• Increasing parallelism for all jobs, or
• Reducing parallelism of final output stage for large jobs
• Still preserves high parallelism for previous stages
True story
Customer
• Spark ORC Read Performance is much slower than Parquet
• The same query took
• 3 seconds on a Parquet dataset
• 4 minutes on an equivalent ORC dataset
41
True story
Me
• Ran a simple count(*), which took
• Seconds on the Parquet dataset with a handful IO requests
• 35 minutes on the ORC dataset with 10,000s of IO requests
• Most task execution threads are reading ORC stripe footers
42
True story
43
True story
44
import org.apache.hadoop.hive.ql.io.orc._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
val conf = new Configuration
def countStripes(file: String): Int = {
val path = new Path(file)
val reader = OrcFile.createReader(path, OrcFile.readerOptions(conf))
val metadata = reader.getMetadata
metadata.getStripeStatistics.size
}
True story
45
Maximum file size: ~15 MB
Maximum ORC stripe counts: ~1,400
True story
46
Root cause
Malformed (but not corrupted) ORC dataset
• ORC readers read the footer first before consuming a strip
• ~1,400 stripes within a single file as small as 15 MB
• ~1,400 x 2 read requests issued to S3 for merely 15 MB of data
True story
47
Root cause
Malformed (but not corrupted) ORC dataset
• ORC readers read the footer first before consuming a strip
• ~1,400 stripes within a single file as small as 15 MB
• ~1,400 x 2 read requests issued to S3 for merely 15 MB of data
Much worse than even CSV, not mention Parquet
True story
48
Why?
• Tiny ORC files (~10 KB) generated by Streaming jobs
• Resulting one tiny ORC stripe inside each ORC file
• The footers might take even more space than the actual data!
True story
49
Why?
Tiny files got compacted into larger ones using
ALTER TABLE ... PARTITION (...) CONCATENATE;
The CONCATENATE command just, well, concatenated those tiny
stripes and produced larger (~15 MB) files with a huge number of
tiny stripes.
True story
50
Lessons learned
Again, avoid writing small files in columnar formats
• Output files using CSV or JSON for Streaming jobs
• For better write path performance
• Compact small files into large chunks of columnar files later
• For better read path performance
True story
51
The cure
Simply read the ORC dataset and write it back using
spark.read.orc(input).write.orc(output)
So that stripes are adjusted into more reasonable sizes.
Schema evolution
Columns come and go
• Never ever change the data type of a published column
• Columns with the same name should have the same data type
• If you really dislike the data type of some column
• Add a new column with a new name and the right data type
• Deprecate the old one
• Optionally, drop it after updating all downstream consumers
52
Schema evolution
Columns come and go
Spark built-in data sources that support schema evolution
• JSON
• Parquet
• ORC
53
Schema evolution
Common columnar formats are less tolerant of data type
mismatch. E.g.:
• INT cannot be promoted to LONG
• FLOAT cannot be promoted to DOUBLE
JSON is more tolerating, though
• LONG → DOUBLE → STRING
54
True story
Customer
Parquet dataset corrupted!!! HALP!!!
55
True story
What happened?
Original schema
• {col1: DECIMAL(19, 4), col2: INT}
Accidentally appended data with schema
• {col1: DOUBLE, col2: DOUBLE}
All files written into the same directory
56
True story
What happened?
Common columnar formats are less tolerant of data type
mismatch. E.g.:
• INT cannot be promoted to LONG
• FLOAT cannot be promoted to DOUBLE
Parquet considered these schemas as incompatible ones and
refused to merge them.
57
True story
BTW
JSON schema inference is more tolerating
• LONG → DOUBLE → STRING
However
• JSON is NOT suitable for analytics scenario
• Schema inference is unreliable, not suitable for production
58
True story
The cure
Correct the schema
• Filter out all the files with the wrong schema
• Rewrite those files using the correct schema
Exhausting because all files are appended into a single directory
59
True story
Lessons learned
• Be very careful on the write path
• Consider partitioning when possible
• Better read path performance
• Easier to fix the data when something went wrong
60
Recap
File formats Directory layout
• Partitioning
• Bucketing
• Compression schemes
• Columnar (Parquet, ORC)
• Semi-structured (JSON, CSV)
• Raw text format
61
Other tips
• File sizes and compaction
• Schema evolution
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
6262
DATABRICKS RUNTIME 3.0
• Apache Spark - optimized for the cloud
• Caching and optimization layer - DBIO
• Enterprise security - DBES
Try for free today.
databricks.com
Early draft available
for free today!
go.databricks.com/book
6363
Thank you
Q & A
64

More Related Content

PPTX
Optimizing Apache Spark SQL Joins
PDF
Parquet performance tuning: the missing guide
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PDF
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Optimizing Apache Spark SQL Joins
Parquet performance tuning: the missing guide
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
How Adobe Does 2 Million Records Per Second Using Apache Spark!

What's hot (20)

PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Apache Spark At Scale in the Cloud
PDF
Apache Spark Core – Practical Optimization
PDF
Spark and S3 with Ryan Blue
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PDF
Top 5 Mistakes When Writing Spark Applications
PDF
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
PDF
Top 5 mistakes when writing Spark applications
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PDF
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
PDF
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
PDF
Iceberg: a fast table format for S3
PPTX
The Impala Cookbook
PDF
TiDB Introduction
PPTX
Druid deep dive
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Dynamic Partition Pruning in Apache Spark
PDF
Introduction to Spark Internals
PDF
Top 5 mistakes when writing Spark applications
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Apache Spark At Scale in the Cloud
Apache Spark Core – Practical Optimization
Spark and S3 with Ryan Blue
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Top 5 Mistakes When Writing Spark Applications
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Top 5 mistakes when writing Spark applications
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Iceberg: a fast table format for S3
The Impala Cookbook
TiDB Introduction
Druid deep dive
The Parquet Format and Performance Optimization Opportunities
Dynamic Partition Pruning in Apache Spark
Introduction to Spark Internals
Top 5 mistakes when writing Spark applications
Ad

Similar to Why you should care about data layout in the file system with Cheng Lian and Vida Ha (20)

PDF
Apache Spark's Built-in File Sources in Depth
PPTX
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
PPTX
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
PDF
Spark SQL Beyond Official Documentation
PDF
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
PDF
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PPTX
iceberg introduction.pptx
PDF
Getting The Best Performance With PySpark
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
PPTX
Beyond shuffling - Strata London 2016
PPTX
File Format Benchmarks - Avro, JSON, ORC, & Parquet
PDF
Storage in hadoop
PDF
Hive partitioning best practices
PPTX
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
PPTX
File Format Benchmark - Avro, JSON, ORC and Parquet
PDF
Beyond shuffling - Scala Days Berlin 2016
Apache Spark's Built-in File Sources in Depth
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Spark SQL Beyond Official Documentation
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
iceberg introduction.pptx
Getting The Best Performance With PySpark
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond shuffling - Strata London 2016
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Storage in hadoop
Hive partitioning best practices
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
Beyond shuffling - Scala Days Berlin 2016
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Global journeys: estimating international migration
PDF
Mega Projects Data Mega Projects Data
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Foundation of Data Science unit number two notes
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Clinical guidelines as a resource for EBP(1).pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Knowledge Engineering Part 1
Global journeys: estimating international migration
Mega Projects Data Mega Projects Data
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Data_Analytics_and_PowerBI_Presentation.pptx
Foundation of Data Science unit number two notes
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Galatica Smart Energy Infrastructure Startup Pitch Deck
Miokarditis (Inflamasi pada Otot Jantung)
oil_refinery_comprehensive_20250804084928 (1).pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

Why you should care about data layout in the file system with Cheng Lian and Vida Ha

  • 1. Why you should care about data layout in the file system Cheng Lian, @liancheng Vida Ha, @femineer Spark Summit 2017 1
  • 2. TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 22 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 3. Apache Spark is a powerful framework with some temper 3
  • 5. 5 Serve him the right ingredients
  • 6. 6 Powers up and gets more efficient
  • 12. 12 It can be messy...
  • 15. Choosing a compression scheme 15 The obvious • Compression ratio: the higher the better • De/compression speed: the faster the better
  • 16. Choosing a compression scheme 16 Splittable v.s. non-splittable • Affects parallelism, crucial for big data • Common splittable compression schemes • LZ4, Snappy, BZip2, LZO, and etc. • GZip is non-splittable • Still common if file sizes are << 1GB • Still applicable for Parquet
  • 17. Columnar formats Smart, analytics friendly, optimized for big data • Support for nested data types • Efficient data skipping • Column pruning • Min/max statistics based predicate push-down • Nice interoperability • Examples: • Spark SQL built-in support: Apache Parquet and Apache ORC • Newly emerging: Apache CarbonData and Spinach 17
  • 18. Columnar formats Parquet • Apache Spark default output format • Usually the best practice for Spark SQL • Relatively heavy write path • Worth the time to encode for repeated analytics scenario • Does not support fine grained appending • Not ideal for, e.g., collecting logs • Check out Parquet presentations for more details 18
  • 19. Semi-structured text formats Sort of structured but not self-describing • Excellent write path performance but slow on the read path • Good candidates for collecting raw data (e.g., logs) • Subject to inconsistent and/or malformed records • Schema inference provided by Spark (for JSON and CSV) • Sampling-based • Handy for exploratory scenario but can be inaccurate • Always specify an accurate schema in production 19
  • 20. Semi-structured text formats JSON • Supported by Apache Spark out of the box • One JSON object per line for fast file splitting • JSON object: map or struct? • Spark schema inference always treats JSON objects as structs • Watch out for arbitrary number of keys (may OOM executors) • Specify an accurate schema if you decide to stick with maps 20
  • 21. Semi-structured text formats JSON • Malformed records • Bad records are collected into column _corrupted_record • All other columns are set to null 21
  • 22. Semi-structured text formats CSV • Supported by Spark 2.x out of the box • Check out the spark-csv package for Spark 1.x • Often used for handling legacy data providers & consumers • Lacks of a standard file specification – Separator, escaping, quoting, and etc. • Lacks of support for nested data types 22
  • 23. Raw text files 23 Arbitrary line-based text files • Splitting files into lines using spark.read.text() • Keep your lines a reasonable size • Keep file size < 1GB if compressed with a non-splittable compression scheme (e.g., GZip) • Handing inevitable malformed data • Use a filter() transformation to drop bad lines, or • Use a map() transformation to fix bad line
  • 25. Partitioning year=2017 genre=classic albums albumsgenre=folk 25 Overview • Coarse-grained data skipping • Available for both persisted tables and raw directories • Automatically discovers Hive style partitioned directories
  • 26. CREATE TABLE ratings USING PARQUET PARTITIONED BY (year, genre) AS SELECT artist, rating, year, genre FROM music Partitioning SQL DataFrame API spark .table(“music”) .select(’artist, ’rating, ’year, ’genre) .write .format(“parquet”) .partitionBy(’year, ’genre) .saveAsTable(“ratings”) 26
  • 27. Partitioning year=2017 albums albums genre=classic genre=folk 27 Filter predicates Use simple filter predicates containing partition columns to leverage partition pruning
  • 28. Partitioning Filter predicates • year = 2000 AND genre = ‘folk’ • year > 2000 AND rating > 3 • year > 2000 OR genre <> ‘rock’ 28
  • 29. Partitioning 29 Filter predicates • year > 2000 OR rating = 5 • year > rating
  • 30. Partitioning Avoid excessive partitions • Stress metastore for persisted tables • Stress file system when reading directly from the file system • Suggestions • Avoid using too many partition columns • Avoid using partition columns with too many distinct values – Try hashing the values – E.g., partition by first letter of first name rather than first name 30
  • 31. Partitioning Using persisted partitioned tables with Spark 2.1+ • Per-partition metadata gets persisted into the metastore • Avoids unnecessary partition discovery (esp. valuable for S3) Check our blog post for more details 31 Scalable partition handling
  • 32. • Pre-shuffles and optionally pre-sorts the data while writing • Layout information gets persisted in the metastore • Avoids shuffling and sorting when joining large datasets • Only available for persisted tables Bucketing Overview 32
  • 33. CREATE TABLE ratings USING PARQUET PARTITIONED BY (year, genre) CLUSTERED BY (rating) INTO 5 BUCKETS SORTED BY (rating) AS SELECT artist, rating, year, genre FROM music Bucketing SQL 33 DataFrame ratings .select(’artist, ’rating, ’year, ’genre) .write .format(“parquet”) .partitionBy(“year”, “genre”) .bucketBy(5, “rating”) .sortBy(“rating”) .saveAsTable(“ratings”)
  • 34. Bucketing In combo with columnar formats • Bucketing • Per-bucket sorting • Columnar formats • Efficient data skipping based on min/max statistics • Works best when the searched columns are sorted 34
  • 37. Bucketing In combo with columnar formats Perfect combination, makes your Spark jobs FLY! 37
  • 39. File size and compaction Avoid small files • Cause excessive parallelism • Spark 2.x improves this by packing small files • Cause extra file metadata operations • Particularly bad when hosted on S3 39
  • 40. File size and compaction 40 How to control output file sizes • In general, one task in the output stage writes one file • Tune parallelism of the output stage • coalesce(N), for • Reduces parallelism for small jobs • repartition(N), for • Increasing parallelism for all jobs, or • Reducing parallelism of final output stage for large jobs • Still preserves high parallelism for previous stages
  • 41. True story Customer • Spark ORC Read Performance is much slower than Parquet • The same query took • 3 seconds on a Parquet dataset • 4 minutes on an equivalent ORC dataset 41
  • 42. True story Me • Ran a simple count(*), which took • Seconds on the Parquet dataset with a handful IO requests • 35 minutes on the ORC dataset with 10,000s of IO requests • Most task execution threads are reading ORC stripe footers 42
  • 44. True story 44 import org.apache.hadoop.hive.ql.io.orc._ import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.Path val conf = new Configuration def countStripes(file: String): Int = { val path = new Path(file) val reader = OrcFile.createReader(path, OrcFile.readerOptions(conf)) val metadata = reader.getMetadata metadata.getStripeStatistics.size }
  • 45. True story 45 Maximum file size: ~15 MB Maximum ORC stripe counts: ~1,400
  • 46. True story 46 Root cause Malformed (but not corrupted) ORC dataset • ORC readers read the footer first before consuming a strip • ~1,400 stripes within a single file as small as 15 MB • ~1,400 x 2 read requests issued to S3 for merely 15 MB of data
  • 47. True story 47 Root cause Malformed (but not corrupted) ORC dataset • ORC readers read the footer first before consuming a strip • ~1,400 stripes within a single file as small as 15 MB • ~1,400 x 2 read requests issued to S3 for merely 15 MB of data Much worse than even CSV, not mention Parquet
  • 48. True story 48 Why? • Tiny ORC files (~10 KB) generated by Streaming jobs • Resulting one tiny ORC stripe inside each ORC file • The footers might take even more space than the actual data!
  • 49. True story 49 Why? Tiny files got compacted into larger ones using ALTER TABLE ... PARTITION (...) CONCATENATE; The CONCATENATE command just, well, concatenated those tiny stripes and produced larger (~15 MB) files with a huge number of tiny stripes.
  • 50. True story 50 Lessons learned Again, avoid writing small files in columnar formats • Output files using CSV or JSON for Streaming jobs • For better write path performance • Compact small files into large chunks of columnar files later • For better read path performance
  • 51. True story 51 The cure Simply read the ORC dataset and write it back using spark.read.orc(input).write.orc(output) So that stripes are adjusted into more reasonable sizes.
  • 52. Schema evolution Columns come and go • Never ever change the data type of a published column • Columns with the same name should have the same data type • If you really dislike the data type of some column • Add a new column with a new name and the right data type • Deprecate the old one • Optionally, drop it after updating all downstream consumers 52
  • 53. Schema evolution Columns come and go Spark built-in data sources that support schema evolution • JSON • Parquet • ORC 53
  • 54. Schema evolution Common columnar formats are less tolerant of data type mismatch. E.g.: • INT cannot be promoted to LONG • FLOAT cannot be promoted to DOUBLE JSON is more tolerating, though • LONG → DOUBLE → STRING 54
  • 55. True story Customer Parquet dataset corrupted!!! HALP!!! 55
  • 56. True story What happened? Original schema • {col1: DECIMAL(19, 4), col2: INT} Accidentally appended data with schema • {col1: DOUBLE, col2: DOUBLE} All files written into the same directory 56
  • 57. True story What happened? Common columnar formats are less tolerant of data type mismatch. E.g.: • INT cannot be promoted to LONG • FLOAT cannot be promoted to DOUBLE Parquet considered these schemas as incompatible ones and refused to merge them. 57
  • 58. True story BTW JSON schema inference is more tolerating • LONG → DOUBLE → STRING However • JSON is NOT suitable for analytics scenario • Schema inference is unreliable, not suitable for production 58
  • 59. True story The cure Correct the schema • Filter out all the files with the wrong schema • Rewrite those files using the correct schema Exhausting because all files are appended into a single directory 59
  • 60. True story Lessons learned • Be very careful on the write path • Consider partitioning when possible • Better read path performance • Easier to fix the data when something went wrong 60
  • 61. Recap File formats Directory layout • Partitioning • Bucketing • Compression schemes • Columnar (Parquet, ORC) • Semi-structured (JSON, CSV) • Raw text format 61 Other tips • File sizes and compaction • Schema evolution
  • 62. UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) 6262 DATABRICKS RUNTIME 3.0 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today. databricks.com
  • 63. Early draft available for free today! go.databricks.com/book 6363