Optimizing Delta/Parquet Data Lakes for Apache Spark

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Matthew Powers, Prognos Health
Optimizing Delta / Parquet
Data Lakes
#UnifiedDataAnalytics #SparkAISummit

Agenda
• Why Delta?
• Delta basics and transaction log
• Compacting Delta lake
• Vacuuming old files
• Partitioning Delta lakes
• Deleting rows
• Persisting transformations in columns
3

About
4
MungingData
• Time travel
• Compacting
• Vacuuming
• Update columns

Contact me
• GitHub: MrPowers
• Email: matthewkevinpowers@gmail.com
• Delta Slack channel
• Open source hacking
5

What is Delta lake?
• Parquet + transaction log
• Provides awesome features for free!
6

Delta Lake =!= Databricks Delta
7
https://github.com/delta-io/delta/issues/49

#UnifiedDataAnalytics #SparkAISummit
TL;DR
• 1 GB files
• No nested directories
8

#UnifiedDataAnalytics #SparkAISummit 9
Delta Lake Slack says 1GB files

Databricks Delta autoOptimize
10

Why does compaction speed up
lakes?
• Parquet: files need to be listed before they are
read. Listing is expensive in object stores.
• Delta: Data is read via the transaction log.
• Easier for Spark to read partitioned lakes into
memory partitions.
11

_delta_log/00000000000000000000.json
15

_delta_log/00000000000000000001.json
19

Compacting Delta lakes without breaking
downstream apps
20
https://github.com/delta-io/delta/issues/146

Delta Lake Vacuum
• Files marked for removal older than the retention
period
• Default retention period is 7 days
• Not going to improve performance
22

Optimal number of partitions
(delta)
25

Optimal number of partitions (parquet)
28
https://github.com/MrPowers/spark-daria/blob/master/src/main/scala/com/github/
mrpowers/spark/daria/utils/DirHelpers.scala

Why partition data lakes?
• Data skipping
• Massively improve query performance
• I’ve seen queries run 50-100 times faster on
partitioned lakes
29

Filtering unpartitioned lake
31
== Physical Plan ==
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) &&
StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/people.csv],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia),
StringStartsWith(first_name,M)],
ReadSchema: struct

_delta_log/00000000000000000000.json
34

Filtering partitioned lake
35
== Physical Plan ==
*(1) Project [first_name#662, last_name#663, country#664]
+- *(1) Filter (isnotnull(first_name#662) && StartsWith(first_name#662, M))
+- *(1) FileScan parquet [first_name#662,last_name#663,country#664]
Batched: true,
Format: Parquet,
Location: TahoeLogFileIndex[file:/…/tmp/europe_partitioned1],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#664), (country#664 = Russia)],
PushedFilters: [IsNotNull(first_name), StringStartsWith(first_name,M)],
ReadSchema: struct<first_name:string,last_name:string>

Comparing physical plans
36
Unpartitioned
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12))
&& (country#14 = Russia)) && StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[….],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name),
EqualTo(country,Russia), StringStartsWith(first_name,M)],
ReadSchema: struct
Partitioned
+- Filter (isnotnull(first_name#662) && StartsWith(first_name#662, M))
+- FileScan parquet [first_name#662,last_name#663,country#664]
Batched: true,
Format: Parquet,
Location: TahoeLogFileIndex[file:/…/tmp/europe_partitioned1],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#664), (country#664 =
Russia)],
PushedFilters: [IsNotNull(first_name),
StringStartsWith(first_name,M)],
ReadSchema: struct<first_name:string,last_name:string>

Directly grabbing the partitions is
faster for Parquet lakes…
37
Directly grabbing partitions was 83 times faster than relying on partition
filters for a simple query

Real partitioned data lake
• Updates every 3 hours
• Has 5 million files
• 15,000 files are being added every day
• Still great for a lot of queries
38

Creating partitioned lake (2/3)
39

Partitioned lake on disk (2/3)
40

Creating partitioned lake (3/3)
41

Incrementally updating
partitioned lakes
• Small file problem grows quickly
• Compaction is hard
42

We can delete rows in Delta lakes
44

Delta lake downsides… not many
49

Contact me
• GitHub: MrPowers
• Email: matthewkevinpowers@gmail.com
• Delta Slack channel
• Open source hacking
50

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Optimizing Delta/Parquet Data Lakes for Apache Spark

More Related Content

What's hot (20)

Similar to Optimizing Delta/Parquet Data Lakes for Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

Optimizing Delta/Parquet Data Lakes for Apache Spark