Operating and Supporting Delta Lake in Production

Harikrishnan Kunhumveettil & Mathan Pillai
Operating and Supporting Delta
Lake in Production

Who we are?
Mathan PillaiHarikrishnan Kunhumveettil
Currently
Sr.TSE ,Databricks.
Areas: Spark SQL, Delta, SS
Previously
Sr.TSE MapR
Hadoop Tech. Lead, Nielsen
Currently
Sr. TSE ,Databricks.
Areas: Spark SQL, Delta, SS
Previously
Tech Lead ,Intersys Consulting
Sr Big data Consultant ,Saama
Technologies

Agenda
■ Delta Lake in Production - Data
○ Optimize and Auto-Optimize - Overview
○ Choosing the right strategy - The What
○ Choosing the right strategy - The When
○ Choosing the right strategy - The Where
■ Delta Lake in Production - Metadata
○ Sneak Peek Into Delta Log
○ Delta Log Configs
○ Delta Log Misconception
○ Delta Log Exceptions
○ Tips & Tricks

Delta Lake in Production - Data

Optimize and Auto-Optimize - In a nutshell
▪ Bin-
packing/Compaction
▪ Handles small file
problem
▪ Idempotent
▪ Incremental
▪ Creates 1 GB file or
10M records
▪ Controlled by
optimize.maxFileSize
▪ Helps in Data
Skipping
▪ Use Range
Partitioning
▪ Hilbert Curve In
Preview
▪ Partially incremental
▪ Supports
all/new/minCubeSize
▪ Controlled by
optimize.zorder.mergeS
trategy.minCubeSize.th
reshold
OPTIMIZE + ZORDEROPTIMIZE
▪ Unintentionally
referred as Auto-
optimize
▪ Introduce an extra
shuffle phase
▪ Creates row-
compressed data of
512mb (binSize)
▪ Output file ~128 mb
▪ Controlled by
optimizeWrite.binSize
Optimize Write
▪ Mini-Optimize
▪ Creates file as big
as 128 MB
▪ Post-commit
action
▪ Triggered when
more than 50
files/directory
▪ Controlled by:
autoCompact.minNumFi
les
autoCompact.maxFileS
ize
Auto-Compaction
Note: All configurations with a prefix “spark.databricks.delta”. eg: spark.databricks.delta.optimizeWrite.binSize

Choosing the right strategy - The What? strategy
● Optimize writes:
○ Misconception - does not work with Streaming workloads
○ Makes life easy for OPTIMIZE and VACUUM
○ In terms of number of files, Map Only writes can be very expensive. Optimize writes can do magic!
3.2 PB
~ 700 TB input data
~ 400 TB new writes
OPTIMIZE takes ~ 6 -8
hours
Run Optimize job 3
times/day
OPTIMIZE WRITE
OPTIMIZE Job takes 2-3
hours.
Run optimize 4 times/day
More than 40% resource
saved on OPTIMIZE

● Z-Order Vs Partition By
○ Z-order is better than creating large number of small files.
○ More effective use of DBIO cache through the handling of less metadata
326 TB
3 partitions
25 million files
326 TB
2 partitions
650k files

import com.databricks.sql.transaction.tahoe.DeltaLog
import org.apache.hadoop.fs.Path
val deltaPath = "<table_path>"
val deltaLog = DeltaLog(spark, new Path(deltaPath + "/_delta_log"))
val currentFiles = deltaLog.snapshot.allFiles
display(currentFiles.groupBy("partitionValues.col").count().orderBy($"count".desc))

Choosing the right strategy - The When? strategy
● Auto-Optimize runs on the same cluster during/after a write.
● Optimize - Trade off between read performance and cost
● Delay Z-Ordering if you are continuously adding data on active partition.
○ If active reads are not on the latest partition
○ optimize.zorder.mergeStrategy.minCubeSize.threshold is 100 GB by default
○ Reducing the value to make Z-order run time efficient, degrades the read performance
● Should I always run OPTIMIZE + VACUUM ?
○ VACUUM happens on the Spark Driver.
○ Roughly 200k files/hour in ADLS
○ Roughly 300k files/hour in AWS S3
○ DRY RUN gives the estimate

Choosing the right strategy - The Where? strategy
● Auto-optimize runs on the same cluster during/after a write.
● Z-ordering is CPU intensive.
○ Involves Parquet Decoding and Encoding
○ General purpose instances vs Compute optimized clusters.
● Always have “where” clause for OPTIMIZE queries
● Auto-scaling clusters for VACUUM only workloads

Delta Lake in Production - Metadata

Delta Lake Transaction Log
■ Sneak Peek Into Delta Log
■ Delta Log Configs
■ Delta Exceptions
■ Tips & Tricks

Sneak Peek Into Delta Log
Who What When Where
Version N Who What When Where
Version N-1 Who What When Where
Version N-2 Who What When Where

Who ?

What ?

When ?

Where ?

.json files .crc files .checkpoint _last_checkpoint

Delta Log Configs
LogRetentionDuration
How long log files are kept?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.logRetentionDuration'='
7 days')
JSON
JSON

Delta Log Configs
CheckpointRetentionDurationLogRetentionDuration
How long log files are kept?
▪ %sql
SET TBLPROPERTIES
('delta.logRetentionDuration'='
7 days')
How long checkpoint files are kept ?
▪ %sql
SET TBLPROPERTIES
('delta.checkpointRetentionDur
ation' = '7 days')
PARQUET
PARQUET

You can drive in
parallel in a
freeway, but not in
a tunnel
Delta Exceptions
concurrentModificationException Analogy

Delta Exceptions
concurrentModificationException Analogy
You can drive in
parallel in a
freeway, but not
in a tunnel

Delta Exceptions
concurrentModificationException
Verify if concurrent updates happened to same partition

Delta Exceptions
concurrentAppendException
Concurrent operation adds files to the same partition from where your operation
reads

Delta Exceptions
reads
concurrentDeleteReadException
Concurrent operation deleted a file that your operation read

Delta Exceptions
reads
concurrentDeleteReadException
Concurrent operation deleted a file that your operation read
concurrentDeleteDeleteException
Concurrent operation deleted a file that your operation deletes

SELECT * FROM delta_table_name@v2 EXCEPT ALL FROM delta_table_name@v0
Tips & Tricks
How to find what records were added between 2 versions of Delta Table ?

%scala
display(spark.read.json("//path-to-delta-table/_delta_log/0000000000000000000x.json")
.where("add is not null")
.select("add.path"))
Tips & Tricks
How to find what files were added in a specific version of Delta Table ?

val oldestVersionAvailable =
val newestVersionAvailable =
val pathToDeltaTable = ""
val pathToFileName = ""
(oldestVersionAvailable to newestVersionAvailable).map { version =>
var df1 = spark.read.json(f"$pathToDeltaTable/_delta_log/$version%020d.json")
if (df1.columns.toSeq.contains("remove")) {
var df2 = df1.where("remove is not null").select("remove.path")
var df3 = df2.filter('path.contains(pathToFileName))
if (df3.count > 0)
print(s"Commit Version $version removed the file $pathToFileName n")
}
}
Tips & Tricks
How to find which delta commit removed a specific file ?

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Operating and Supporting Delta Lake in Production

More Related Content

What's hot (20)

Similar to Operating and Supporting Delta Lake in Production (20)

More from Databricks (20)

Recently uploaded (20)

Operating and Supporting Delta Lake in Production