SlideShare a Scribd company logo
Harikrishnan Kunhumveettil & Mathan Pillai
Operating and Supporting Delta
Lake in Production
Who we are?
Mathan PillaiHarikrishnan Kunhumveettil
Currently
Sr.TSE ,Databricks.
Areas: Spark SQL, Delta, SS
Previously
Sr.TSE MapR
Hadoop Tech. Lead, Nielsen
Currently
Sr. TSE ,Databricks.
Areas: Spark SQL, Delta, SS
Previously
Tech Lead ,Intersys Consulting
Sr Big data Consultant ,Saama
Technologies
Agenda
■ Delta Lake in Production - Data
○ Optimize and Auto-Optimize - Overview
○ Choosing the right strategy - The What
○ Choosing the right strategy - The When
○ Choosing the right strategy - The Where
■ Delta Lake in Production - Metadata
○ Sneak Peek Into Delta Log
○ Delta Log Configs
○ Delta Log Misconception
○ Delta Log Exceptions
○ Tips & Tricks
Delta Lake in Production - Data
Optimize and Auto-Optimize - In a nutshell
▪ Bin-
packing/Compaction
▪ Handles small file
problem
▪ Idempotent
▪ Incremental
▪ Creates 1 GB file or
10M records
▪ Controlled by
optimize.maxFileSize
▪ Helps in Data
Skipping
▪ Use Range
Partitioning
▪ Hilbert Curve In
Preview
▪ Partially incremental
▪ Supports
all/new/minCubeSize
▪ Controlled by
optimize.zorder.mergeS
trategy.minCubeSize.th
reshold
OPTIMIZE + ZORDEROPTIMIZE
▪ Unintentionally
referred as Auto-
optimize
▪ Introduce an extra
shuffle phase
▪ Creates row-
compressed data of
512mb (binSize)
▪ Output file ~128 mb
▪ Controlled by
optimizeWrite.binSize
Optimize Write
▪ Mini-Optimize
▪ Creates file as big
as 128 MB
▪ Post-commit
action
▪ Triggered when
more than 50
files/directory
▪ Controlled by:
autoCompact.minNumFi
les
autoCompact.maxFileS
ize
Auto-Compaction
Note: All configurations with a prefix “spark.databricks.delta”. eg: spark.databricks.delta.optimizeWrite.binSize
Choosing the right strategy - The What? strategy
● Optimize writes:
○ Misconception - does not work with Streaming workloads
○ Makes life easy for OPTIMIZE and VACUUM
○ In terms of number of files, Map Only writes can be very expensive. Optimize writes can do magic!
3.2 PB
~ 700 TB input data
~ 400 TB new writes
OPTIMIZE takes ~ 6 -8
hours
Run Optimize job 3
times/day
OPTIMIZE WRITE
OPTIMIZE Job takes 2-3
hours.
Run optimize 4 times/day
More than 40% resource
saved on OPTIMIZE
Choosing the right strategy - The What? strategy
● Z-Order Vs Partition By
○ Z-order is better than creating large number of small files.
○ More effective use of DBIO cache through the handling of less metadata
326 TB
3 partitions
25 million files
326 TB
2 partitions
650k files
Choosing the right strategy - The What? strategy
import com.databricks.sql.transaction.tahoe.DeltaLog
import org.apache.hadoop.fs.Path
val deltaPath = "<table_path>"
val deltaLog = DeltaLog(spark, new Path(deltaPath + "/_delta_log"))
val currentFiles = deltaLog.snapshot.allFiles
display(currentFiles.groupBy("partitionValues.col").count().orderBy($"count".desc))
Choosing the right strategy - The When? strategy
● Auto-Optimize runs on the same cluster during/after a write.
● Optimize - Trade off between read performance and cost
● Delay Z-Ordering if you are continuously adding data on active partition.
○ If active reads are not on the latest partition
○ optimize.zorder.mergeStrategy.minCubeSize.threshold is 100 GB by default
○ Reducing the value to make Z-order run time efficient, degrades the read performance
● Should I always run OPTIMIZE + VACUUM ?
○ VACUUM happens on the Spark Driver.
○ Roughly 200k files/hour in ADLS
○ Roughly 300k files/hour in AWS S3
○ DRY RUN gives the estimate
Choosing the right strategy - The Where? strategy
● Auto-optimize runs on the same cluster during/after a write.
● Z-ordering is CPU intensive.
○ Involves Parquet Decoding and Encoding
○ General purpose instances vs Compute optimized clusters.
● Always have “where” clause for OPTIMIZE queries
● Auto-scaling clusters for VACUUM only workloads
Delta Lake in Production - Metadata
Delta Lake Transaction Log
■ Sneak Peek Into Delta Log
■ Delta Log Configs
■ Delta Exceptions
■ Tips & Tricks
Sneak Peek Into Delta Log
Who What When Where
Version N Who What When Where
Version N-1 Who What When Where
Version N-2 Who What When Where
Sneak Peek Into Delta Log
Who ?
Sneak Peek Into Delta Log
What ?
Sneak Peek Into Delta Log
When ?
Sneak Peek Into Delta Log
Where ?
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Sneak Peek Into Delta Log
.json files .crc files .checkpoint _last_checkpoint
Delta Log Configs
LogRetentionDuration
How long log files are kept?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.logRetentionDuration'='
7 days')
JSON
JSON
Delta Log Configs
CheckpointRetentionDurationLogRetentionDuration
How long log files are kept?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.logRetentionDuration'='
7 days')
How long checkpoint files are kept ?
▪ %sql
ALTER TABLE delta-table-name
SET TBLPROPERTIES
('delta.checkpointRetentionDur
ation' = '7 days')
PARQUET
PARQUET
You can drive in
parallel in a
freeway, but not in
a tunnel
Delta Exceptions
concurrentModificationException Analogy
Delta Exceptions
concurrentModificationException Analogy
You can drive in
parallel in a
freeway, but not
in a tunnel
Delta Exceptions
concurrentModificationException
Verify if concurrent updates happened to same partition
Delta Exceptions
concurrentAppendException
Concurrent operation adds files to the same partition from where your operation
reads
Delta Exceptions
concurrentAppendException
Concurrent operation adds files to the same partition from where your operation
reads
concurrentDeleteReadException
Concurrent operation deleted a file that your operation read
Delta Exceptions
concurrentAppendException
Concurrent operation adds files to the same partition from where your operation
reads
concurrentDeleteReadException
Concurrent operation deleted a file that your operation read
concurrentDeleteDeleteException
Concurrent operation deleted a file that your operation deletes
SELECT * FROM delta_table_name@v2 EXCEPT ALL FROM delta_table_name@v0
Tips & Tricks
How to find what records were added between 2 versions of Delta Table ?
%scala
display(spark.read.json("//path-to-delta-table/_delta_log/0000000000000000000x.json")
.where("add is not null")
.select("add.path"))
Tips & Tricks
How to find what files were added in a specific version of Delta Table ?
val oldestVersionAvailable =
val newestVersionAvailable =
val pathToDeltaTable = ""
val pathToFileName = ""
(oldestVersionAvailable to newestVersionAvailable).map { version =>
var df1 = spark.read.json(f"$pathToDeltaTable/_delta_log/$version%020d.json")
if (df1.columns.toSeq.contains("remove")) {
var df2 = df1.where("remove is not null").select("remove.path")
var df3 = df2.filter('path.contains(pathToFileName))
if (df3.count > 0)
print(s"Commit Version $version removed the file $pathToFileName n")
}
}
Tips & Tricks
How to find which delta commit removed a specific file ?
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

PPTX
Delta lake and the delta architecture
PDF
Snowflake SnowPro Certification Exam Cheat Sheet
PDF
The Parquet Format and Performance Optimization Opportunities
PPTX
Databricks for Dummies
PDF
Snowflake free trial_lab_guide
PDF
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Delta lake and the delta architecture
Snowflake SnowPro Certification Exam Cheat Sheet
The Parquet Format and Performance Optimization Opportunities
Databricks for Dummies
Snowflake free trial_lab_guide
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...

What's hot (20)

PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PPTX
Building a modern data warehouse
PDF
Achieving Lakehouse Models with Spark 3.0
PDF
Apache Iceberg: An Architectural Look Under the Covers
PDF
Azure Synapse 101 Webinar Presentation
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
PDF
Azure Data Factory Introduction.pdf
PDF
FLiP Into Trino
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
PDF
Webinar Data Mesh - Part 3
PPTX
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
PDF
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
PDF
Delta: Building Merge on Read
PPT
9. Document Oriented Databases
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PPTX
Microsoft Fabric.pptx
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PDF
NoSQL databases
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Building a modern data warehouse
Achieving Lakehouse Models with Spark 3.0
Apache Iceberg: An Architectural Look Under the Covers
Azure Synapse 101 Webinar Presentation
Introduction SQL Analytics on Lakehouse Architecture
Azure Data Factory Introduction.pdf
FLiP Into Trino
Making Data Timelier and More Reliable with Lakehouse Technology
Webinar Data Mesh - Part 3
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Delta: Building Merge on Read
9. Document Oriented Databases
Iceberg: A modern table format for big data (Strata NY 2018)
Microsoft Fabric.pptx
Data Lakehouse Symposium | Day 1 | Part 2
NoSQL databases
Ad

Similar to Operating and Supporting Delta Lake in Production (20)

PDF
Best Practices for Building Robust Data Platform with Apache Spark and Delta
PDF
Delta Lake: Optimizing Merge
PDF
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
PDF
Making Apache Spark Better with Delta Lake
PDF
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
PDF
Mastering Query Optimization Techniques for Modern Data Engineers
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
Delta from a Data Engineer's Perspective
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
PPTX
Delta Lake with Azure Databricks
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PDF
Containerized Stream Engine to Build Modern Delta Lake
PDF
Massive Data Processing in Adobe Using Delta Lake
PPTX
Spark Gotchas and Lessons Learned
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PDF
DeltaLakeOperations.pdf
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
PDF
Building a Real-Time Feature Store at iFood
PDF
Delta Lake: The Definitive Guide 2023rd Edition Denny Lee
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Delta Lake: Optimizing Merge
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Making Apache Spark Better with Delta Lake
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
Mastering Query Optimization Techniques for Modern Data Engineers
Common Strategies for Improving Performance on Your Delta Lakehouse
Delta from a Data Engineer's Perspective
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Delta Lake with Azure Databricks
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Containerized Stream Engine to Build Modern Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Spark Gotchas and Lessons Learned
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
DeltaLakeOperations.pdf
Simplify and Scale Data Engineering Pipelines with Delta Lake
Building a Real-Time Feature Store at iFood
Delta Lake: The Definitive Guide 2023rd Edition Denny Lee
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Machine Learning CI/CD for Email Attack Detection
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
PDF
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Machine Learning CI/CD for Email Attack Detection
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue

Recently uploaded (20)

PDF
Transcultural that can help you someday.
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Introduction to Data Science and Data Analysis
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Lecture1 pattern recognition............
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Leprosy and NLEP programme community medicine
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Transcultural that can help you someday.
Supervised vs unsupervised machine learning algorithms
climate analysis of Dhaka ,Banglades.pptx
Introduction to Data Science and Data Analysis
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Knowledge Engineering Part 1
Lecture1 pattern recognition............
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Optimise Shopper Experiences with a Strong Data Estate.pdf
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Leprosy and NLEP programme community medicine
Acceptance and paychological effects of mandatory extra coach I classes.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ISS -ESG Data flows What is ESG and HowHow
SAP 2 completion done . PRESENTATION.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg

Operating and Supporting Delta Lake in Production

  • 1. Harikrishnan Kunhumveettil & Mathan Pillai Operating and Supporting Delta Lake in Production
  • 2. Who we are? Mathan PillaiHarikrishnan Kunhumveettil Currently Sr.TSE ,Databricks. Areas: Spark SQL, Delta, SS Previously Sr.TSE MapR Hadoop Tech. Lead, Nielsen Currently Sr. TSE ,Databricks. Areas: Spark SQL, Delta, SS Previously Tech Lead ,Intersys Consulting Sr Big data Consultant ,Saama Technologies
  • 3. Agenda ■ Delta Lake in Production - Data ○ Optimize and Auto-Optimize - Overview ○ Choosing the right strategy - The What ○ Choosing the right strategy - The When ○ Choosing the right strategy - The Where ■ Delta Lake in Production - Metadata ○ Sneak Peek Into Delta Log ○ Delta Log Configs ○ Delta Log Misconception ○ Delta Log Exceptions ○ Tips & Tricks
  • 4. Delta Lake in Production - Data
  • 5. Optimize and Auto-Optimize - In a nutshell ▪ Bin- packing/Compaction ▪ Handles small file problem ▪ Idempotent ▪ Incremental ▪ Creates 1 GB file or 10M records ▪ Controlled by optimize.maxFileSize ▪ Helps in Data Skipping ▪ Use Range Partitioning ▪ Hilbert Curve In Preview ▪ Partially incremental ▪ Supports all/new/minCubeSize ▪ Controlled by optimize.zorder.mergeS trategy.minCubeSize.th reshold OPTIMIZE + ZORDEROPTIMIZE ▪ Unintentionally referred as Auto- optimize ▪ Introduce an extra shuffle phase ▪ Creates row- compressed data of 512mb (binSize) ▪ Output file ~128 mb ▪ Controlled by optimizeWrite.binSize Optimize Write ▪ Mini-Optimize ▪ Creates file as big as 128 MB ▪ Post-commit action ▪ Triggered when more than 50 files/directory ▪ Controlled by: autoCompact.minNumFi les autoCompact.maxFileS ize Auto-Compaction Note: All configurations with a prefix “spark.databricks.delta”. eg: spark.databricks.delta.optimizeWrite.binSize
  • 6. Choosing the right strategy - The What? strategy ● Optimize writes: ○ Misconception - does not work with Streaming workloads ○ Makes life easy for OPTIMIZE and VACUUM ○ In terms of number of files, Map Only writes can be very expensive. Optimize writes can do magic! 3.2 PB ~ 700 TB input data ~ 400 TB new writes OPTIMIZE takes ~ 6 -8 hours Run Optimize job 3 times/day OPTIMIZE WRITE OPTIMIZE Job takes 2-3 hours. Run optimize 4 times/day More than 40% resource saved on OPTIMIZE
  • 7. Choosing the right strategy - The What? strategy ● Z-Order Vs Partition By ○ Z-order is better than creating large number of small files. ○ More effective use of DBIO cache through the handling of less metadata 326 TB 3 partitions 25 million files 326 TB 2 partitions 650k files
  • 8. Choosing the right strategy - The What? strategy import com.databricks.sql.transaction.tahoe.DeltaLog import org.apache.hadoop.fs.Path val deltaPath = "<table_path>" val deltaLog = DeltaLog(spark, new Path(deltaPath + "/_delta_log")) val currentFiles = deltaLog.snapshot.allFiles display(currentFiles.groupBy("partitionValues.col").count().orderBy($"count".desc))
  • 9. Choosing the right strategy - The When? strategy ● Auto-Optimize runs on the same cluster during/after a write. ● Optimize - Trade off between read performance and cost ● Delay Z-Ordering if you are continuously adding data on active partition. ○ If active reads are not on the latest partition ○ optimize.zorder.mergeStrategy.minCubeSize.threshold is 100 GB by default ○ Reducing the value to make Z-order run time efficient, degrades the read performance ● Should I always run OPTIMIZE + VACUUM ? ○ VACUUM happens on the Spark Driver. ○ Roughly 200k files/hour in ADLS ○ Roughly 300k files/hour in AWS S3 ○ DRY RUN gives the estimate
  • 10. Choosing the right strategy - The Where? strategy ● Auto-optimize runs on the same cluster during/after a write. ● Z-ordering is CPU intensive. ○ Involves Parquet Decoding and Encoding ○ General purpose instances vs Compute optimized clusters. ● Always have “where” clause for OPTIMIZE queries ● Auto-scaling clusters for VACUUM only workloads
  • 11. Delta Lake in Production - Metadata
  • 12. Delta Lake Transaction Log ■ Sneak Peek Into Delta Log ■ Delta Log Configs ■ Delta Exceptions ■ Tips & Tricks
  • 13. Sneak Peek Into Delta Log Who What When Where Version N Who What When Where Version N-1 Who What When Where Version N-2 Who What When Where
  • 14. Sneak Peek Into Delta Log Who ?
  • 15. Sneak Peek Into Delta Log What ?
  • 16. Sneak Peek Into Delta Log When ?
  • 17. Sneak Peek Into Delta Log Where ?
  • 18. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 19. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 20. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 21. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 22. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 23. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 24. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 25. Sneak Peek Into Delta Log .json files .crc files .checkpoint _last_checkpoint
  • 26. Delta Log Configs LogRetentionDuration How long log files are kept? ▪ %sql ALTER TABLE delta-table-name SET TBLPROPERTIES ('delta.logRetentionDuration'=' 7 days') JSON JSON
  • 27. Delta Log Configs CheckpointRetentionDurationLogRetentionDuration How long log files are kept? ▪ %sql ALTER TABLE delta-table-name SET TBLPROPERTIES ('delta.logRetentionDuration'=' 7 days') How long checkpoint files are kept ? ▪ %sql ALTER TABLE delta-table-name SET TBLPROPERTIES ('delta.checkpointRetentionDur ation' = '7 days') PARQUET PARQUET
  • 28. You can drive in parallel in a freeway, but not in a tunnel Delta Exceptions concurrentModificationException Analogy
  • 29. Delta Exceptions concurrentModificationException Analogy You can drive in parallel in a freeway, but not in a tunnel
  • 30. Delta Exceptions concurrentModificationException Verify if concurrent updates happened to same partition
  • 31. Delta Exceptions concurrentAppendException Concurrent operation adds files to the same partition from where your operation reads
  • 32. Delta Exceptions concurrentAppendException Concurrent operation adds files to the same partition from where your operation reads concurrentDeleteReadException Concurrent operation deleted a file that your operation read
  • 33. Delta Exceptions concurrentAppendException Concurrent operation adds files to the same partition from where your operation reads concurrentDeleteReadException Concurrent operation deleted a file that your operation read concurrentDeleteDeleteException Concurrent operation deleted a file that your operation deletes
  • 34. SELECT * FROM delta_table_name@v2 EXCEPT ALL FROM delta_table_name@v0 Tips & Tricks How to find what records were added between 2 versions of Delta Table ?
  • 35. %scala display(spark.read.json("//path-to-delta-table/_delta_log/0000000000000000000x.json") .where("add is not null") .select("add.path")) Tips & Tricks How to find what files were added in a specific version of Delta Table ?
  • 36. val oldestVersionAvailable = val newestVersionAvailable = val pathToDeltaTable = "" val pathToFileName = "" (oldestVersionAvailable to newestVersionAvailable).map { version => var df1 = spark.read.json(f"$pathToDeltaTable/_delta_log/$version%020d.json") if (df1.columns.toSeq.contains("remove")) { var df2 = df1.where("remove is not null").select("remove.path") var df3 = df2.filter('path.contains(pathToFileName)) if (df3.count > 0) print(s"Commit Version $version removed the file $pathToFileName n") } } Tips & Tricks How to find which delta commit removed a specific file ?
  • 37. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.