SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Matthew Powers, Prognos Health
Optimizing Delta / Parquet
Data Lakes
#UnifiedDataAnalytics #SparkAISummit
Agenda
• Why Delta?
• Delta basics and transaction log
• Compacting Delta lake
• Vacuuming old files
• Partitioning Delta lakes
• Deleting rows
• Persisting transformations in columns
3
About
4
MungingData
• Time travel
• Compacting
• Vacuuming
• Update columns
Contact me
• GitHub: MrPowers
• Email: matthewkevinpowers@gmail.com
• Delta Slack channel
• Open source hacking
5
What is Delta lake?
• Parquet + transaction log
• Provides awesome features for free!
6
Delta Lake =!= Databricks Delta
7
https://github.com/delta-io/delta/issues/49
#UnifiedDataAnalytics #SparkAISummit
TL;DR
• 1 GB files
• No nested directories
8
#UnifiedDataAnalytics #SparkAISummit 9
Delta Lake Slack says 1GB files
Databricks Delta autoOptimize
10
Why does compaction speed up
lakes?
• Parquet: files need to be listed before they are
read. Listing is expensive in object stores.
• Delta: Data is read via the transaction log.
• Easier for Spark to read partitioned lakes into
memory partitions.
11
Sample Data
12
Create Delta Data Lake
13
Delta Lake on Disk
14
_delta_log/00000000000000000000.json
15
Code examples
16
Compact Delta Data Lake
17
Files post-compaction
18
_delta_log/00000000000000000001.json
19
Compacting Delta lakes without breaking
downstream apps
20
https://github.com/delta-io/delta/issues/146
21
Delta Lake Vacuum
• Files marked for removal older than the retention
period
• Default retention period is 7 days
• Not going to improve performance
22
Vacuum Delta Data Lake
23
Files post-vacuum
24
Optimal number of partitions
(delta)
25
spark-daria helps!
26
spark-daria on GitHub
27
Optimal number of partitions (parquet)
28
https://github.com/MrPowers/spark-daria/blob/master/src/main/scala/com/github/
mrpowers/spark/daria/utils/DirHelpers.scala
Why partition data lakes?
• Data skipping
• Massively improve query performance
• I’ve seen queries run 50-100 times faster on
partitioned lakes
29
Sample data
30
Filtering unpartitioned lake
31
== Physical Plan ==
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12)) && (country#14 = Russia)) &&
StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[file:/Users/powers/Documents/tmp/blog_data/people.csv],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name), EqualTo(country,Russia),
StringStartsWith(first_name,M)],
ReadSchema: struct
Partitioning the data lake
32
Partitioned lake on disk
33
_delta_log/00000000000000000000.json
34
Filtering partitioned lake
35
== Physical Plan ==
*(1) Project [first_name#662, last_name#663, country#664]
+- *(1) Filter (isnotnull(first_name#662) && StartsWith(first_name#662, M))
+- *(1) FileScan parquet [first_name#662,last_name#663,country#664]
Batched: true,
Format: Parquet,
Location: TahoeLogFileIndex[file:/…/tmp/europe_partitioned1],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#664), (country#664 = Russia)],
PushedFilters: [IsNotNull(first_name), StringStartsWith(first_name,M)],
ReadSchema: struct<first_name:string,last_name:string>
Comparing physical plans
36
Unpartitioned
Project [first_name#12, last_name#13, country#14]
+- Filter (((isnotnull(country#14) && isnotnull(first_name#12))
&& (country#14 = Russia)) && StartsWith(first_name#12, M))
+- FileScan csv [first_name#12,last_name#13,country#14]
Batched: false,
Format: CSV,
Location: InMemoryFileIndex[….],
PartitionFilters: [],
PushedFilters: [IsNotNull(country), IsNotNull(first_name),
EqualTo(country,Russia), StringStartsWith(first_name,M)],
ReadSchema: struct
Partitioned
Project [first_name#662, last_name#663, country#664]
+- Filter (isnotnull(first_name#662) && StartsWith(first_name#662, M))
+- FileScan parquet [first_name#662,last_name#663,country#664]
Batched: true,
Format: Parquet,
Location: TahoeLogFileIndex[file:/…/tmp/europe_partitioned1],
PartitionCount: 1,
PartitionFilters: [isnotnull(country#664), (country#664 =
Russia)],
PushedFilters: [IsNotNull(first_name),
StringStartsWith(first_name,M)],
ReadSchema: struct<first_name:string,last_name:string>
Directly grabbing the partitions is
faster for Parquet lakes…
37
Directly grabbing partitions was 83 times faster than relying on partition
filters for a simple query
Real partitioned data lake
• Updates every 3 hours
• Has 5 million files
• 15,000 files are being added every day
• Still great for a lot of queries
38
Creating partitioned lake (2/3)
39
Partitioned lake on disk (2/3)
40
Creating partitioned lake (3/3)
41
Incrementally updating
partitioned lakes
• Small file problem grows quickly
• Compaction is hard
42
Filtering data from a lake
43
We can delete rows in Delta lakes
44
Deleting under the hood
45
Append a column on the fly
46
Resulting DataFrame
47
Append a column in Delta
48
Delta lake downsides… not many
49
Contact me
• GitHub: MrPowers
• Email: matthewkevinpowers@gmail.com
• Delta Slack channel
• Open source hacking
50
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

PPTX
Optimizing Apache Spark SQL Joins
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
PDF
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
PDF
Parquet performance tuning: the missing guide
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Hudi architecture, fundamentals and capabilities
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Apache Spark Core – Practical Optimization
Optimizing Apache Spark SQL Joins
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Parquet performance tuning: the missing guide
The Parquet Format and Performance Optimization Opportunities
Hudi architecture, fundamentals and capabilities
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Apache Spark Core – Practical Optimization

What's hot (20)

PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Physical Plans in Spark SQL
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Intro to Delta Lake
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PDF
Change Data Feed in Delta
PDF
Making Apache Spark Better with Delta Lake
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Apache Spark Overview
PDF
Enabling Vectorized Engine in Apache Spark
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
PostgreSQL Tutorial For Beginners | Edureka
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Physical Plans in Spark SQL
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Intro to Delta Lake
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Change Data Feed in Delta
Making Apache Spark Better with Delta Lake
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark Overview
Enabling Vectorized Engine in Apache Spark
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
A Deep Dive into Query Execution Engine of Spark SQL
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PostgreSQL Tutorial For Beginners | Edureka
Ad

Similar to Optimizing Delta/Parquet Data Lakes for Apache Spark (20)

PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
PDF
Simplifying Change Data Capture using Databricks Delta
PPTX
Search and analyze data in real time
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
Apache Cassandra at Macys
PDF
Optimising Geospatial Queries with Dynamic File Pruning
PPTX
Druid at naver.com - part 1
PDF
DRUG - RDSTK Talk
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
PDF
Exadata下的数据并行加载、并行卸载及性能监控
PDF
New Developments in Spark
PPTX
Bigdata and Hadoop
PPTX
Introducing Apache Carbon Data - Hadoop Native Columnar Data Format
PDF
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
PDF
Back to FME School - Day 1: Your Data and FME
PDF
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
PDF
Apache: Big Data - Starting with Apache Spark, Best Practices
Optimizing Delta/Parquet Data Lakes for Apache Spark
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
Simplifying Change Data Capture using Databricks Delta
Search and analyze data in real time
Jump Start with Apache Spark 2.0 on Databricks
Apache Cassandra at Macys
Optimising Geospatial Queries with Dynamic File Pruning
Druid at naver.com - part 1
DRUG - RDSTK Talk
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Exadata下的数据并行加载、并行卸载及性能监控
New Developments in Spark
Bigdata and Hadoop
Introducing Apache Carbon Data - Hadoop Native Columnar Data Format
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Back to FME School - Day 1: Your Data and FME
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
Apache: Big Data - Starting with Apache Spark, Best Practices
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PDF
Fluorescence-microscope_Botany_detailed content
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Mega Projects Data Mega Projects Data
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Lecture1 pattern recognition............
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Computer network topology notes for revision
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Quality review (1)_presentation of this 21
Fluorescence-microscope_Botany_detailed content
.pdf is not working space design for the following data for the following dat...
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Mega Projects Data Mega Projects Data
IBA_Chapter_11_Slides_Final_Accessible.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Moving the Public Sector (Government) to a Digital Adoption
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Business Acumen Training GuidePresentation.pptx
Launch Your Data Science Career in Kochi – 2025
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Lecture1 pattern recognition............
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IB Computer Science - Internal Assessment.pptx
Supervised vs unsupervised machine learning algorithms
1_Introduction to advance data techniques.pptx
Computer network topology notes for revision
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Optimizing Delta/Parquet Data Lakes for Apache Spark