SlideShare a Scribd company logo
Presented By:
Kundan Kumar
Software Consultant
Spark With Delta
Lake
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session timings, you
are requested not to join sessions
after a 5 minutes threshold post
the session start time.
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Agenda
01 What & Why Delta Lake
02 Features Of Delta Lake
03 Delta Lake Transaction
Log
04 Demo
Brings Data Reliability and Performance to Data
Lakes
What is Data Lake?
A Data Lake is a centralized repository that can store large amount of
structured, semi-structured, and unstructured data. It is a place to store
every type of data in its native format with no fixed limits on account size or
file. It offers high data quantity to increase analytic performance and native
integration.
Why Delta Lake?
1. Data reliability challenges with data lake.
2013 2014 2018
Failed production
job
Orphan Data
No Schema
Enforcement
2. ACID Transactions: Critical Feature missing from Spark.
What is Delta Lake?
Delta Lake is basically a open source storage layer that would sit on top of your
existing data lake and is fully compatible with Apache Spark APIs. It brings ACID
transactions to Apache Spark and big data workloads.
Features of Delta Lake
1. ACID Transaction
2. Scalable metedata handling
3. Open Format
4. Time Travel
5. Schema Enforcement & Evolution
6. Updates and Deletes
7. Unified Batch and Streaming
ACID Transactions
Delta Lake brings Atomicity, Consistency, Isolation and Durability (ACID)
transactions to your data lakes. It provides serializability, the strongest level of
isolation level and ensure that readers never see inconsistent data.
Transaction Log
The Delta Lake transaction log (also known as the DeltaLog) is an ordered
record of every transaction that has ever been performed on a Delta Lake
table since its inception. It is a single source of truth.
Optimistic Concurrency Control
Optimistic concurrency control is a method of dealing with concurrent
transactions that assumes that transactions (changes) made to a table by
different users can complete without conflicting with one another.
Time Travel (Data Versioning)
Delta Lake time travel allows us to query an older snapshot of a Delta Lake table. This
time-traveling can be achieved using 2 approaches:
1. Using a version number
2. Using a timestamp
Time travel has many use cases, including:
● Time travel makes it easy to do rollbacks in case of bad writes, playing an important role in fixing
mistakes in our data.
● It helps in re-creating analysis, reports, or outputs (for example, the output of a machine learning
model). This could be useful for debugging or auditing, especially in regulated industries.
● It also simplifies time-series analytics. For instance, in finding out how many new customers were
added over the last week.
Scalable Metadata Handling: In big data, even the metadata itself can be "big
data". Delta Lake treats metadata just like data, leveraging Spark's distributed
processing power to handle all its metadata. As a result, Delta Lake can handle
petabyte-scale tables with billions of partitions and files at ease.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling
Delta Lake to leverage the efficient compression and encoding schemes that are
native to Parquet. Apache Parquet is column-oriented and designed to bring
efficient columnar storage of data compared to row-based like CSV.
.
id name age
123 xyz 21
321 abc 20
123 xyz 21 321 abc 20
123 321 xyz abc 21 20
Row Orriented
Column Oriented
Schema Enforcement: Schema enforcement, also known as schema validation,
is a safeguard in Delta Lake that ensures data quality by rejecting writes to a
table that do not match the table’s schema. To determine whether a write to a
table is compatible, Delta Lake uses the following rules:
1. Cannot contain any additional columns that are not present in the target table’s schema.
2. Cannot have column data types that differ from the column data types in the target table.
3. Cannot contain column names that differ only by case.
Schema Evolution: Delta Lake enables you to make changes to a table schema
that can be applied automatically.
Updates, Deletes and Merges
Delta Lake supports Scala / Java APIs to merge, update and delete datasets.
This allows you to easily comply with GDPR and CCPA.
Updates: We can update data that matches a predicate in a Delta Lake table.
Deletes: We can remove data that matches a predicate from a Delta Lake
table.
Merges: we can upsert data from a Spark DataFrame into a Delta Lake table
using the merge operation.
Unified Batch and Streaming Source and Sink
A table in Delta Lake is both a batch table, as well as a streaming source and
sink. Streaming data ingest, batch historic backfill, and interactive queries all
just work out of the box.
DEMO
Q/A
References
1. Welcome to the Delta Lake documentation — Delta Lake
Documentation
2. Spark: ACID compliant or not
3. Spark: ACID Transaction with Delta Lake
4. Time Travel: Data versioning in Delta Lake
Thank You !

More Related Content

PPTX
Zero to Snowflake Presentation
PPTX
Delta lake and the delta architecture
PDF
Intro to Delta Lake
PDF
Making Apache Spark Better with Delta Lake
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
PDF
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
PDF
Databricks Delta Lake and Its Benefits
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Zero to Snowflake Presentation
Delta lake and the delta architecture
Intro to Delta Lake
Making Apache Spark Better with Delta Lake
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Databricks Delta Lake and Its Benefits
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

What's hot (20)

PDF
Delta from a Data Engineer's Perspective
PPTX
Snowflake Datawarehouse Architecturing
PPTX
Snowflake essentials
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PDF
Introduction to DataFusion An Embeddable Query Engine Written in Rust
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
PDF
Kafka 101 and Developer Best Practices
PDF
Achieving Lakehouse Models with Spark 3.0
PPTX
Snowflake Architecture.pptx
PDF
The delta architecture
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Data Mesh Part 4 Monolith to Mesh
PDF
Introducing Databricks Delta
PDF
Snowflake for Data Engineering
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PPTX
Databricks Fundamentals
Delta from a Data Engineer's Perspective
Snowflake Datawarehouse Architecturing
Snowflake essentials
Building Lakehouses on Delta Lake with SQL Analytics Primer
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introducing the Snowflake Computing Cloud Data Warehouse
Kafka 101 and Developer Best Practices
Achieving Lakehouse Models with Spark 3.0
Snowflake Architecture.pptx
The delta architecture
A Thorough Comparison of Delta Lake, Iceberg and Hudi
The Parquet Format and Performance Optimization Opportunities
Large Scale Lakehouse Implementation Using Structured Streaming
Introduction SQL Analytics on Lakehouse Architecture
Data Mesh Part 4 Monolith to Mesh
Introducing Databricks Delta
Snowflake for Data Engineering
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Databricks Fundamentals
Ad

Similar to Spark with Delta Lake (20)

PDF
Getting Started with Delta Lake on Databricks
PDF
Intro to databricks delta lake
PDF
What Is Delta Lake ???
PDF
Delta Lake: The Definitive Guide 2023rd Edition Denny Lee
PDF
Delta Lake: Open Source Reliability w/ Apache Spark
PPTX
Free Training: How to Build a Lakehouse
PDF
delta_lake_cheat_sheet.pdf
PPTX
databricks course | databricks online training
PDF
So You Want to Build a Data Lake?
PDF
DeltaLakeOperations.pdf
PPTX
Delta Lake with Azure Databricks
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
PDF
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
PDF
Building Reliable Data Lakes at Scale with Delta Lake
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
PPTX
Turning Raw Data Into Gold With A Data Lakehouse.pptx
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
PDF
Massive Data Processing in Adobe Using Delta Lake
Getting Started with Delta Lake on Databricks
Intro to databricks delta lake
What Is Delta Lake ???
Delta Lake: The Definitive Guide 2023rd Edition Denny Lee
Delta Lake: Open Source Reliability w/ Apache Spark
Free Training: How to Build a Lakehouse
delta_lake_cheat_sheet.pdf
databricks course | databricks online training
So You Want to Build a Data Lake?
DeltaLakeOperations.pdf
Delta Lake with Azure Databricks
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Building Reliable Lakehouses with Apache Flink and Delta Lake
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Building Reliable Data Lakes at Scale with Delta Lake
Building Data Intensive Analytic Application on Top of Delta Lakes
Turning Raw Data Into Gold With A Data Lakehouse.pptx
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Simplify and Scale Data Engineering Pipelines with Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
PPTX
Self-Healing Test Automation Framework - Healenium
PPTX
Kanban Metrics Presentation (Project Management)
PPTX
Java 17 features and implementation.pptx
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
PPTX
GraalVM - A Step Ahead of JVM Presentation
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
DAPR - Distributed Application Runtime Presentation
PPTX
Introduction to Azure Virtual WAN Presentation
PPTX
Introduction to Argo Rollouts Presentation
PPTX
Intro to Azure Container App Presentation
PPTX
Insights Unveiled Test Reporting and Observability Excellence
PPTX
Introduction to Splunk Presentation (DevOps)
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
PPTX
AWS: Messaging Services in AWS Presentation
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
PPTX
Managing State & HTTP Requests In Ionic.
Angular Hydration Presentation (FrontEnd)
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Self-Healing Test Automation Framework - Healenium
Kanban Metrics Presentation (Project Management)
Java 17 features and implementation.pptx
Chaos Mesh Introducing Chaos in Kubernetes
GraalVM - A Step Ahead of JVM Presentation
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
DAPR - Distributed Application Runtime Presentation
Introduction to Azure Virtual WAN Presentation
Introduction to Argo Rollouts Presentation
Intro to Azure Container App Presentation
Insights Unveiled Test Reporting and Observability Excellence
Introduction to Splunk Presentation (DevOps)
Code Camp - Data Profiling and Quality Analysis Framework
AWS: Messaging Services in AWS Presentation
Amazon Cognito: A Primer on Authentication and Authorization
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Managing State & HTTP Requests In Ionic.

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
cuic standard and advanced reporting.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
Empathic Computing: Creating Shared Understanding
Understanding_Digital_Forensics_Presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
NewMind AI Weekly Chronicles - August'25 Week I
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
NewMind AI Monthly Chronicles - July 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
cuic standard and advanced reporting.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing

Spark with Delta Lake

  • 1. Presented By: Kundan Kumar Software Consultant Spark With Delta Lake
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Respect Knolx session timings, you are requested not to join sessions after a 5 minutes threshold post the session start time. Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. Agenda 01 What & Why Delta Lake 02 Features Of Delta Lake 03 Delta Lake Transaction Log 04 Demo
  • 4. Brings Data Reliability and Performance to Data Lakes
  • 5. What is Data Lake? A Data Lake is a centralized repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
  • 6. Why Delta Lake? 1. Data reliability challenges with data lake. 2013 2014 2018 Failed production job Orphan Data No Schema Enforcement
  • 7. 2. ACID Transactions: Critical Feature missing from Spark.
  • 8. What is Delta Lake? Delta Lake is basically a open source storage layer that would sit on top of your existing data lake and is fully compatible with Apache Spark APIs. It brings ACID transactions to Apache Spark and big data workloads.
  • 9. Features of Delta Lake 1. ACID Transaction 2. Scalable metedata handling 3. Open Format 4. Time Travel 5. Schema Enforcement & Evolution 6. Updates and Deletes 7. Unified Batch and Streaming
  • 10. ACID Transactions Delta Lake brings Atomicity, Consistency, Isolation and Durability (ACID) transactions to your data lakes. It provides serializability, the strongest level of isolation level and ensure that readers never see inconsistent data.
  • 11. Transaction Log The Delta Lake transaction log (also known as the DeltaLog) is an ordered record of every transaction that has ever been performed on a Delta Lake table since its inception. It is a single source of truth.
  • 12. Optimistic Concurrency Control Optimistic concurrency control is a method of dealing with concurrent transactions that assumes that transactions (changes) made to a table by different users can complete without conflicting with one another.
  • 13. Time Travel (Data Versioning) Delta Lake time travel allows us to query an older snapshot of a Delta Lake table. This time-traveling can be achieved using 2 approaches: 1. Using a version number 2. Using a timestamp Time travel has many use cases, including: ● Time travel makes it easy to do rollbacks in case of bad writes, playing an important role in fixing mistakes in our data. ● It helps in re-creating analysis, reports, or outputs (for example, the output of a machine learning model). This could be useful for debugging or auditing, especially in regulated industries. ● It also simplifies time-series analytics. For instance, in finding out how many new customers were added over the last week.
  • 14. Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease. Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet. Apache Parquet is column-oriented and designed to bring efficient columnar storage of data compared to row-based like CSV. . id name age 123 xyz 21 321 abc 20 123 xyz 21 321 abc 20 123 321 xyz abc 21 20 Row Orriented Column Oriented
  • 15. Schema Enforcement: Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. To determine whether a write to a table is compatible, Delta Lake uses the following rules: 1. Cannot contain any additional columns that are not present in the target table’s schema. 2. Cannot have column data types that differ from the column data types in the target table. 3. Cannot contain column names that differ only by case. Schema Evolution: Delta Lake enables you to make changes to a table schema that can be applied automatically.
  • 16. Updates, Deletes and Merges Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA. Updates: We can update data that matches a predicate in a Delta Lake table. Deletes: We can remove data that matches a predicate from a Delta Lake table. Merges: we can upsert data from a Spark DataFrame into a Delta Lake table using the merge operation.
  • 17. Unified Batch and Streaming Source and Sink A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • 18. DEMO
  • 19. Q/A
  • 20. References 1. Welcome to the Delta Lake documentation — Delta Lake Documentation 2. Spark: ACID compliant or not 3. Spark: ACID Transaction with Delta Lake 4. Time Travel: Data versioning in Delta Lake