SlideShare a Scribd company logo
Building Reliable Delta
Lakes at scale
Steps to running this tutorial
Instructions - https://dbricks.co/saiseu19-delta
1. Create an account + sign in to Databricks Community Edition
https://databricks.com/try
2. Create a cluster with Databricks Runtime 6.1
3. Import the Python notebook and attach it to the cluster
You can also use Scala notebook if you prefer
1. Collect
Everything
• Recommendation Engines
• Risk, Fraud Detection
• IoT & Predictive Maintenance
• Genomics & DNA Sequencing
3. Data Science &
Machine Learning
2. Store it all in
the Data Lake
The Promise of the Data Lake
Garbage In Garbage Stored Garbage Out
🔥
🔥
🔥
🔥🔥
🔥
🔥
Tutorial instructions - https://dbricks.co/saiseu19-delta
What does a typical
data lake project look like?
Tutorial instructions - https://dbricks.co/saiseu19-delta
Evolution of a Cutting-Edge Data Lake
Events
?
AI & Reporting
Streaming
Analytics
Data Lake
Tutorial instructions - https://dbricks.co/saiseu19-delta
Evolution of a Cutting-Edge Data Lake
Events
AI & Reporting
Streaming
Analytics
Data Lake
Tutorial instructions - https://dbricks.co/saiseu19-delta
Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
λ-arch1
1
1
Tutorial instructions - https://dbricks.co/saiseu19-delta
Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
1
21
1
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Partitioned
1
2
3
1
1
3
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
Reprocessing
Challenge #4: Updates?
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates
Partitioned
DELETE, UPDATE
& MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
Wasting Time & Money
Solving Systems Problems
Instead of Extracting Value From Data
Tutorial instructions - https://dbricks.co/saiseu19-delta
Data Lake Distractions
No atomicity means failed production jobs
leave data in corrupt state requiring tedious
recovery
✗
No quality enforcement creates inconsistent
and unusable data
No consistency / isolation makes it almost
impossible to mix appends and reads, batch and
streaming
Tutorial instructions - https://dbricks.co/saiseu19-delta
Let’s try it instead with
Tutorial instructions - https://dbricks.co/saiseu19-delta
Reprocessing
Challenges of the Data Lake
Data Lake
λ-arch
λ-arch
Streaming
Analytics
AI & Reporting
Events
Validation
λ-arch
Validation
Reprocessing
Updates
Partitioned
UPDATE &
MERGE
Scheduled to
Avoid
Modifications
1
2
3
1
1
3
4
4
4
2
Tutorial instructions - https://dbricks.co/saiseu19-delta
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the
quality of your data until it is ready for consumption.
*Data Quality Levels *
The Architecture
Tutorial instructions - https://dbricks.co/saiseu19-delta
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
*Data Quality Levels *
The Architecture
Full ACID Transactions
Focus on your data flow, instead of worrying about failures.
Tutorial instructions - https://dbricks.co/saiseu19-delta
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
*Data Quality Levels *
The Architecture
Open Standards, Open Source
Store petabytes of data without worries of lock-in. Growing
community including Spark, Presto, Hive and more.
Tutorial instructions - https://dbricks.co/saiseu19-delta
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
Tutorial instructions - https://dbricks.co/saiseu19-delta
Powered by
Unifies Streaming / Batch. Convert existing jobs with minimal
modifications.
Data Lake
AI & Reporting
Streaming
Analytics
Business-level
Aggregates
Filtered, Cleaned
Augmented
Raw
Ingestion
The
Bronze Silver Gold
CSV,
JSON, TXT…
Kinesis
UPDATE
DELETE
MERGE
OVERWRITE
INSERT
Tutorial instructions - https://dbricks.co/saiseu19-delta
Support for DMLs
Use Delete/Update/Merge operations for data
corrections, GDPR, Change Data Capture, etc.
Open source and open formats
Unified Batch and Streaming
sources
ACID Transactions
Schema Enforcement and
Evolution
Delete, Update, Merge
Audit History
Versioning and Time Travel
Scalable metadata management
Support from Spark, Presto, Hive
Tutorial instructions - https://dbricks.co/saiseu19-delta
Used by 1000s of organizations world wide
> 2 exabyte processed last month alone
Tutorial instructions - https://dbricks.co/saiseu19-delta
Let’s begin the tutorial!
Build your own Delta Lake
at https://delta.io

More Related Content

PPTX
Databricks for Dummies
PPTX
Delta lake and the delta architecture
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
PPTX
Introduction to Data Engineering
PDF
Summary introduction to data engineering
PDF
Intro to Delta Lake
PPTX
Introduction to Data Engineering
PDF
Delta Lake Streaming: Under the Hood
Databricks for Dummies
Delta lake and the delta architecture
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
Introduction to Data Engineering
Summary introduction to data engineering
Intro to Delta Lake
Introduction to Data Engineering
Delta Lake Streaming: Under the Hood

What's hot (20)

PDF
Apache Spark Overview
PPTX
Databricks Fundamentals
PDF
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
PDF
Moving to Databricks & Delta
PDF
Modernizing to a Cloud Data Architecture
PPTX
Optimizing Apache Spark SQL Joins
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PDF
Productizing Structured Streaming Jobs
PPTX
Building a modern data warehouse
PDF
Future of Data Engineering
PPTX
Azure data platform overview
PPTX
Databricks Platform.pptx
PDF
Achieving Lakehouse Models with Spark 3.0
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PDF
Delta from a Data Engineer's Perspective
PDF
Change Data Feed in Delta
PDF
CDC patterns in Apache Kafka®
PDF
Enabling a Data Mesh Architecture with Data Virtualization
PDF
Introduction SQL Analytics on Lakehouse Architecture
Apache Spark Overview
Databricks Fundamentals
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Moving to Databricks & Delta
Modernizing to a Cloud Data Architecture
Optimizing Apache Spark SQL Joins
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Productizing Structured Streaming Jobs
Building a modern data warehouse
Future of Data Engineering
Azure data platform overview
Databricks Platform.pptx
Achieving Lakehouse Models with Spark 3.0
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Building Lakehouses on Delta Lake with SQL Analytics Primer
Delta from a Data Engineer's Perspective
Change Data Feed in Delta
CDC patterns in Apache Kafka®
Enabling a Data Mesh Architecture with Data Virtualization
Introduction SQL Analytics on Lakehouse Architecture
Ad

Similar to Building Reliable Data Lakes at Scale with Delta Lake (20)

PDF
Delta Lake: Open Source Reliability w/ Apache Spark
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
What Is Delta Lake ???
PDF
Building Robust Production Data Pipelines with Databricks Delta
PDF
Building Robust Production Data Pipelines with Databricks Delta
PDF
Simplify and Scale Data Engineering Pipelines with Delta Lake
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
PDF
Making Apache Spark Better with Delta Lake
PPTX
databricks course | databricks online training
PDF
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
PDF
Getting Started with Delta Lake on Databricks
PDF
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
PDF
Building End-to-End Delta Pipelines on GCP
PDF
Spark with Delta Lake
PDF
Intro to databricks delta lake
PDF
Databricks Delta Lake and Its Benefits
PPTX
Data Engineering A Deep Dive into Databricks
PPTX
Free Training: How to Build a Lakehouse
PPTX
Data Engineering with Databricks Presentation
PDF
Technical Deck Delta Live Tables.pdf
Delta Lake: Open Source Reliability w/ Apache Spark
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
What Is Delta Lake ???
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Simplify and Scale Data Engineering Pipelines with Delta Lake
Building Data Intensive Analytic Application on Top of Delta Lakes
Making Apache Spark Better with Delta Lake
databricks course | databricks online training
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Getting Started with Delta Lake on Databricks
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Building End-to-End Delta Pipelines on GCP
Spark with Delta Lake
Intro to databricks delta lake
Databricks Delta Lake and Its Benefits
Data Engineering A Deep Dive into Databricks
Free Training: How to Build a Lakehouse
Data Engineering with Databricks Presentation
Technical Deck Delta Live Tables.pdf
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Global journeys: estimating international migration
PDF
Foundation of Data Science unit number two notes
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Database Infoormation System (DBIS).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Mega Projects Data Mega Projects Data
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Launch Your Data Science Career in Kochi – 2025
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Global journeys: estimating international migration
Foundation of Data Science unit number two notes
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Database Infoormation System (DBIS).pptx
Fluorescence-microscope_Botany_detailed content
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Mega Projects Data Mega Projects Data
Supervised vs unsupervised machine learning algorithms
IBA_Chapter_11_Slides_Final_Accessible.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Clinical guidelines as a resource for EBP(1).pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Data_Analytics_and_PowerBI_Presentation.pptx
Business Acumen Training GuidePresentation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck

Building Reliable Data Lakes at Scale with Delta Lake

  • 2. Steps to running this tutorial Instructions - https://dbricks.co/saiseu19-delta 1. Create an account + sign in to Databricks Community Edition https://databricks.com/try 2. Create a cluster with Databricks Runtime 6.1 3. Import the Python notebook and attach it to the cluster You can also use Scala notebook if you prefer
  • 3. 1. Collect Everything • Recommendation Engines • Risk, Fraud Detection • IoT & Predictive Maintenance • Genomics & DNA Sequencing 3. Data Science & Machine Learning 2. Store it all in the Data Lake The Promise of the Data Lake Garbage In Garbage Stored Garbage Out 🔥 🔥 🔥 🔥🔥 🔥 🔥 Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 4. What does a typical data lake project look like? Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 5. Evolution of a Cutting-Edge Data Lake Events ? AI & Reporting Streaming Analytics Data Lake Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 6. Evolution of a Cutting-Edge Data Lake Events AI & Reporting Streaming Analytics Data Lake Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 7. Challenge #1: Historical Queries? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events λ-arch1 1 1 Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 8. Challenge #2: Messy Data? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation 1 21 1 2 Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 9. Reprocessing Challenge #3: Mistakes and Failures? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Partitioned 1 2 3 1 1 3 2 Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 10. Reprocessing Challenge #4: Updates? Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates Partitioned DELETE, UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2 Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 11. Wasting Time & Money Solving Systems Problems Instead of Extracting Value From Data Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 12. Data Lake Distractions No atomicity means failed production jobs leave data in corrupt state requiring tedious recovery ✗ No quality enforcement creates inconsistent and unusable data No consistency / isolation makes it almost impossible to mix appends and reads, batch and streaming Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 13. Let’s try it instead with Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 14. Reprocessing Challenges of the Data Lake Data Lake λ-arch λ-arch Streaming Analytics AI & Reporting Events Validation λ-arch Validation Reprocessing Updates Partitioned UPDATE & MERGE Scheduled to Avoid Modifications 1 2 3 1 1 3 4 4 4 2 Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 15. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis Quality Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption. *Data Quality Levels * The Architecture Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 16. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis *Data Quality Levels * The Architecture Full ACID Transactions Focus on your data flow, instead of worrying about failures. Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 17. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion Bronze Silver Gold CSV, JSON, TXT… Kinesis *Data Quality Levels * The Architecture Open Standards, Open Source Store petabytes of data without worries of lock-in. Growing community including Spark, Presto, Hive and more. Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 18. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis Tutorial instructions - https://dbricks.co/saiseu19-delta Powered by Unifies Streaming / Batch. Convert existing jobs with minimal modifications.
  • 19. Data Lake AI & Reporting Streaming Analytics Business-level Aggregates Filtered, Cleaned Augmented Raw Ingestion The Bronze Silver Gold CSV, JSON, TXT… Kinesis UPDATE DELETE MERGE OVERWRITE INSERT Tutorial instructions - https://dbricks.co/saiseu19-delta Support for DMLs Use Delete/Update/Merge operations for data corrections, GDPR, Change Data Capture, etc.
  • 20. Open source and open formats Unified Batch and Streaming sources ACID Transactions Schema Enforcement and Evolution Delete, Update, Merge Audit History Versioning and Time Travel Scalable metadata management Support from Spark, Presto, Hive Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 21. Used by 1000s of organizations world wide > 2 exabyte processed last month alone Tutorial instructions - https://dbricks.co/saiseu19-delta
  • 22. Let’s begin the tutorial!
  • 23. Build your own Delta Lake at https://delta.io