Turning Raw Data Into Gold With A Data Lakehouse.pptx

ED OLDHAM
DATA ENGINEER AT ADVANCING ANALYTICS

agenda
• WHAT IS A DATA LAKEHOUSE?
• DATA LAKES/DATA WAREHOUSES
• APACHE SPARK & DELTA LAKE
• MEDALLION ARCHITECTURE
• BRONZE LAYER
• SILVER LAYER
• GOLD LAYER
• DEMOS

DATA LAKE PROS
• Volume and Variety
• Speed of Ingest
• Lower Costs relative to Data Warehouse
• Greater Accessibility
• Advanced Algorithms

DATA LAKE CONS
• Complex On-Premises Deployment
• Learning Curve
• Migration
• Handling Queries
• Governance, semantic consistency, and
access controls
• Often need a data warehouse as well

DATA WAREHOuSE PROS
• Maturity
• Maintenance
• Performance
• Usability
• Flexibility
• Governance
• Structure

DATA WAREHOUSE CONS
• Storage Costs
• Time
• Limited Source Data
• The Big Data Challenge
• Complicated Changeover
• Potential for Data Distortion

The Data Lakehouse
• Structured
• Governed
• Familiar
• Fast*
• Flexible
• Cheap
• Scalable

Why USE A DATABRICKS OVER A
CLOUD DATA WAREHOuSE?
• Unified analytics platform
• Apache Spark
• Delta Lake
• AI and Machine Learning
• No vendor lock in

WHY NOT USE DATABRICKS?
• Migration can be complex
• A lot to learn if you haven’t used Spark before
• Designed for big data so can be a train
compared to a bicycle

What is APACHE SPARK?
Apache Spark is an open-source
distributed processing analytics engine

What is Delta Lake?
Delta Lake is an optimised,
managed format for organising &
working with Parquet files
“It’s Parquet, but better”
PARQUET
Delta Lake

mEdalliOn architecture AKA MULTI
HOp
• System for organising data in a Lakehouse
• Standard layers are Bronze, Silver, Gold
• More precious = More structure and
validation
• Might be similar to what you have seen in a
Data Warehouse architecture

Bronze layer
• First ingestion into the system
• Complete history
• Data is unvalidated and raw
• Add file name and ingestion date
• Store as Delta
• Use a combination of batch and streaming

Silver layer
• Filter, clean and augment data
• Handle missing data
• Deduplicate
• Apply validation rules
• Logical checks
• Don’t join data from different systems
• Enrich with reference data
• Append or use merge to load new
data/changes

Gold layer
• Create business level aggregates
• Apply business rules
• Join tables to form the basis for
queries/visualisations
• If using a star schema apply that here

OTHER LAYERS
• Sometimes there are use cases for other
layers e.g. Landing
• A separate presentation layer maybe required
or different presentation layers for different
audiences

OTHER NAMING CONVENTIoNS
• RAW > VALIDATED > ENRICHED
• RAW > BASE > CURATED
• RAW > STAGE > CURATED

demos
• Databricks Community
Edition

ed@advancinganalytics.co.uk
https://www.linkedin.com/in/edward-
oldham-99b101175/
https://github.com/edoldhamaa
https://github.com/AdvancingAnalytics/
AdvancingAnalytics.co.uk/blog

Turning Raw Data Into Gold With A Data Lakehouse.pptx

More Related Content

What's hot (20)

Similar to Turning Raw Data Into Gold With A Data Lakehouse.pptx (20)

Recently uploaded (20)

Turning Raw Data Into Gold With A Data Lakehouse.pptx

Editor's Notes