SlideShare a Scribd company logo
8
Most read
9
Most read
15
Most read
ED OLDHAM
DATA ENGINEER AT ADVANCING ANALYTICS
agenda
• WHAT IS A DATA LAKEHOUSE?
• DATA LAKES/DATA WAREHOUSES
• APACHE SPARK & DELTA LAKE
• MEDALLION ARCHITECTURE
• BRONZE LAYER
• SILVER LAYER
• GOLD LAYER
• DEMOS
WHAT IS A DATA LAKEHOuSE?
DATA LAKE PROS
• Volume and Variety
• Speed of Ingest
• Lower Costs relative to Data Warehouse
• Greater Accessibility
• Advanced Algorithms
DATA LAKE CONS
• Complex On-Premises Deployment
• Learning Curve
• Migration
• Handling Queries
• Governance, semantic consistency, and
access controls
• Often need a data warehouse as well
DATA WAREHOuSE PROS
• Maturity
• Maintenance
• Performance
• Usability
• Flexibility
• Governance
• Structure
DATA WAREHOUSE CONS
• Storage Costs
• Time
• Limited Source Data
• The Big Data Challenge
• Complicated Changeover
• Potential for Data Distortion
The Data Lakehouse
• Structured
• Governed
• Familiar
• Fast*
• Flexible
• Cheap
• Scalable
Why USE A DATABRICKS OVER A
CLOUD DATA WAREHOuSE?
• Unified analytics platform
• Apache Spark
• Delta Lake
• AI and Machine Learning
• No vendor lock in
WHY NOT USE DATABRICKS?
• Migration can be complex
• A lot to learn if you haven’t used Spark before
• Designed for big data so can be a train
compared to a bicycle
What is APACHE SPARK?
Apache Spark is an open-source
distributed processing analytics engine
What is Delta Lake?
Delta Lake is an optimised,
managed format for organising &
working with Parquet files
“It’s Parquet, but better”
PARQUET
Delta Lake
mEdalliOn architecture AKA MULTI
HOp
• System for organising data in a Lakehouse
• Standard layers are Bronze, Silver, Gold
• More precious = More structure and
validation
• Might be similar to what you have seen in a
Data Warehouse architecture
Bronze layer
• First ingestion into the system
• Complete history
• Data is unvalidated and raw
• Add file name and ingestion date
• Store as Delta
• Use a combination of batch and streaming
Silver layer
• Filter, clean and augment data
• Handle missing data
• Deduplicate
• Apply validation rules
• Logical checks
• Don’t join data from different systems
• Enrich with reference data
• Append or use merge to load new
data/changes
Gold layer
• Create business level aggregates
• Apply business rules
• Join tables to form the basis for
queries/visualisations
• If using a star schema apply that here
OTHER LAYERS
• Sometimes there are use cases for other
layers e.g. Landing
• A separate presentation layer maybe required
or different presentation layers for different
audiences
OTHER NAMING CONVENTIoNS
• RAW > VALIDATED > ENRICHED
• RAW > BASE > CURATED
• RAW > STAGE > CURATED
demos
• Databricks Community
Edition
ed@advancinganalytics.co.uk
https://www.linkedin.com/in/edward-
oldham-99b101175/
https://github.com/edoldhamaa
https://github.com/AdvancingAnalytics/
AdvancingAnalytics.co.uk/blog

More Related Content

PDF
Bias in AI-systems: A multi-step approach
PDF
Getting Started with Delta Lake on Databricks
PDF
Azure data analytics platform - A reference architecture
PDF
ChatGPT, Generative AI and Microsoft Copilot: Step Into the Future - Geoff Ab...
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PDF
Five Things to Consider About Data Mesh and Data Governance
PDF
Building Data Lakehouse.pdf
PDF
Service Mesh on Kubernetes with Istio
Bias in AI-systems: A multi-step approach
Getting Started with Delta Lake on Databricks
Azure data analytics platform - A reference architecture
ChatGPT, Generative AI and Microsoft Copilot: Step Into the Future - Geoff Ab...
Architect’s Open-Source Guide for a Data Mesh Architecture
Five Things to Consider About Data Mesh and Data Governance
Building Data Lakehouse.pdf
Service Mesh on Kubernetes with Istio

What's hot (20)

PPTX
Data Lakehouse Symposium | Day 4
PDF
Lakehouse in Azure
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Data Mesh Part 4 Monolith to Mesh
PDF
Data Warehouse or Data Lake, Which Do I Choose?
PDF
Data Profiling, Data Catalogs and Metadata Harmonisation
PDF
Enabling a Data Mesh Architecture with Data Virtualization
PDF
Data Lake: A simple introduction
PPTX
Architecting a datalake
PDF
Data Governance and Metadata Management
PPTX
Free Training: How to Build a Lakehouse
PPTX
DW Migration Webinar-March 2022.pptx
PDF
Mdm: why, when, how
PPTX
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
PPTX
Five steps to launch your data governance office
PDF
Enterprise Architecture vs. Data Architecture
PDF
Data Architecture Strategies: Data Architecture for Digital Transformation
PDF
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
PPTX
Databricks Fundamentals
Data Lakehouse Symposium | Day 4
Lakehouse in Azure
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Introduction SQL Analytics on Lakehouse Architecture
Data Mesh Part 4 Monolith to Mesh
Data Warehouse or Data Lake, Which Do I Choose?
Data Profiling, Data Catalogs and Metadata Harmonisation
Enabling a Data Mesh Architecture with Data Virtualization
Data Lake: A simple introduction
Architecting a datalake
Data Governance and Metadata Management
Free Training: How to Build a Lakehouse
DW Migration Webinar-March 2022.pptx
Mdm: why, when, how
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Five steps to launch your data governance office
Enterprise Architecture vs. Data Architecture
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Databricks Fundamentals
Ad

Similar to Turning Raw Data Into Gold With A Data Lakehouse.pptx (20)

PDF
LMastering Lakehouse Architecture: Best Practices for Data Engineersakehouse ...
PDF
Hadoop and IDW - When_to_use_which
PPTX
Data modeling trends for analytics
PDF
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
PPTX
Walking Around the Data Lake
PPTX
Managing storage on Prem and in Cloud
PDF
Building A Self Service Analytics Platform on Hadoop
PPTX
Presentation Presentation Presentation Presentation Presentation
PPTX
Move your on prem data to a lake in a Lake in Cloud
PDF
So You Want to Build a Data Lake?
PPT
datastage training | datastage online training | datastage training videos | ...
PPTX
Build a modern data platform.pptx
PPTX
[DSC Europe 23] Tomislav Hlupic - Your data lakes need architecture too!
PDF
Designing and Building Next Generation Data Pipelines at Scale with Structure...
PDF
Scaling the Web: Databases & NoSQL
PPTX
Revision
PPTX
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
PDF
Designing a modern data warehouse in azure
PDF
Designing a modern data warehouse in azure
PPT
kalyani.ppt
LMastering Lakehouse Architecture: Best Practices for Data Engineersakehouse ...
Hadoop and IDW - When_to_use_which
Data modeling trends for analytics
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
Walking Around the Data Lake
Managing storage on Prem and in Cloud
Building A Self Service Analytics Platform on Hadoop
Presentation Presentation Presentation Presentation Presentation
Move your on prem data to a lake in a Lake in Cloud
So You Want to Build a Data Lake?
datastage training | datastage online training | datastage training videos | ...
Build a modern data platform.pptx
[DSC Europe 23] Tomislav Hlupic - Your data lakes need architecture too!
Designing and Building Next Generation Data Pipelines at Scale with Structure...
Scaling the Web: Databases & NoSQL
Revision
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
kalyani.ppt
Ad

Recently uploaded (20)

PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Managing Community Partner Relationships
PPT
DATA COLLECTION METHODS-ppt for nursing research
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Mega Projects Data Mega Projects Data
PDF
annual-report-2024-2025 original latest.
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Managing Community Partner Relationships
DATA COLLECTION METHODS-ppt for nursing research
Data_Analytics_and_PowerBI_Presentation.pptx
ISS -ESG Data flows What is ESG and HowHow
Database Infoormation System (DBIS).pptx
Qualitative Qantitative and Mixed Methods.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
A Complete Guide to Streamlining Business Processes
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Mega Projects Data Mega Projects Data
annual-report-2024-2025 original latest.
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Pilar Kemerdekaan dan Identi Bangsa.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
SAP 2 completion done . PRESENTATION.pptx
importance of Data-Visualization-in-Data-Science. for mba studnts

Turning Raw Data Into Gold With A Data Lakehouse.pptx

Editor's Notes

  • #3: Should also note for the purposes of this talk while I am talking about Data Lakehouse/Medallion architecture as concepts everything (especially demos) is really rooted in Databricks
  • #4: Data lakehouses are more popular than ever. Microsoft fabric is about to be released into general availability so Microsoft has shown a lot of confidence in this technology. Today I’m going to be mostly talking from a Databricks perspective as they are one of the leading providers but its important to point out that lots of what im talking about, especially medallion architecture can be applied to any data lakehouse. So what is a data lakehouse? Data Lake + Datawarehouse The idea behind the data lakehouse is to take the positives of both while addressing at least some of the negatives
  • #5: Volume and Variety: A data lake can accommodate the large amount of data that Big Data, artificial intelligence, and machine learning requires. Data lakes can handle the volume, variety, and velocity of data from various sources being ingested in any format. Speed of Ingest: Format is irrelevant during ingest. Rather than schema-on-write, it uses schema-on-read, which means data doesn’t need to be processed for use until it is needed. Data can be written quickly. Lower Costs: Relative to a data warehouse, a data lake can be significantly less expensive when it comes to storage costs. This allows companies to collect a wider variety of data, including unstructured data such as rich media, sensor data from the Internet of Things (IoT), email, or social media. Greater Accessibility: Data stored in a data lake makes it easy to open copies or subsets of data to be accessed by different users or user groups. Data access can be controlled while companies can provide broader accessibility. Advanced Algorithms: Data lakes let companies conduct complex queries and deep learning algorithms to recognize patterns.
  • #6: Complex On-Premises Deployment: Spinning off a data lake in the cloud is a simple process. Deploying a data lake on-premises can be significantly more complex. On-prem solutions such as Hadoop or Splunk are available, but data lakes are built for the cloud. Learning Curve: There is a bit of a learning curve, new tools, and new services with a data lake. This requires either outside help, training, or recruiting team members with data lake skill sets. Migration: If you’re already working with a data warehouse, transitioning to a data lake takes some careful planning of your data strategy to manage your data sets. Depending on your infrastructure, this can be a challenge. Handling Queries: It’s fast and easy to ingest data, but a data lake is not optimized for queries in the way structured and semi-structured data is in a data warehouse. Using best practices for database queries can help, but data retrieval is not as straightforward as it is with a data warehouse. In a data lake, data is processed after it’s been loaded using the Extract, Load, Transform (ELT) process. Governance, semantic consistency, and access controls are required, otherwise, the data lake may turn into a “data swamp” of unusable, raw data.
  • #7: Maturity: Data warehouses have been around for over 30 years. These are tried and tested tools/methods for storing data. Maintenance: Data warehouses typically perform efficiently with little maintenance. If you do need to work on your warehouses, there are plenty of IT teams with the skill sets needed. Performance: Since data warehouses use a schema-on-write process, the common underlying data structure makes it easy for the query engine to sift through data and generate results quickly. Usability: Rather than sort through raw data or complex data structures, users find it easy to find what they need. Flexibility: Companies can store data in the cloud using one of the many cloud providers available, or on premise, or a hybrid of the two. Data Marts: Data marts can provide structure and access to specific environments within a single functional data set. For example, you can create data marts for specific products or lines of business.
  • #8: Storage Costs: One of the downsides of data warehouses is the cost of storage for large volumes of data. Time: Each business process component must be built to extract value from the data. It can be time-consuming to get data into the data warehouse using an ETL process. For example, if you want to do data analytics on current and historical data, but the source data is not currently in the database, it will have to be processed and added before you can examine it. Limited Source Data: Because of storage costs and the need to process data, organizations often limit what data is captured and stored and what data sources are ingested. In most cases, this leads to storing data for known reporting requirements rather than raw data or data you may need in the future. The Big Data Challenge: A data warehouse is not built for Big Data analysis. Data warehouses are not designed/optimized for the massive amounts of data now being collected by some companies. Complicated Changeover: If you want to add new fields it can be tricky. Data must either be reprocessed or historical data will include blank fields. Potential for Data Distortion: Since data quality relies on pre-load data cleansing, errors in the processing means data can be distorted forever.
  • #9: Structured - Can still have that table structure that a data warehouse provides Governed – Who has access to what Familiar – We have SQL, Tables, Views etc Fast – We can deal with massive amounts of data very quickly *as long as it’s set up efficiently Flexible – Can deal with structured and unstructured data
  • #10: A lot of the negatives of a traditional data warehouse are solved by the modern cloud data warehouses available today. So why would you want to use a lakehouse? Unified platform – Data Engineering, Data Science, Data Analytics Apache Spark – Designed for huge amounts of data Delta Lake – Storage format – will talk about more in a min AI & Machine Learning – Massive at the moment – Dolly LLM – Mlflow is integrated into databricks – great support for deploying and running production models
  • #12: Important thing to take away from this slide is that its distributed. Jobs will be split into tasks and then executed in parallel across different worker nodes and then the results combined.
  • #14: Versioned Parquet data files + Transaction log This gives ACID transactions: Atomic, Consistent, Isolated and Durable – What we are used to from Data warehouses that are missing from data lakes
  • #17: Validation rules - No nulls - Data is unique - Data is correct type Logical checks - Does an order have an order date - For a country field is that a country and is it spelled correctly Reference data - Swapping country, state codes for full names
  • #18: - Monthly, yearly sales tables - A business might want a query to show which customers have spent the most in a certain timeframe - You could aggregate the revenue for different LOB weekly
  • #20: At AA we use RAW > BASE > CURATED unless a customer wants to have their own convention. This is to hopefully give a better description of the data at each layer