SlideShare a Scribd company logo
5
Most read
16
Most read
17
Most read
Data Engineering:
A Deep Dive into
Databricks
Presenter:
Mohika Rastogi
Sant Singh
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
1. Introduction
o What is Data Engineering
o Data Engineer vs Analyst vs Scientist
2. Central Repository
o Data Warehouse
o Data Lake
o Data Lakehouse
3. Databricks
o What is Databricks ?
o Use cases
o Managed Integration
o Delta Lake
o Delta Sharing
4. Apache Spark
5. Databricks Workspace
o Workspace Terminologies
6. Demo
Data Engineering A Deep Dive into Databricks
Data Engineering
 Data engineering is the practice of designing and building systems for collecting, storing, and
analyzing data at scale.
 Data engineering is the complex task of making raw data usable to data scientists and groups
within an organization.
Data Engineer vs Data Analyst vs Data Scientist
Data Scientist
A data scientist is someone who
uses their knowledge of statistics,
machine learning, and
programming to extract meaning
from data. They use their skills to
solve complex problems, identify
trends, and make predictions.
Data Analyst
A data analyst is someone who
collects, cleans, and analyzes
data to help businesses make
better decisions. They use
their skills to identify patterns
in data, and to create reports and
visualizations that help others
understand the data.
Data Engineer
A data engineer is someone who
builds and maintains the systems
that data scientists and data
analysts use to collect, store, and
analyze data. They use their
skills to design and build
data pipelines, and to ensure
that data is stored in a secure
and efficient way.
Central Repositories
Data Warehouse
A data lake is an ample storage that can store structured,
semi-structured, and raw data. The schema of the data is
not known as it is a schema-on-read.
Data Lake
A data warehouse is a central repository of business
data stored in structured format to help organizations
gain insights. Schema needs to be known before writing
data into a warehouse.
Data Lakehouse
 Data lakehouse is a realtively new architecture and it is combining the best of the both worlds —
data warehouses and data lakes.
 It serves as a single platform for data warehousing and data lakes. It has data management
features such as ACID transcation coming from a warehouse perspective and low cost storage
like a data lake.
Databricks
A unified, open analytics platform for
building, deploying, sharing, and
maintaining enterprise-grade data,
analytics, and AI solutions at scale.
Databricks
 An Interactive Analytics platform that enables Data Engineers, Data Scientists, and Businesses to
collaborate and work closely on notebooks, experiments, models, data, libraries, and jobs.
 Databricks was founded by creators of Apache Spark in 2013
 A one-stop product for all Data requirements, like Storage and Analysis.
 Databricks is integrated with Microsoft Azure, Amazon Web Services, and Google Cloud Platform.
What is Databricks used for?
 Data processing workflows scheduling and
management
 Working in SQL
 Generating dashboards and visualizations
 Data ingestion
 Managing security, governance, and HA/DR
 Data discovery, annotation, and exploration
 Compute management
 Machine learning (ML) modeling and tracking
 ML model serving
 Source control with Git
The Databricks workspace provides a unified interface and tools for most data tasks, including:
Databricks for Data Engineering
 Simplified data ingestion
 Automated ETL processing
 Reliable workflow orchestration
 End-to-end observability and monitoring
 Next-generation data processing engine
 Foundation of governance, reliability and performance
Databricks excels in data engineering with
its unified platform, leveraging Apache
Spark for efficient processing and
scalability.
Managed integration with open source
The following technologies are open source projects founded by Databricks employees:
 Delta Lake
− Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks
Lakehouse Platform.
 Delta Sharing
− An open standard for secure data sharing.
 Apache Spark
− Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on
single-node machines or clusters.
 MLflow
− MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment,
and a central model registry
Delta Lake
 Delta Lake is the default storage format for all operations on Databricks.
 Delta Lake is open source software that extends Parquet data files with a file-based transaction
log for ACID transactions and scalable metadata handling.
 Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration
with Structured Streaming, allowing you to easily use a single copy of data for both batch and
streaming operations and providing incremental processing at scale.
Delta Sharing
 Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to
share data with other organizations regardless of which computing platforms they use.
 Databricks and the Linux Foundation developed Delta Sharing to provide the first open source
approach to data sharing across data, analytics and AI. Customers can share live data across
platforms, clouds and regions with strong security and governance.
Apache Spark
 Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop
MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes
interactive queries and stream processing.
 The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
 PySpark:- PySpark is an interface for Apache Spark in Python. With PySpark, you can write
Python and SQL-like commands to manipulate and analyze data in a distributed processing
environment.
Databricks Workspace
The Databricks “Workspace” is an Environment for accessing all of the Databricks “Assets”.
The “Workspace” organizes Objects, such as- “Notebooks”, “Libraries” and “Experiments”
into “Folders”, and, provides access to “Data” and Computational Resources, such as -
“Clusters” and “Jobs”.
The Databricks “Workspace” can be managed using :-
1. Workspace UI
2. Databricks CLI
3. Databricks REST API
Databricks Workspace Terminology
01 02
03
05 06
04
Cluster is a “Set of Computational
Resources and Configurations”, on
which an organization’s Data
Engineering Workloads are run.
Cluster
Job is a way of running a “Notebook”
or a “JAR” either immediately or on a
“Scheduled Basis”.
Jobs
Every “Databricks Deployment” has a
“Central Hive Meta-store”, accessible
by all “Clusters” to persist “Table
Metadata”.
Hive Meta-store
“Notebook” is a “Web-Based Interface”
composed of a “Group of Cells” that
allow to execute coding commands.
Notebooks
DBFS is a Distributed File System
mounted into each Databricks
Workspace. DBFS contains Directorie
s which in turn contains Data Files,
Libraries and other Directories.
DBFS
By default, all tables created in
Databricks are Delta tables. Delta
tables are based on the Delta Lake
open source project.
Delta Table
Data Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into Databricks
Data Engineering A Deep Dive into Databricks

More Related Content

PPTX
Data Engineering with Databricks Presentation
PDF
Using Databricks as an Analysis Platform
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
PDF
Building End-to-End Delta Pipelines on GCP
PPTX
6. Accelerate your Journey to the Data Intelligence Platform.pptx
PPTX
Turning Raw Data Into Gold With A Data Lakehouse.pptx
PDF
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Data Engineering with Databricks Presentation
Using Databricks as an Analysis Platform
Unified Big Data Processing with Apache Spark (QCON 2014)
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Building End-to-End Delta Pipelines on GCP
6. Accelerate your Journey to the Data Intelligence Platform.pptx
Turning Raw Data Into Gold With A Data Lakehouse.pptx
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...

What's hot (20)

PPTX
Databricks Platform.pptx
PPTX
Modernize & Automate Analytics Data Pipelines
PDF
Achieving Lakehouse Models with Spark 3.0
PPTX
Databricks Fundamentals
PPTX
Data Sharing with Snowflake
PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Azure Data Factory Data Flow
PPTX
Zero to Snowflake Presentation
PPTX
Azure data platform overview
PDF
Databricks Delta Lake and Its Benefits
PDF
Databricks Overview for MLOps
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PPTX
Big data architectures and the data lake
PDF
Introduction SQL Analytics on Lakehouse Architecture
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PDF
Data Migration Strategies PowerPoint Presentation Slides
PPTX
Snowflake Architecture.pptx
PDF
Intro to Delta Lake
PDF
Snowflake Architecture
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Databricks Platform.pptx
Modernize & Automate Analytics Data Pipelines
Achieving Lakehouse Models with Spark 3.0
Databricks Fundamentals
Data Sharing with Snowflake
DW Migration Webinar-March 2022.pptx
Azure Data Factory Data Flow
Zero to Snowflake Presentation
Azure data platform overview
Databricks Delta Lake and Its Benefits
Databricks Overview for MLOps
Building Lakehouses on Delta Lake with SQL Analytics Primer
Big data architectures and the data lake
Introduction SQL Analytics on Lakehouse Architecture
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Migration Strategies PowerPoint Presentation Slides
Snowflake Architecture.pptx
Intro to Delta Lake
Snowflake Architecture
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Ad

Similar to Data Engineering A Deep Dive into Databricks (20)

PPTX
DataBricks fundamentals for fresh graduates
PPTX
Data Lakehouse Symposium | Day 4
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PDF
Master Databricks with AccentFuture – Online Training
PPTX
Data Engineering Overview for freshers.pptx
PPTX
Data Engineering Overview for new learners.pptx
PPTX
Unlock Data-driven Insights in Databricks Using Location Intelligence
PPTX
Introduction_to_Databricks_power_point_presentation.pptx
PDF
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
PPTX
Introduction to Databricks - AccentFuture
PDF
So You Want to Build a Data Lake?
PDF
What Is Delta Lake ???
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
use_case.pptx
PDF
Agile data lake? An oxymoron?
PDF
52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf
PDF
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
PDF
Modernizing to a Cloud Data Architecture
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
Spark + AI Summit 2020 イベント概要
DataBricks fundamentals for fresh graduates
Data Lakehouse Symposium | Day 4
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Master Databricks with AccentFuture – Online Training
Data Engineering Overview for freshers.pptx
Data Engineering Overview for new learners.pptx
Unlock Data-driven Insights in Databricks Using Location Intelligence
Introduction_to_Databricks_power_point_presentation.pptx
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
Introduction to Databricks - AccentFuture
So You Want to Build a Data Lake?
What Is Delta Lake ???
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
use_case.pptx
Agile data lake? An oxymoron?
52023374-5ab1-4b99-8b31-bdc4ee5a7d89.pdf
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Modernizing to a Cloud Data Architecture
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Spark + AI Summit 2020 イベント概要
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
PPTX
Self-Healing Test Automation Framework - Healenium
PPTX
Kanban Metrics Presentation (Project Management)
PPTX
Java 17 features and implementation.pptx
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
PPTX
GraalVM - A Step Ahead of JVM Presentation
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
DAPR - Distributed Application Runtime Presentation
PPTX
Introduction to Azure Virtual WAN Presentation
PPTX
Introduction to Argo Rollouts Presentation
PPTX
Intro to Azure Container App Presentation
PPTX
Insights Unveiled Test Reporting and Observability Excellence
PPTX
Introduction to Splunk Presentation (DevOps)
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
PPTX
AWS: Messaging Services in AWS Presentation
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
PPTX
Managing State & HTTP Requests In Ionic.
Angular Hydration Presentation (FrontEnd)
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Self-Healing Test Automation Framework - Healenium
Kanban Metrics Presentation (Project Management)
Java 17 features and implementation.pptx
Chaos Mesh Introducing Chaos in Kubernetes
GraalVM - A Step Ahead of JVM Presentation
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
DAPR - Distributed Application Runtime Presentation
Introduction to Azure Virtual WAN Presentation
Introduction to Argo Rollouts Presentation
Intro to Azure Container App Presentation
Insights Unveiled Test Reporting and Observability Excellence
Introduction to Splunk Presentation (DevOps)
Code Camp - Data Profiling and Quality Analysis Framework
AWS: Messaging Services in AWS Presentation
Amazon Cognito: A Primer on Authentication and Authorization
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Managing State & HTTP Requests In Ionic.

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Diabetes mellitus diagnosis method based random forest with bat algorithm
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Unlocking AI with Model Context Protocol (MCP)
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation_ Review paper, used for researhc scholars
“AI and Expert System Decision Support & Business Intelligence Systems”
MYSQL Presentation for SQL database connectivity
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Network Security Unit 5.pdf for BCA BBA.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Data Engineering A Deep Dive into Databricks

  • 1. Data Engineering: A Deep Dive into Databricks Presenter: Mohika Rastogi Sant Singh
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. 1. Introduction o What is Data Engineering o Data Engineer vs Analyst vs Scientist 2. Central Repository o Data Warehouse o Data Lake o Data Lakehouse 3. Databricks o What is Databricks ? o Use cases o Managed Integration o Delta Lake o Delta Sharing 4. Apache Spark 5. Databricks Workspace o Workspace Terminologies 6. Demo
  • 5. Data Engineering  Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale.  Data engineering is the complex task of making raw data usable to data scientists and groups within an organization.
  • 6. Data Engineer vs Data Analyst vs Data Scientist Data Scientist A data scientist is someone who uses their knowledge of statistics, machine learning, and programming to extract meaning from data. They use their skills to solve complex problems, identify trends, and make predictions. Data Analyst A data analyst is someone who collects, cleans, and analyzes data to help businesses make better decisions. They use their skills to identify patterns in data, and to create reports and visualizations that help others understand the data. Data Engineer A data engineer is someone who builds and maintains the systems that data scientists and data analysts use to collect, store, and analyze data. They use their skills to design and build data pipelines, and to ensure that data is stored in a secure and efficient way.
  • 7. Central Repositories Data Warehouse A data lake is an ample storage that can store structured, semi-structured, and raw data. The schema of the data is not known as it is a schema-on-read. Data Lake A data warehouse is a central repository of business data stored in structured format to help organizations gain insights. Schema needs to be known before writing data into a warehouse.
  • 8. Data Lakehouse  Data lakehouse is a realtively new architecture and it is combining the best of the both worlds — data warehouses and data lakes.  It serves as a single platform for data warehousing and data lakes. It has data management features such as ACID transcation coming from a warehouse perspective and low cost storage like a data lake.
  • 9. Databricks A unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale.
  • 10. Databricks  An Interactive Analytics platform that enables Data Engineers, Data Scientists, and Businesses to collaborate and work closely on notebooks, experiments, models, data, libraries, and jobs.  Databricks was founded by creators of Apache Spark in 2013  A one-stop product for all Data requirements, like Storage and Analysis.  Databricks is integrated with Microsoft Azure, Amazon Web Services, and Google Cloud Platform.
  • 11. What is Databricks used for?  Data processing workflows scheduling and management  Working in SQL  Generating dashboards and visualizations  Data ingestion  Managing security, governance, and HA/DR  Data discovery, annotation, and exploration  Compute management  Machine learning (ML) modeling and tracking  ML model serving  Source control with Git The Databricks workspace provides a unified interface and tools for most data tasks, including:
  • 12. Databricks for Data Engineering  Simplified data ingestion  Automated ETL processing  Reliable workflow orchestration  End-to-end observability and monitoring  Next-generation data processing engine  Foundation of governance, reliability and performance Databricks excels in data engineering with its unified platform, leveraging Apache Spark for efficient processing and scalability.
  • 13. Managed integration with open source The following technologies are open source projects founded by Databricks employees:  Delta Lake − Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform.  Delta Sharing − An open standard for secure data sharing.  Apache Spark − Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.  MLflow − MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry
  • 14. Delta Lake  Delta Lake is the default storage format for all operations on Databricks.  Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.  Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations and providing incremental processing at scale.
  • 15. Delta Sharing  Delta Sharing is the industry’s first open protocol for secure data sharing, making it simple to share data with other organizations regardless of which computing platforms they use.  Databricks and the Linux Foundation developed Delta Sharing to provide the first open source approach to data sharing across data, analytics and AI. Customers can share live data across platforms, clouds and regions with strong security and governance.
  • 16. Apache Spark  Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.  The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.  PySpark:- PySpark is an interface for Apache Spark in Python. With PySpark, you can write Python and SQL-like commands to manipulate and analyze data in a distributed processing environment.
  • 17. Databricks Workspace The Databricks “Workspace” is an Environment for accessing all of the Databricks “Assets”. The “Workspace” organizes Objects, such as- “Notebooks”, “Libraries” and “Experiments” into “Folders”, and, provides access to “Data” and Computational Resources, such as - “Clusters” and “Jobs”. The Databricks “Workspace” can be managed using :- 1. Workspace UI 2. Databricks CLI 3. Databricks REST API
  • 18. Databricks Workspace Terminology 01 02 03 05 06 04 Cluster is a “Set of Computational Resources and Configurations”, on which an organization’s Data Engineering Workloads are run. Cluster Job is a way of running a “Notebook” or a “JAR” either immediately or on a “Scheduled Basis”. Jobs Every “Databricks Deployment” has a “Central Hive Meta-store”, accessible by all “Clusters” to persist “Table Metadata”. Hive Meta-store “Notebook” is a “Web-Based Interface” composed of a “Group of Cells” that allow to execute coding commands. Notebooks DBFS is a Distributed File System mounted into each Databricks Workspace. DBFS contains Directorie s which in turn contains Data Files, Libraries and other Directories. DBFS By default, all tables created in Databricks are Delta tables. Delta tables are based on the Delta Lake open source project. Delta Table