SlideShare a Scribd company logo
Azure DataBricks for Data
Engineering
Eugene Polonichko
Senior Software Developer at Eleks,
Data Platform MVP
2 0 1 8 U k r a i n e
https://www.linkedin.com/in/eugenepolonichko
/
About me
Eugene Polonichko has over 7 years of experience
with SQL Server. He mainly focused on BI projects
(SSAS, SSIS, PowerBI, Cognos, Informatica
PowerCenter, Pentaho, Tableau). Eugene is a
passionate speaker and SQL community volunteer
presenting regularly at PASS SQL Saturday events
and local user groups around Ukraine and Europe.
Eugene is PASS Chapter Leader and he has a status
MVP Data Platform
https://www.linkedin.com/in/eugenepolonichko/
https://twitter.com/EvgenPolonichko
Agenda
1. What is Azure Databricks?
• Azure Databricks
• Apache Spark
• Componets of Apache Spark
• Architecture of Azure Databricks
• Azure integration
2. Azure Databricks
• Cluster
• Workspace
• Notebooks
• Visualizations
• Jobs and Alerts
• Databricks File System
• Business Intelligence Tools
3. For data engineer
• Scenario
• Prices
What is Azure Databricks?
Azure Databricks
Azure Databricks is an Apache Spark-
based analytics platform optimized for
the Microsoft Azure cloud services
platform. Designed with the founders of
Apache Spark, Databricks is integrated
with Azure to provide one-click setup,
streamlined workflows, and an interactive
workspace that enables collaboration
between data scientists, data engineers,
and business analysts.
Apache Spark-based analytics platform
Azure Databricks comprises the complete open-source Apache Spark cluster technologies and capabilities.
Spark in Azure Databricks includes the following components
Apache Spark-based analytics platform
• Spark SQL and DataFrames: Spark SQL is the Spark module for working with
structured data
• Streaming: Real-time data processing and analysis for analytical and
interactive applications. Integrates with HDFS, Flume, and Kafka.
• MLib: Machine Learning library consisting of common learning algorithms
and utilities, including classification, regression, clustering, collaborative
filtering, dimensionality reduction, as well as underlying optimization
primitives.
• GraphX: Graphs and graph computation for a broad scope of use cases
from cognitive analytics to data exploration.
• Spark Core API: Includes support for R, SQL, Python, Scala, and Java.
Architecture of Azure Databricks
Total Azure integration
• Diversity of VM types
• Security and Privacy
• Flexibility in network topology
• Azure Storage and Azure Data Lake integration
• Azure Power BI
• Azure Active Directory
• Azure SQL Data Warehouse, Azure SQL DB, and
Azure CosmosDB:
Azure Databricks
Clusters
Azure Databricks clusters provide a unified platform for various use cases such as running production ETL
pipelines, streaming analytics, ad-hoc analytics, and machine learning.
Job
Interactive
Workspace
The Workspace is the special root folder for all of
your organization’s Azure Databricks assets.
The Workspace stores:
• notebooks
• libraries
• dashboards
• folders
Notebooks
A notebook is a web-based interface to a document that
contains runnable code, visualizations, and narrative text.
• Create a notebook
• Delete a notebook
• Control access to a notebook
• Notebook external formats
• Notebooks and clusters
• Schedule a notebook
• Distributing notebooks
Visualizations
Databricks supports a
number of visualizations out
of the box.
All notebooks, regardless of
their language, support
Databricks visualization
using the display function.
display(<dataframe-name>)
Jobs and Alerts
A job is a way of
running a
notebook or JAR
either immediately
or on a scheduled
basis
The number of jobs is limited to 1000.
Alerts
You can set up email
alerts for job runs. You
can send alerts up job
start, job success, and job
failure (including skipped
jobs), providing multiple
comma-separated email
addresses for each alert
type. You can also opt out
of alerts for skipped job
runs.
Databricks File System
Databricks File System (DBFS) is a
distributed file system installed on
Databricks Runtime clusters. Files in
DBFS persist to Azure Blob storage
You can access files in DBFS
using the Databricks CLI,
DBFS API, Databricks
Utilities, Spark APIs, and local
file APIs.
# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana
Python
Copy
#write a file to DBFS using python i/o apis
with open("/dbfs/tmp/test_dbfs.txt", 'w') as f:
f.write("Apache Spark is awesome!n")
f.write("End of example!")
# read the file
with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read:
for line in f_read:
print line
Business Intelligence Tools
Business Intelligence (BI) tools can
connect to Azure Databricks clusters
to query data in tables. Every Azure
Databricks cluster runs a
JDBC/ODBC server on the driver
node. This section provides general
instructions for connecting BI tools
to Azure Databricks clusters, along
with specific instructions for
popular BI tools.
For Data Engineers
Scenario
Scenario
Thank you

More Related Content

PPTX
1- Introduction of Azure data factory.pptx
PPTX
Azure Data Factory Data Flow
PPTX
Azure Data Factory
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
PPTX
Intro to Azure Data Factory v1
PPTX
Azure Synapse Analytics Overview (r1)
PDF
Azure Data Factory V2; The Data Flows
PPTX
ADF Demo_ppt.pptx
1- Introduction of Azure data factory.pptx
Azure Data Factory Data Flow
Azure Data Factory
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Intro to Azure Data Factory v1
Azure Synapse Analytics Overview (r1)
Azure Data Factory V2; The Data Flows
ADF Demo_ppt.pptx

What's hot (20)

PPTX
Azure Data Factory for Azure Data Week
PDF
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
PDF
Microsoft Power BI Technical Overview
PDF
Azure Data Factory Introduction.pdf
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
PPTX
Azure Databricks - An Introduction (by Kris Bock)
PPTX
Azure data factory
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
PDF
Introduction to Azure Data Factory
PPTX
Azure Data Factory Data Flows Training v005
PDF
Azure Data Factory presentation with links
PPTX
Azure Active Directory - An Introduction
PPTX
Azure Synapse Analytics Overview (r2)
PPTX
Azure Databricks (For Data Analytics).pptx
PPTX
Introducing Azure SQL Data Warehouse
PPTX
Azure Data Factory ETL Patterns in the Cloud
PPTX
Azure datafactory
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
PDF
Databricks: A Tool That Empowers You To Do More With Data
PPTX
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Azure Data Factory for Azure Data Week
Pipelines and Packages: Introduction to Azure Data Factory (DATA:Scotland 2019)
Microsoft Power BI Technical Overview
Azure Data Factory Introduction.pdf
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Azure Databricks - An Introduction (by Kris Bock)
Azure data factory
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Introduction to Azure Data Factory
Azure Data Factory Data Flows Training v005
Azure Data Factory presentation with links
Azure Active Directory - An Introduction
Azure Synapse Analytics Overview (r2)
Azure Databricks (For Data Analytics).pptx
Introducing Azure SQL Data Warehouse
Azure Data Factory ETL Patterns in the Cloud
Azure datafactory
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Databricks: A Tool That Empowers You To Do More With Data
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Ad

Similar to Azure data bricks by Eugene Polonichko (20)

PDF
201905 Azure Databricks for Machine Learning
PPTX
TechEvent Databricks on Azure
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
PPTX
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
PPTX
Introduction to Azure Databricks
PPTX
Azure Databricks - An Introduction 2019 Roadshow.pptx
PDF
Introduction to Azure Data Lake
PDF
Databricks and Logging in Notebooks
PDF
Prague data management meetup 2018-03-27
PDF
USQL Trivadis Azure Data Lake Event
PPTX
CC -Unit4.pptx
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
PPTX
Azure Databricks is Easier Than You Think
PPTX
Eugene Polonichko "Architecture of modern data warehouse"
PPTX
Global AI Bootcamp Madrid - Azure Databricks
PPTX
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
PPTX
Data Engineering A Deep Dive into Databricks
PPTX
Azure Lowlands: An intro to Azure Data Lake
PDF
Инструменты программиста
PDF
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
201905 Azure Databricks for Machine Learning
TechEvent Databricks on Azure
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Introduction to Azure Databricks
Azure Databricks - An Introduction 2019 Roadshow.pptx
Introduction to Azure Data Lake
Databricks and Logging in Notebooks
Prague data management meetup 2018-03-27
USQL Trivadis Azure Data Lake Event
CC -Unit4.pptx
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
Azure Databricks is Easier Than You Think
Eugene Polonichko "Architecture of modern data warehouse"
Global AI Bootcamp Madrid - Azure Databricks
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Data Engineering A Deep Dive into Databricks
Azure Lowlands: An intro to Azure Data Lake
Инструменты программиста
44spotkaniePLSSUGWRO_CoNowegowKrainieChmur
Ad

More from Alex Tumanoff (20)

PPTX
Sql server 2019 New Features by Yevhen Nedaskivskyi
PPTX
Odessa .net-user-group-sql-server-2019-hidden-gems by Denis Reznik
PPTX
Sdlc by Anatoliy Anthony Cox
PPTX
Kostenko ux november-2014_1
PPTX
Java 8 in action.jinq.v.1.3
PPT
"Drools: декларативная бизнес-логика в Java-приложениях" by Дмитрий Контрерас...
PPTX
Spring.new hope.1.3
PPTX
Sql saturday azure storage by Anton Vidishchev
PPTX
Navigation map factory by Alexey Klimenko
PPTX
Serialization and performance by Sergey Morenets
PPTX
Игры для мобильных платформ by Алексей Рыбаков
PDF
Android sync adapter
PPTX
Async clinic by by Sergey Teplyakov
PPTX
Deep Dive C# by Sergey Teplyakov
PPTX
Bdd by Dmitri Aizenberg
PPTX
Неформальные размышления о сертификации в IT
PPTX
Разработка расширений Firefox
PPTX
"AnnotatedSQL - провайдер с плюшками за 5 минут" - Геннадий Дубина, Senior So...
PPTX
Patterns of parallel programming
PPTX
Lambda выражения и Java 8
Sql server 2019 New Features by Yevhen Nedaskivskyi
Odessa .net-user-group-sql-server-2019-hidden-gems by Denis Reznik
Sdlc by Anatoliy Anthony Cox
Kostenko ux november-2014_1
Java 8 in action.jinq.v.1.3
"Drools: декларативная бизнес-логика в Java-приложениях" by Дмитрий Контрерас...
Spring.new hope.1.3
Sql saturday azure storage by Anton Vidishchev
Navigation map factory by Alexey Klimenko
Serialization and performance by Sergey Morenets
Игры для мобильных платформ by Алексей Рыбаков
Android sync adapter
Async clinic by by Sergey Teplyakov
Deep Dive C# by Sergey Teplyakov
Bdd by Dmitri Aizenberg
Неформальные размышления о сертификации в IT
Разработка расширений Firefox
"AnnotatedSQL - провайдер с плюшками за 5 минут" - Геннадий Дубина, Senior So...
Patterns of parallel programming
Lambda выражения и Java 8

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
ISO 45001 Occupational Health and Safety Management System
PPT
Introduction Database Management System for Course Database
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
top salesforce developer skills in 2025.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
history of c programming in notes for students .pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
AI in Product Development-omnex systems
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
2025 Textile ERP Trends: SAP, Odoo & Oracle
Operating system designcfffgfgggggggvggggggggg
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
ISO 45001 Occupational Health and Safety Management System
Introduction Database Management System for Course Database
Softaken Excel to vCard Converter Software.pdf
top salesforce developer skills in 2025.pdf
Design an Analysis of Algorithms II-SECS-1021-03
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
How Creative Agencies Leverage Project Management Software.pdf
history of c programming in notes for students .pptx
Odoo POS Development Services by CandidRoot Solutions
Wondershare Filmora 15 Crack With Activation Key [2025
Which alternative to Crystal Reports is best for small or large businesses.pdf
Odoo Companies in India – Driving Business Transformation.pdf
ManageIQ - Sprint 268 Review - Slide Deck
AI in Product Development-omnex systems
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)

Azure data bricks by Eugene Polonichko

  • 1. Azure DataBricks for Data Engineering Eugene Polonichko Senior Software Developer at Eleks, Data Platform MVP 2 0 1 8 U k r a i n e https://www.linkedin.com/in/eugenepolonichko /
  • 2. About me Eugene Polonichko has over 7 years of experience with SQL Server. He mainly focused on BI projects (SSAS, SSIS, PowerBI, Cognos, Informatica PowerCenter, Pentaho, Tableau). Eugene is a passionate speaker and SQL community volunteer presenting regularly at PASS SQL Saturday events and local user groups around Ukraine and Europe. Eugene is PASS Chapter Leader and he has a status MVP Data Platform https://www.linkedin.com/in/eugenepolonichko/ https://twitter.com/EvgenPolonichko
  • 3. Agenda 1. What is Azure Databricks? • Azure Databricks • Apache Spark • Componets of Apache Spark • Architecture of Azure Databricks • Azure integration 2. Azure Databricks • Cluster • Workspace • Notebooks • Visualizations • Jobs and Alerts • Databricks File System • Business Intelligence Tools 3. For data engineer • Scenario • Prices
  • 4. What is Azure Databricks?
  • 5. Azure Databricks Azure Databricks is an Apache Spark- based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
  • 6. Apache Spark-based analytics platform Azure Databricks comprises the complete open-source Apache Spark cluster technologies and capabilities. Spark in Azure Databricks includes the following components
  • 7. Apache Spark-based analytics platform • Spark SQL and DataFrames: Spark SQL is the Spark module for working with structured data • Streaming: Real-time data processing and analysis for analytical and interactive applications. Integrates with HDFS, Flume, and Kafka. • MLib: Machine Learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives. • GraphX: Graphs and graph computation for a broad scope of use cases from cognitive analytics to data exploration. • Spark Core API: Includes support for R, SQL, Python, Scala, and Java.
  • 9. Total Azure integration • Diversity of VM types • Security and Privacy • Flexibility in network topology • Azure Storage and Azure Data Lake integration • Azure Power BI • Azure Active Directory • Azure SQL Data Warehouse, Azure SQL DB, and Azure CosmosDB:
  • 11. Clusters Azure Databricks clusters provide a unified platform for various use cases such as running production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning. Job Interactive
  • 12. Workspace The Workspace is the special root folder for all of your organization’s Azure Databricks assets. The Workspace stores: • notebooks • libraries • dashboards • folders
  • 13. Notebooks A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative text. • Create a notebook • Delete a notebook • Control access to a notebook • Notebook external formats • Notebooks and clusters • Schedule a notebook • Distributing notebooks
  • 14. Visualizations Databricks supports a number of visualizations out of the box. All notebooks, regardless of their language, support Databricks visualization using the display function. display(<dataframe-name>)
  • 15. Jobs and Alerts A job is a way of running a notebook or JAR either immediately or on a scheduled basis The number of jobs is limited to 1000.
  • 16. Alerts You can set up email alerts for job runs. You can send alerts up job start, job success, and job failure (including skipped jobs), providing multiple comma-separated email addresses for each alert type. You can also opt out of alerts for skipped job runs.
  • 17. Databricks File System Databricks File System (DBFS) is a distributed file system installed on Databricks Runtime clusters. Files in DBFS persist to Azure Blob storage You can access files in DBFS using the Databricks CLI, DBFS API, Databricks Utilities, Spark APIs, and local file APIs. # List files in DBFS dbfs ls # Put local file ./apple.txt to dbfs:/apple.txt dbfs cp ./apple.txt dbfs:/apple.txt # Get dbfs:/apple.txt and save to local file ./apple.txt dbfs cp dbfs:/apple.txt ./apple.txt # Recursively put local dir ./banana to dbfs:/banana dbfs cp -r ./banana dbfs:/banana Python Copy #write a file to DBFS using python i/o apis with open("/dbfs/tmp/test_dbfs.txt", 'w') as f: f.write("Apache Spark is awesome!n") f.write("End of example!") # read the file with open("/dbfs/tmp/test_dbfs.txt", "r") as f_read: for line in f_read: print line
  • 18. Business Intelligence Tools Business Intelligence (BI) tools can connect to Azure Databricks clusters to query data in tables. Every Azure Databricks cluster runs a JDBC/ODBC server on the driver node. This section provides general instructions for connecting BI tools to Azure Databricks clusters, along with specific instructions for popular BI tools.