SlideShare a Scribd company logo
Presented By:
Swantika Gupta
Software Consultant
Databricks and
Logging in
Notebooks
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session timings, you
are requested not to join sessions
after a 5 minutes threshold post
the session start time.
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Agenda
What is Databricks?
Reasons to use Azure Databricks
Azure Databricks Core Artifacts
Logging in Scala Notebooks
Workspace Clusters
Notebooks Libraries
Jobs Data
What is Databricks ?
Industry - leading, Zero-management cloud platform built around Spark.
Delivers
- fully managed Spark clusters
- an interactive workspace for exploration and visualization
- a production pipeline scheduler
- a platform for powering your favorite Spark-based applications
So instead of tackling headaches like setting up infrastructures, creating data backup, scaling
your nodes according to load, you can finally focus on finding answers that make an immediate
impact on your business.
It is a product offered by a third - party, but it is offered as a first - class service tightly integrated
with AWS and Azure
Reasons to use Azure Databricks
Familiar Languages and Environment
Higher Productivity and Collaboration
Easy integration with Microsoft Stack
Extensive List of Data Sources
Suitable for Small Jobs too
Extensive Documentation and Support Available
Azure Databricks Core Artifacts
Azure Databricks Artifact - Workspace
An environment inside Databricks
service with access to all your
databricks resources
Organizes various objects, like
Notebooks, Jars into Folders.
Provides easy one-click access to
computational resources, like clusters
and data stored
Azure Databricks Artifact - Clusters
Core Component of Databricks
A set of computational resources and
configurations
Runs our Data Engineering, Data
Science and Data Analytics
workloads
Types of Clusters:
- Interactive Clusters
- Automated Clusters
Azure Databricks Artifact - Data
● Create Tables directly from imported data.
The table schema is stored in the internal
databricks metastore
● Use Apache Spark commands to read data
from supported data source
● Import data into DBFS and use the DBFS
CLI, DBFS API, DBFS utilities, Spark APIs,
and local file APIs to access the data.
Azure Databricks Artifact - Notebooks
Web-Based interface to a Document
containing
- Runnable Code
- Visualizations
- Narrations
Support for multiple languages in the
same notebook
Real-time collaboration on the same
notebook
Revision History of the notebook
Azure Databricks Artifact - Jobs
As an alternative to running notebooks
interactively, you can set for a notebook or Jar
to either run immediately or on a scheduled
basis.
Three types of task can be run as jobs:
- Notebooks
- JARs
- Spark Submit
Notebook Job and JAR jobs can be
configured by passing parameters too
Spark-Submit can also be configured
Azure Databricks Artifact - Libraries
A Library can be installed on a cluster to
make some third-party or custom code
available to the running notebook or JAR
Libraries can be installed in 3 modes:
- Workspace Libraries
- Cluster Libraries
- Notebook scoped Libraries
Spark Jobs work with large amount of data and the tasks involve time consuming
computations.
They run at remote locations too. So, it becomes difficult to track the execution step without
logs.
Logs help to track at what point is the execution at, they help the developer to debug at what
points is the job consuming maximum of it’s time.
Logs often contain vast amounts of metadata, including date stamps, logger name [that can be
set to be the name of the Logging Class], source information such as cluster name. This data
helps in the debugging process.
Using Real-time Logs and messages logged in them, certain Alerting techniques can be used
to create notifications if a log containing a particular message appears.
Importance of Logs
Logging in Databricks Scala Notebooks
Databricks’ Log Delivery System
Delivers Spark Driver, Executor and Event
Logs to a location specified during configuring
the cluster
Delivered every 5 minutes
Incase a cluster terminates, Databricks make
sure to deliver all logs up till the termination to
the delivery location.
Logging in Databricks Scala Notebooks
Default Log4j Properties in Databricks
Two log4j.properties files:
For Driver:
For Executors:
Logging in Databricks Scala Notebooks
Overwriting the Default Log4j Properties with init scripts
Drawback - Every time configuration is to be changed, cluster restart is required
Logging in Databricks Scala Notebooks
Using External Log4j configuration for
your job
- Create a Log4j properties file for your
custom loggers and appenders
- Upload the file to your required DBFS
location
- Use Log4j’s PropertyConfigurator object
to configure using your custom log4j
properties
References
https://kb.databricks.com/clusters/overwrite-log4j-logs.html
https://www.youtube.com/watch?v=cxyUy1bZ9mk
https://forums.databricks.com/questions/17625/how-can-i-customize-log4j.html
https://docs.databricks.com/getting-started/concepts.html
https://blog.knoldus.com/databricks-make-log4j-configurable
Thank You !
Get in touch with us:
Lorem Studio, Lord Building
D4456, LA, USA

More Related Content

PPTX
Oracle to Postgres Schema Migration Hustle
 
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Apache Hadoopの新機能Ozoneの現状
PDF
5 Steps to PostgreSQL Performance
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Oracle to Postgres Schema Migration Hustle
 
Solving Enterprise Data Challenges with Apache Arrow
Building robust CDC pipeline with Apache Hudi and Debezium
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Apache Hadoopの新機能Ozoneの現状
5 Steps to PostgreSQL Performance
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...

What's hot (20)

PPTX
Data Engineer's Lunch #54: dbt and Spark
PDF
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
PDF
Getting Started with Apache Spark on Kubernetes
PDF
Massive Data Processing in Adobe Using Delta Lake
PPTX
Great Expectations Presentation
PDF
Spark SQL Join Improvement at Facebook
PDF
Spark SQL
PDF
Linux tuning to improve PostgreSQL performance
PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Enabling Vectorized Engine in Apache Spark
PDF
dbt Python models - GoDataFest by Guillermo Sanchez
PDF
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
PPTX
スケールアウトするPostgreSQLを目指して!その第一歩!(NTTデータ テクノロジーカンファレンス 2020 発表資料)
PDF
Spark overview
PPTX
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
PDF
[pgday.Seoul 2022] PostgreSQL with Google Cloud
ODP
PostgreSQL Administration for System Administrators
PPTX
Delta lake and the delta architecture
PDF
Migration From Oracle to PostgreSQL
PDF
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Data Engineer's Lunch #54: dbt and Spark
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Getting Started with Apache Spark on Kubernetes
Massive Data Processing in Adobe Using Delta Lake
Great Expectations Presentation
Spark SQL Join Improvement at Facebook
Spark SQL
Linux tuning to improve PostgreSQL performance
The Parquet Format and Performance Optimization Opportunities
Enabling Vectorized Engine in Apache Spark
dbt Python models - GoDataFest by Guillermo Sanchez
"It can always get worse!" – Lessons Learned in over 20 years working with Or...
スケールアウトするPostgreSQLを目指して!その第一歩!(NTTデータ テクノロジーカンファレンス 2020 発表資料)
Spark overview
大規模データ活用向けストレージレイヤソフトのこれまでとこれから(NTTデータ テクノロジーカンファレンス 2019 講演資料、2019/09/05)
[pgday.Seoul 2022] PostgreSQL with Google Cloud
PostgreSQL Administration for System Administrators
Delta lake and the delta architecture
Migration From Oracle to PostgreSQL
Introducing KRaft: Kafka Without Zookeeper With Colin McCabe | Current 2022
Ad

Similar to Databricks and Logging in Notebooks (20)

PPTX
Azure Databricks (For Data Analytics).pptx
PPTX
Azure data bricks by Eugene Polonichko
PPTX
Azure DataBricks for Data Engineering by Eugene Polonichko
PPTX
Introduction to Databricks - AccentFuture
PDF
201905 Azure Databricks for Machine Learning
PPTX
Introduction to Azure Databricks
PPTX
Introduction_to_Databricks_power_point_presentation.pptx
PDF
Comparing Microsoft Big Data Platform Technologies
PDF
Predicting Flights with Azure Databricks
DOCX
Databricks Online Training | Databricks Online Course
PPTX
TechEvent Databricks on Azure
PDF
Master Databricks with AccentFuture – Online Training
PPTX
Azure Databricks is Easier Than You Think
PDF
Data Lakes with Azure Databricks
PPTX
Databricks for Dummies
PDF
Databricks Online Training | Databricks Online Course
PPTX
slides.pptx
PPTX
Azure Data serices and databricks architecture
PPTX
slides.pptx
PPTX
Data Engineering A Deep Dive into Databricks
Azure Databricks (For Data Analytics).pptx
Azure data bricks by Eugene Polonichko
Azure DataBricks for Data Engineering by Eugene Polonichko
Introduction to Databricks - AccentFuture
201905 Azure Databricks for Machine Learning
Introduction to Azure Databricks
Introduction_to_Databricks_power_point_presentation.pptx
Comparing Microsoft Big Data Platform Technologies
Predicting Flights with Azure Databricks
Databricks Online Training | Databricks Online Course
TechEvent Databricks on Azure
Master Databricks with AccentFuture – Online Training
Azure Databricks is Easier Than You Think
Data Lakes with Azure Databricks
Databricks for Dummies
Databricks Online Training | Databricks Online Course
slides.pptx
Azure Data serices and databricks architecture
slides.pptx
Data Engineering A Deep Dive into Databricks
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
PPTX
Self-Healing Test Automation Framework - Healenium
PPTX
Kanban Metrics Presentation (Project Management)
PPTX
Java 17 features and implementation.pptx
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
PPTX
GraalVM - A Step Ahead of JVM Presentation
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
DAPR - Distributed Application Runtime Presentation
PPTX
Introduction to Azure Virtual WAN Presentation
PPTX
Introduction to Argo Rollouts Presentation
PPTX
Intro to Azure Container App Presentation
PPTX
Insights Unveiled Test Reporting and Observability Excellence
PPTX
Introduction to Splunk Presentation (DevOps)
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
PPTX
AWS: Messaging Services in AWS Presentation
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
PPTX
Managing State & HTTP Requests In Ionic.
Angular Hydration Presentation (FrontEnd)
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Self-Healing Test Automation Framework - Healenium
Kanban Metrics Presentation (Project Management)
Java 17 features and implementation.pptx
Chaos Mesh Introducing Chaos in Kubernetes
GraalVM - A Step Ahead of JVM Presentation
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
DAPR - Distributed Application Runtime Presentation
Introduction to Azure Virtual WAN Presentation
Introduction to Argo Rollouts Presentation
Intro to Azure Container App Presentation
Insights Unveiled Test Reporting and Observability Excellence
Introduction to Splunk Presentation (DevOps)
Code Camp - Data Profiling and Quality Analysis Framework
AWS: Messaging Services in AWS Presentation
Amazon Cognito: A Primer on Authentication and Authorization
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Managing State & HTTP Requests In Ionic.

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Electronic commerce courselecture one. Pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Big Data Technologies - Introduction.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Cloud computing and distributed systems.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf
Encapsulation theory and applications.pdf
Electronic commerce courselecture one. Pdf
The AUB Centre for AI in Media Proposal.docx
MIND Revenue Release Quarter 2 2025 Press Release
Advanced methodologies resolving dimensionality complications for autism neur...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Weekly Chronicles - August'25-Week II
Big Data Technologies - Introduction.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
Machine Learning_overview_presentation.pptx
Cloud computing and distributed systems.
“AI and Expert System Decision Support & Business Intelligence Systems”
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
A comparative analysis of optical character recognition models for extracting...
Empathic Computing: Creating Shared Understanding

Databricks and Logging in Notebooks

  • 1. Presented By: Swantika Gupta Software Consultant Databricks and Logging in Notebooks
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Respect Knolx session timings, you are requested not to join sessions after a 5 minutes threshold post the session start time. Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. Agenda What is Databricks? Reasons to use Azure Databricks Azure Databricks Core Artifacts Logging in Scala Notebooks Workspace Clusters Notebooks Libraries Jobs Data
  • 4. What is Databricks ? Industry - leading, Zero-management cloud platform built around Spark. Delivers - fully managed Spark clusters - an interactive workspace for exploration and visualization - a production pipeline scheduler - a platform for powering your favorite Spark-based applications So instead of tackling headaches like setting up infrastructures, creating data backup, scaling your nodes according to load, you can finally focus on finding answers that make an immediate impact on your business. It is a product offered by a third - party, but it is offered as a first - class service tightly integrated with AWS and Azure
  • 5. Reasons to use Azure Databricks Familiar Languages and Environment Higher Productivity and Collaboration Easy integration with Microsoft Stack Extensive List of Data Sources Suitable for Small Jobs too Extensive Documentation and Support Available
  • 7. Azure Databricks Artifact - Workspace An environment inside Databricks service with access to all your databricks resources Organizes various objects, like Notebooks, Jars into Folders. Provides easy one-click access to computational resources, like clusters and data stored
  • 8. Azure Databricks Artifact - Clusters Core Component of Databricks A set of computational resources and configurations Runs our Data Engineering, Data Science and Data Analytics workloads Types of Clusters: - Interactive Clusters - Automated Clusters
  • 9. Azure Databricks Artifact - Data ● Create Tables directly from imported data. The table schema is stored in the internal databricks metastore ● Use Apache Spark commands to read data from supported data source ● Import data into DBFS and use the DBFS CLI, DBFS API, DBFS utilities, Spark APIs, and local file APIs to access the data.
  • 10. Azure Databricks Artifact - Notebooks Web-Based interface to a Document containing - Runnable Code - Visualizations - Narrations Support for multiple languages in the same notebook Real-time collaboration on the same notebook Revision History of the notebook
  • 11. Azure Databricks Artifact - Jobs As an alternative to running notebooks interactively, you can set for a notebook or Jar to either run immediately or on a scheduled basis. Three types of task can be run as jobs: - Notebooks - JARs - Spark Submit Notebook Job and JAR jobs can be configured by passing parameters too Spark-Submit can also be configured
  • 12. Azure Databricks Artifact - Libraries A Library can be installed on a cluster to make some third-party or custom code available to the running notebook or JAR Libraries can be installed in 3 modes: - Workspace Libraries - Cluster Libraries - Notebook scoped Libraries
  • 13. Spark Jobs work with large amount of data and the tasks involve time consuming computations. They run at remote locations too. So, it becomes difficult to track the execution step without logs. Logs help to track at what point is the execution at, they help the developer to debug at what points is the job consuming maximum of it’s time. Logs often contain vast amounts of metadata, including date stamps, logger name [that can be set to be the name of the Logging Class], source information such as cluster name. This data helps in the debugging process. Using Real-time Logs and messages logged in them, certain Alerting techniques can be used to create notifications if a log containing a particular message appears. Importance of Logs
  • 14. Logging in Databricks Scala Notebooks Databricks’ Log Delivery System Delivers Spark Driver, Executor and Event Logs to a location specified during configuring the cluster Delivered every 5 minutes Incase a cluster terminates, Databricks make sure to deliver all logs up till the termination to the delivery location.
  • 15. Logging in Databricks Scala Notebooks Default Log4j Properties in Databricks Two log4j.properties files: For Driver: For Executors:
  • 16. Logging in Databricks Scala Notebooks Overwriting the Default Log4j Properties with init scripts Drawback - Every time configuration is to be changed, cluster restart is required
  • 17. Logging in Databricks Scala Notebooks Using External Log4j configuration for your job - Create a Log4j properties file for your custom loggers and appenders - Upload the file to your required DBFS location - Use Log4j’s PropertyConfigurator object to configure using your custom log4j properties
  • 19. Thank You ! Get in touch with us: Lorem Studio, Lord Building D4456, LA, USA