Databricks and Logging in Notebooks

Presented By:
Swantika Gupta
Software Consultant
Databricks and
Logging in
Notebooks

Lack of etiquette and manners is a huge turn oﬀ.
KnolX Etiquettes
Punctuality
Respect Knolx session timings, you
are requested not to join sessions
after a 5 minutes threshold post
the session start time.
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.

Agenda
What is Databricks?
Reasons to use Azure Databricks
Azure Databricks Core Artifacts
Logging in Scala Notebooks
Workspace Clusters
Notebooks Libraries
Jobs Data

What is Databricks ?
Industry - leading, Zero-management cloud platform built around Spark.
Delivers
- fully managed Spark clusters
- an interactive workspace for exploration and visualization
- a production pipeline scheduler
- a platform for powering your favorite Spark-based applications
So instead of tackling headaches like setting up infrastructures, creating data backup, scaling
your nodes according to load, you can finally focus on finding answers that make an immediate
impact on your business.
It is a product offered by a third - party, but it is offered as a first - class service tightly integrated
with AWS and Azure

Reasons to use Azure Databricks
Familiar Languages and Environment
Higher Productivity and Collaboration
Easy integration with Microsoft Stack
Extensive List of Data Sources
Suitable for Small Jobs too
Extensive Documentation and Support Available

Azure Databricks Core Artifacts

Azure Databricks Artifact - Workspace
An environment inside Databricks
service with access to all your
databricks resources
Organizes various objects, like
Notebooks, Jars into Folders.
Provides easy one-click access to
computational resources, like clusters
and data stored

Azure Databricks Artifact - Clusters
Core Component of Databricks
A set of computational resources and
configurations
Runs our Data Engineering, Data
Science and Data Analytics
workloads
Types of Clusters:
- Interactive Clusters
- Automated Clusters

Azure Databricks Artifact - Data
● Create Tables directly from imported data.
The table schema is stored in the internal
databricks metastore
● Use Apache Spark commands to read data
from supported data source
● Import data into DBFS and use the DBFS
CLI, DBFS API, DBFS utilities, Spark APIs,
and local file APIs to access the data.

Azure Databricks Artifact - Notebooks
Web-Based interface to a Document
containing
- Runnable Code
- Visualizations
- Narrations
Support for multiple languages in the
same notebook
Real-time collaboration on the same
notebook
Revision History of the notebook

Azure Databricks Artifact - Jobs
As an alternative to running notebooks
interactively, you can set for a notebook or Jar
to either run immediately or on a scheduled
basis.
Three types of task can be run as jobs:
- Notebooks
- JARs
- Spark Submit
Notebook Job and JAR jobs can be
configured by passing parameters too
Spark-Submit can also be configured

Azure Databricks Artifact - Libraries
A Library can be installed on a cluster to
make some third-party or custom code
available to the running notebook or JAR
Libraries can be installed in 3 modes:
- Workspace Libraries
- Cluster Libraries
- Notebook scoped Libraries

Spark Jobs work with large amount of data and the tasks involve time consuming
computations.
They run at remote locations too. So, it becomes difficult to track the execution step without
logs.
Logs help to track at what point is the execution at, they help the developer to debug at what
points is the job consuming maximum of it’s time.
Logs often contain vast amounts of metadata, including date stamps, logger name [that can be
set to be the name of the Logging Class], source information such as cluster name. This data
helps in the debugging process.
Using Real-time Logs and messages logged in them, certain Alerting techniques can be used
to create notifications if a log containing a particular message appears.
Importance of Logs

Logging in Databricks Scala Notebooks
Databricks’ Log Delivery System
Delivers Spark Driver, Executor and Event
Logs to a location specified during configuring
the cluster
Delivered every 5 minutes
Incase a cluster terminates, Databricks make
sure to deliver all logs up till the termination to
the delivery location.

Default Log4j Properties in Databricks
Two log4j.properties files:
For Driver:
For Executors:

Overwriting the Default Log4j Properties with init scripts
Drawback - Every time configuration is to be changed, cluster restart is required

Using External Log4j configuration for
your job
- Create a Log4j properties file for your
custom loggers and appenders
- Upload the file to your required DBFS
location
- Use Log4j’s PropertyConfigurator object
to configure using your custom log4j
properties

References
https://kb.databricks.com/clusters/overwrite-log4j-logs.html
https://www.youtube.com/watch?v=cxyUy1bZ9mk
https://forums.databricks.com/questions/17625/how-can-i-customize-log4j.html
https://docs.databricks.com/getting-started/concepts.html
https://blog.knoldus.com/databricks-make-log4j-configurable

Thank You !
Get in touch with us:
Lorem Studio, Lord Building
D4456, LA, USA

Databricks and Logging in Notebooks

More Related Content

What's hot (20)

Similar to Databricks and Logging in Notebooks (20)

More from Knoldus Inc. (20)

Recently uploaded (20)

Databricks and Logging in Notebooks