Data Discovery at Databricks with Amundsen

Data Discovery At
Databricks With
Amundsen
Tao Feng
Tianru Zhou

Who
Tao Feng
▪ Engineer at Databricks
▪ Co Creator of Amundsen
▪ Apache Airﬂow PMC
▪ Previously worked at Lyft, Linkedin,
Oracle
Tianru Zhou
▪ Engineer at Databricks
▪ Previously worked at AWS
Elasticsearch

Data-Driven Decisions
Analysts Data Scientists General
Managers
Engineers Experimenters
Product
Managers
● Axiom: Good decisions are based on data
● Who needs Data? Anyone who wants to make good decisions
○ HR wants to ensure salaries are competitive with market
○ Politician wants to optimize campaign strategy

Data-Driven Decisions
1. Data is Collected
2. Analyst Finds the Data
3. Analyst Understands the Data
4. Analyst Creates Report
5. Analyst Shares the Results
6. Someone Makes a Decision

Data Discovery Not Productive
● Data Scientists spend up to 30% of their
time in Data Discovery
● Data Discovery in itself provides little to
no intrinsic value. Impactful work
happens in Analysis.
● The answer to these problems is
Metadata / Data Catalog

Data Catalog to the rescue
• Ease of documentation and discoverability
‒ Single searchable portal
‒ Display dependencies / lineages between data entities ( tables,
dashboards)
• Help to answer questions like:
‒ Where can I ﬁnd data about ___?
‒ What is the context about the data?
‒ Who are the owners that I can ask for access?
‒ How is the data created? Is the data trustable?
‒ How should i use the data? Any sample query, statistics around the
column?
‒ How frequently does the data refresh?
‒ ...

What is Amundsen
• In a nutshell, Amundsen is an open-source data discovery and metadata
platform for improving the productivity of data analysts, data scientists,
and engineers when interacting with data.
• Amundsen is currently hosted at Linux Foundation Data & AI (fromer
LFAI) as its incubation project with open governance and RFC process.
(e.g blog post)

Lineage between dashboards and dataset

Search for existing dashboards/reports

Announcement page
• Plugin client to support new feature or new datasets

Central data quality issue portal
• Central portal for users to
report data issues.
• Users could see all the past
issues as well.
• Users could request further
context / descriptions from
owners through the portal.

Data Preview
• Supports data preview for
datasets.
• Plugin client with different BI Viz
tools (e.g Apache Superset,
Bigquery).

5000+
Across the globe
CUSTOMERS
Lakehouse
One simple platform to unify all of
your data, analytics, and AI workloads
The Data and AI Company
ORIGINAL CREATORS

Databricks Lakehouse
BI Reports &
Dashboards
Data
Science
Workspace
Machine
Learning
Lifecycle
Structured, Semi-Structured and Unstructured Data
DELTA ENGINE
Structured
transaction layer
High performance
query engine

Internal dataset discovery at Databricks
● Static maintained wiki
page for golden tables of
the central workspace
● Metadata easily
becomes stale
● Amundsen for the
rescue!

Deployment(detailed)
vpn
Control plane
amundsen ns
Load balancer
amundsen-frontend
amundsen-search amundsen-metadata
neo4j
LB
Data plane
Databricks notebook
Databricks job service
Amazon RDS to store
connections

Development
Open source amundsen (git
submodule)
Private changes
Private changes
Base layer
Layer m
Layer n

Notebook version control
Databricks private repo
Databricks notebook
Generate & grant
access token
Syncing changes

Metadata surfaced in amunden
• Downstream/Upstream tables
• Downstream jobs
• Downstream users of the table
• Job that writes the table
• Writer of the table
• Column stats
• Dataset frequent users
• Delta table extended metadata
• Redash Dashboards
• Sample data
Lineage information
Statistics
Extended information

Lineage information
Jobs that write the table
Writer of table
Main lineage info

How is the lineage table generated?
Raw lineage pipeline Raw -> processed lineage
Usage_logs
ReadEventTable (reads)
WriteEventTable (writes)
Insights_table
Cleaning + workload aggregation
Graph
Read <-> Workload <-> Write
Raw Lineage table
With raw table paths
dbfs:/user/hive/… → db.table
String processing
Paths → View conversion
Get Delta metadata (Describe Extended) +
String processing + heuristics
Mount point → Blob path
Get mount points (dbutils.fs.mounts()) +
String processing
Processed Lineage table
With table Names

Statistics information
Column statistics for numeric data
type
Frequent users
Raw usage data also
comes from usage_logs
table
analyze {table} compute statistics for column col1, col2
describe extended {db}.{table} `{column name}`
Get column stats

Delta table extended metadata
For delta table, we can run:
describe detail table_name
For delta table view, we can
run:
describe detail table_name
Extract extended metadata

Notebook structure
Open source delta_lake_metadata_extractor
can be extended easily.

Notebook structure
Step 1. Extract delta lake metadata + Publish to Neo4j
Step 2. Publish to Elasticsearch

Notebook structure
Step 3. Cleanup stale data

Redash dashboards
All redash dashboards that use this
table

Redash dashboards
View in redash
Copy button

Sample data Sample data tab
Example:

Amundsen Open Source
1500+
Community
members
2k+
Stars for the
repo
30+
Companies using
in production
Also part of top 20 most popular OSS data projects in 2021 based on data
council survey

Notable RFCs / PRs
● AWS Neptune metadata datastore (RFC#13)
● Mysql metadata datastore (RFC#019, RFC#021, RFC#023)
● Lineage frontend and backend (RFC#025, RFC#032)
● ETL push model paradigm (PR)
● Other rfcs could be found in here

Summary
● Solve data discovery challenges with Amundsen
● Integrate Amundsen with Databricks infrastructure
● Amundsen OSS adoptions signiﬁcantly growing
●

Data Discovery at Databricks with Amundsen

More Related Content

What's hot (20)

Similar to Data Discovery at Databricks with Amundsen (20)

More from Databricks (20)

Recently uploaded (20)

Data Discovery at Databricks with Amundsen