SlideShare a Scribd company logo
Data Discovery At
Databricks With
Amundsen
Tao Feng
Tianru Zhou
Who
Tao Feng
▪ Engineer at Databricks
▪ Co Creator of Amundsen
▪ Apache Airflow PMC
▪ Previously worked at Lyft, Linkedin,
Oracle
Tianru Zhou
▪ Engineer at Databricks
▪ Previously worked at AWS
Elasticsearch
Data Discovery & Challenges
Data-Driven Decisions
Analysts Data Scientists General
Managers
Engineers Experimenters
Product
Managers
● Axiom: Good decisions are based on data
● Who needs Data? Anyone who wants to make good decisions
○ HR wants to ensure salaries are competitive with market
○ Politician wants to optimize campaign strategy
Data-Driven Decisions
1. Data is Collected
2. Analyst Finds the Data
3. Analyst Understands the Data
4. Analyst Creates Report
5. Analyst Shares the Results
6. Someone Makes a Decision
Data Discovery Not Productive
● Data Scientists spend up to 30% of their
time in Data Discovery
● Data Discovery in itself provides little to
no intrinsic value. Impactful work
happens in Analysis.
● The answer to these problems is
Metadata / Data Catalog
Data Catalog to the rescue
• Ease of documentation and discoverability
‒ Single searchable portal
‒ Display dependencies / lineages between data entities ( tables,
dashboards)
• Help to answer questions like:
‒ Where can I find data about ___?
‒ What is the context about the data?
‒ Who are the owners that I can ask for access?
‒ How is the data created? Is the data trustable?
‒ How should i use the data? Any sample query, statistics around the
column?
‒ How frequently does the data refresh?
‒ ...
Introducing Amundsen
What is Amundsen
• In a nutshell, Amundsen is an open-source data discovery and metadata
platform for improving the productivity of data analysts, data scientists,
and engineers when interacting with data.
• Amundsen is currently hosted at Linux Foundation Data & AI (fromer
LFAI) as its incubation project with open governance and RFC process.
(e.g blog post)
Amundsen homepage
Dataset detail page
Lineage between dashboards and dataset
Search for existing dashboards/reports
Dashboard detail page
Search for co-workers
User Profile page
Announcement page
• Plugin client to support new feature or new datasets
Central data quality issue portal
• Central portal for users to
report data issues.
• Users could see all the past
issues as well.
• Users could request further
context / descriptions from
owners through the portal.
Data Preview
• Supports data preview for
datasets.
• Plugin client with different BI Viz
tools (e.g Apache Superset,
Bigquery).
Amundsen @ Databricks
5000+
Across the globe
CUSTOMERS
Lakehouse
One simple platform to unify all of
your data, analytics, and AI workloads
The Data and AI Company
ORIGINAL CREATORS
Databricks Lakehouse
BI Reports &
Dashboards
Data
Science
Workspace
Machine
Learning
Lifecycle
Structured, Semi-Structured and Unstructured Data
DELTA ENGINE
Structured
transaction layer
High performance
query engine
Internal dataset discovery at Databricks
● Static maintained wiki
page for golden tables of
the central workspace
● Metadata easily
becomes stale
● Amundsen for the
rescue!
Databricks Deployment
Deployment(detailed)
vpn
Control plane
amundsen ns
Load balancer
amundsen-frontend
amundsen-search amundsen-metadata
neo4j
LB
Data plane
Databricks notebook
Databricks job service
Amazon RDS to store
connections
Development
Open source amundsen (git
submodule)
Private changes
Private changes
Base layer
Layer m
Layer n
Notebook version control
Databricks private repo
Databricks notebook
Generate & grant
access token
Syncing changes
Metadata surfaced in amunden
• Downstream/Upstream tables
• Downstream jobs
• Downstream users of the table
• Job that writes the table
• Writer of the table
• Column stats
• Dataset frequent users
• Delta table extended metadata
• Redash Dashboards
• Sample data
Lineage information
Statistics
Extended information
Lineage information
Jobs that write the table
Writer of table
Main lineage info
What is table lineage
How is the lineage table generated?
Raw lineage pipeline Raw -> processed lineage
Usage_logs
ReadEventTable (reads)
WriteEventTable (writes)
Insights_table
Cleaning + workload aggregation
Graph
Read <-> Workload <-> Write
Raw Lineage table
With raw table paths
dbfs:/user/hive/… → db.table
String processing
Paths → View conversion
Get Delta metadata (Describe Extended) +
String processing + heuristics
Mount point → Blob path
Get mount points (dbutils.fs.mounts()) +
String processing
Processed Lineage table
With table Names
Statistics information
Column statistics for numeric data
type
Frequent users
Raw usage data also
comes from usage_logs
table
analyze {table} compute statistics for column col1, col2
describe extended {db}.{table} `{column name}`
Get column stats
Delta table extended metadata
For delta table, we can run:
describe detail table_name
For delta table view, we can
run:
describe detail table_name
Extract extended metadata
Notebook structure
Open source delta_lake_metadata_extractor
can be extended easily.
Notebook structure
Step 1. Extract delta lake metadata + Publish to Neo4j
Step 2. Publish to Elasticsearch
Notebook structure
Step 3. Cleanup stale data
Redash dashboards
All redash dashboards that use this
table
Redash dashboards
View in redash
Copy button
Sample data Sample data tab
Example:
WAU
Amundsen Open Source
Amundsen Open Source
1500+
Community
members
2k+
Stars for the
repo
30+
Companies using
in production
Also part of top 20 most popular OSS data projects in 2021 based on data
council survey
Notable RFCs / PRs
● AWS Neptune metadata datastore (RFC#13)
● Mysql metadata datastore (RFC#019, RFC#021, RFC#023)
● Lineage frontend and backend (RFC#025, RFC#032)
● ETL push model paradigm (PR)
● Other rfcs could be found in here
Summary
Summary
● Solve data discovery challenges with Amundsen
● Integrate Amundsen with Databricks infrastructure
● Amundsen OSS adoptions significantly growing
●

More Related Content

PDF
Using Databricks as an Analysis Platform
PDF
Scaling and Modernizing Data Platform with Databricks
PDF
Build Real-Time Applications with Databricks Streaming
PPTX
Turning Raw Data Into Gold With A Data Lakehouse.pptx
PDF
Change Data Feed in Delta
PDF
Building End-to-End Delta Pipelines on GCP
PDF
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Using Databricks as an Analysis Platform
Scaling and Modernizing Data Platform with Databricks
Build Real-Time Applications with Databricks Streaming
Turning Raw Data Into Gold With A Data Lakehouse.pptx
Change Data Feed in Delta
Building End-to-End Delta Pipelines on GCP
Apache Kafka With Spark Structured Streaming With Emma Liu, Nitin Saksena, Ra...
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard

What's hot (20)

PPTX
Data Lakehouse Symposium | Day 4
PDF
Modernizing to a Cloud Data Architecture
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PPTX
Databricks on AWS.pptx
PPTX
DW Migration Webinar-March 2022.pptx
PDF
How a Semantic Layer Makes Data Mesh Work at Scale
PPTX
Data Mesh in Azure using Cloud Scale Analytics (WAF)
PPTX
Data platform modernization with Databricks.pptx
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
PDF
Getting Started with Delta Lake on Databricks
PPTX
Zero to Snowflake Presentation
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Data Pipline Observability meetup
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
PDF
Enabling a Data Mesh Architecture with Data Virtualization
PDF
Databricks Delta Lake and Its Benefits
PPTX
Databricks Platform.pptx
PDF
Intro to Delta Lake
Data Lakehouse Symposium | Day 4
Modernizing to a Cloud Data Architecture
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks on AWS.pptx
DW Migration Webinar-March 2022.pptx
How a Semantic Layer Makes Data Mesh Work at Scale
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data platform modernization with Databricks.pptx
Introduction SQL Analytics on Lakehouse Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Siligong.Data - May 2021 - Transforming your analytics workflow with dbt
Getting Started with Delta Lake on Databricks
Zero to Snowflake Presentation
Democratizing Data Quality Through a Centralized Platform
Data Pipline Observability meetup
Incremental View Maintenance with Coral, DBT, and Iceberg
Enabling a Data Mesh Architecture with Data Virtualization
Databricks Delta Lake and Its Benefits
Databricks Platform.pptx
Intro to Delta Lake
Ad

Similar to Data Discovery at Databricks with Amundsen (20)

PDF
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
PPTX
Data council sf amundsen presentation
PDF
Meetup SF - Amundsen
PPTX
How Lyft Drives Data Discovery
PDF
Data Discovery and Metadata
PDF
From discovering to trusting data
PDF
Amundsen at Brex and Looker integration
PPTX
Strata sf - Amundsen presentation
PPTX
How Lyft Drives Data Discovery
PDF
Disrupting Data Discovery
PDF
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
PDF
Continuum Analytics and Python
PDF
Amundsen: From discovering to security data
PDF
Democratizing Data within your organization - Data Discovery
PPTX
Data Engineering A Deep Dive into Databricks
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PDF
Build an Open Source Data Lake For Data Scientists
PDF
Learn to Use Databricks for Data Science
PPTX
Snowplow Analytics: from NoSQL to SQL and back again
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Data council sf amundsen presentation
Meetup SF - Amundsen
How Lyft Drives Data Discovery
Data Discovery and Metadata
From discovering to trusting data
Amundsen at Brex and Looker integration
Strata sf - Amundsen presentation
How Lyft Drives Data Discovery
Disrupting Data Discovery
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Continuum Analytics and Python
Amundsen: From discovering to security data
Democratizing Data within your organization - Data Discovery
Data Engineering A Deep Dive into Databricks
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Build an Open Source Data Lake For Data Scientists
Learn to Use Databricks for Data Science
Snowplow Analytics: from NoSQL to SQL and back again
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Ad

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
PDF
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Infrastructure Agnostic Machine Learning Workload Deployment

Recently uploaded (20)

PDF
Introduction to Business Data Analytics.
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Computer network topology notes for revision
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Lecture1 pattern recognition............
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Foundation of Data Science unit number two notes
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
.pdf is not working space design for the following data for the following dat...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Business Data Analytics.
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Computer network topology notes for revision
IB Computer Science - Internal Assessment.pptx
Lecture1 pattern recognition............
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Foundation of Data Science unit number two notes
STUDY DESIGN details- Lt Col Maksud (21).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
climate analysis of Dhaka ,Banglades.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
.pdf is not working space design for the following data for the following dat...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Acumen Training GuidePresentation.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction-to-Cloud-ComputingFinal.pptx

Data Discovery at Databricks with Amundsen

  • 1. Data Discovery At Databricks With Amundsen Tao Feng Tianru Zhou
  • 2. Who Tao Feng ▪ Engineer at Databricks ▪ Co Creator of Amundsen ▪ Apache Airflow PMC ▪ Previously worked at Lyft, Linkedin, Oracle Tianru Zhou ▪ Engineer at Databricks ▪ Previously worked at AWS Elasticsearch
  • 3. Data Discovery & Challenges
  • 4. Data-Driven Decisions Analysts Data Scientists General Managers Engineers Experimenters Product Managers ● Axiom: Good decisions are based on data ● Who needs Data? Anyone who wants to make good decisions ○ HR wants to ensure salaries are competitive with market ○ Politician wants to optimize campaign strategy
  • 5. Data-Driven Decisions 1. Data is Collected 2. Analyst Finds the Data 3. Analyst Understands the Data 4. Analyst Creates Report 5. Analyst Shares the Results 6. Someone Makes a Decision
  • 6. Data Discovery Not Productive ● Data Scientists spend up to 30% of their time in Data Discovery ● Data Discovery in itself provides little to no intrinsic value. Impactful work happens in Analysis. ● The answer to these problems is Metadata / Data Catalog
  • 7. Data Catalog to the rescue • Ease of documentation and discoverability ‒ Single searchable portal ‒ Display dependencies / lineages between data entities ( tables, dashboards) • Help to answer questions like: ‒ Where can I find data about ___? ‒ What is the context about the data? ‒ Who are the owners that I can ask for access? ‒ How is the data created? Is the data trustable? ‒ How should i use the data? Any sample query, statistics around the column? ‒ How frequently does the data refresh? ‒ ...
  • 9. What is Amundsen • In a nutshell, Amundsen is an open-source data discovery and metadata platform for improving the productivity of data analysts, data scientists, and engineers when interacting with data. • Amundsen is currently hosted at Linux Foundation Data & AI (fromer LFAI) as its incubation project with open governance and RFC process. (e.g blog post)
  • 13. Search for existing dashboards/reports
  • 17. Announcement page • Plugin client to support new feature or new datasets
  • 18. Central data quality issue portal • Central portal for users to report data issues. • Users could see all the past issues as well. • Users could request further context / descriptions from owners through the portal.
  • 19. Data Preview • Supports data preview for datasets. • Plugin client with different BI Viz tools (e.g Apache Superset, Bigquery).
  • 21. 5000+ Across the globe CUSTOMERS Lakehouse One simple platform to unify all of your data, analytics, and AI workloads The Data and AI Company ORIGINAL CREATORS
  • 22. Databricks Lakehouse BI Reports & Dashboards Data Science Workspace Machine Learning Lifecycle Structured, Semi-Structured and Unstructured Data DELTA ENGINE Structured transaction layer High performance query engine
  • 23. Internal dataset discovery at Databricks ● Static maintained wiki page for golden tables of the central workspace ● Metadata easily becomes stale ● Amundsen for the rescue!
  • 25. Deployment(detailed) vpn Control plane amundsen ns Load balancer amundsen-frontend amundsen-search amundsen-metadata neo4j LB Data plane Databricks notebook Databricks job service Amazon RDS to store connections
  • 26. Development Open source amundsen (git submodule) Private changes Private changes Base layer Layer m Layer n
  • 27. Notebook version control Databricks private repo Databricks notebook Generate & grant access token Syncing changes
  • 28. Metadata surfaced in amunden • Downstream/Upstream tables • Downstream jobs • Downstream users of the table • Job that writes the table • Writer of the table • Column stats • Dataset frequent users • Delta table extended metadata • Redash Dashboards • Sample data Lineage information Statistics Extended information
  • 29. Lineage information Jobs that write the table Writer of table Main lineage info
  • 30. What is table lineage
  • 31. How is the lineage table generated? Raw lineage pipeline Raw -> processed lineage Usage_logs ReadEventTable (reads) WriteEventTable (writes) Insights_table Cleaning + workload aggregation Graph Read <-> Workload <-> Write Raw Lineage table With raw table paths dbfs:/user/hive/… → db.table String processing Paths → View conversion Get Delta metadata (Describe Extended) + String processing + heuristics Mount point → Blob path Get mount points (dbutils.fs.mounts()) + String processing Processed Lineage table With table Names
  • 32. Statistics information Column statistics for numeric data type Frequent users Raw usage data also comes from usage_logs table analyze {table} compute statistics for column col1, col2 describe extended {db}.{table} `{column name}` Get column stats
  • 33. Delta table extended metadata For delta table, we can run: describe detail table_name For delta table view, we can run: describe detail table_name Extract extended metadata
  • 34. Notebook structure Open source delta_lake_metadata_extractor can be extended easily.
  • 35. Notebook structure Step 1. Extract delta lake metadata + Publish to Neo4j Step 2. Publish to Elasticsearch
  • 36. Notebook structure Step 3. Cleanup stale data
  • 37. Redash dashboards All redash dashboards that use this table
  • 38. Redash dashboards View in redash Copy button
  • 39. Sample data Sample data tab Example:
  • 40. WAU
  • 42. Amundsen Open Source 1500+ Community members 2k+ Stars for the repo 30+ Companies using in production Also part of top 20 most popular OSS data projects in 2021 based on data council survey
  • 43. Notable RFCs / PRs ● AWS Neptune metadata datastore (RFC#13) ● Mysql metadata datastore (RFC#019, RFC#021, RFC#023) ● Lineage frontend and backend (RFC#025, RFC#032) ● ETL push model paradigm (PR) ● Other rfcs could be found in here
  • 45. Summary ● Solve data discovery challenges with Amundsen ● Integrate Amundsen with Databricks infrastructure ● Amundsen OSS adoptions significantly growing ●