SlideShare a Scribd company logo
4
Most read
16
Most read
19
Most read
[Live] Incremental data
processing with Hudi &
Spark + dbt
December 06, 2023
Shiyan Xu
Apache Hudi PMC
❏ PMC member @ Apache Hudi
❏ Open Source Engineer @ Onehouse
❏ ex Tech Lead Manager @ Zendesk
Shiyan Xu
Speaker Bio
in/xushiyan
@rshiyanxu
blog.datumagic.com
The medallion
architecture
Medallion Architecture Overview
So, what does it take to build
medallion architecture?
Challenges in the Medallion Architecture
But … what if you can simplify
the medallion architecture?
Simplified architecture with Apache Hudi
Apache Hudi Overview
Open
Formats
CDC Incremental
Change Feed
Transactions +
Concurrency
Managed
Perf Tuning
+++
More
Auto Catalog
Sync
Merge-On-Read
Stream Writers
AWS Glue
Data Catalog
Metastore
BigQuery
Catalogs
+ Many More
Lakehouse Platform
Apache Kafka
Raw Cleaned Derived
Incremental
processing with
Spark + dbt
dbt overview
Apache
Kafka
Raw Cleaned Derived
Lakehouse storage
Extract &
Load Transform
dbt (data build tool)
● handles the T in ELT
● compiles and runs SQL
with engines like Spark
Read more: What, exactly, is dbt?
dbt project structure
tells dbt the project context
let dbt know how to build a specific data set
define transformations between data sets
defines data set schemas
contains compiled/runtime SQLs
dbt case study: update user profiles
Profile
update
events
Raw
updates
Profiles Profile
changes
Downstream
jobs
dbt case study: update user profiles
-- raw_updates.sql
{{
config(
materialized='incremental',
file_format='hudi',
incremental_strategy='insert_overwrite'
)
}}
with source_data as (
select '101' as user_id, 'A' as city, unix_timestamp() as
updated_at
union all
select '102' as user_id, 'B' as city, unix_timestamp() as
updated_at
union all
select '103' as user_id, 'C' as city, unix_timestamp() as
updated_at
)
select *
from source_data
select user_id, city,
updated_at from raw_updates
+-------+----+----------+
|user_id|city|updated_at|
+-------+----+----------+
| 101| A|1701083620|
| 103| C|1701083620|
| 102| B|1701083620|
| 101| D|1701084137|
| 102| E|1701084365|
| 103| F|1701084369|
+-------+----+----------+
dbt case study: update user profiles
-- profiles.sql
{{
config(
materialized='incremental',
incremental_strategy='merge',
merge_update_columns = ['city', 'updated_at'],
unique_key='user_id',
file_format='hudi',
options={
'type': 'cow',
'primaryKey': 'user_id',
'preCombineField': 'updated_at',
'hoodie.table.cdc.enabled': 'true'
}
)
}}
with new_updates as (
select user_id, city, updated_at from {{ ref('raw_updates') }}
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
)
select user_id, city, updated_at from new_updates
select user_id, city,
updated_at from profiles
+-------+----+----------+
|user_id|city|updated_at|
+-------+----+----------+
| 101| D|1701084137|
| 102| E|1701084365|
| 103| F|1701084369|
+-------+----+----------+
dbt case study: update user profiles
-- profile_changes.sql
{{
config(
materialized='incremental',
file_format='hudi'
)
}}
with new_changes as (
select
GET_JSON_OBJECT(after, '$.user_id') AS user_id,
GET_JSON_OBJECT(after, '$.city') AS new_city,
ts_ms as process_ts
from hudi_table_changes('dbt_example_cdc.profiles', 'cdc',
from_unixtime(unix_timestamp() - 3600 * 24, 'yyyyMMddHHmmss'))
{% if is_incremental() %}
where ts_ms > (select max(process_ts) from {{ this }})
{% endif %}
)
select user_id, new_city, process_ts
from new_changes
select user_id, new_city
from profile_changes
+-------+--------+
|user_id|new_city|
+-------+--------+
| 102| E|
| 103| F|
| 101| D|
+-------+--------+
dbt
docs
UI
dbt x Hudi recap
● dbt supports incremental & merge semantics
● Hudi CDC feature supports rich data capabilities and fits
the incremental model
● Efficiency & cost-saving
● Sample code @
https://github.com/apache/hudi/tree/master/hudi-exam
ples/hudi-examples-dbt
Come Build With The Community!
Checkout Hudi docs 🔖
Give us a star in Github ⭐
Join Hudi Slack 👥
Follow us on Linkedin!
Join our Twitter Community!
Subscribe to our Mailing list (send an empty email to subscribe) 📩
Subscribe to Apache Hudi Youtube Channel
Thanks!
Questions?
Join Hudi Slack
in/xushiyan
@rshiyanxu
blog.datumagic.com

More Related Content

PPTX
Architecting a datalake
PDF
Azure SQL Database
PPTX
Azure Data Engineering.pptx
PDF
Azure SQL Database Managed Instance - technical overview
PDF
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
PPTX
Microsoft Fabric Introduction
PPTX
Azure purview
PPTX
Azure Synapse Analytics Overview (r1)
Architecting a datalake
Azure SQL Database
Azure Data Engineering.pptx
Azure SQL Database Managed Instance - technical overview
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Microsoft Fabric Introduction
Azure purview
Azure Synapse Analytics Overview (r1)

What's hot (20)

PDF
Azure Data Factory v2
PDF
Designing a modern data warehouse in azure
PPTX
Databricks for Dummies
PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
Data Warehouse or Data Lake, Which Do I Choose?
PPTX
Azure data platform overview
PDF
Enabling a Data Mesh Architecture with Data Virtualization
PPTX
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
PPT
Medical center using Data warehousing
PDF
Data Lake,beyond the Data Warehouse
PDF
Apache Spark Introduction
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PPTX
Introduction to ML with Apache Spark MLlib
PDF
Data Modeling & Metadata for Graph Databases
PDF
Getting Started with Delta Lake on Databricks
PDF
Building an open data platform with apache iceberg
PPTX
SQL to Azure Migrations
PDF
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
PDF
The delta architecture
PDF
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...
Azure Data Factory v2
Designing a modern data warehouse in azure
Databricks for Dummies
Introduction SQL Analytics on Lakehouse Architecture
Data Warehouse or Data Lake, Which Do I Choose?
Azure data platform overview
Enabling a Data Mesh Architecture with Data Virtualization
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Medical center using Data warehousing
Data Lake,beyond the Data Warehouse
Apache Spark Introduction
Architect’s Open-Source Guide for a Data Mesh Architecture
Introduction to ML with Apache Spark MLlib
Data Modeling & Metadata for Graph Databases
Getting Started with Delta Lake on Databricks
Building an open data platform with apache iceberg
SQL to Azure Migrations
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
The delta architecture
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...
Ad

Similar to Incremental data processing with Hudi & Spark + dbt.pdf (20)

PPT
Evolutionary db development
PDF
Dynamic Partition Pruning in Apache Spark
PPTX
MongoDB World 2018: Keynote
PPTX
DAC4B 2015 - Polybase
PPTX
Shrug2017 arcpy data_and_you
PDF
Scaling and Modernizing Data Platform with Databricks
PDF
Online | MongoDB Atlas on GCP Workshop
PDF
BigQuery implementation
PDF
Ajax Performance Tuning and Best Practices
PPTX
Vida Dashboard Training
PDF
How to create an Angular builder
PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
PPTX
Te kslate sap bods
DOC
Sap bo xi r4.0
DOC
Sap bo xi r4.0
DOC
Sap bo xi r4.0
DOC
Sap bo xi r4.0
PDF
Power BI with Essbase in the Oracle Cloud
PDF
Apache Calcite Tutorial - BOSS 21
KEY
OSCON 2011 Learning CouchDB
Evolutionary db development
Dynamic Partition Pruning in Apache Spark
MongoDB World 2018: Keynote
DAC4B 2015 - Polybase
Shrug2017 arcpy data_and_you
Scaling and Modernizing Data Platform with Databricks
Online | MongoDB Atlas on GCP Workshop
BigQuery implementation
Ajax Performance Tuning and Best Practices
Vida Dashboard Training
How to create an Angular builder
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Te kslate sap bods
Sap bo xi r4.0
Sap bo xi r4.0
Sap bo xi r4.0
Sap bo xi r4.0
Power BI with Essbase in the Oracle Cloud
Apache Calcite Tutorial - BOSS 21
OSCON 2011 Learning CouchDB
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Getting Started with Data Integration: FME Form 101
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
OMC Textile Division Presentation 2021.pptx
1 - Historical Antecedents, Social Consideration.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Hindi spoken digit analysis for native and non-native speakers
Assigned Numbers - 2025 - Bluetooth® Document
Enhancing emotion recognition model for a student engagement use case through...
cloud_computing_Infrastucture_as_cloud_p
Group 1 Presentation -Planning and Decision Making .pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 5: Probability Theory and Statistics
Zenith AI: Advanced Artificial Intelligence
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Getting Started with Data Integration: FME Form 101
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
A comparative study of natural language inference in Swahili using monolingua...
NewMind AI Weekly Chronicles - August'25-Week II
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
A novel scalable deep ensemble learning framework for big data classification...

Incremental data processing with Hudi & Spark + dbt.pdf

  • 1. [Live] Incremental data processing with Hudi & Spark + dbt December 06, 2023 Shiyan Xu Apache Hudi PMC
  • 2. ❏ PMC member @ Apache Hudi ❏ Open Source Engineer @ Onehouse ❏ ex Tech Lead Manager @ Zendesk Shiyan Xu Speaker Bio in/xushiyan @rshiyanxu blog.datumagic.com
  • 5. So, what does it take to build medallion architecture?
  • 6. Challenges in the Medallion Architecture
  • 7. But … what if you can simplify the medallion architecture?
  • 9. Apache Hudi Overview Open Formats CDC Incremental Change Feed Transactions + Concurrency Managed Perf Tuning +++ More Auto Catalog Sync Merge-On-Read Stream Writers AWS Glue Data Catalog Metastore BigQuery Catalogs + Many More Lakehouse Platform Apache Kafka Raw Cleaned Derived
  • 11. dbt overview Apache Kafka Raw Cleaned Derived Lakehouse storage Extract & Load Transform dbt (data build tool) ● handles the T in ELT ● compiles and runs SQL with engines like Spark Read more: What, exactly, is dbt?
  • 12. dbt project structure tells dbt the project context let dbt know how to build a specific data set define transformations between data sets defines data set schemas contains compiled/runtime SQLs
  • 13. dbt case study: update user profiles Profile update events Raw updates Profiles Profile changes Downstream jobs
  • 14. dbt case study: update user profiles -- raw_updates.sql {{ config( materialized='incremental', file_format='hudi', incremental_strategy='insert_overwrite' ) }} with source_data as ( select '101' as user_id, 'A' as city, unix_timestamp() as updated_at union all select '102' as user_id, 'B' as city, unix_timestamp() as updated_at union all select '103' as user_id, 'C' as city, unix_timestamp() as updated_at ) select * from source_data select user_id, city, updated_at from raw_updates +-------+----+----------+ |user_id|city|updated_at| +-------+----+----------+ | 101| A|1701083620| | 103| C|1701083620| | 102| B|1701083620| | 101| D|1701084137| | 102| E|1701084365| | 103| F|1701084369| +-------+----+----------+
  • 15. dbt case study: update user profiles -- profiles.sql {{ config( materialized='incremental', incremental_strategy='merge', merge_update_columns = ['city', 'updated_at'], unique_key='user_id', file_format='hudi', options={ 'type': 'cow', 'primaryKey': 'user_id', 'preCombineField': 'updated_at', 'hoodie.table.cdc.enabled': 'true' } ) }} with new_updates as ( select user_id, city, updated_at from {{ ref('raw_updates') }} {% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}) {% endif %} ) select user_id, city, updated_at from new_updates select user_id, city, updated_at from profiles +-------+----+----------+ |user_id|city|updated_at| +-------+----+----------+ | 101| D|1701084137| | 102| E|1701084365| | 103| F|1701084369| +-------+----+----------+
  • 16. dbt case study: update user profiles -- profile_changes.sql {{ config( materialized='incremental', file_format='hudi' ) }} with new_changes as ( select GET_JSON_OBJECT(after, '$.user_id') AS user_id, GET_JSON_OBJECT(after, '$.city') AS new_city, ts_ms as process_ts from hudi_table_changes('dbt_example_cdc.profiles', 'cdc', from_unixtime(unix_timestamp() - 3600 * 24, 'yyyyMMddHHmmss')) {% if is_incremental() %} where ts_ms > (select max(process_ts) from {{ this }}) {% endif %} ) select user_id, new_city, process_ts from new_changes select user_id, new_city from profile_changes +-------+--------+ |user_id|new_city| +-------+--------+ | 102| E| | 103| F| | 101| D| +-------+--------+
  • 18. dbt x Hudi recap ● dbt supports incremental & merge semantics ● Hudi CDC feature supports rich data capabilities and fits the incremental model ● Efficiency & cost-saving ● Sample code @ https://github.com/apache/hudi/tree/master/hudi-exam ples/hudi-examples-dbt
  • 19. Come Build With The Community! Checkout Hudi docs 🔖 Give us a star in Github ⭐ Join Hudi Slack 👥 Follow us on Linkedin! Join our Twitter Community! Subscribe to our Mailing list (send an empty email to subscribe) 📩 Subscribe to Apache Hudi Youtube Channel