Incremental data processing with Hudi & Spark + dbt.pdf

[Live] Incremental data
processing with Hudi &
Spark + dbt
December 06, 2023
Shiyan Xu
Apache Hudi PMC

❏ PMC member @ Apache Hudi
❏ Open Source Engineer @ Onehouse
❏ ex Tech Lead Manager @ Zendesk
Shiyan Xu
Speaker Bio
in/xushiyan
@rshiyanxu
blog.datumagic.com

Medallion Architecture Overview

So, what does it take to build
medallion architecture?

Challenges in the Medallion Architecture

But … what if you can simplify
the medallion architecture?

Simpliﬁed architecture with Apache Hudi

Apache Hudi Overview
Open
Formats
CDC Incremental
Change Feed
Transactions +
Concurrency
Managed
Perf Tuning
+++
More
Auto Catalog
Sync
Merge-On-Read
Stream Writers
AWS Glue
Data Catalog
Metastore
BigQuery
Catalogs
+ Many More
Lakehouse Platform
Apache Kafka
Raw Cleaned Derived

Incremental
processing with
Spark + dbt

dbt overview
Apache
Kafka
Raw Cleaned Derived
Lakehouse storage
Extract &
Load Transform
dbt (data build tool)
● handles the T in ELT
● compiles and runs SQL
with engines like Spark
Read more: What, exactly, is dbt?

dbt project structure
tells dbt the project context
let dbt know how to build a specific data set
define transformations between data sets
defines data set schemas
contains compiled/runtime SQLs

dbt case study: update user profiles
Profile
update
events
Raw
updates
Profiles Profile
changes
Downstream
jobs

-- raw_updates.sql
{{
config(
materialized='incremental',
file_format='hudi',
incremental_strategy='insert_overwrite'
)
}}
with source_data as (
select '101' as user_id, 'A' as city, unix_timestamp() as
updated_at
union all
select '102' as user_id, 'B' as city, unix_timestamp() as
updated_at
union all
select '103' as user_id, 'C' as city, unix_timestamp() as
updated_at
)
select *
from source_data
select user_id, city,
updated_at from raw_updates
+-------+----+----------+
|user_id|city|updated_at|
+-------+----+----------+
| 101| A|1701083620|
| 103| C|1701083620|
| 102| B|1701083620|
| 101| D|1701084137|
| 102| E|1701084365|
| 103| F|1701084369|
+-------+----+----------+

-- profiles.sql
{{
config(
incremental_strategy='merge',
merge_update_columns = ['city', 'updated_at'],
unique_key='user_id',
file_format='hudi',
options={
'type': 'cow',
'primaryKey': 'user_id',
'preCombineField': 'updated_at',
'hoodie.table.cdc.enabled': 'true'
}
)
}}
with new_updates as (
select user_id, city, updated_at from {{ ref('raw_updates') }}
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
)
select user_id, city, updated_at from new_updates
select user_id, city,
updated_at from profiles
+-------+----+----------+
|user_id|city|updated_at|
+-------+----+----------+
| 101| D|1701084137|
| 102| E|1701084365|
| 103| F|1701084369|
+-------+----+----------+

-- profile_changes.sql
{{
config(
file_format='hudi'
)
}}
with new_changes as (
select
GET_JSON_OBJECT(after, '$.user_id') AS user_id,
GET_JSON_OBJECT(after, '$.city') AS new_city,
ts_ms as process_ts
from hudi_table_changes('dbt_example_cdc.profiles', 'cdc',
from_unixtime(unix_timestamp() - 3600 * 24, 'yyyyMMddHHmmss'))
{% if is_incremental() %}
where ts_ms > (select max(process_ts) from {{ this }})
{% endif %}
)
select user_id, new_city, process_ts
from new_changes
select user_id, new_city
from profile_changes
+-------+--------+
|user_id|new_city|
+-------+--------+
| 102| E|
| 103| F|
| 101| D|
+-------+--------+

dbt x Hudi recap
● dbt supports incremental & merge semantics
● Hudi CDC feature supports rich data capabilities and ﬁts
the incremental model
● Efﬁciency & cost-saving
● Sample code @
https://github.com/apache/hudi/tree/master/hudi-exam
ples/hudi-examples-dbt

Come Build With The Community!
Checkout Hudi docs 🔖
Give us a star in Github ⭐
Join Hudi Slack 👥
Follow us on Linkedin!
Join our Twitter Community!
Subscribe to our Mailing list (send an empty email to subscribe) 📩
Subscribe to Apache Hudi Youtube Channel

Thanks!
Questions?
Join Hudi Slack
in/xushiyan
@rshiyanxu
blog.datumagic.com

Incremental data processing with Hudi & Spark + dbt.pdf

More Related Content

What's hot (20)

Similar to Incremental data processing with Hudi & Spark + dbt.pdf (20)

Recently uploaded (20)

Incremental data processing with Hudi & Spark + dbt.pdf