A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi

A Glide, Skip or a Jump:
Efﬁciently Stream Data into Your
Medallion Architecture with Apache Hudi
Nadine Farah Ethan Guo
{nadine, ethan}@onehouse.ai
September 27, 2023

Ethan Guo
❏ Dev Rel @ Onehouse
❏ Contributor @ Apache Hudi
❏ Former @ Rockset, Bose
❏ Software Engineer @ Onehouse
❏ PMC @ Apache Hudi
❏ Data, Networking @ Uber
in/nadinefarah/
@nfarah86
Nadine Farah
Speaker Bio
in/yihua-ethan-guo/

Share your highlight from this session to win
one of 10 Hudi Hoodies
- Tag and follow OnehouseHQ on Linkedin
with a post about this session
OR
- Live tweet this session & tag and follow
@apachehudi
Session Highlights: Share to Win Hudi Hoodies
Hudi Slack Community
Collect your hoodie at the Onehouse booth, 414 expo hall (by the laté/coffee
bar area)

The Medallion Architecture
Overview

Medallion Architecture Overview

So, what does it take to build
medallion architecture?

Challenges in the Medallion Architecture

Bottlenecks that Cause the Challenges

But … what if you can simplify
the medallion architecture?

Simplify the Medallion Architecture with
Apache Hudi

Compute-efﬁcient Architecture with Hudi

Open & Interoperable Lakehouse Platform
S3
AWS Glue
Data Catalog
Metastore
BigQuery
Catalogs
+ Many More
Lakehouse Platform
Apache Kafka
Raw
Tables
Cleaned
Tables
Derived
Tables

The Missing State Store
Hudi
Table
upsert(records)
at time t
Changes
to table
Changes
from table
incremental_query
(t-1, t)
query at time t
Latest committed records

Proven @ Massive Scale
https://www.youtube.com/watch?v=ZamXiT9aqs8
https://chowdera.com/2022/184/202207030146453436.html
https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/
100GB/s
Throughput
> 1Exabyte
Even just 1 Table
Daily -> Min
Analytics Latency
70%
CPU Savings
(write+read)
300GB/d
Throughput
25+TB
Datasets
Hourly
Analytics Latency
https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-nat
ive-data-pipelines-at-enterprise-scale-using-the-aws-platform/
10,000+
Tables
150+
Source systems
CDC, ETL
Use cases
https://www.uber.com/blog/apache-hudi-graduation/
4000+
Tables
250+PB
Raw + Derived
Daily -> Min
Analytics Latency
800B
Records/Day

Incremental Data Processing
in the Medallion Architecture

Incremental Processing with Apache Hudi
Hudi
Streamer
Hudi
Streamer

Hudi Streamer: E2E Incremental Processing
Clean
Table
(fact)
(Join)
Dataset 1
(dimension)
(Join)
Dataset 2
(dimension)
Upsert
Summary
Table
Apache Kafka
Raw
Table
(fact)
Hudi Streamer
Hudi Streamer Hudi Streamer
Projection
User-
deﬁned
Transform
ation
SELECT
a.loc.lon as
loc_lon,
a.loc.lat as
loc_lat,
a.name
FROM <SRC> a
Schema registry
Bronze Silver Gold
PostgresSQL
Debezium Upsert
Bulk Insert

Hudi Incr. Processing: Under the hood
● Record-level changes with primary keys -> index lookup, record payload and merging
● Faster metadata changes, consistency between index and data -> metadata management
● Optimize data layout on storage -> small-ﬁle handling, table services
● Needs fundamentally different concurrency control techniques -> OCC and MVCC
Incremental /
CDC Changes
From Source
Pre-
Process
Locate
Records
Optimize
File
Layout
Perform
Upsert
Write
New
Files
Update
Index /
Metadata
Commit
Sched/Run
Table
Services
Incremental /
CDC Changes
From Hudi

Deep Dive on Record-Level Mutation
uuid name ts balance
1 Ethan 1000 100
2 XYZ 1000 200
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
uuid name ts balance is_delete
3 Nadine 4000 100 false
1 Ethan 5000 60 false
uuid name ts balance is_delete
2 XYZ 6000 null true
1 Ethan 2000 80 false
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
Incoming Data 1 Incoming Data 2
Primary Key
Insert
Update Update
Delete
Hudi Table
Hudi Timeline
t1 t2 t3
● Payload and merge API for customized upserts;
built-in support for event-time ordering
● Auto primary key generation for log ingestion
(upcoming 0.14.0 release)
Upsert operation Ordering Field

Incremental Processing with CDC Feature
1 Ethan 1000 100
2 XYZ 1000 200
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
Hudi Timeline
t1 t2 t3
Debezium-like change logs with before and after images with “hoodie.table.cdc.enabled=true”
op ts before after
i t2 null
{“uuid”:“3”,“name”:“Nadine”,
“ts”:“4000”,“balance”:“100”}
u t2
{“uuid”:“1”,“name”:“Ethan”,
“ts”:“1000”,“balance”:“100”}
{“uuid”:“1”,“name”:“Ethan”,
“ts”:“5000”,“balance”:“60”}
d t3
{“uuid”:“2”,“name”:“XYZ”,
“ts”:“1000”,“balance”:“200”}
null
spark.read.format("hudi").
option("hoodie.datasource.query.type",
"incremental").
option("hoodie.datasource.query.incremental.format",
"cdc").
option("hoodie.datasource.read.begin.instanttime",
t1).
option("hoodie.datasource.read.end.instanttime",
t3).
load("/path/to/hudi")
(New in 0.13.0 release)

Speed up UPSERT Operations
with Record-Level Index

● Widely employed in DB systems
○ Locate information quickly
○ Reduce I/O cost
○ Improve Query efficiency
https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/
Indexes: Locating Records Efficiently
● Indexing provides fast upserts
○ Locate records for incoming writes
○ Bloom filter based, Simple, Hbase, etc.

Existing Indexes in Hudi
● Simple Index
○ Simply read keys and location from table
○ Best for random updates and deletes
● Bloom Index
○ Prune data ﬁles by bloom ﬁlters and key ranges
○ Best for late arriving updates and dedup
● HBase Index
○ Look up key-to-location mapping in an external
HBase table
○ Best for large-scale datasets (10s of TB to PB)

Challenges for Large Datasets
● Simple Index
○ Read keys from all files
● Bloom Index
○ Read all bloom filters
○ Read keys after file pruning to avoid false
positives
● HBase Index
○ Key-to-location mapping for every record
Reading data and metadata
per file is expensive
HBase cluster maintenance is
required and operationally difficult
Particular for cloud storage which
enforces rate limiting on I/O
A new Index to address both challenges?

Record-Level Index (RLI) Design
● Key-to-location mapping in table-level metadata
○ A new partition,“record_index”, in the metadata table
○ Stored in a few file groups instead of all data files
● Efficient key-to-location entry as payload
○ Random UUID key and datestr partition: 50-55 B per record in MDT
● Fast index update and lookup
○ MDT, an internal Hudi MOR table, enables uniformed fast updates
○ HFile format enables fast point lookup

Record-Level Index on Storage
“record_index”
partition
FG N-1
FG 1
File Group 0
File Group 0
File Slice t0
…
FS t1
HFile
Log File 1
HFile
record_key 0 -> partition 1, ﬁle 1
.
.
.
Compaction
HFile
Log File 1
Header
HFile Data Block 0
record_key 7-> partition 1, ﬁle 1
…
HFile Data Block 1
Footer
File Group ID
by the hash
Record
Keys Log File 2

SELECT * FROM table WHERE key = 'val'
DELETE FROM table WHERE key = 'val'
Performance Beneﬁt from RLI
● Improves index lookup and write latency
○ 1TB dataset, 200MB batch, random updates
○ 17x speedup on index lookup, 2x on write
Record-Level Index will be available in
upcoming Hudi 0.14.0 release
17x 2x
2x
3x
● Reduces SQL latency with point lookups
○ TPC-DS 10TB datasets, store_sales table
○ 2-3x improvement compared to no RLI

Customer-360: “Clickstream” Schema
Field Description
click_id Unique identiﬁer for each click
customer_id Customer table reference
session_id User session id
url User clicked on url
timestamp Timestamp of click

Customer-360: “Purchase” Schema
Field Description
purchase_id Unique identifier for purchase
customer_id Unique identifier for customer
product_id Unique identifier for product
quantity # product purchase
purchase_price Products total price
purchase_date timestamp
payment_method Customer’s payment method
order_status Delivered, in-route, etc

Customer-360: “Cart Activity” Schema
Field Description
activity_id Unique identiﬁer for activity
product_id Unique identiﬁer for product
timestamp Activity timestamp
activity-type Added, removed etc items
quantity How many items customer add/remove
cart-status Active/abandoned cart

Customer-360: “Customer” Schema
Field Description
ﬁrst_name Customer’s ﬁrst name
last_name Customer’s last name
email Customer’s email
signup_date Account creation date
last_login Most recent login date

Correlate User’s Activity with Purchases
SELECT
c.first_name,
c.last_name,
cs.url AS clicked_url,
cs.timestamp AS click_timestamp,
p.product_id AS purchased_product,
p.purchase_date
FROM customers c
-- Joining clickstream data
LEFT JOIN clickstream cs ON c.customer_id = cs.customer_id
-- Joining purchase data
LEFT JOIN purchases p ON c.customer_id = p.customer_id
WHERE cs.timestamp > '2023-01-01' AND p.purchase_date > '2023-01-01'
ORDER BY c.last_name, cs.timestamp DESC, p.purchase_date DESC;

What’s Next in Apache Hudi
● Hudi 0.14.0 release will be out soon
○ Record-Level Index to speed up index lookup and upsert performance
○ Auto-generated keys for use cases without user-provided primary key ﬁeld
○ New MOR reader in Spark to boost query performance
● Hudi 1.x (RFC-69)
○ Re-imagination of Hudi, the transactional database for the lake
○ Storage format changes to unlock long retention of timeline, non-blocking
concurrency control
○ Enhancement to the indexing, performance and better abstractions, APIs for
engine integration

Come Build With The Community!
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : Apache Hudi Slack Group
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack

Thanks!
Questions?
A Glide, Skip or a Jump:
Efﬁciently Stream Data into Your
Medallion Architecture with Apache Hudi
Join Hudi Slack

Challenges with Lakehouse Technologies
Context
❏ Append-only; no support for
upserts & deletes
Problems
❏ No indexing -> Full table scans
❏ Inconsistent view of the data lake
❏ No record modiﬁcations

Challenges in medallion architecture

Open & Interoperable Lakehouse Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock
providers, Scheduling...)
Table Services
(cleaning, compaction, clustering,
indexing, file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index,
Hash based, Lucene..)
Table Format
(Schema, File listings, Stats,
Evolution, …)
Lake Cache*
(Columnar, transactional, mutable,
WIP,...)
Metaserver*
(Stats, table service coordination,...)
Query Engines
(Spark, Flink, Hive, Presto, Trino,
Impala, Redshift, BigQuery,
Snowflake,..)
Platform Services
(Streaming/Batch ingest, various
sources, Catalog sync, Admin CLI,
Data Quality,...)
Transactional
Database
Layer
User Interface
Readers
(Snapshot, Time Travel,
Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart
Layout Management, etc)
Programming API

A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi

More Related Content

Similar to A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi