SlideShare a Scribd company logo
A Glide, Skip or a Jump:
Efficiently Stream Data into Your
Medallion Architecture with Apache Hudi
Nadine Farah Ethan Guo
{nadine, ethan}@onehouse.ai
September 27, 2023
Ethan Guo
❏ Dev Rel @ Onehouse
❏ Contributor @ Apache Hudi
❏ Former @ Rockset, Bose
❏ Software Engineer @ Onehouse
❏ PMC @ Apache Hudi
❏ Data, Networking @ Uber
in/nadinefarah/
@nfarah86
Nadine Farah
Speaker Bio
in/yihua-ethan-guo/
Share your highlight from this session to win
one of 10 Hudi Hoodies
- Tag and follow OnehouseHQ on Linkedin
with a post about this session
OR
- Live tweet this session & tag and follow
@apachehudi
Session Highlights: Share to Win Hudi Hoodies
Hudi Slack Community
Collect your hoodie at the Onehouse booth, 414 expo hall (by the laté/coffee
bar area)
A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi
The Medallion Architecture
Overview
Medallion Architecture Overview
So, what does it take to build
medallion architecture?
Challenges in the Medallion Architecture
Bottlenecks that Cause the Challenges
But … what if you can simplify
the medallion architecture?
Simplify the Medallion Architecture with
Apache Hudi
Apache Hudi Overview
Compute-efficient Architecture with Hudi
Open & Interoperable Lakehouse Platform
S3
AWS Glue
Data Catalog
Metastore
BigQuery
Catalogs
+ Many More
Lakehouse Platform
Apache Kafka
Raw
Tables
Cleaned
Tables
Derived
Tables
Hudi Table Deep-Dive
The Missing State Store
Hudi
Table
upsert(records)
at time t
Changes
to table
Changes
from table
incremental_query
(t-1, t)
query at time t
Latest committed records
Proven @ Massive Scale
https://www.youtube.com/watch?v=ZamXiT9aqs8
https://chowdera.com/2022/184/202207030146453436.html
https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/
100GB/s
Throughput
> 1Exabyte
Even just 1 Table
Daily -> Min
Analytics Latency
70%
CPU Savings
(write+read)
300GB/d
Throughput
25+TB
Datasets
Hourly
Analytics Latency
https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-nat
ive-data-pipelines-at-enterprise-scale-using-the-aws-platform/
10,000+
Tables
150+
Source systems
CDC, ETL
Use cases
https://www.uber.com/blog/apache-hudi-graduation/
4000+
Tables
250+PB
Raw + Derived
Daily -> Min
Analytics Latency
800B
Records/Day
Incremental Data Processing
in the Medallion Architecture
Incremental Processing with Apache Hudi
Hudi
Streamer
Hudi
Streamer
Hudi Streamer: E2E Incremental Processing
Clean
Table
(fact)
(Join)
Dataset 1
(dimension)
(Join)
Dataset 2
(dimension)
Upsert
Summary
Table
Apache Kafka
Raw
Table
(fact)
Hudi Streamer
Hudi Streamer Hudi Streamer
Projection
User-
defined
Transform
ation
SELECT
a.loc.lon as
loc_lon,
a.loc.lat as
loc_lat,
a.name
FROM <SRC> a
Schema registry
Bronze Silver Gold
PostgresSQL
Debezium Upsert
Bulk Insert
Hudi Incr. Processing: Under the hood
● Record-level changes with primary keys -> index lookup, record payload and merging
● Faster metadata changes, consistency between index and data -> metadata management
● Optimize data layout on storage -> small-file handling, table services
● Needs fundamentally different concurrency control techniques -> OCC and MVCC
Incremental /
CDC Changes
From Source
Pre-
Process
Locate
Records
Optimize
File
Layout
Perform
Upsert
Write
New
Files
Update
Index /
Metadata
Commit
Sched/Run
Table
Services
Incremental /
CDC Changes
From Hudi
Deep Dive on Record-Level Mutation
uuid name ts balance
1 Ethan 1000 100
2 XYZ 1000 200
uuid name ts balance
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
uuid name ts balance is_delete
3 Nadine 4000 100 false
1 Ethan 5000 60 false
uuid name ts balance is_delete
2 XYZ 6000 null true
1 Ethan 2000 80 false
uuid name ts balance
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
Incoming Data 1 Incoming Data 2
Primary Key
Insert
Update Update
Delete
Hudi Table
Hudi Timeline
t1 t2 t3
● Payload and merge API for customized upserts;
built-in support for event-time ordering
● Auto primary key generation for log ingestion
(upcoming 0.14.0 release)
Upsert operation Ordering Field
Incremental Processing with CDC Feature
uuid name ts balance
1 Ethan 1000 100
2 XYZ 1000 200
uuid name ts balance
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
uuid name ts balance
1 Ethan 5000 60
2 XYZ 1000 200
3 Nadine 4000 100
Hudi Timeline
t1 t2 t3
Debezium-like change logs with before and after images with “hoodie.table.cdc.enabled=true”
op ts before after
i t2 null
{“uuid”:“3”,“name”:“Nadine”,
“ts”:“4000”,“balance”:“100”}
u t2
{“uuid”:“1”,“name”:“Ethan”,
“ts”:“1000”,“balance”:“100”}
{“uuid”:“1”,“name”:“Ethan”,
“ts”:“5000”,“balance”:“60”}
d t3
{“uuid”:“2”,“name”:“XYZ”,
“ts”:“1000”,“balance”:“200”}
null
spark.read.format("hudi").
option("hoodie.datasource.query.type",
"incremental").
option("hoodie.datasource.query.incremental.format",
"cdc").
option("hoodie.datasource.read.begin.instanttime",
t1).
option("hoodie.datasource.read.end.instanttime",
t3).
load("/path/to/hudi")
(New in 0.13.0 release)
Speed up UPSERT Operations
with Record-Level Index
● Widely employed in DB systems
○ Locate information quickly
○ Reduce I/O cost
○ Improve Query efficiency
https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/
Indexes: Locating Records Efficiently
● Indexing provides fast upserts
○ Locate records for incoming writes
○ Bloom filter based, Simple, Hbase, etc.
Existing Indexes in Hudi
● Simple Index
○ Simply read keys and location from table
○ Best for random updates and deletes
● Bloom Index
○ Prune data files by bloom filters and key ranges
○ Best for late arriving updates and dedup
● HBase Index
○ Look up key-to-location mapping in an external
HBase table
○ Best for large-scale datasets (10s of TB to PB)
Challenges for Large Datasets
● Simple Index
○ Read keys from all files
● Bloom Index
○ Read all bloom filters
○ Read keys after file pruning to avoid false
positives
● HBase Index
○ Key-to-location mapping for every record
Reading data and metadata
per file is expensive
HBase cluster maintenance is
required and operationally difficult
Particular for cloud storage which
enforces rate limiting on I/O
A new Index to address both challenges?
Record-Level Index (RLI) Design
● Key-to-location mapping in table-level metadata
○ A new partition,“record_index”, in the metadata table
○ Stored in a few file groups instead of all data files
● Efficient key-to-location entry as payload
○ Random UUID key and datestr partition: 50-55 B per record in MDT
● Fast index update and lookup
○ MDT, an internal Hudi MOR table, enables uniformed fast updates
○ HFile format enables fast point lookup
Record-Level Index on Storage
“record_index”
partition
FG N-1
FG 1
File Group 0
File Group 0
File Slice t0
…
FS t1
HFile
Log File 1
HFile
record_key 0 -> partition 1, file 1
record_key 1 -> partition 1, file 1
record_key 2 -> partition 2, file 3
record_key 3 -> partition 1, file 2
.
.
.
Compaction
HFile
Log File 1
Header
HFile Data Block 0
record_key 6 -> partition 1, file 5
record_key 7-> partition 1, file 1
…
HFile Data Block 1
Footer
File Group ID
by the hash
Record
Keys Log File 2
SELECT * FROM table WHERE key = 'val'
DELETE FROM table WHERE key = 'val'
Performance Benefit from RLI
● Improves index lookup and write latency
○ 1TB dataset, 200MB batch, random updates
○ 17x speedup on index lookup, 2x on write
Record-Level Index will be available in
upcoming Hudi 0.14.0 release
17x 2x
2x
3x
● Reduces SQL latency with point lookups
○ TPC-DS 10TB datasets, store_sales table
○ 2-3x improvement compared to no RLI
Case Walkthrough
Customer-360 Walkthrough
Customer-360 Architecture
Customer-360: Bronze layer
Customer-360: “Clickstream” Schema
Field Description
click_id Unique identifier for each click
customer_id Customer table reference
session_id User session id
url User clicked on url
timestamp Timestamp of click
Customer-360: “Purchase” Schema
Field Description
purchase_id Unique identifier for purchase
customer_id Unique identifier for customer
product_id Unique identifier for product
quantity # product purchase
purchase_price Products total price
purchase_date timestamp
payment_method Customer’s payment method
order_status Delivered, in-route, etc
Customer-360: “Cart Activity” Schema
Field Description
activity_id Unique identifier for activity
customer_id Unique identifier for customer
product_id Unique identifier for product
timestamp Activity timestamp
activity-type Added, removed etc items
quantity How many items customer add/remove
cart-status Active/abandoned cart
Customer-360: “Customer” Schema
Field Description
customer_id Unique identifier for customer
first_name Customer’s first name
last_name Customer’s last name
email Customer’s email
signup_date Account creation date
last_login Most recent login date
Customer-360: Silver layer
Customer-360: Gold layer
Correlate User’s Activity with Purchases
SELECT
c.first_name,
c.last_name,
cs.url AS clicked_url,
cs.timestamp AS click_timestamp,
p.product_id AS purchased_product,
p.purchase_date
FROM customers c
-- Joining clickstream data
LEFT JOIN clickstream cs ON c.customer_id = cs.customer_id
-- Joining purchase data
LEFT JOIN purchases p ON c.customer_id = p.customer_id
WHERE cs.timestamp > '2023-01-01' AND p.purchase_date > '2023-01-01'
ORDER BY c.last_name, cs.timestamp DESC, p.purchase_date DESC;
Customer-360: Analytics
What’s Next in Apache Hudi
● Hudi 0.14.0 release will be out soon
○ Record-Level Index to speed up index lookup and upsert performance
○ Auto-generated keys for use cases without user-provided primary key field
○ New MOR reader in Spark to boost query performance
● Hudi 1.x (RFC-69)
○ Re-imagination of Hudi, the transactional database for the lake
○ Storage format changes to unlock long retention of timeline, non-blocking
concurrency control
○ Enhancement to the indexing, performance and better abstractions, APIs for
engine integration
Come Build With The Community!
Docs : https://hudi.apache.org
Blogs : https://hudi.apache.org/blog
Slack : Apache Hudi Slack Group
Twitter : https://twitter.com/apachehudi
Github: https://github.com/apache/hudi/ Give us a star ⭐!
Mailing list(s) :
dev-subscribe@hudi.apache.org (send an empty email to subscribe)
Join Hudi Slack
Thanks!
Questions?
A Glide, Skip or a Jump:
Efficiently Stream Data into Your
Medallion Architecture with Apache Hudi
Join Hudi Slack
Challenges with Lakehouse Technologies
Context
❏ Append-only; no support for
upserts & deletes
Problems
❏ No indexing -> Full table scans
❏ Inconsistent view of the data lake
❏ No record modifications
Challenges in medallion architecture
Open & Interoperable Lakehouse Platform
Lake Storage
(Cloud Object Stores, HDFS, …)
Open File/Data Formats
(Parquet, HFile, Avro, Orc, …)
Concurrency Control
(OCC, MVCC, Non-blocking, Lock
providers, Scheduling...)
Table Services
(cleaning, compaction, clustering,
indexing, file sizing,...)
Indexes
(Bloom filter, HBase, Bucket index,
Hash based, Lucene..)
Table Format
(Schema, File listings, Stats,
Evolution, …)
Lake Cache*
(Columnar, transactional, mutable,
WIP,...)
Metaserver*
(Stats, table service coordination,...)
Query Engines
(Spark, Flink, Hive, Presto, Trino,
Impala, Redshift, BigQuery,
Snowflake,..)
Platform Services
(Streaming/Batch ingest, various
sources, Catalog sync, Admin CLI,
Data Quality,...)
Transactional
Database
Layer
User Interface
Readers
(Snapshot, Time Travel,
Incremental, etc)
Writers
(Inserts, Updates, Deletes, Smart
Layout Management, etc)
Programming API

More Related Content

PDF
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
PDF
Building large scale transactional data lake using apache hudi
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
PDF
Apache Hudi: The Path Forward
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
PDF
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
Building large scale transactional data lake using apache hudi
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Apache Hudi: The Path Forward
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO

Similar to A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi (20)

PPTX
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
PDF
Hudi architecture, fundamentals and capabilities
PDF
Building robust CDC pipeline with Apache Hudi and Debezium
PPTX
Reshape Data Lake (as of 2020.07)
PDF
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PPTX
Build data warehouse for retail using Hadoop
PDF
Incremental data processing with Hudi & Spark + dbt.pdf
PDF
Patterns for Building Streaming Apps
PDF
[WSO2Con USA 2018] Patterns for Building Streaming Apps
PDF
A head start on cloud native event driven applications - bigdatadays
PDF
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
PDF
Fluentd meetup #3
PDF
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
PDF
Apache Hudi: Community-Driven Development
PPTX
MVP Cloud OS Week Track 1 9 Sept: Data liberty
PPTX
MVP Cloud OS Week: 9 Sept, Track 1 Data Liberty
PPTX
Practical HBase - Hadoop World2011
KEY
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
PPTX
HBase in Practice
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Hudi architecture, fundamentals and capabilities
Building robust CDC pipeline with Apache Hudi and Debezium
Reshape Data Lake (as of 2020.07)
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Build data warehouse for retail using Hadoop
Incremental data processing with Hudi & Spark + dbt.pdf
Patterns for Building Streaming Apps
[WSO2Con USA 2018] Patterns for Building Streaming Apps
A head start on cloud native event driven applications - bigdatadays
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Fluentd meetup #3
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
Apache Hudi: Community-Driven Development
MVP Cloud OS Week Track 1 9 Sept: Data liberty
MVP Cloud OS Week: 9 Sept, Track 1 Data Liberty
Practical HBase - Hadoop World2011
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012
HBase in Practice
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
PDF
Renaming a Kafka Topic | Kafka Summit London
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
PDF
Exactly-once Stream Processing with Arroyo and Kafka
PDF
Fish Plays Pokemon | Kafka Summit London
PDF
Tiered Storage 101 | Kafla Summit London
PDF
Building a Self-Service Stream Processing Portal: How And Why
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
PDF
TL;DR Kafka Metrics | Kafka Summit London
PDF
A Window Into Your Kafka Streams Tasks | KSL
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
PDF
Data Contracts Management: Schema Registry and Beyond
PDF
Code-First Approach: Crafting Efficient Flink Apps
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Renaming a Kafka Topic | Kafka Summit London
Evolution of NRT Data Ingestion Pipeline at Trendyol
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Exactly-once Stream Processing with Arroyo and Kafka
Fish Plays Pokemon | Kafka Summit London
Tiered Storage 101 | Kafla Summit London
Building a Self-Service Stream Processing Portal: How And Why
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Navigating Private Network Connectivity Options for Kafka Clusters
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Explaining How Real-Time GenAI Works in a Noisy Pub
TL;DR Kafka Metrics | Kafka Summit London
A Window Into Your Kafka Streams Tasks | KSL
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Data Contracts Management: Schema Registry and Beyond
Code-First Approach: Crafting Efficient Flink Apps
Debezium vs. the World: An Overview of the CDC Ecosystem
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
August Patch Tuesday
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Tartificialntelligence_presentation.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
A Presentation on Touch Screen Technology
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
project resource management chapter-09.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Hindi spoken digit analysis for native and non-native speakers
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
August Patch Tuesday
Accuracy of neural networks in brain wave diagnosis of schizophrenia
gpt5_lecture_notes_comprehensive_20250812015547.pdf
MIND Revenue Release Quarter 2 2025 Press Release
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
OMC Textile Division Presentation 2021.pptx
Getting Started with Data Integration: FME Form 101
Tartificialntelligence_presentation.pptx
A comparative study of natural language inference in Swahili using monolingua...
Assigned Numbers - 2025 - Bluetooth® Document
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
A Presentation on Touch Screen Technology
SOPHOS-XG Firewall Administrator PPT.pptx
project resource management chapter-09.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
1. Introduction to Computer Programming.pptx
Heart disease approach using modified random forest and particle swarm optimi...
Hindi spoken digit analysis for native and non-native speakers

A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi

  • 1. A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi Nadine Farah Ethan Guo {nadine, ethan}@onehouse.ai September 27, 2023
  • 2. Ethan Guo ❏ Dev Rel @ Onehouse ❏ Contributor @ Apache Hudi ❏ Former @ Rockset, Bose ❏ Software Engineer @ Onehouse ❏ PMC @ Apache Hudi ❏ Data, Networking @ Uber in/nadinefarah/ @nfarah86 Nadine Farah Speaker Bio in/yihua-ethan-guo/
  • 3. Share your highlight from this session to win one of 10 Hudi Hoodies - Tag and follow OnehouseHQ on Linkedin with a post about this session OR - Live tweet this session & tag and follow @apachehudi Session Highlights: Share to Win Hudi Hoodies Hudi Slack Community Collect your hoodie at the Onehouse booth, 414 expo hall (by the laté/coffee bar area)
  • 7. So, what does it take to build medallion architecture?
  • 8. Challenges in the Medallion Architecture
  • 9. Bottlenecks that Cause the Challenges
  • 10. But … what if you can simplify the medallion architecture?
  • 11. Simplify the Medallion Architecture with Apache Hudi
  • 14. Open & Interoperable Lakehouse Platform S3 AWS Glue Data Catalog Metastore BigQuery Catalogs + Many More Lakehouse Platform Apache Kafka Raw Tables Cleaned Tables Derived Tables
  • 16. The Missing State Store Hudi Table upsert(records) at time t Changes to table Changes from table incremental_query (t-1, t) query at time t Latest committed records
  • 17. Proven @ Massive Scale https://www.youtube.com/watch?v=ZamXiT9aqs8 https://chowdera.com/2022/184/202207030146453436.html https://hudi.apache.org/blog/2021/09/01/building-eb-level-data-lake-using-hudi-at-bytedance/ 100GB/s Throughput > 1Exabyte Even just 1 Table Daily -> Min Analytics Latency 70% CPU Savings (write+read) 300GB/d Throughput 25+TB Datasets Hourly Analytics Latency https://aws.amazon.com/blogs/big-data/how-ge-aviation-built-cloud-nat ive-data-pipelines-at-enterprise-scale-using-the-aws-platform/ 10,000+ Tables 150+ Source systems CDC, ETL Use cases https://www.uber.com/blog/apache-hudi-graduation/ 4000+ Tables 250+PB Raw + Derived Daily -> Min Analytics Latency 800B Records/Day
  • 18. Incremental Data Processing in the Medallion Architecture
  • 19. Incremental Processing with Apache Hudi Hudi Streamer Hudi Streamer
  • 20. Hudi Streamer: E2E Incremental Processing Clean Table (fact) (Join) Dataset 1 (dimension) (Join) Dataset 2 (dimension) Upsert Summary Table Apache Kafka Raw Table (fact) Hudi Streamer Hudi Streamer Hudi Streamer Projection User- defined Transform ation SELECT a.loc.lon as loc_lon, a.loc.lat as loc_lat, a.name FROM <SRC> a Schema registry Bronze Silver Gold PostgresSQL Debezium Upsert Bulk Insert
  • 21. Hudi Incr. Processing: Under the hood ● Record-level changes with primary keys -> index lookup, record payload and merging ● Faster metadata changes, consistency between index and data -> metadata management ● Optimize data layout on storage -> small-file handling, table services ● Needs fundamentally different concurrency control techniques -> OCC and MVCC Incremental / CDC Changes From Source Pre- Process Locate Records Optimize File Layout Perform Upsert Write New Files Update Index / Metadata Commit Sched/Run Table Services Incremental / CDC Changes From Hudi
  • 22. Deep Dive on Record-Level Mutation uuid name ts balance 1 Ethan 1000 100 2 XYZ 1000 200 uuid name ts balance 1 Ethan 5000 60 2 XYZ 1000 200 3 Nadine 4000 100 uuid name ts balance is_delete 3 Nadine 4000 100 false 1 Ethan 5000 60 false uuid name ts balance is_delete 2 XYZ 6000 null true 1 Ethan 2000 80 false uuid name ts balance 1 Ethan 5000 60 2 XYZ 1000 200 3 Nadine 4000 100 Incoming Data 1 Incoming Data 2 Primary Key Insert Update Update Delete Hudi Table Hudi Timeline t1 t2 t3 ● Payload and merge API for customized upserts; built-in support for event-time ordering ● Auto primary key generation for log ingestion (upcoming 0.14.0 release) Upsert operation Ordering Field
  • 23. Incremental Processing with CDC Feature uuid name ts balance 1 Ethan 1000 100 2 XYZ 1000 200 uuid name ts balance 1 Ethan 5000 60 2 XYZ 1000 200 3 Nadine 4000 100 uuid name ts balance 1 Ethan 5000 60 2 XYZ 1000 200 3 Nadine 4000 100 Hudi Timeline t1 t2 t3 Debezium-like change logs with before and after images with “hoodie.table.cdc.enabled=true” op ts before after i t2 null {“uuid”:“3”,“name”:“Nadine”, “ts”:“4000”,“balance”:“100”} u t2 {“uuid”:“1”,“name”:“Ethan”, “ts”:“1000”,“balance”:“100”} {“uuid”:“1”,“name”:“Ethan”, “ts”:“5000”,“balance”:“60”} d t3 {“uuid”:“2”,“name”:“XYZ”, “ts”:“1000”,“balance”:“200”} null spark.read.format("hudi"). option("hoodie.datasource.query.type", "incremental"). option("hoodie.datasource.query.incremental.format", "cdc"). option("hoodie.datasource.read.begin.instanttime", t1). option("hoodie.datasource.read.end.instanttime", t3). load("/path/to/hudi") (New in 0.13.0 release)
  • 24. Speed up UPSERT Operations with Record-Level Index
  • 25. ● Widely employed in DB systems ○ Locate information quickly ○ Reduce I/O cost ○ Improve Query efficiency https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/ Indexes: Locating Records Efficiently ● Indexing provides fast upserts ○ Locate records for incoming writes ○ Bloom filter based, Simple, Hbase, etc.
  • 26. Existing Indexes in Hudi ● Simple Index ○ Simply read keys and location from table ○ Best for random updates and deletes ● Bloom Index ○ Prune data files by bloom filters and key ranges ○ Best for late arriving updates and dedup ● HBase Index ○ Look up key-to-location mapping in an external HBase table ○ Best for large-scale datasets (10s of TB to PB)
  • 27. Challenges for Large Datasets ● Simple Index ○ Read keys from all files ● Bloom Index ○ Read all bloom filters ○ Read keys after file pruning to avoid false positives ● HBase Index ○ Key-to-location mapping for every record Reading data and metadata per file is expensive HBase cluster maintenance is required and operationally difficult Particular for cloud storage which enforces rate limiting on I/O A new Index to address both challenges?
  • 28. Record-Level Index (RLI) Design ● Key-to-location mapping in table-level metadata ○ A new partition,“record_index”, in the metadata table ○ Stored in a few file groups instead of all data files ● Efficient key-to-location entry as payload ○ Random UUID key and datestr partition: 50-55 B per record in MDT ● Fast index update and lookup ○ MDT, an internal Hudi MOR table, enables uniformed fast updates ○ HFile format enables fast point lookup
  • 29. Record-Level Index on Storage “record_index” partition FG N-1 FG 1 File Group 0 File Group 0 File Slice t0 … FS t1 HFile Log File 1 HFile record_key 0 -> partition 1, file 1 record_key 1 -> partition 1, file 1 record_key 2 -> partition 2, file 3 record_key 3 -> partition 1, file 2 . . . Compaction HFile Log File 1 Header HFile Data Block 0 record_key 6 -> partition 1, file 5 record_key 7-> partition 1, file 1 … HFile Data Block 1 Footer File Group ID by the hash Record Keys Log File 2
  • 30. SELECT * FROM table WHERE key = 'val' DELETE FROM table WHERE key = 'val' Performance Benefit from RLI ● Improves index lookup and write latency ○ 1TB dataset, 200MB batch, random updates ○ 17x speedup on index lookup, 2x on write Record-Level Index will be available in upcoming Hudi 0.14.0 release 17x 2x 2x 3x ● Reduces SQL latency with point lookups ○ TPC-DS 10TB datasets, store_sales table ○ 2-3x improvement compared to no RLI
  • 35. Customer-360: “Clickstream” Schema Field Description click_id Unique identifier for each click customer_id Customer table reference session_id User session id url User clicked on url timestamp Timestamp of click
  • 36. Customer-360: “Purchase” Schema Field Description purchase_id Unique identifier for purchase customer_id Unique identifier for customer product_id Unique identifier for product quantity # product purchase purchase_price Products total price purchase_date timestamp payment_method Customer’s payment method order_status Delivered, in-route, etc
  • 37. Customer-360: “Cart Activity” Schema Field Description activity_id Unique identifier for activity customer_id Unique identifier for customer product_id Unique identifier for product timestamp Activity timestamp activity-type Added, removed etc items quantity How many items customer add/remove cart-status Active/abandoned cart
  • 38. Customer-360: “Customer” Schema Field Description customer_id Unique identifier for customer first_name Customer’s first name last_name Customer’s last name email Customer’s email signup_date Account creation date last_login Most recent login date
  • 41. Correlate User’s Activity with Purchases SELECT c.first_name, c.last_name, cs.url AS clicked_url, cs.timestamp AS click_timestamp, p.product_id AS purchased_product, p.purchase_date FROM customers c -- Joining clickstream data LEFT JOIN clickstream cs ON c.customer_id = cs.customer_id -- Joining purchase data LEFT JOIN purchases p ON c.customer_id = p.customer_id WHERE cs.timestamp > '2023-01-01' AND p.purchase_date > '2023-01-01' ORDER BY c.last_name, cs.timestamp DESC, p.purchase_date DESC;
  • 43. What’s Next in Apache Hudi ● Hudi 0.14.0 release will be out soon ○ Record-Level Index to speed up index lookup and upsert performance ○ Auto-generated keys for use cases without user-provided primary key field ○ New MOR reader in Spark to boost query performance ● Hudi 1.x (RFC-69) ○ Re-imagination of Hudi, the transactional database for the lake ○ Storage format changes to unlock long retention of timeline, non-blocking concurrency control ○ Enhancement to the indexing, performance and better abstractions, APIs for engine integration
  • 44. Come Build With The Community! Docs : https://hudi.apache.org Blogs : https://hudi.apache.org/blog Slack : Apache Hudi Slack Group Twitter : https://twitter.com/apachehudi Github: https://github.com/apache/hudi/ Give us a star ⭐! Mailing list(s) : dev-subscribe@hudi.apache.org (send an empty email to subscribe) Join Hudi Slack
  • 45. Thanks! Questions? A Glide, Skip or a Jump: Efficiently Stream Data into Your Medallion Architecture with Apache Hudi Join Hudi Slack
  • 46. Challenges with Lakehouse Technologies Context ❏ Append-only; no support for upserts & deletes Problems ❏ No indexing -> Full table scans ❏ Inconsistent view of the data lake ❏ No record modifications
  • 47. Challenges in medallion architecture
  • 48. Open & Interoperable Lakehouse Platform Lake Storage (Cloud Object Stores, HDFS, …) Open File/Data Formats (Parquet, HFile, Avro, Orc, …) Concurrency Control (OCC, MVCC, Non-blocking, Lock providers, Scheduling...) Table Services (cleaning, compaction, clustering, indexing, file sizing,...) Indexes (Bloom filter, HBase, Bucket index, Hash based, Lucene..) Table Format (Schema, File listings, Stats, Evolution, …) Lake Cache* (Columnar, transactional, mutable, WIP,...) Metaserver* (Stats, table service coordination,...) Query Engines (Spark, Flink, Hive, Presto, Trino, Impala, Redshift, BigQuery, Snowflake,..) Platform Services (Streaming/Batch ingest, various sources, Catalog sync, Admin CLI, Data Quality,...) Transactional Database Layer User Interface Readers (Snapshot, Time Travel, Incremental, etc) Writers (Inserts, Updates, Deletes, Smart Layout Management, etc) Programming API