Containerized Stream Engine to Build Modern Delta Lake

Containerized Stream Engine to build
Modern Delta Lake
Sandeep Reddy Bheemi Reddy, Senior Data Engineer
Karthikeyan Siva Baskaran, Senior Data Engineer

Who We Are
Sandeep Reddy Bheemi Reddy
Senior Data Engineer
Karthikeyan Siva Baskaran
Senior Data Engineer
TIGER DATA FOUNDATION
Containerized Stream Engine to
build Modern Delta Lake
Contact us
+1-408-508-4430
info@tigeranalytics.com
https://www.tigeranalytics.com/

Agenda
Objective
Design Considerations
Infrastructure Provisioning
Solution Deep Dive
Application Monitoring
Points to be Noted
Questions

Objective
To Build Single Source of Truth data for Enterprise via CDC
Most compelling operational analytics demand
real-time data rather than historical data.
Data Agility
The Speed of business is rapidly accelerating,
driving the need for delivering intelligent, fast
solutions.
Facilitate larger amounts of data from multiple sources by
tracking changes made to the source data, combine them
together to build a Single Source of Truth to make
decisions based on data.
Build SSOT from Siloed Data
Demand for real-time Data

Few Common Ways to Capture Data to get Insights
Change Data Capture
App
DB
LOG
Analytics Data Lake
Dual Writes
App
DB
Pub Sub
System
Analytics Data Lake
Direct JDBC
App
DB
Analytics Data Lake

Inconsistent Data
During Job failure, in overwrite mode
it leads to inconsistent data
Schema Enforcement & Evolution
DDLs are not supported, this leads to
break in the flow if upstream
applications changed the schema
Roll Back not possible
In case of failure, it is not possible to
roll back to the previous state of data
No Metadata layer
As there is no metadata layer, there is no clear
isolation b/w reads and writes – thus it is not
consistent, durable or atomic
VersioningSchema
E2
Data
Corruption
Not
ACID
Complaint
Problem with Today’s Data Lake

Provides clear isolation
between different writes by
maintaining log file for each
transaction
Even Job failure with
Overwrite mode, will not
corrupt the data
Provides serializable isolation
levels to ensure the data
consistent across multiple users
Changes to the table are
maintained as ordered, atomic
commits
ACID
Compliant
Atomicity Consistency
Isolation Durability
mergeSchema - Any column that is present in
the Data Frame but not in the target table is
automatically added on to the end of the
schema.
overwriteSchema – Datatype change,
drop/rename column
Time Travel to older version
All metadata and lineage of your data are
stored. To travel back to previous versions of
your delta table, provide a timestamp or
specific version
Expectations for data quality, which prohibits
the invalid data to enter your enterprise data
lake
Data Check Constraints
Schema Enforcement & Evolution
Delta Lake to Rescue

10
On Premise
Code Repo
To maintain
versions of
Terraform files
Open Source
Agent
Security &
Compliance Checks
Terraform
To deploy TF files
and maintain state
of TF files.
Deploy TF Files
TF State Files
DevOps
Professional
CD Pipeline
TF Files
▪ Cloud Agnostic – Create & manage infrastructure
across various platforms.
▪ Minimize human errors and configuration
differences in various environments.
▪ Maintain the state of infrastructure.
▪ Perform policy checks on Infrastructure
IaC – Workflow
Infra Provisioning in Selected Environment
Kubernetes Cluster (With
Scalable worker nodes)
Pods (Deployment,
Replica Sets)
Launch the Deployment
Services (Node Port &
Load Balancer)
Volumes (PV & PVC)

Kafka
Schema Registry
Kubernetes
Source Database
Structured StreamingKafka Connect
Change Data Streaming Queue Processing Layer Storage Layer
DB Logs
ADLS S3
Kafka Connect uses
Debezium connector
to parse the database
logs
Schema
id data
Schema
id data
Avro
Schema-1 Schema-2 Schema-n
id - 1001 id - 1002 id - n
Register Schema
Provides flexibility by creating
a VIEW from different schema
for different teams based on
their need. This helps
downstream apps to run
without any interruption when
schema changes
Persistent
Volume Claim
PVC
PVC

{
"name": "mssql-${DBName}-connector-${conn_string}",
"config":
{
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"database.hostname": "${Hostname}",
"database.port": "${Port}",
"database.user": "${UserName}",
"database.password": "${Password}",
"database.server.id": "${conn_string}",
"database.server.name": "${Source}.${DBName}.${conn_string}",
"database.whitelist": "${DBName}",
"database.dbname": "${DBName}",
"database.history.kafka.bootstrap.servers": "${KAFKA}:9092",
"database.history.kafka.topic": "${Source}.${DBName}.dbhistory",
"key.converter":"io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url":"http://${SCHEMA_REGISTRY}:8081",
"value.converter":"io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://${SCHEMA_REGISTRY}:8081",
}
}
Kafka Connector Properties

{
"payload": {
"before": {
"emp_no": 1,
"birth_date": 18044,
"first_name": “Marion",
"last_name": “Colbrun"
},
"after": {
"emp_no": 1,
"first_name": “Marion",
"last_name": “Brickell"
}
}
}
{
"payload": {
"before": {
"emp_no": 1,
"first_name": "Marion",
"last_name": "Colbrun"
},
"after": null
}
}
{
"payload": {
"before": null,
"after": {
"emp_no": 1,
"first_name": "Marion",
"last_name": "Colbrun"
}
}
}
insert into sample_emp values
(1,current_date,'Marion’,
'Colbrun');
update sample_emp set
last_name='Brickell’
where emp_no=1;
delete from sample_emp
where emp_no=1;
INSERT UPDATE DELETE

CDC Code Logic Flow
Read data from Kafka and create Delta
Table and insert the recent data based on
Primary Key and exclude if there are any
Deletes.
Read the data from Kafka and split
delete data from Inserts/Updates.
Get the latest data by using Rank
window.
Enable autoMerge schema
property to detect any schema
changes and merge the schema
to Delta Table
MERGE command to handle
Inserts/Updates/Deletes based on
Operation(op) column which is created by
Debezium by parsing the logs
Initial
Load
DDL
DM L
Data
Preprocess
Data Preprocess
Initial Load
DML Scenario
DDL Scenario
Incremental Load:
Data Preprocess, DDL & DML

Flag ID Value CDCTimeStamp
I 1 1 2018-01-01 16:02:00
U 1 11 2018-01-01 16:02:01
I 2 2 2018-01-01 16:02:03
I 3 33 2018-01-01 16:02:04
I 4 40 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
Get Latest Record in order to maintain
SCD Type I
Inserts/Updates
D 2 2 2018-01-01 16:02:04
Deletes
Deletes will have different schema when it is inserted in
Kafka from Debezium. For Deletes, take Before Image data
to know which primary key records got deleted, where as
for Inserts and Updates, pull data from After Image.
U 1 11 2018-01-01 16:02:01
D 2 2 2018-01-01 16:02:04
I 3 33 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
val orderBy_lst = List(“CDCTimeStamp")
val byPrimaryKey = Window
.partitionBy(partitionBy_lst.map(col): _*)
.orderBy(orderBy_lst.map(x => col(x) desc):_*)
rankDf = dmlDf
.withColumn("rank", rank over byPrimaryKey)
.filter("rank = 1")
.drop("rank")
Data Pre-processing

Get Latest Record and exclude Deletes
for Initial Load
Consolidated data for Initial Load
As the requirement is to maintain SCD I, there is
no need to load the Deletes data into Delta Lake
during Initial Load.
U 1 11 2018-01-01 16:02:01
I 3 33 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
df.where("op != 'd'")
.write
.mode("overwrite")
.option("path", delta_tbl_loc)
.format("delta")
.saveAsTable(db_tbl)
U 1 11 2018-01-01 16:02:01
D 2 2 2018-01-01 16:02:04
I 3 33 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
Initial Load

Flag ID Value City CDCTimeStamp
I 11 100 MDU 2018-01-01 16:02:20
U 11 1000 CHN 2018-01-01 16:02:21
U 3 300 MDU 2018-01-01 16:02:22
I 14 400 MDU 2018-01-01 16:02:21
D 4 44 2018-01-01 16:02:22
Incremental Staged Data Incremental Load: Data pre-process &
Get Latest Record
val orderBy_lst = List("CDCTimeStamp")
val byPrimaryKey = Window
.partitionBy(partitionBy_lst.map(col): _*)
.orderBy(orderBy_lst.map(x => col(x) desc):_*)
rankDf = dmlDf
.withColumn("rank", rank over byPrimaryKey)
.filter("rank = 1")
.drop("rank")
U 11 1000 CHN 2018-01-01 16:02:21
U 3 300 MDU 2018-01-01 16:02:22
I 14 400 MDU 2018-01-01 16:02:21
D 4 44 2018-01-01 16:02:22
Latest Record from incremental load
Data pre-processing by splitting deletes and
Inserts/Updates and get Latest Records per
primary key.
Finally union both dataframe before performing
MERGE

U 11 1000 CHN 2018-01-01 16:02:21
U 3 300 MDU 2018-01-01 16:02:22
I 14 400 MDU 2018-01-01 16:02:21
D 4 44 2018-01-01 16:02:22
Latest Incremental Staged Data Incremental Load – DDL & DML
MERGE INTO ${db_tbl} AS target
USING staging_tbl AS source
ON ${pri_key_const}
WHEN MATCHED and source.op = 'u’
THEN
UPDATE SET *
WHEN MATCHED and source.op = 'd’
THEN
delete
WHEN NOT MATCHED and source.op = 'c’
THEN
INSERT *
U 1 11 Null 2018-01-01 16:02:01
U 3 300 MDU 2018-01-01 16:02:22
U 11 1000 CHN 2018-01-01 16:02:21
I 14 400 MDU 2018-01-01 16:02:21
Enable this property to add new columns on the fly
when MERGE happens
spark.databricks.delta.schema.autoMerge.enabled
Only available from delta lake 0.6.0 and higher
versions
U 1 11 2018-01-01 16:02:01
I 3 33 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
Before
After

20
Spark Streaming with Kubernetes
API
Server
Scheduler
Kubernetes Master
Spark Driver Pod
Spark Executor
Pod
Spark Executor
Pod
Spark Executor
Pod
1)Spark Submit
2)Start Driver Pod
3)Request Executor Pods
5)Notify of New Executor
4)Schedule Executor Pod
6)Schedule tasks on executors
Kubernetes Cluster
Key Benefits
▪ Containerization – Applications are more portable and easy to package dependencies.
▪ Cloud Agnostic – Able to launch the Spark job in any platform without any code changes.
▪ Efficient Resource Sharing – Resources can be utilized by other applications when Spark jobs are idle.
File Share
Checkpointing

Fluentd is a popular data collector that runs
as a DaemonSet inside Kubernetes Worker
Nodes to collect and ingest container logs
from local filesystem into Elasticsearch
engine
Metrics Beat is a lightweight shipper that
collects and ships various system and
service metrics like CPU, Memory, Disk
usage etc. to Elasticsearch engine
Elasticsearch is a real time distributed
scalable search engine that is used to index
and search through larger volume of log
data
Kibana is a powerful data visualization tool
that allows to explore log data stored in
Elasticsearch and gain quick insights to
Kubernetes applications
Node
Pod2
Podn
Pod1
.
.
Node
Storage
Pod2
Podn
Pod1
.
.
Node
Storage
Daemon
Node
Daemon
Daemon Daemon
CPU,Memory,
Network
CPU,Memory,
Network
Logs
Logs
Centralized Log Monitoring

DEBEZIUM
Points to be Noted
Primary Key is mandatory, without primary
key it is not possible to track the changes
and apply the changes to Target
Primary Key
By default, Kafka connect will create topic
with only one partition. Due to this, Spark
Job will not get parallelized. To achieve
parallelism, we need to create the topic
with more no of partitions
Partitions
For each Table under database, one topic
will be created and one common topic for
one DB to maintain DDLs
Topic/Table

SPARK
If small files are handled, when merge
happens during incremental load, there is no
need to rewrite most of the files, in return
performance between micro batch will be
increased.
To control the compaction, either
Run OPTIMIZE or set false to dataChange delta
property or enable Adaptive Execution Mode
Small Files
Time travel will not read the delta log
checkpoint directory, because we need the
specific version, so this will read the specific
json commit file, because the checkpoint
parquet file is consolidated of all the json files
which is committed previously.
Time Travel

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the session.
THANKS!

Containerized Stream Engine to Build Modern Delta Lake

More Related Content

What's hot (20)

Similar to Containerized Stream Engine to Build Modern Delta Lake (20)

More from Databricks (20)

Recently uploaded (20)

Containerized Stream Engine to Build Modern Delta Lake