SlideShare a Scribd company logo
Containerized Stream Engine to build
Modern Delta Lake
Sandeep Reddy Bheemi Reddy, Senior Data Engineer
Karthikeyan Siva Baskaran, Senior Data Engineer
Who We Are
Sandeep Reddy Bheemi Reddy
Senior Data Engineer
Karthikeyan Siva Baskaran
Senior Data Engineer
TIGER DATA FOUNDATION
Containerized Stream Engine to
build Modern Delta Lake
Contact us
+1-408-508-4430
info@tigeranalytics.com
https://www.tigeranalytics.com/
Agenda
Objective
Design Considerations
Infrastructure Provisioning
Solution Deep Dive
Application Monitoring
Points to be Noted
Questions
Objective
To Build Single Source of Truth data for Enterprise via CDC
Most compelling operational analytics demand
real-time data rather than historical data.
Data Agility
The Speed of business is rapidly accelerating,
driving the need for delivering intelligent, fast
solutions.
Facilitate larger amounts of data from multiple sources by
tracking changes made to the source data, combine them
together to build a Single Source of Truth to make
decisions based on data.
Build SSOT from Siloed Data
Demand for real-time Data
Design Considerations
Few Common Ways to Capture Data to get Insights
Change Data Capture
App
DB
LOG
Analytics Data Lake
Dual Writes
App
DB
Pub Sub
System
Analytics Data Lake
Direct JDBC
App
DB
Analytics Data Lake
Inconsistent Data
During Job failure, in overwrite mode
it leads to inconsistent data
Schema Enforcement & Evolution
DDLs are not supported, this leads to
break in the flow if upstream
applications changed the schema
Roll Back not possible
In case of failure, it is not possible to
roll back to the previous state of data
No Metadata layer
As there is no metadata layer, there is no clear
isolation b/w reads and writes – thus it is not
consistent, durable or atomic
VersioningSchema
E2
Data
Corruption
Not
ACID
Complaint
Problem with Today’s Data Lake
Provides clear isolation
between different writes by
maintaining log file for each
transaction
Even Job failure with
Overwrite mode, will not
corrupt the data
Provides serializable isolation
levels to ensure the data
consistent across multiple users
Changes to the table are
maintained as ordered, atomic
commits
ACID
Compliant
Atomicity Consistency
Isolation Durability
mergeSchema - Any column that is present in
the Data Frame but not in the target table is
automatically added on to the end of the
schema.
overwriteSchema – Datatype change,
drop/rename column
Time Travel to older version
All metadata and lineage of your data are
stored. To travel back to previous versions of
your delta table, provide a timestamp or
specific version
Expectations for data quality, which prohibits
the invalid data to enter your enterprise data
lake
Data Check Constraints
Schema Enforcement & Evolution
Delta Lake to Rescue
Infrastructure Provisioning
10
On Premise
Code Repo
To maintain
versions of
Terraform files
Open Source
Agent
Security &
Compliance Checks
Terraform
To deploy TF files
and maintain state
of TF files.
Deploy TF Files
TF State Files
DevOps
Professional
CD Pipeline
TF Files
▪ Cloud Agnostic – Create & manage infrastructure
across various platforms.
▪ Minimize human errors and configuration
differences in various environments.
▪ Maintain the state of infrastructure.
▪ Perform policy checks on Infrastructure
IaC – Workflow
Infra Provisioning in Selected Environment
Kubernetes Cluster (With
Scalable worker nodes)
Pods (Deployment,
Replica Sets)
Launch the Deployment
Services (Node Port &
Load Balancer)
Volumes (PV & PVC)
Solution Deep Dive
Kafka
Schema Registry
Kubernetes
Source Database
Structured StreamingKafka Connect
Change Data Streaming Queue Processing Layer Storage Layer
DB Logs
ADLS S3
Kafka Connect uses
Debezium connector
to parse the database
logs
Schema
id data
Schema
id data
Avro
Schema-1 Schema-2 Schema-n
id - 1001 id - 1002 id - n
Register Schema
Provides flexibility by creating
a VIEW from different schema
for different teams based on
their need. This helps
downstream apps to run
without any interruption when
schema changes
Persistent
Volume Claim
PVC
PVC
{
"name": "mssql-${DBName}-connector-${conn_string}",
"config":
{
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"database.hostname": "${Hostname}",
"database.port": "${Port}",
"database.user": "${UserName}",
"database.password": "${Password}",
"database.server.id": "${conn_string}",
"database.server.name": "${Source}.${DBName}.${conn_string}",
"database.whitelist": "${DBName}",
"database.dbname": "${DBName}",
"database.history.kafka.bootstrap.servers": "${KAFKA}:9092",
"database.history.kafka.topic": "${Source}.${DBName}.dbhistory",
"key.converter":"io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url":"http://${SCHEMA_REGISTRY}:8081",
"value.converter":"io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://${SCHEMA_REGISTRY}:8081",
}
}
Kafka Connector Properties
{
"payload": {
"before": {
"emp_no": 1,
"birth_date": 18044,
"first_name": “Marion",
"last_name": “Colbrun"
},
"after": {
"emp_no": 1,
"birth_date": 18044,
"first_name": “Marion",
"last_name": “Brickell"
}
}
}
{
"payload": {
"before": {
"emp_no": 1,
"birth_date": 18044,
"first_name": "Marion",
"last_name": "Colbrun"
},
"after": null
}
}
{
"payload": {
"before": null,
"after": {
"emp_no": 1,
"birth_date": 18044,
"first_name": "Marion",
"last_name": "Colbrun"
}
}
}
insert into sample_emp values
(1,current_date,'Marion’,
'Colbrun');
update sample_emp set
last_name='Brickell’
where emp_no=1;
delete from sample_emp
where emp_no=1;
INSERT UPDATE DELETE
CDC Code Logic Flow
Read data from Kafka and create Delta
Table and insert the recent data based on
Primary Key and exclude if there are any
Deletes.
Read the data from Kafka and split
delete data from Inserts/Updates.
Get the latest data by using Rank
window.
Enable autoMerge schema
property to detect any schema
changes and merge the schema
to Delta Table
MERGE command to handle
Inserts/Updates/Deletes based on
Operation(op) column which is created by
Debezium by parsing the logs
Initial
Load
DDL
DM L
Data
Preprocess
Data Preprocess
Initial Load
DML Scenario
DDL Scenario
Incremental Load:
Data Preprocess, DDL & DML
Flag ID Value CDCTimeStamp
I 1 1 2018-01-01 16:02:00
U 1 11 2018-01-01 16:02:01
I 2 2 2018-01-01 16:02:03
I 3 33 2018-01-01 16:02:04
I 4 40 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
Get Latest Record in order to maintain
SCD Type I
Inserts/Updates
Flag ID Value CDCTimeStamp
D 2 2 2018-01-01 16:02:04
Deletes
Deletes will have different schema when it is inserted in
Kafka from Debezium. For Deletes, take Before Image data
to know which primary key records got deleted, where as
for Inserts and Updates, pull data from After Image.
Flag ID Value CDCTimeStamp
U 1 11 2018-01-01 16:02:01
D 2 2 2018-01-01 16:02:04
I 3 33 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
val orderBy_lst = List(“CDCTimeStamp")
val byPrimaryKey = Window
.partitionBy(partitionBy_lst.map(col): _*)
.orderBy(orderBy_lst.map(x => col(x) desc):_*)
rankDf = dmlDf
.withColumn("rank", rank over byPrimaryKey)
.filter("rank = 1")
.drop("rank")
Data Pre-processing
Get Latest Record and exclude Deletes
for Initial Load
Consolidated data for Initial Load
As the requirement is to maintain SCD I, there is
no need to load the Deletes data into Delta Lake
during Initial Load.
Flag ID Value CDCTimeStamp
U 1 11 2018-01-01 16:02:01
I 3 33 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
df.where("op != 'd'")
.write
.mode("overwrite")
.option("path", delta_tbl_loc)
.format("delta")
.saveAsTable(db_tbl)
Flag ID Value CDCTimeStamp
U 1 11 2018-01-01 16:02:01
D 2 2 2018-01-01 16:02:04
I 3 33 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
Initial Load
Flag ID Value City CDCTimeStamp
I 11 100 MDU 2018-01-01 16:02:20
U 11 1000 CHN 2018-01-01 16:02:21
U 3 300 MDU 2018-01-01 16:02:22
I 14 400 MDU 2018-01-01 16:02:21
D 4 44 2018-01-01 16:02:22
Incremental Staged Data Incremental Load: Data pre-process &
Get Latest Record
val orderBy_lst = List("CDCTimeStamp")
val byPrimaryKey = Window
.partitionBy(partitionBy_lst.map(col): _*)
.orderBy(orderBy_lst.map(x => col(x) desc):_*)
rankDf = dmlDf
.withColumn("rank", rank over byPrimaryKey)
.filter("rank = 1")
.drop("rank")
Flag ID Value City CDCTimeStamp
U 11 1000 CHN 2018-01-01 16:02:21
U 3 300 MDU 2018-01-01 16:02:22
I 14 400 MDU 2018-01-01 16:02:21
D 4 44 2018-01-01 16:02:22
Latest Record from incremental load
Data pre-processing by splitting deletes and
Inserts/Updates and get Latest Records per
primary key.
Finally union both dataframe before performing
MERGE
Flag ID Value City CDCTimeStamp
U 11 1000 CHN 2018-01-01 16:02:21
U 3 300 MDU 2018-01-01 16:02:22
I 14 400 MDU 2018-01-01 16:02:21
D 4 44 2018-01-01 16:02:22
Latest Incremental Staged Data Incremental Load – DDL & DML
MERGE INTO ${db_tbl} AS target
USING staging_tbl AS source
ON ${pri_key_const}
WHEN MATCHED and source.op = 'u’
THEN
UPDATE SET *
WHEN MATCHED and source.op = 'd’
THEN
delete
WHEN NOT MATCHED and source.op = 'c’
THEN
INSERT *
Flag ID Value City CDCTimeStamp
U 1 11 Null 2018-01-01 16:02:01
U 3 300 MDU 2018-01-01 16:02:22
U 11 1000 CHN 2018-01-01 16:02:21
I 14 400 MDU 2018-01-01 16:02:21
Enable this property to add new columns on the fly
when MERGE happens
spark.databricks.delta.schema.autoMerge.enabled
Only available from delta lake 0.6.0 and higher
versions
Flag ID Value CDCTimeStamp
U 1 11 2018-01-01 16:02:01
I 3 33 2018-01-01 16:02:04
U 4 44 2018-01-01 16:02:05
Before
After
20
Spark Streaming with Kubernetes
API
Server
Scheduler
Kubernetes Master
Spark Driver Pod
Spark Executor
Pod
Spark Executor
Pod
Spark Executor
Pod
1)Spark Submit
2)Start Driver Pod
3)Request Executor Pods
5)Notify of New Executor
4)Schedule Executor Pod
6)Schedule tasks on executors
Kubernetes Cluster
Key Benefits
▪ Containerization – Applications are more portable and easy to package dependencies.
▪ Cloud Agnostic – Able to launch the Spark job in any platform without any code changes.
▪ Efficient Resource Sharing – Resources can be utilized by other applications when Spark jobs are idle.
File Share
Checkpointing
Application Monitoring
Fluentd is a popular data collector that runs
as a DaemonSet inside Kubernetes Worker
Nodes to collect and ingest container logs
from local filesystem into Elasticsearch
engine
Metrics Beat is a lightweight shipper that
collects and ships various system and
service metrics like CPU, Memory, Disk
usage etc. to Elasticsearch engine
Elasticsearch is a real time distributed
scalable search engine that is used to index
and search through larger volume of log
data
Kibana is a powerful data visualization tool
that allows to explore log data stored in
Elasticsearch and gain quick insights to
Kubernetes applications
Node
Pod2
Podn
Pod1
.
.
Node
Storage
Pod2
Podn
Pod1
.
.
Node
Storage
Daemon
Node
Daemon
Daemon Daemon
CPU,Memory,
Network
CPU,Memory,
Network
Logs
Logs
Centralized Log Monitoring
Monitoring Dashboard
Points to be noted!
DEBEZIUM
Points to be Noted
Primary Key is mandatory, without primary
key it is not possible to track the changes
and apply the changes to Target
Primary Key
By default, Kafka connect will create topic
with only one partition. Due to this, Spark
Job will not get parallelized. To achieve
parallelism, we need to create the topic
with more no of partitions
Partitions
For each Table under database, one topic
will be created and one common topic for
one DB to maintain DDLs
Topic/Table
SPARK
If small files are handled, when merge
happens during incremental load, there is no
need to rewrite most of the files, in return
performance between micro batch will be
increased.
To control the compaction, either
Run OPTIMIZE or set false to dataChange delta
property or enable Adaptive Execution Mode
Small Files
Time travel will not read the delta log
checkpoint directory, because we need the
specific version, so this will read the specific
json commit file, because the checkpoint
parquet file is consolidated of all the json files
which is committed previously.
Time Travel
Any Questions?
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the session.
THANKS!

More Related Content

PPTX
What is no sql
PPTX
Big Data Testing: Ensuring MongoDB Data Quality
PDF
Modernizing to a Cloud Data Architecture
PPTX
Optimizing Apache Spark SQL Joins
PPTX
Delta lake and the delta architecture
PDF
Spark SQL Join Improvement at Facebook
PPTX
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
What is no sql
Big Data Testing: Ensuring MongoDB Data Quality
Modernizing to a Cloud Data Architecture
Optimizing Apache Spark SQL Joins
Delta lake and the delta architecture
Spark SQL Join Improvement at Facebook
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...

What's hot (20)

PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PPTX
Introduction Data warehouse
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
PPT
Ms sql server architecture
PDF
Delta Lake Streaming: Under the Hood
PDF
Understanding index
PDF
Getting Started with Delta Lake on Databricks
PDF
Achieving Lakehouse Models with Spark 3.0
PDF
Databricks Delta Lake and Its Benefits
PPTX
Introducing Azure SQL Data Warehouse
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
What s new in spark 2.3 and spark 2.4
PDF
Productizing Structured Streaming Jobs
PPTX
Dimensional model | | Fact Tables | | Types
PPT
Oracle Transparent Data Encryption (TDE) 12c
PPTX
Introduction To HBase
PPTX
Spark introduction and architecture
PDF
Introduction to Azure Data Lake
PDF
A look under the hood at Apache Spark's API and engine evolutions
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Building Lakehouses on Delta Lake with SQL Analytics Primer
Introduction Data warehouse
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Ms sql server architecture
Delta Lake Streaming: Under the Hood
Understanding index
Getting Started with Delta Lake on Databricks
Achieving Lakehouse Models with Spark 3.0
Databricks Delta Lake and Its Benefits
Introducing Azure SQL Data Warehouse
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
What s new in spark 2.3 and spark 2.4
Productizing Structured Streaming Jobs
Dimensional model | | Fact Tables | | Types
Oracle Transparent Data Encryption (TDE) 12c
Introduction To HBase
Spark introduction and architecture
Introduction to Azure Data Lake
A look under the hood at Apache Spark's API and engine evolutions
Spark SQL Deep Dive @ Melbourne Spark Meetup
Ad

Similar to Containerized Stream Engine to Build Modern Delta Lake (20)

PPTX
Synapse 2018 Guarding against failure in a hundred step pipeline
PPT
Evolutionary db development
PPT
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
PPT
Building the DW - ETL
PDF
Vertica And Spark: Connecting Computation And Data
PDF
Vertica And Spark: Connecting Computation And Data
PDF
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PPTX
Product overview 6.0 v.1.0
PDF
Witsml data processing with kafka and spark streaming
PPTX
LendingClub RealTime BigData Platform with Oracle GoldenGate
PPT
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
PPTX
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
PDF
2017 10-oow-fma-application-containers-v01-final
PDF
Deep dive into spark streaming
PDF
Change Data Feed in Delta
PDF
Flink SQL: The Challenges to Build a Streaming SQL Engine
DOCX
ETL and pivoting in spark
DOCX
ETL and pivoting in spark
PPTX
Why and how to leverage the simplicity and power of SQL on Flink
Synapse 2018 Guarding against failure in a hundred step pipeline
Evolutionary db development
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Building the DW - ETL
Vertica And Spark: Connecting Computation And Data
Vertica And Spark: Connecting Computation And Data
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Product overview 6.0 v.1.0
Witsml data processing with kafka and spark streaming
LendingClub RealTime BigData Platform with Oracle GoldenGate
“Lights Out”Configuration using Tivoli Netcool AutoDiscovery Tools
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
2017 10-oow-fma-application-containers-v01-final
Deep dive into spark streaming
Change Data Feed in Delta
Flink SQL: The Challenges to Build a Streaming SQL Engine
ETL and pivoting in spark
ETL and pivoting in spark
Why and how to leverage the simplicity and power of SQL on Flink
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Introduction to the R Programming Language
PDF
How to run a consulting project- client discovery
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Mega Projects Data Mega Projects Data
PDF
annual-report-2024-2025 original latest.
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Predictive modeling basics in data cleaning process
PPTX
Managing Community Partner Relationships
PDF
Business Analytics and business intelligence.pdf
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
Introduction to the R Programming Language
How to run a consulting project- client discovery
Introduction-to-Cloud-ComputingFinal.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IBA_Chapter_11_Slides_Final_Accessible.pptx
ISS -ESG Data flows What is ESG and HowHow
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Mega Projects Data Mega Projects Data
annual-report-2024-2025 original latest.
Data_Analytics_and_PowerBI_Presentation.pptx
Predictive modeling basics in data cleaning process
Managing Community Partner Relationships
Business Analytics and business intelligence.pdf
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Database Infoormation System (DBIS).pptx
SAP 2 completion done . PRESENTATION.pptx

Containerized Stream Engine to Build Modern Delta Lake

  • 1. Containerized Stream Engine to build Modern Delta Lake Sandeep Reddy Bheemi Reddy, Senior Data Engineer Karthikeyan Siva Baskaran, Senior Data Engineer
  • 2. Who We Are Sandeep Reddy Bheemi Reddy Senior Data Engineer Karthikeyan Siva Baskaran Senior Data Engineer TIGER DATA FOUNDATION Containerized Stream Engine to build Modern Delta Lake Contact us +1-408-508-4430 info@tigeranalytics.com https://www.tigeranalytics.com/
  • 3. Agenda Objective Design Considerations Infrastructure Provisioning Solution Deep Dive Application Monitoring Points to be Noted Questions
  • 4. Objective To Build Single Source of Truth data for Enterprise via CDC Most compelling operational analytics demand real-time data rather than historical data. Data Agility The Speed of business is rapidly accelerating, driving the need for delivering intelligent, fast solutions. Facilitate larger amounts of data from multiple sources by tracking changes made to the source data, combine them together to build a Single Source of Truth to make decisions based on data. Build SSOT from Siloed Data Demand for real-time Data
  • 6. Few Common Ways to Capture Data to get Insights Change Data Capture App DB LOG Analytics Data Lake Dual Writes App DB Pub Sub System Analytics Data Lake Direct JDBC App DB Analytics Data Lake
  • 7. Inconsistent Data During Job failure, in overwrite mode it leads to inconsistent data Schema Enforcement & Evolution DDLs are not supported, this leads to break in the flow if upstream applications changed the schema Roll Back not possible In case of failure, it is not possible to roll back to the previous state of data No Metadata layer As there is no metadata layer, there is no clear isolation b/w reads and writes – thus it is not consistent, durable or atomic VersioningSchema E2 Data Corruption Not ACID Complaint Problem with Today’s Data Lake
  • 8. Provides clear isolation between different writes by maintaining log file for each transaction Even Job failure with Overwrite mode, will not corrupt the data Provides serializable isolation levels to ensure the data consistent across multiple users Changes to the table are maintained as ordered, atomic commits ACID Compliant Atomicity Consistency Isolation Durability mergeSchema - Any column that is present in the Data Frame but not in the target table is automatically added on to the end of the schema. overwriteSchema – Datatype change, drop/rename column Time Travel to older version All metadata and lineage of your data are stored. To travel back to previous versions of your delta table, provide a timestamp or specific version Expectations for data quality, which prohibits the invalid data to enter your enterprise data lake Data Check Constraints Schema Enforcement & Evolution Delta Lake to Rescue
  • 10. 10 On Premise Code Repo To maintain versions of Terraform files Open Source Agent Security & Compliance Checks Terraform To deploy TF files and maintain state of TF files. Deploy TF Files TF State Files DevOps Professional CD Pipeline TF Files ▪ Cloud Agnostic – Create & manage infrastructure across various platforms. ▪ Minimize human errors and configuration differences in various environments. ▪ Maintain the state of infrastructure. ▪ Perform policy checks on Infrastructure IaC – Workflow Infra Provisioning in Selected Environment Kubernetes Cluster (With Scalable worker nodes) Pods (Deployment, Replica Sets) Launch the Deployment Services (Node Port & Load Balancer) Volumes (PV & PVC)
  • 12. Kafka Schema Registry Kubernetes Source Database Structured StreamingKafka Connect Change Data Streaming Queue Processing Layer Storage Layer DB Logs ADLS S3 Kafka Connect uses Debezium connector to parse the database logs Schema id data Schema id data Avro Schema-1 Schema-2 Schema-n id - 1001 id - 1002 id - n Register Schema Provides flexibility by creating a VIEW from different schema for different teams based on their need. This helps downstream apps to run without any interruption when schema changes Persistent Volume Claim PVC PVC
  • 13. { "name": "mssql-${DBName}-connector-${conn_string}", "config": { "connector.class": "io.debezium.connector.sqlserver.SqlServerConnector", "tasks.max": "1", "database.hostname": "${Hostname}", "database.port": "${Port}", "database.user": "${UserName}", "database.password": "${Password}", "database.server.id": "${conn_string}", "database.server.name": "${Source}.${DBName}.${conn_string}", "database.whitelist": "${DBName}", "database.dbname": "${DBName}", "database.history.kafka.bootstrap.servers": "${KAFKA}:9092", "database.history.kafka.topic": "${Source}.${DBName}.dbhistory", "key.converter":"io.confluent.connect.avro.AvroConverter", "key.converter.schema.registry.url":"http://${SCHEMA_REGISTRY}:8081", "value.converter":"io.confluent.connect.avro.AvroConverter", "value.converter.schema.registry.url": "http://${SCHEMA_REGISTRY}:8081", } } Kafka Connector Properties
  • 14. { "payload": { "before": { "emp_no": 1, "birth_date": 18044, "first_name": “Marion", "last_name": “Colbrun" }, "after": { "emp_no": 1, "birth_date": 18044, "first_name": “Marion", "last_name": “Brickell" } } } { "payload": { "before": { "emp_no": 1, "birth_date": 18044, "first_name": "Marion", "last_name": "Colbrun" }, "after": null } } { "payload": { "before": null, "after": { "emp_no": 1, "birth_date": 18044, "first_name": "Marion", "last_name": "Colbrun" } } } insert into sample_emp values (1,current_date,'Marion’, 'Colbrun'); update sample_emp set last_name='Brickell’ where emp_no=1; delete from sample_emp where emp_no=1; INSERT UPDATE DELETE
  • 15. CDC Code Logic Flow Read data from Kafka and create Delta Table and insert the recent data based on Primary Key and exclude if there are any Deletes. Read the data from Kafka and split delete data from Inserts/Updates. Get the latest data by using Rank window. Enable autoMerge schema property to detect any schema changes and merge the schema to Delta Table MERGE command to handle Inserts/Updates/Deletes based on Operation(op) column which is created by Debezium by parsing the logs Initial Load DDL DM L Data Preprocess Data Preprocess Initial Load DML Scenario DDL Scenario Incremental Load: Data Preprocess, DDL & DML
  • 16. Flag ID Value CDCTimeStamp I 1 1 2018-01-01 16:02:00 U 1 11 2018-01-01 16:02:01 I 2 2 2018-01-01 16:02:03 I 3 33 2018-01-01 16:02:04 I 4 40 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 Get Latest Record in order to maintain SCD Type I Inserts/Updates Flag ID Value CDCTimeStamp D 2 2 2018-01-01 16:02:04 Deletes Deletes will have different schema when it is inserted in Kafka from Debezium. For Deletes, take Before Image data to know which primary key records got deleted, where as for Inserts and Updates, pull data from After Image. Flag ID Value CDCTimeStamp U 1 11 2018-01-01 16:02:01 D 2 2 2018-01-01 16:02:04 I 3 33 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 val orderBy_lst = List(“CDCTimeStamp") val byPrimaryKey = Window .partitionBy(partitionBy_lst.map(col): _*) .orderBy(orderBy_lst.map(x => col(x) desc):_*) rankDf = dmlDf .withColumn("rank", rank over byPrimaryKey) .filter("rank = 1") .drop("rank") Data Pre-processing
  • 17. Get Latest Record and exclude Deletes for Initial Load Consolidated data for Initial Load As the requirement is to maintain SCD I, there is no need to load the Deletes data into Delta Lake during Initial Load. Flag ID Value CDCTimeStamp U 1 11 2018-01-01 16:02:01 I 3 33 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 df.where("op != 'd'") .write .mode("overwrite") .option("path", delta_tbl_loc) .format("delta") .saveAsTable(db_tbl) Flag ID Value CDCTimeStamp U 1 11 2018-01-01 16:02:01 D 2 2 2018-01-01 16:02:04 I 3 33 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 Initial Load
  • 18. Flag ID Value City CDCTimeStamp I 11 100 MDU 2018-01-01 16:02:20 U 11 1000 CHN 2018-01-01 16:02:21 U 3 300 MDU 2018-01-01 16:02:22 I 14 400 MDU 2018-01-01 16:02:21 D 4 44 2018-01-01 16:02:22 Incremental Staged Data Incremental Load: Data pre-process & Get Latest Record val orderBy_lst = List("CDCTimeStamp") val byPrimaryKey = Window .partitionBy(partitionBy_lst.map(col): _*) .orderBy(orderBy_lst.map(x => col(x) desc):_*) rankDf = dmlDf .withColumn("rank", rank over byPrimaryKey) .filter("rank = 1") .drop("rank") Flag ID Value City CDCTimeStamp U 11 1000 CHN 2018-01-01 16:02:21 U 3 300 MDU 2018-01-01 16:02:22 I 14 400 MDU 2018-01-01 16:02:21 D 4 44 2018-01-01 16:02:22 Latest Record from incremental load Data pre-processing by splitting deletes and Inserts/Updates and get Latest Records per primary key. Finally union both dataframe before performing MERGE
  • 19. Flag ID Value City CDCTimeStamp U 11 1000 CHN 2018-01-01 16:02:21 U 3 300 MDU 2018-01-01 16:02:22 I 14 400 MDU 2018-01-01 16:02:21 D 4 44 2018-01-01 16:02:22 Latest Incremental Staged Data Incremental Load – DDL & DML MERGE INTO ${db_tbl} AS target USING staging_tbl AS source ON ${pri_key_const} WHEN MATCHED and source.op = 'u’ THEN UPDATE SET * WHEN MATCHED and source.op = 'd’ THEN delete WHEN NOT MATCHED and source.op = 'c’ THEN INSERT * Flag ID Value City CDCTimeStamp U 1 11 Null 2018-01-01 16:02:01 U 3 300 MDU 2018-01-01 16:02:22 U 11 1000 CHN 2018-01-01 16:02:21 I 14 400 MDU 2018-01-01 16:02:21 Enable this property to add new columns on the fly when MERGE happens spark.databricks.delta.schema.autoMerge.enabled Only available from delta lake 0.6.0 and higher versions Flag ID Value CDCTimeStamp U 1 11 2018-01-01 16:02:01 I 3 33 2018-01-01 16:02:04 U 4 44 2018-01-01 16:02:05 Before After
  • 20. 20 Spark Streaming with Kubernetes API Server Scheduler Kubernetes Master Spark Driver Pod Spark Executor Pod Spark Executor Pod Spark Executor Pod 1)Spark Submit 2)Start Driver Pod 3)Request Executor Pods 5)Notify of New Executor 4)Schedule Executor Pod 6)Schedule tasks on executors Kubernetes Cluster Key Benefits ▪ Containerization – Applications are more portable and easy to package dependencies. ▪ Cloud Agnostic – Able to launch the Spark job in any platform without any code changes. ▪ Efficient Resource Sharing – Resources can be utilized by other applications when Spark jobs are idle. File Share Checkpointing
  • 22. Fluentd is a popular data collector that runs as a DaemonSet inside Kubernetes Worker Nodes to collect and ingest container logs from local filesystem into Elasticsearch engine Metrics Beat is a lightweight shipper that collects and ships various system and service metrics like CPU, Memory, Disk usage etc. to Elasticsearch engine Elasticsearch is a real time distributed scalable search engine that is used to index and search through larger volume of log data Kibana is a powerful data visualization tool that allows to explore log data stored in Elasticsearch and gain quick insights to Kubernetes applications Node Pod2 Podn Pod1 . . Node Storage Pod2 Podn Pod1 . . Node Storage Daemon Node Daemon Daemon Daemon CPU,Memory, Network CPU,Memory, Network Logs Logs Centralized Log Monitoring
  • 24. Points to be noted!
  • 25. DEBEZIUM Points to be Noted Primary Key is mandatory, without primary key it is not possible to track the changes and apply the changes to Target Primary Key By default, Kafka connect will create topic with only one partition. Due to this, Spark Job will not get parallelized. To achieve parallelism, we need to create the topic with more no of partitions Partitions For each Table under database, one topic will be created and one common topic for one DB to maintain DDLs Topic/Table
  • 26. SPARK If small files are handled, when merge happens during incremental load, there is no need to rewrite most of the files, in return performance between micro batch will be increased. To control the compaction, either Run OPTIMIZE or set false to dataChange delta property or enable Adaptive Execution Mode Small Files Time travel will not read the delta log checkpoint directory, because we need the specific version, so this will read the specific json commit file, because the checkpoint parquet file is consolidated of all the json files which is committed previously. Time Travel
  • 28. Feedback Your feedback is important to us. Don’t forget to rate and review the session. THANKS!