SlideShare a Scribd company logo
A noETL
Parallel Streaming Transformation Loader
using Kafka, Spark & Vertica
Jack Gudenkauf
CEO & Founder of
BigDataInfra
https://www.linkedin.com/in/jackglinkedin
1
JackGudenkauf@gmail.com
WARNING!
Slides that follow
violate Powerpoint best practices
in favor of providing densely
packed information for later review
https://www.linkedin.com/in/jackglinkedin 2
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
Parallelized Streaming Transformation Loader
5. Vertica Performance!
https://www.linkedin.com/in/jackglinkedin 3
Agenda
1. Background
https://www.linkedin.com/in/jackglinkedin 4
My Background
Playtika, VP of Big Data
Flume, Kafka, Spark, ZK, Yarn, Vertica. [Jay Kreps (Kafka), Michael Armbrust (Spark SQL),
Chris Bowden (Dev Demigod)]
MIS Director of several start-up companies
Dataflex a 4GL RDBMS. [E.F. Codd]
Self-employed Consultant
Intercept Dataflex db calls to store and retrieve data to/from Btrieve and IBM DB2
Mainframe
FoxPro, Sybase, MSSQL Server beta
Design Patterns: Elements of Reusable Object-Oriented Software [The Gang of Four]
Microsoft; Dev Manager, Architect CLR/.Net Framework,
Product Unit Manager Technical Strategy Group
Inventor of “Shuttle”, a Microsoft product in use since 1999
A distributed ETL based on MSMQ which influenced MSSQL DTS (SQL SSIS)
[Joe Celko, Ralph Kimball, Steve Butler (Microsoft Architect)]
Twitter, Manager of Analytics Data Warehouse
Core Storage; Hadoop, HBase, Cassandra, Blob Store
Analytics Infra; MySQL, PIG, Vertica (n-Petabyte capacity with Multi-DC DR)
[Prof. Michael Stonebraker, Ron Cormier (Vertica Internals)]
https://www.linkedin.com/in/jackglinkedin
5
A Quest
With attributes of
 Operational Robustness
 High Availabilty
 Stronger durability guarantees
 Idempotent (an operation that is safe to repeat)
 Productivity
 Analytics
 Streaming, Machine Learning, BI, BA, Data Science
 Rich Development env.
 Strongly typed, OO, Functional, with support for set based logic and
aggregations (SQL)
 Performance
 Scalable in every tier
 MPP for Transformations, Reads & Writes
A Unified Data Pipeline with Parallelism
from Streaming Data
through Data Transformations
to Data Storage (Semi-Structured, Structured, and Relational Data)
https://www.linkedin.com/in/jackglinkedin 6
https://www.linkedin.com/in/jackglinkedin 7https://en.wikipedia.org/wiki/Extract,_transform,_load
ELT
“Extract, Load, Transform is an alternative to Extract,
transform, load (ETL) used with data lake implementations.
In ELT models the data is not processed on entry to the data
lake which enables faster loading times.
But does require sufficient processing within the data
processing engine to carry out the transform on demand and
return the results to the consumer in a timely manner.
Since the data is not processed on entry to the data lake the
query and schema do not need to be defined a-priori (often
the schema will be available during load since many data
sources are extracts from databases or similar structured data
systems and hence have an associated schema).”
https://www.linkedin.com/in/jackglinkedin 8
https://en.wikipedia.org/wiki/Extract,_load,_transform
Lambda Architecture
9
“Essentially, the speed layer (streaming) is responsible for filling the "gap" caused by the batch
layer's lag in providing views based on the most recent data” -
https://en.wikipedia.org/wiki/Lambda_architecture
Questioning the Lambda Architecture
by Jay Kreps
The Lambda Architecture has its merits, but alternatives
are worth exploring.
“As someone who designs infrastructure, I think the
glaring question is this: why can’t the stream processing
system just be improved to handle the full problem set in
its target domain? Why do you need to glue on another
system? Why can’t you do both real-time processing and
also handle the reprocessing when code changes? Stream
processing systems already have a notion of parallelism;
why not just handle reprocessing by increasing the
parallelism and replaying history very, very fast? The
answer is that you can do this, and I think this it is
actually a reasonable alternative architecture if you are
building this type of system today.”
10
REST API
Flume
Apache
Flume™
ETL
JAVA ™
Parser & Loader
MPP Columnar
DW
HP Vertica™ Cluster
UserId <-> UserGId 
Analytics of Relational Data
 Structured Relational and Aggregated Data
Application
Application
Game
Applications
GameX
GameY
GameZ
COPY
Playtika Santa Monica
original ETL Architecture
Extract Transform Load
Single Source of Truths to Global SOT
Unified
Schema
JSON
Local Data Warehouses
Original Architecture (ETL)
1
2 3 4
5
https://www.linkedin.com/in/jackglinkedin 11
UserId: INT
SessionId: UUId (36)
UserId: INT
SessionId: UUId (32)
UserId: varchar(32)
SessionId: varchar(255)
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
https://www.linkedin.com/in/jackglinkedin 12
Real-Time
Messaging
Apache
Kafka™
Analytics of [semi]Structured [non]Relational Data Stores
Real-Time Streaming ✓Machine Learning
✓ Semi-Structured Raw JSON Data
✖Structured (non)relational Parquet Data
 Structured Relational and Aggregated Data
Resilient Distributed
Datasets
Apache Spark™ Hadoop™
Parquet™ ✓ ✓ ✖ 
REST API
Or Local
Kafka
Application
Application
Game
Applications
Unified
Schema
JSON
Local Data Warehouses
MPP Columnar DW
HP Vertica™

MPP
1 2
3
P a r a l l e l i z e d S t r e a m i n g
T r a n s f o r m a t i o n
L o a d e r
4
5
New PSTL
Architecture
New PSTL
Architecture
https://www.linkedin.com/in/jackglinkedin
13
Bingo Blitz
UserId: INT
SessionId: UUId (36) Slotomania
UserId: INT
SessionId: UUId (32)
WSOP
UserId: varchar(32)
SessionId: varchar(255)
14
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
Apache Kafka ™
is a distributed, partitioned, replicated
commit log service
Producer Producer Producer
Kafka Cluster
(Broker)
Consumer Consumer Consumer
A topic is a category or feed name to which messages are published.
For each topic, the Kafka cluster maintains a partitioned log that looks like
Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log.
The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each
message within the partition.
The Kafka cluster retains all published messages—whether or not they have been consumed
—for a configurable period of time.
Apache Kafka
™
Spark RDD
A Resilient Distributed Dataset [in Memory]
Represents an immutable, partitioned collection of elements that can be operated on in parallel
Node 1 Node 2 Node 3 Node…
RDD 1
RDD 1
Partition 1
RDD 1
Partition 2
RDD 3 RDD 3
Partition 2
RDD 3
Partition 3
RDD 3
Partition 1
RDD 2
RDD 2
Partition
1 to 64
RDD 2
Partition
65 to 128
RDD 2
Partition
193 to 256
RDD 2
Partition
129 to 192
RDD 1
Partition 3
18Initiator Node
An Initiator Node shuffles
data to storage nodes
Vertica Hashing & Partitioning
19
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
Parallelized Streaming Transformation Loader
{"appId": 3, "sessionId": ”7”,
"userId": ”42” }
{"appId": 3, "sessionId": ”6”,
"userId": ”42” }
Node 1 Node 2 Node 3 Node 4
3 Import recent Sessions
Apache Kafka Cluster
Topic: “appId_1” Topic: “appId_2” Topic: “appId_3”
old new
Kafka Table
appId,
TopicOffsetRange,
Batch_Id
SessionMax Table
sessionGIdMax Int
UserMax Table
userGIdMax Int
appSessionMap_RDD
appId: Int
sessionId: String
sessionGId: Int
appUserMap_RDD
appId: Int
userId: String
userGId: Int
appSession
appId: Int
sessionId:
varchar(255)
sessionGId: Int
appUser
appId: Int
userId:
varchar(255)
userGId: Int
1 Start a Spark Driver
per APP
Node 1 Node 2 Node 3
4 Spark Kafka [non]Streaming job per APP
(read partition/offset range)
5 select for
update;
update max
GId
5 Assign userGIds To
userId
sessionGIds To
sessionId
6 Hash(userGId) to
RDD partitions with
affinity
To Vertica Node(s)
7
userGIdRDD.foreachPartition
{…stream.writeTo(socket)...}
8 Idempotent: Write
Raw JSON to hdfs
9 Idempotent: Write
Parsed JSON to .ORC
hdfs
10 Update
MySQL
Kafka Offsets
{"appId": 2, "sessionId": ”4”,
"userId": ”KA” }
{"appId": 2, "sessionId": ”3”,
"userId": ”KY” }{"appId": 1, "sessionId": ”2”,
"userId": ”CB” }
{"appId": 1, "sessionId": "1”,
"userId": ”JG” }
4 appId {Game events, Users, Sessions,…}
Partition 1..n RDDs
5 appId Users & Sessions
Partition 1..n RDDs
5 appId
appUserMap_RDD.union(assignedID_RDD)
6 appId Users & Sessions
Partition 1..n RDDs
7 copy jackg.DIM_USER
with source SPARK(port='12345’,
nodes=‘node0001:4, node0002:4,
node0003:4’) direct;
2 Import Users
Apache Hadoop™
Spark™ Cluster
HPE Vertica™ Cluster
21
Agenda
1. Background
2. PSTL overview
Parallelized Streaming Transformation Loader
3. Parallelism in
Kafka, Spark, Vertica
4. PSTL drill down
Parallelized Streaming Transformation Loader
5. Vertica Performance!
Impressive Parallel COPY Performance
Loaded 2.42 Billion Rows (451 GB)
in 7min 35sec on an 8 Node Cluster
Key Takeaways
Parallel Kafka Reads to Spark RDD (in memory) with Parallel
writes to a Vertica via tcp server – ROCKS!
COPY 36 TB/Hour with 81 Node cluster
No ephemeral nodes needed for ingest
Kafka read parallelism to Spark RDD partitions
A priori hash() in Spark RDD Partitions (in Memory)
TCP Server as a Vertica User Define Copy Source
Single COPY does not preallocate Memory across nodes
http://www.vertica.com/2014/09/17/how-vertica-met-facebooks-35-tbhour-ingest-sla/
* 270 Nodes ( 215 Data Nodes )
A noETL
Parallel Streaming Transformation Loader
using Kafka, Spark & Vertica
Jack Gudenkauf
CEO & Founder of
BigDataInfra
https://www.linkedin.com/in/jackglinkedin
23
JackGudenkauf@gmail.com
THANK YOU

More Related Content

PDF
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PDF
Introduction to Spark Streaming
PPTX
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
PDF
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
PPTX
Papers we love realtime at facebook
Big Data Day LA 2015 - Introduction to Apache Kafka - The Big Data Message Bu...
Spark Streaming & Kafka-The Future of Stream Processing
Introduction to Spark Streaming
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
Papers we love realtime at facebook

What's hot (20)

PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
PDF
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
PDF
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
PPTX
Kappa Architecture on Apache Kafka and Querona: datamass.io
PDF
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
PPTX
Data Pipeline at Tapad
PDF
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
PPTX
Introduction to Streaming Distributed Processing with Storm
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
PDF
Apache kafka-a distributed streaming platform
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PDF
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
PPTX
Bullet: A Real Time Data Query Engine
PDF
Data Pipeline with Kafka
PDF
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
PDF
Introduction to Apache Kafka and Confluent... and why they matter
PDF
The Netflix Way to deal with Big Data Problems
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PDF
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Data Pipeline at Tapad
A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...
Introduction to Streaming Distributed Processing with Storm
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Apache kafka-a distributed streaming platform
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Bullet: A Real Time Data Query Engine
Data Pipeline with Kafka
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Introduction to Apache Kafka and Confluent... and why they matter
The Netflix Way to deal with Big Data Problems
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Use ksqlDB to migrate core-banking processing from batch to streaming | Mark ...
Ad

Viewers also liked (17)

PDF
Oracle 12.2 sharded database management
PDF
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
PDF
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
PDF
Big Data Day LA 2015 - Data mining, forecasting, and BI at the RRCC by Benjam...
PDF
Getting started with Spark & Cassandra by Jon Haddad of Datastax
PDF
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
PPTX
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
PDF
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
PDF
Data science and good questions eric kostello
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PPTX
Big Data Day LA 2015 - Data Science ≠ Big Data by Jim McGuire of ZestFinance
PPTX
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
PDF
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
PDF
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
PPTX
Big Data Day LA 2016/ Use Case Driven track - BI is broken, Dave Fryer, Produ...
PPTX
Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intellig...
Oracle 12.2 sharded database management
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
Big Data Day LA 2015 - Data mining, forecasting, and BI at the RRCC by Benjam...
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
Data science and good questions eric kostello
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Big Data Day LA 2015 - Data Science ≠ Big Data by Jim McGuire of ZestFinance
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Use Case Driven track - BI is broken, Dave Fryer, Produ...
Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intellig...
Ad

Similar to A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica (20)

PPTX
HPBigData2015 PSTL kafka spark vertica
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
Real time analytics at uber @ strata data 2019
PDF
Postgres clusters
PPTX
Presto: Distributed sql query engine
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
PDF
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Apache spark - Architecture , Overview & libraries
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PPTX
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
PPTX
Software architecture for data applications
PPTX
G rpc talk with intel (3)
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PPTX
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
Spark 101 - First steps to distributed computing
HPBigData2015 PSTL kafka spark vertica
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Real time analytics at uber @ strata data 2019
Postgres clusters
Presto: Distributed sql query engine
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Spark (Structured) Streaming vs. Kafka Streams
Apache spark - Architecture , Overview & libraries
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Software architecture for data applications
G rpc talk with intel (3)
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Spark 101 - First steps to distributed computing

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation theory and applications.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
KodekX | Application Modernization Development
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
20250228 LYD VKU AI Blended-Learning.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation theory and applications.pdf
cuic standard and advanced reporting.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Spectral efficient network and resource selection model in 5G networks
KodekX | Application Modernization Development
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Monthly Chronicles - July 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica

  • 1. A noETL Parallel Streaming Transformation Loader using Kafka, Spark & Vertica Jack Gudenkauf CEO & Founder of BigDataInfra https://www.linkedin.com/in/jackglinkedin 1 JackGudenkauf@gmail.com
  • 2. WARNING! Slides that follow violate Powerpoint best practices in favor of providing densely packed information for later review https://www.linkedin.com/in/jackglinkedin 2
  • 3. Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica 4. PSTL drill down Parallelized Streaming Transformation Loader 5. Vertica Performance! https://www.linkedin.com/in/jackglinkedin 3
  • 5. My Background Playtika, VP of Big Data Flume, Kafka, Spark, ZK, Yarn, Vertica. [Jay Kreps (Kafka), Michael Armbrust (Spark SQL), Chris Bowden (Dev Demigod)] MIS Director of several start-up companies Dataflex a 4GL RDBMS. [E.F. Codd] Self-employed Consultant Intercept Dataflex db calls to store and retrieve data to/from Btrieve and IBM DB2 Mainframe FoxPro, Sybase, MSSQL Server beta Design Patterns: Elements of Reusable Object-Oriented Software [The Gang of Four] Microsoft; Dev Manager, Architect CLR/.Net Framework, Product Unit Manager Technical Strategy Group Inventor of “Shuttle”, a Microsoft product in use since 1999 A distributed ETL based on MSMQ which influenced MSSQL DTS (SQL SSIS) [Joe Celko, Ralph Kimball, Steve Butler (Microsoft Architect)] Twitter, Manager of Analytics Data Warehouse Core Storage; Hadoop, HBase, Cassandra, Blob Store Analytics Infra; MySQL, PIG, Vertica (n-Petabyte capacity with Multi-DC DR) [Prof. Michael Stonebraker, Ron Cormier (Vertica Internals)] https://www.linkedin.com/in/jackglinkedin 5
  • 6. A Quest With attributes of  Operational Robustness  High Availabilty  Stronger durability guarantees  Idempotent (an operation that is safe to repeat)  Productivity  Analytics  Streaming, Machine Learning, BI, BA, Data Science  Rich Development env.  Strongly typed, OO, Functional, with support for set based logic and aggregations (SQL)  Performance  Scalable in every tier  MPP for Transformations, Reads & Writes A Unified Data Pipeline with Parallelism from Streaming Data through Data Transformations to Data Storage (Semi-Structured, Structured, and Relational Data) https://www.linkedin.com/in/jackglinkedin 6
  • 8. ELT “Extract, Load, Transform is an alternative to Extract, transform, load (ETL) used with data lake implementations. In ELT models the data is not processed on entry to the data lake which enables faster loading times. But does require sufficient processing within the data processing engine to carry out the transform on demand and return the results to the consumer in a timely manner. Since the data is not processed on entry to the data lake the query and schema do not need to be defined a-priori (often the schema will be available during load since many data sources are extracts from databases or similar structured data systems and hence have an associated schema).” https://www.linkedin.com/in/jackglinkedin 8 https://en.wikipedia.org/wiki/Extract,_load,_transform
  • 9. Lambda Architecture 9 “Essentially, the speed layer (streaming) is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data” - https://en.wikipedia.org/wiki/Lambda_architecture
  • 10. Questioning the Lambda Architecture by Jay Kreps The Lambda Architecture has its merits, but alternatives are worth exploring. “As someone who designs infrastructure, I think the glaring question is this: why can’t the stream processing system just be improved to handle the full problem set in its target domain? Why do you need to glue on another system? Why can’t you do both real-time processing and also handle the reprocessing when code changes? Stream processing systems already have a notion of parallelism; why not just handle reprocessing by increasing the parallelism and replaying history very, very fast? The answer is that you can do this, and I think this it is actually a reasonable alternative architecture if you are building this type of system today.” 10
  • 11. REST API Flume Apache Flume™ ETL JAVA ™ Parser & Loader MPP Columnar DW HP Vertica™ Cluster UserId <-> UserGId  Analytics of Relational Data  Structured Relational and Aggregated Data Application Application Game Applications GameX GameY GameZ COPY Playtika Santa Monica original ETL Architecture Extract Transform Load Single Source of Truths to Global SOT Unified Schema JSON Local Data Warehouses Original Architecture (ETL) 1 2 3 4 5 https://www.linkedin.com/in/jackglinkedin 11 UserId: INT SessionId: UUId (36) UserId: INT SessionId: UUId (32) UserId: varchar(32) SessionId: varchar(255)
  • 12. Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader https://www.linkedin.com/in/jackglinkedin 12
  • 13. Real-Time Messaging Apache Kafka™ Analytics of [semi]Structured [non]Relational Data Stores Real-Time Streaming ✓Machine Learning ✓ Semi-Structured Raw JSON Data ✖Structured (non)relational Parquet Data  Structured Relational and Aggregated Data Resilient Distributed Datasets Apache Spark™ Hadoop™ Parquet™ ✓ ✓ ✖  REST API Or Local Kafka Application Application Game Applications Unified Schema JSON Local Data Warehouses MPP Columnar DW HP Vertica™  MPP 1 2 3 P a r a l l e l i z e d S t r e a m i n g T r a n s f o r m a t i o n L o a d e r 4 5 New PSTL Architecture New PSTL Architecture https://www.linkedin.com/in/jackglinkedin 13 Bingo Blitz UserId: INT SessionId: UUId (36) Slotomania UserId: INT SessionId: UUId (32) WSOP UserId: varchar(32) SessionId: varchar(255)
  • 14. 14 Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica
  • 15. Apache Kafka ™ is a distributed, partitioned, replicated commit log service Producer Producer Producer Kafka Cluster (Broker) Consumer Consumer Consumer
  • 16. A topic is a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log that looks like Each partition is an ordered, immutable sequence of messages that is continually appended to—a commit log. The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition. The Kafka cluster retains all published messages—whether or not they have been consumed —for a configurable period of time. Apache Kafka ™
  • 17. Spark RDD A Resilient Distributed Dataset [in Memory] Represents an immutable, partitioned collection of elements that can be operated on in parallel Node 1 Node 2 Node 3 Node… RDD 1 RDD 1 Partition 1 RDD 1 Partition 2 RDD 3 RDD 3 Partition 2 RDD 3 Partition 3 RDD 3 Partition 1 RDD 2 RDD 2 Partition 1 to 64 RDD 2 Partition 65 to 128 RDD 2 Partition 193 to 256 RDD 2 Partition 129 to 192 RDD 1 Partition 3
  • 18. 18Initiator Node An Initiator Node shuffles data to storage nodes Vertica Hashing & Partitioning
  • 19. 19 Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica 4. PSTL drill down Parallelized Streaming Transformation Loader
  • 20. {"appId": 3, "sessionId": ”7”, "userId": ”42” } {"appId": 3, "sessionId": ”6”, "userId": ”42” } Node 1 Node 2 Node 3 Node 4 3 Import recent Sessions Apache Kafka Cluster Topic: “appId_1” Topic: “appId_2” Topic: “appId_3” old new Kafka Table appId, TopicOffsetRange, Batch_Id SessionMax Table sessionGIdMax Int UserMax Table userGIdMax Int appSessionMap_RDD appId: Int sessionId: String sessionGId: Int appUserMap_RDD appId: Int userId: String userGId: Int appSession appId: Int sessionId: varchar(255) sessionGId: Int appUser appId: Int userId: varchar(255) userGId: Int 1 Start a Spark Driver per APP Node 1 Node 2 Node 3 4 Spark Kafka [non]Streaming job per APP (read partition/offset range) 5 select for update; update max GId 5 Assign userGIds To userId sessionGIds To sessionId 6 Hash(userGId) to RDD partitions with affinity To Vertica Node(s) 7 userGIdRDD.foreachPartition {…stream.writeTo(socket)...} 8 Idempotent: Write Raw JSON to hdfs 9 Idempotent: Write Parsed JSON to .ORC hdfs 10 Update MySQL Kafka Offsets {"appId": 2, "sessionId": ”4”, "userId": ”KA” } {"appId": 2, "sessionId": ”3”, "userId": ”KY” }{"appId": 1, "sessionId": ”2”, "userId": ”CB” } {"appId": 1, "sessionId": "1”, "userId": ”JG” } 4 appId {Game events, Users, Sessions,…} Partition 1..n RDDs 5 appId Users & Sessions Partition 1..n RDDs 5 appId appUserMap_RDD.union(assignedID_RDD) 6 appId Users & Sessions Partition 1..n RDDs 7 copy jackg.DIM_USER with source SPARK(port='12345’, nodes=‘node0001:4, node0002:4, node0003:4’) direct; 2 Import Users Apache Hadoop™ Spark™ Cluster HPE Vertica™ Cluster
  • 21. 21 Agenda 1. Background 2. PSTL overview Parallelized Streaming Transformation Loader 3. Parallelism in Kafka, Spark, Vertica 4. PSTL drill down Parallelized Streaming Transformation Loader 5. Vertica Performance!
  • 22. Impressive Parallel COPY Performance Loaded 2.42 Billion Rows (451 GB) in 7min 35sec on an 8 Node Cluster Key Takeaways Parallel Kafka Reads to Spark RDD (in memory) with Parallel writes to a Vertica via tcp server – ROCKS! COPY 36 TB/Hour with 81 Node cluster No ephemeral nodes needed for ingest Kafka read parallelism to Spark RDD partitions A priori hash() in Spark RDD Partitions (in Memory) TCP Server as a Vertica User Define Copy Source Single COPY does not preallocate Memory across nodes http://www.vertica.com/2014/09/17/how-vertica-met-facebooks-35-tbhour-ingest-sla/ * 270 Nodes ( 215 Data Nodes )
  • 23. A noETL Parallel Streaming Transformation Loader using Kafka, Spark & Vertica Jack Gudenkauf CEO & Founder of BigDataInfra https://www.linkedin.com/in/jackglinkedin 23 JackGudenkauf@gmail.com THANK YOU

Editor's Notes

  • #2: BigDataInfra – compelled to start a company to help companies with the same issues I have seen throughout my career. BigDataInfra specializes in the system architectures we will be discussing Imho - noETL is the latest discussion of noE as T&L will always need to eventualy happen, even with “schema-less” schema on read, Data Lakes, etc.. noSQL should really be noRel and is being proven out by everyone putting SQL consumption on non-RDBMS.
  • #6: My experience and influencers framed my architectural decisions
  • #16: http://kafka.apache.org/documentation.html Authored at LinkedIn, used by Twitter, …
  • #17: http://kafka.apache.org/documentation.html Authored at LinkedIn, used by Twitter, …
  • #18: We use Spark RDD partitioned data to parallelize opertaions to/from affinitized Vertica nodes e.g., 3 Kafka Partitions would read in parallel into 3 Spark RDD Partitions Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10))
  • #19: SHUFFLE!
  • #24: BigDataInfra – compelled to start a company to help companies with the same issues I have seen throughout my career. BigDataInfra specializes in the system architectures we will be discussing Imho - noETL is the latest discussion of noE as T&L will always need to eventualy happen, even with “schema-less” schema on read, Data Lakes, etc.. noSQL should really be noRel and is being proven out by everyone putting SQL consumption on non-RDBMS.