Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

Using Spark Streaming and NiFi for the next
generation of ETL in the enterprise
Darryl Dutton, Principal Consultant, T4G
Kenneth Poon, Director of Data Engineering, RBC

Agenda
What is the Event Standardization
Service (ESS) Use Case
The drivers to modernize ESS
The solution for ESS and benefits
The project challenges
The Good, the Bad and the Ugly
Questions

What is the
ESS Use Case
Event Standardization Service (ESS) captures
customer activity across all channels, such as
Online Banking, Mobile Apps, Bank Branch,
Advice Center, etc…
ESS facilitates customer journey reporting to
turn raw event data into actionable insights
ESS provides APIs to customer-facing systems
to get insights on recent customer activity,
journeys, and life events.

ESS – Business Value
Understand customer activity across all
channels
Identify customer journey from
interactions
Identify life events from journeys to
optimize customer experience
Life Events
Customer Journey
Events / Interactions

What is the ESS Use Case - Legacy
Event Standardization Service – Legacy Architecture
Event Hub / Ingest ProcessingData Source Data Storage and Batch Processing
Reporting /
Analytics
Business
Events
(real time)
IBM
Data
Power
IBM MQ
TeraData
TPump
Stage 0
60 Minute Mini
Batch (SQL)
Teradata
Core EDW
TeraData
Extended
Model
TeraData
Report
view
Batch SQL
Batch
Extracts
Oracle
BI tool
Batch
Source
Events
IBM
Data Stage
Batch SQL
Apps/
Ad hoc

The drivers to
modernize
ESS
Provide real-time access to customer event and
journey data
Reduce cost to enhance, support, and maintain
Simplify onboarding process for new systems
Support exponential growth of event data
Provide users with self-serve validation tools

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

Event Standardization Service – High Level Design
Event Hub / Ingest ProcessingData Source Data Storage and Logic Processing
Reporting /
Analytics
YARN
NiFi
Business
Events
Data
Power
IBM MQ
XML
JSON
Batch
Source
Events
Kafka
HDFS
Text
Kafka
Data
Stage
Text
KafkaSPARK
streaming
SPARK
streaming
SPARK
StreamingJSON
JSON JSON
Kafka WAL/Offsets
Teradata
Core EDW
Other
Data
Stores
Down
Stream
Systems
Read & RouteRead & Route
Processors for
Near Real Time
Events
Read & RouteRead & RouteProcessors for
Batch Events
Read & RouteRead & RouteProcessors for
Persistence/OPS Text
Parquet
Lookup/Reference Data
Elastic
Search
Kibana
(OPS)
Email
Server
The solution for ESS

HDFS
Spark Implementation
Hadoop Cluster
Edge Node YARN Resource
Manager
Node Manager 5
Node Manager
2
Node Manager
3
Node Manager 6
Node Manager
4
Kafka Cluster
Server 1
Node Manager 7
P0 P1 P2
Server 2
P3 P4 P5
Node Manager
1
YARN Resource
Manager
Node Manager 2 Node Manager 3 Node Manager 4
Node Manager 1
Spark App
Spark Driver(AM)
Spark Executor Spark Executor Spark Executor
Spark Executor Spark Executor Spark Executor
Data
Node
Data
Node
Data
Nodes

Benefits
• Event data available for further
analytics in near real-time
• Scalability solved
• Handle higher outage windows
• Fast development and iterations
• Better data flow visibility
• Integration to legacy infrastructure
• Reinvestment of IT budget to
newer open source technologies

Project
Challenges
• Too many new things at once
• Lack of knowledge and
documentation of legacy systems
• Infrastructure readiness
• Implementing security requirements
• Versioning of different open source
Apache projects
• Getting to simple

NiFi Canvas – rapid build through configuration

NiFi – Integration & Load Testing

NiFi – Access Control (Groups, Users, LDAP integration)

NiFi – Supporting Different Environments
DEV
UAT
PROD

NiFi – Version Upgrade
1.3.0 1.5.0

Spark Streaming input source and output sink
“The streaming sinks are designed to be
idempotent for handling reprocessing.”
You need to handle the logic for duplicate
replay/reprocessing when writing output if
exact once processing is needed.

Spark Structure Streams….focus on logic code, not plumbing code
Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write Stream

Spark Structure Streams….lazy design of source and sinks
Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write Stream

Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write StreamWrite Stream

Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write StreamWrite Stream
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Spark Session

Spark Streaming Hosting on YARN….deploy, control and logging
YARN
Data NodeData Node
Spark
Streaming
Application
Kafka
HDFS
NiFi
Data NodeData NodeProcessorsKafka
Spark Submit
(package your version)
DeployControlLogging
HDFS
(temp files)
Stop?...Kill
Text Commands
Logs and metrics ‘tailed’ and saved
Email
Notification
Log4J2

/app/spark/spark-2.2.0/bin/spark-submit
--jars spark-sql-kafka-0-10_2.11-2.2.0.jar
--class <com.MainClassName>
--master yarn
--deploy-mode cluster
--queue <your queue name>
--num-executors 18
--executor-cores 1
--executor-memory 4G
--driver-memory 4G
--driver-java-options="-XX:+UseConcMarkSweepGC -Dhdp.version=current -Dlog4j.configuration=./log4j.properties -Dconfig.file=./application.conf
-Djava.security.auth.login.config=./kafka_client_jaas.conf -Djava.security.krb5.conf=./krb5.conf"
--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dhdp.version=current -Dlog4j.configuration=./log4j.properties
-Dconfig.file=./application.conf -Djava.security.auth.login.config=./kafka_client_jaas.conf -Djava.security.krb5.conf=./krb5.conf"
--conf "spark.yarn.maxAppAttempts=4"
--conf "spark.yarn.am.attemptFailuresValidityInterval=1h"
--conf "spark.yarn.max.executor.failures=16"
--conf "spark.speculation=false"
--conf "spark.task.maxFailures=1"
--conf "spark.hadoop.fs.hdfs.impl.disable.cache=true"
--conf "spark.ui.showConsoleProgress=false"
--conf "spark.shuffle.consolidateFiles=true"
--conf "spark.locality.wait=1s"
--conf "spark.sql.tungsten.enabled=false"
--conf "spark.sql.codegen=false"
--conf "spark.sql.unsafe.enabled=false"
--conf "spark.streaming.backpressure.enabled=true"
--conf "spark.streaming.kafka.consumer.cache.enabled=false"
--conf "spark.ui.view.acls=*"
--principal <your principle name>
--keytab <keytab file path>
--files ./log4j.properties#log4j.properties,./log4j2.xml#log4j2.xml,./application.conf#application.conf,./metrics.properties#metrics.properties,
./kafka_client_jaas.conf#kafka_client_jaas.conf,/app/pbrtappk/YYYYY#YYYYYYY,./krb5.conf,./client.truststore.jks $1

Summary
• NiFi has been great on load/extract
• Use NiFi to handle routes & format
• Spark good for transforms
• Operationalizing Spark Streaming
is a challenge
• Deploying changes with NiFi a
challenge
• Keep it simple

Questions?
Darryl Dutton, T4G
darryl.dutton@T4G.com
Ready to Build Brilliant?
We’re always looking for new challenges
and teammates.
Connect with us!
800.399.5370
hello@t4g.com
www.t4g.com
Kenneth Poon, RBC
kenneth.t.poon@rbc.com
Helping clients thrive and
communities prosper.
Always hiring!
Simplify. Agile. Innovate.
jobs.rbc.com

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

More Related Content

What's hot (20)

Similar to Using Spark Streaming and NiFi for the next generation of ETL in the enterprise (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

Editor's Notes