SlideShare a Scribd company logo
Using Spark Streaming and NiFi for the next
generation of ETL in the enterprise
Darryl Dutton, Principal Consultant, T4G
Kenneth Poon, Director of Data Engineering, RBC
The Journey
Agenda
What is the Event Standardization
Service (ESS) Use Case
The drivers to modernize ESS
The solution for ESS and benefits
The project challenges
The Good, the Bad and the Ugly
Questions
What is the
ESS Use Case
Event Standardization Service (ESS) captures
customer activity across all channels, such as
Online Banking, Mobile Apps, Bank Branch,
Advice Center, etc…
ESS facilitates customer journey reporting to
turn raw event data into actionable insights
ESS provides APIs to customer-facing systems
to get insights on recent customer activity,
journeys, and life events.
ESS – Business Value
Understand customer activity across all
channels
Identify customer journey from
interactions
Identify life events from journeys to
optimize customer experience
Life Events
Customer Journey
Events / Interactions
What is the ESS Use Case - Legacy
Event Standardization Service – Legacy Architecture
Event Hub / Ingest ProcessingData Source Data Storage and Batch Processing
Reporting /
Analytics
Business
Events
(real time)
IBM
Data
Power
IBM MQ
TeraData
TPump
Stage 0
60 Minute Mini
Batch (SQL)
Teradata
Core EDW
TeraData
Extended
Model
TeraData
Report
view
Batch SQL
Batch
Extracts
Oracle
BI tool
Batch
Source
Events
IBM
Data Stage
Batch SQL
Apps/
Ad hoc
The drivers to
modernize
ESS
Provide real-time access to customer event and
journey data
Reduce cost to enhance, support, and maintain
Simplify onboarding process for new systems
Support exponential growth of event data
Provide users with self-serve validation tools
Key Solution
Components
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Extract & Load
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Transformation
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Integration
Event Standardization Service – High Level Design
Event Hub / Ingest ProcessingData Source Data Storage and Logic Processing
Reporting /
Analytics
YARN
NiFi
Business
Events
Data
Power
IBM MQ
XML
JSON
Batch
Source
Events
Kafka
HDFS
Text
Kafka
Data
Stage
Text
KafkaSPARK
streaming
SPARK
streaming
SPARK
StreamingJSON
JSON JSON
Kafka WAL/Offsets
Teradata
Core EDW
Other
Data
Stores
Down
Stream
Systems
Read & RouteRead & Route
Processors for
Near Real Time
Events
Read & RouteRead & RouteProcessors for
Batch Events
Read & RouteRead & RouteProcessors for
Persistence/OPS Text
Parquet
Lookup/Reference Data
Elastic
Search
Kibana
(OPS)
Email
Server
The solution for ESS
NiFi Implementation
External
HDFS
Spark Implementation
Hadoop Cluster
Edge Node YARN Resource
Manager
Node Manager 5
Node Manager
2
Node Manager
3
Node Manager 6
Node Manager
4
Kafka Cluster
Server 1
Node Manager 7
P0 P1 P2
Server 2
P3 P4 P5
Node Manager
1
YARN Resource
Manager
Node Manager 2 Node Manager 3 Node Manager 4
Node Manager 1
Spark App
Spark Driver(AM)
Spark Executor Spark Executor Spark Executor
Spark Executor Spark Executor Spark Executor
Data
Node
Data
Node
Data
Nodes
Benefits
• Event data available for further
analytics in near real-time
• Scalability solved
• Handle higher outage windows
• Fast development and iterations
• Better data flow visibility
• Integration to legacy infrastructure
• Reinvestment of IT budget to
newer open source technologies
Project
Challenges
• Too many new things at once
• Lack of knowledge and
documentation of legacy systems
• Infrastructure readiness
• Implementing security requirements
• Versioning of different open source
Apache projects
• Getting to simple
The Good
The Bad &
The Ugly
NiFi Canvas – rapid build through configuration
NiFi Monitoring and Retry
NiFi – Integration & Load Testing
NiFi – Access Control (Groups, Users, LDAP integration)
NiFi – Supporting Different Environments
DEV
UAT
PROD
NiFi – Version Upgrade
1.3.0 1.5.0
Spark Streaming input source and output sink
“The streaming sinks are designed to be
idempotent for handling reprocessing.”
You need to handle the logic for duplicate
replay/reprocessing when writing output if
exact once processing is needed.
Spark Structure Streams….focus on logic code, not plumbing code
Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write Stream
Spark Structure Streams….lazy design of source and sinks
Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write Stream
Spark Structure Streams….lazy design of source and sinks
Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write Stream
Spark Structure Streams….lazy design of source and sinks
Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write StreamWrite Stream
Spark Structure Streams….lazy design of source and sinks
Spark Session
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Write StreamWrite Stream
Read Stream
Transforms/Filters
Transforms/Filters
Transforms/Filters
Spark Session
Spark Streaming Hosting on YARN….deploy, control and logging
YARN
Data NodeData Node
Spark
Streaming
Application
Kafka
HDFS
NiFi
Data NodeData NodeProcessorsKafka
Spark Submit
(package your version)
DeployControlLogging
HDFS
(temp files)
Stop?...Kill
Text Commands
Logs and metrics ‘tailed’ and saved
Email
Notification
Log4J2
/app/spark/spark-2.2.0/bin/spark-submit 
--jars spark-sql-kafka-0-10_2.11-2.2.0.jar 
--class <com.MainClassName> 
--master yarn 
--deploy-mode cluster 
--queue <your queue name> 
--num-executors 18 
--executor-cores 1 
--executor-memory 4G 
--driver-memory 4G 
--driver-java-options="-XX:+UseConcMarkSweepGC -Dhdp.version=current -Dlog4j.configuration=./log4j.properties -Dconfig.file=./application.conf
-Djava.security.auth.login.config=./kafka_client_jaas.conf -Djava.security.krb5.conf=./krb5.conf" 
--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dhdp.version=current -Dlog4j.configuration=./log4j.properties 
-Dconfig.file=./application.conf -Djava.security.auth.login.config=./kafka_client_jaas.conf -Djava.security.krb5.conf=./krb5.conf" 
--conf "spark.yarn.maxAppAttempts=4" 
--conf "spark.yarn.am.attemptFailuresValidityInterval=1h" 
--conf "spark.yarn.max.executor.failures=16" 
--conf "spark.speculation=false" 
--conf "spark.task.maxFailures=1" 
--conf "spark.hadoop.fs.hdfs.impl.disable.cache=true" 
--conf "spark.ui.showConsoleProgress=false" 
--conf "spark.shuffle.consolidateFiles=true" 
--conf "spark.locality.wait=1s" 
--conf "spark.sql.tungsten.enabled=false" 
--conf "spark.sql.codegen=false" 
--conf "spark.sql.unsafe.enabled=false" 
--conf "spark.streaming.backpressure.enabled=true" 
--conf "spark.streaming.kafka.consumer.cache.enabled=false" 
--conf "spark.ui.view.acls=*" 
--principal <your principle name> 
--keytab <keytab file path> 
--files ./log4j.properties#log4j.properties,./log4j2.xml#log4j2.xml,./application.conf#application.conf,./metrics.properties#metrics.properties,
./kafka_client_jaas.conf#kafka_client_jaas.conf,/app/pbrtappk/YYYYY#YYYYYYY,./krb5.conf,./client.truststore.jks $1
Summary
• NiFi has been great on load/extract
• Use NiFi to handle routes & format
• Spark good for transforms
• Operationalizing Spark Streaming
is a challenge
• Deploying changes with NiFi a
challenge
• Keep it simple
Questions?
Darryl Dutton, T4G
darryl.dutton@T4G.com
Ready to Build Brilliant?
We’re always looking for new challenges
and teammates.
Connect with us!
800.399.5370
hello@t4g.com
www.t4g.com
Kenneth Poon, RBC
kenneth.t.poon@rbc.com
Helping clients thrive and
communities prosper.
Always hiring!
Simplify. Agile. Innovate.
jobs.rbc.com

More Related Content

PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
Incremental View Maintenance with Coral, DBT, and Iceberg
PPTX
Real-time Stream Processing with Apache Flink
PPTX
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
CDC patterns in Apache Kafka®
PPTX
Apache flink
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Incremental View Maintenance with Coral, DBT, and Iceberg
Real-time Stream Processing with Apache Flink
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
CDC patterns in Apache Kafka®
Apache flink

What's hot (20)

PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Running Apache NiFi with Apache Spark : Integration Options
PDF
Introduction to Solr
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PDF
Databricks Delta Lake and Its Benefits
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
PDF
How Uber scaled its Real Time Infrastructure to Trillion events per day
PDF
Apache Kafka® and the Data Mesh
PDF
Building Your Data Streams for all the IoT
PDF
Apache Flink 101 - the rise of stream processing and beyond
PDF
Parquet performance tuning: the missing guide
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PDF
Introduction to elasticsearch
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
ETL and Event Sourcing
PDF
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
PPTX
Apache Flink Training: System Overview
PDF
Considerations for Data Access in the Lakehouse
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Introduction to Kafka Streams
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Running Apache NiFi with Apache Spark : Integration Options
Introduction to Solr
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks Delta Lake and Its Benefits
Apache Flink: Real-World Use Cases for Streaming Analytics
How Uber scaled its Real Time Infrastructure to Trillion events per day
Apache Kafka® and the Data Mesh
Building Your Data Streams for all the IoT
Apache Flink 101 - the rise of stream processing and beyond
Parquet performance tuning: the missing guide
Data Lakehouse Symposium | Day 1 | Part 1
Introduction to elasticsearch
Optimizing Delta/Parquet Data Lakes for Apache Spark
ETL and Event Sourcing
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Apache Flink Training: System Overview
Considerations for Data Access in the Lakehouse
Apache Iceberg - A Table Format for Hige Analytic Datasets
Introduction to Kafka Streams
Ad

Similar to Using Spark Streaming and NiFi for the next generation of ETL in the enterprise (20)

PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
PDF
Top 5 mistakes when writing Streaming applications
PDF
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
PDF
AI-Powered Streaming Analytics for Real-Time Customer Experience
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
PPTX
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
What no one tells you about writing a streaming app
PPTX
Paris Data Geek - Spark Streaming
PPTX
Jack Gudenkauf sparkug_20151207_7
PPTX
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
PDF
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
PDF
Gruter TECHDAY 2014 Realtime Processing in Telco
PPTX
Stream, stream, stream: Different streaming methods with Spark and Kafka
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Spark Streaming Recipes and "Exactly Once" Semantics Revised
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Top 5 mistakes when writing Streaming applications
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
AI-Powered Streaming Analytics for Real-Time Customer Experience
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What no one tells you about writing a streaming app
Paris Data Geek - Spark Streaming
Jack Gudenkauf sparkug_20151207_7
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Gruter TECHDAY 2014 Realtime Processing in Telco
Stream, stream, stream: Different streaming methods with Spark and Kafka
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
cuic standard and advanced reporting.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
Teaching material agriculture food technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
The AUB Centre for AI in Media Proposal.docx
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
cuic standard and advanced reporting.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Modernizing your data center with Dell and AMD
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Teaching material agriculture food technology

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

  • 1. Using Spark Streaming and NiFi for the next generation of ETL in the enterprise Darryl Dutton, Principal Consultant, T4G Kenneth Poon, Director of Data Engineering, RBC
  • 3. Agenda What is the Event Standardization Service (ESS) Use Case The drivers to modernize ESS The solution for ESS and benefits The project challenges The Good, the Bad and the Ugly Questions
  • 4. What is the ESS Use Case Event Standardization Service (ESS) captures customer activity across all channels, such as Online Banking, Mobile Apps, Bank Branch, Advice Center, etc… ESS facilitates customer journey reporting to turn raw event data into actionable insights ESS provides APIs to customer-facing systems to get insights on recent customer activity, journeys, and life events.
  • 5. ESS – Business Value Understand customer activity across all channels Identify customer journey from interactions Identify life events from journeys to optimize customer experience Life Events Customer Journey Events / Interactions
  • 6. What is the ESS Use Case - Legacy Event Standardization Service – Legacy Architecture Event Hub / Ingest ProcessingData Source Data Storage and Batch Processing Reporting / Analytics Business Events (real time) IBM Data Power IBM MQ TeraData TPump Stage 0 60 Minute Mini Batch (SQL) Teradata Core EDW TeraData Extended Model TeraData Report view Batch SQL Batch Extracts Oracle BI tool Batch Source Events IBM Data Stage Batch SQL Apps/ Ad hoc
  • 7. The drivers to modernize ESS Provide real-time access to customer event and journey data Reduce cost to enhance, support, and maintain Simplify onboarding process for new systems Support exponential growth of event data Provide users with self-serve validation tools
  • 16. Event Standardization Service – High Level Design Event Hub / Ingest ProcessingData Source Data Storage and Logic Processing Reporting / Analytics YARN NiFi Business Events Data Power IBM MQ XML JSON Batch Source Events Kafka HDFS Text Kafka Data Stage Text KafkaSPARK streaming SPARK streaming SPARK StreamingJSON JSON JSON Kafka WAL/Offsets Teradata Core EDW Other Data Stores Down Stream Systems Read & RouteRead & Route Processors for Near Real Time Events Read & RouteRead & RouteProcessors for Batch Events Read & RouteRead & RouteProcessors for Persistence/OPS Text Parquet Lookup/Reference Data Elastic Search Kibana (OPS) Email Server The solution for ESS
  • 18. HDFS Spark Implementation Hadoop Cluster Edge Node YARN Resource Manager Node Manager 5 Node Manager 2 Node Manager 3 Node Manager 6 Node Manager 4 Kafka Cluster Server 1 Node Manager 7 P0 P1 P2 Server 2 P3 P4 P5 Node Manager 1 YARN Resource Manager Node Manager 2 Node Manager 3 Node Manager 4 Node Manager 1 Spark App Spark Driver(AM) Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Spark Executor Data Node Data Node Data Nodes
  • 19. Benefits • Event data available for further analytics in near real-time • Scalability solved • Handle higher outage windows • Fast development and iterations • Better data flow visibility • Integration to legacy infrastructure • Reinvestment of IT budget to newer open source technologies
  • 20. Project Challenges • Too many new things at once • Lack of knowledge and documentation of legacy systems • Infrastructure readiness • Implementing security requirements • Versioning of different open source Apache projects • Getting to simple
  • 21. The Good The Bad & The Ugly
  • 22. NiFi Canvas – rapid build through configuration
  • 24. NiFi – Integration & Load Testing
  • 25. NiFi – Access Control (Groups, Users, LDAP integration)
  • 26. NiFi – Supporting Different Environments DEV UAT PROD
  • 27. NiFi – Version Upgrade 1.3.0 1.5.0
  • 28. Spark Streaming input source and output sink “The streaming sinks are designed to be idempotent for handling reprocessing.” You need to handle the logic for duplicate replay/reprocessing when writing output if exact once processing is needed.
  • 29. Spark Structure Streams….focus on logic code, not plumbing code Spark Session Read Stream Transforms/Filters Transforms/Filters Transforms/Filters Write Stream
  • 30. Spark Structure Streams….lazy design of source and sinks Spark Session Read Stream Transforms/Filters Transforms/Filters Transforms/Filters Write Stream
  • 31. Spark Structure Streams….lazy design of source and sinks Spark Session Read Stream Transforms/Filters Transforms/Filters Transforms/Filters Write Stream
  • 32. Spark Structure Streams….lazy design of source and sinks Spark Session Read Stream Transforms/Filters Transforms/Filters Transforms/Filters Write StreamWrite Stream
  • 33. Spark Structure Streams….lazy design of source and sinks Spark Session Read Stream Transforms/Filters Transforms/Filters Transforms/Filters Write StreamWrite Stream Read Stream Transforms/Filters Transforms/Filters Transforms/Filters Spark Session
  • 34. Spark Streaming Hosting on YARN….deploy, control and logging YARN Data NodeData Node Spark Streaming Application Kafka HDFS NiFi Data NodeData NodeProcessorsKafka Spark Submit (package your version) DeployControlLogging HDFS (temp files) Stop?...Kill Text Commands Logs and metrics ‘tailed’ and saved Email Notification Log4J2
  • 35. /app/spark/spark-2.2.0/bin/spark-submit --jars spark-sql-kafka-0-10_2.11-2.2.0.jar --class <com.MainClassName> --master yarn --deploy-mode cluster --queue <your queue name> --num-executors 18 --executor-cores 1 --executor-memory 4G --driver-memory 4G --driver-java-options="-XX:+UseConcMarkSweepGC -Dhdp.version=current -Dlog4j.configuration=./log4j.properties -Dconfig.file=./application.conf -Djava.security.auth.login.config=./kafka_client_jaas.conf -Djava.security.krb5.conf=./krb5.conf" --conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dhdp.version=current -Dlog4j.configuration=./log4j.properties -Dconfig.file=./application.conf -Djava.security.auth.login.config=./kafka_client_jaas.conf -Djava.security.krb5.conf=./krb5.conf" --conf "spark.yarn.maxAppAttempts=4" --conf "spark.yarn.am.attemptFailuresValidityInterval=1h" --conf "spark.yarn.max.executor.failures=16" --conf "spark.speculation=false" --conf "spark.task.maxFailures=1" --conf "spark.hadoop.fs.hdfs.impl.disable.cache=true" --conf "spark.ui.showConsoleProgress=false" --conf "spark.shuffle.consolidateFiles=true" --conf "spark.locality.wait=1s" --conf "spark.sql.tungsten.enabled=false" --conf "spark.sql.codegen=false" --conf "spark.sql.unsafe.enabled=false" --conf "spark.streaming.backpressure.enabled=true" --conf "spark.streaming.kafka.consumer.cache.enabled=false" --conf "spark.ui.view.acls=*" --principal <your principle name> --keytab <keytab file path> --files ./log4j.properties#log4j.properties,./log4j2.xml#log4j2.xml,./application.conf#application.conf,./metrics.properties#metrics.properties, ./kafka_client_jaas.conf#kafka_client_jaas.conf,/app/pbrtappk/YYYYY#YYYYYYY,./krb5.conf,./client.truststore.jks $1
  • 36. Summary • NiFi has been great on load/extract • Use NiFi to handle routes & format • Spark good for transforms • Operationalizing Spark Streaming is a challenge • Deploying changes with NiFi a challenge • Keep it simple
  • 37. Questions? Darryl Dutton, T4G darryl.dutton@T4G.com Ready to Build Brilliant? We’re always looking for new challenges and teammates. Connect with us! 800.399.5370 hello@t4g.com www.t4g.com Kenneth Poon, RBC kenneth.t.poon@rbc.com Helping clients thrive and communities prosper. Always hiring! Simplify. Agile. Innovate. jobs.rbc.com

Editor's Notes

  • #2: Darryl
  • #3: Darryl
  • #4: Darryl
  • #5: ESS (Event Standardization Service) is a new service built by RBC’s Data & Analytics group to collect customer interaction data across various channels (such as Online Banking, Mobile apps, Branch, ATMs, Advice Center) into a central repository, apply analytics to it, and then make the data available through APIs. The idea originated around 8 years ago with a system called ECS (Event Capture Service). ECS started collecting events from various channels and load it into the data warehouse. Over time, upstream systems stopped sending new events because it was difficult to onboard and expensive. The dataset became incomplete (new events were missing), making it unusable for customer journey reporting. Last year, RBC partnered with T4G to build a new event service (ESS) that would address all the pain points of the old system and be designed in such a way to capture ALL events across ALL channels, right now and into the future. One of the top priorities of 2018 is to be able to link online and offline activities to get a holistic view of customer journey.
  • #6: One of the common questions we get asked is what we are doing all these events. The goal of ESS is to turn the raw events into actionable insights that can improve the customer experience and the bank’s bottom line. At a high-level, we want to construct customer journeys from the interaction data, which can help predict life events. Through path analysis and prediction, knowing a customer’s current and next stage in life allows us to target them with more relevant offers in a timely manner, and even geo-targeted offers since we also track location. Here are some of the other use cases we are currently working on: Advisor Support Enable advisors to view real-time interaction data to assist with problem resolution Digital  Offline Efficiencies Identify opportunities to reduce Advice Centre call volumes Sales Attribution Identify the right digital marketing mix to drive sales Link digital activity (research) to offline conversions (Mortgages)
  • #7: Before we started building the new event service, we wanted to understand how the old service was designed and implemented, and find out the reasons why it became unusable over time. The first thing we found was that the technology stack was a bit out-dated (but very mature and reliable). (click) Source systems would send XML events to a SOAP endpoint on DataPower, which gets routed to MQ queue, and feeds into Teradata warehouse through TPUMP utility and BTEQ mini-batch process, running every 60 minutes. (click) All the processing runs on Mainframe, triggered by JCLs and scheduled through Zeke. The data is then used for internal reporting on OBI and Tableau. (click) The batch feeds were copied from z/OS to Datastage, and then loaded to Teradata as well through same mini-batch process. (click) (click) As you can tell, a lot of vendor products were used, making it difficult to find resources in the market who have all of this expertise. Also, the folks who worked on it were either retired, switched teams, or no longer with the bank. But technology was only half the problem. Having a rigid XML schema and process-heavy development and deployment cycle resulted in months to deploy a simple change. These reasons made it very expensive to continue to use this system.
  • #8: Since we were going to re-architect the event service to make it easier for systems to use, we figure we would also modernize the tech stack to make it less costly to enhance, support, and maintain as well. As customers go digital and do more of their banking online and on apps, we are seeing the # of interaction events generated exponentially outgrow the # of events the old service was able to handle. RBC was falling behind in the channel analytics space, which is a huge lost opportunity for the bank if we can’t capitalize on all that customer data to analyze their banking behavior and tendencies. Over the last 6 months, several new features have been rolled out across the different channels (especially digital and mobile apps), and we are happy to say that the new ESS service has been able to keep up with the demand. We were also able to go back and capture critical business events that were not onboarded before (such as branch and call center activity) I’ll hand it over to Darryl now to talk about the key components of the new solution.
  • #9: Darryl
  • #10: Darryl
  • #11: Darryl Rapidly build data pipelines Required integrations supported Configuration over code Many available processors/services Easy ingestion, routing and splits Simple transforms and format changes Flow modification at runtime Built-in queuing and backpressure
  • #12: Darryl
  • #13: Darryl
  • #14: Darryl Provide processing in near real time Micro batching good enough Complex transformation/enrichment Structured Streaming…Elegant Automatic retries Out-of-box integration with Kafka Use SQL on streaming data API allow future path to ML
  • #15: Darryl
  • #16: Darryl Move data across boundaries De-couple systems Reliable messaging High performance, high volume Hold events for long outages Supported integrations
  • #17: Darryl
  • #18: Darryl We recently upgraded from 1.3.0 to 1.5.0 in Production. We had both instances up in parallel and migrated 1 template at a time so that we had a way to easily rollback if it didn’t work in 1.5.0 external Zookeeper for Nifi Cluster because of excessive logging issue https://issues.apache.org/jira/browse/NIFI-3731
  • #19: Darryl
  • #20: From what Darryl described, there are quite a few benefits in the new architecture: Instead of events being made available for processing after 60 minutes, Spark streaming enables the events to be consumed in near real-time within seconds Building a distributed system allows us to scale horizontally – we can add more nodes to NiFi cluster, increase # of Kafka partitions, or increase # of Spark executors, as the volume of events increase over time. We moved away from vendor products and embraced open-source, although we are using Confluent for Kafka and Hortonworks for Hadoop. Using open-source frees us from vendor lock-in and assures long-term viability. It’s also easier to find developers who are more interested in working on new tech, which allows for succession planning. Not everything was new though: We did continue to use proven enterprise infrastructure (DataPower and MQ) as our REST API layer to receive events to ensure high availability and fault tolerance. This was in lieu of having a REST proxy for Kafka available. Using NiFi sped up development (allowed for quick prototyping and testing), and also made it operationally easier to manage data-in-motion through visual controls.
  • #21: As with any new project, we encountered some challenges along the way. When you’re building a new system with all new tech, the last thing you need is to also introduce a new language – Scala. My developers (all Java background) didn’t know Scala, but wanted to use it for Spark Streaming. If you have 0 experience in Scala, prototyping to get something to work is much different than writing a Production-grade app. Understanding what the old service did was also a challenge. We didn’t have the right skillset to understand the legacy implementation on Mainframe (and there was also a lack of documentation). Security requirements – communication over SSL and Kerberos-authenticated - lots of certs: Between NiFi nodes in the cluster, connecting to HDFS, connecting to Kafka, connecting to Elasticsearch Open source projects continuously have new versions: NiFi – We started with 1.1.0, then 1.3.0, and now 1.5.0 When we started, we only had Kafka 0.9, but needed Kafka 0.10 or higher to support SSL for Sparking Streaming integration with Kafka. Had to wait 4-5 months for Kafka 0.10 cluster to be ready. To prevent ESS from becoming obsolete over time, we’ll need to continue to optimize and simplify the tech stack so the next generation of folks don’t retire this system in 5 yrs.
  • #22: After a year working with Hadoop, NiFi, Spark, Kafka, and Elasticsearch, several teams at RBC are becoming proficient at it. However, this wasn’t the case last year. Last year, both the development and platform teams were learning at the same time. Dev teams were learning how to build Production-grade apps on Hadoop. The Platform team was learning how to manage and operate an enterprise Hadoop and Kafka environment that supports multitenancy. I’ll go over our experience with NiFi and Darryl will go over our experience with Spark Streaming.
  • #23: As a manager, one thing I love about NiFi is how quickly developers can whip up new data flows. NiFi is perfect for moving data from 1 or more sources to 1 or more sinks. Code reviews are much easier as it’s not as subjective to different coding style since NiFi is more like configuration as code. One new thing to gripe about now is how straight the connector lines are and the spacing between the processors. I was never a fan of drag-and-drop ETL dev tools in RBC (such as BusinessWorks and Datastage), but NiFi gives you more control, has a easy-to-use interface, and is more scalable. NiFi does data movement very well. Debugging was extremely easy with the provenance repository. If there was any failure, it is easy to find out which message failed to process and why.
  • #24: Monitoring failures and implementing retries is always a pain when you have to code it yourself. NiFi makes it very easy to configure it. For retry, just have a self-loop. (click). We configure the penalty duration of the processor if we want to introduce a backoff and wait a certain amt of time before retrying. Typical use case for retry on failure is when you’re writing to a sink and there’s a connection failure (i.e. to HDFS, to Kafka topic, to Elasticsearch). (click, click) Whenever there is a failure, we used the MonitorActivity processor (click) to then send a consolidated email every 5 mins to alert our support team of a failure. (click) Once the issue has been resolved, we send a recovery alert email. (click, click)
  • #25: For testing, we created a bunch of simple flows that read data from disk and publish to one of our ingestion points (either MQ or Kafka topic) We did this for load testing when we needed to pump hundreds of thousands of events to the system in a very short period of time to simulate volume at 10 x the peak, and to measure our expected throughput. We also had test Kafka consumers to verify we indeed published to the topic successfully. The test classes normally would’ve taken us around 5-10 minutes to code in Java. In NiFi, we can quickly whip these up in a minute or 2.
  • #26: As with any GUI interface, we have to implement some sort of access control. We configured LDAP authentication against Active Directory for user login. We used SSL certificates for initial login and setting up secure cluster, and then disabled it. We created read-only, read-write, and admin NiFi groups, assigning different policies for each. This was fairly straightforward. In PROD, developers have read-only access, and support folks have write and admin access.
  • #27: Traditionally with our Java applications, we deploy same JAR file in each environment, but reading from different config files. Trying to replicate this in NiFi was not straight-forward because not all configs could be externalized into variables. Often times, we had to manually alter the values of certain configuration values after importing a new flow to a different environment. (click) To reduce inadvertent changes to existing processors, what we decided to do was for brand new flows or completely refactored flows, we would replace the entire flow xml (Export from lower environment and promote to next environment). For minor changes, we would just manually re-apply them in the subsequent environment. (click) Another concern working with different environments in NiFi is that the NiFi canvas looks the same (grey background color) for each environment. We used Custom Javascript plugin for chrome and added code to change canvas background color to a different one for each environment. Green for DEV, Yellow for UAT, and Red for PROD. That way, we could be more careful when working in PROD.
  • #28: A few months ago in March, we upgraded from NiFi 1.3.0 to 1.5.0, after being on 1.3.0 for a good 8 months There is no magic button to do in-place upgrade. What we did was set up a parallel NiFi cluster on different ports (and also separate zookeeper cluster). (click) We stopped all the processors that ingested new data, let any in-flight messages finish processing (click), and then we shut down the old NiFi instance. We then started the processors back up on the new 1.5.0 cluster. (click) 2 reasons why we upgraded: resolve PROD issue where NiFi couldn’t start back up use newer versions of Kafka producer and consumer. Once in PROD, our data center lost power and all our servers shut down unexpectedly. When the servers came back up, we couldn’t start NiFi instance back up due to “No enum constant CONTENTMISSING error (NIFI-4093)”, which apparently was fixed in NiFi 1.4. The JettyServer just couldn’t start up due to this error because of a bug where the wrong Enum was used to determine how to process an update to the FlowFile repository. At that time, the only way to get NiFi to start back up was to clear out the flowfile repository, which means all in-flight messages were lost. Another reason we upgraded was to keep up to date with the Kafka producer/consumer versions. We migrated to Kafka 0.11 early in the year but NiFi 1.3.0 only had Kafka producers and consumers up to 0.10. Before we upgraded to NiFi 1.5.0, we downloaded the newer Kafka producer/consumer nar files from NiFi 1.5.0 and used them in 1.3.0. Ever since upgrading to NiFi 1.5.0, our cluster has become much more stable and we haven’t faced any cluster issues in last 3 months. Darryl will now talk about some of the good and bad with the Spark Streaming component.
  • #29: Darryl
  • #30: Darryl
  • #31: Darryl
  • #32: Darryl
  • #33: Darryl
  • #34: Darryl
  • #35: Darryl
  • #36: Darryl
  • #37: Darryl
  • #38: >> Darryl to speak first RBC is based in Toronto, Ontario, Canada, but we have offices around the world as well (New York, London, etc…) The Data & Analytics team is in Toronto and we are always looking to hire strong developers. Feel free to email me directly, connect over LinkedIn, or just visit jobs.rbc.com to explore available opportunities. Thanks