SlideShare a Scribd company logo
4
Most read
10
Most read
12
Most read
MAKING BIG DATA COME ALIVE
Integrating Apache Spark And NiFi
For Data Lakes
Ron Bodkin Founder & President
Scott Reisdorf R&D Architect
2
Agenda
• Requirements
• Design
• Demo
3
• A central repository
with trusted,
consistent data
• Reduce costs by
offloading analytical
systems and archiving cold
data
• Derive value quickly
with easier discovery
and prototyping
• A laboratory for
experimenting with
new technologies
and data
Goals for a Data Lake
4
• Automation of pipelines
with metadata and
performance tracking
• Governance with
clear distinction of
roles and responsibilities
• SLA tracking with
alerts on failures or
violations
• Interactive data discovery
and experimentation
What’s Needed For A Hadoop Data Lake?
5
Example Ingestion Project
• 4000+ unique flat files and RDMS tables, plus a few streaming
data feeds
• Mix of incremental and snapshot data
• Ingest into Hadoop (minimally HDFS and Hive tables)
• Cleansing/encryption and data validation
• Metadata capture
Focus shifts over time from data ingestion to
transformation then to analytics
6
Design
7
Apache Spark Functions
• Cleanse
• Validate
• Profile
• Wrangle
8
Pipeline design with Apache
• Visual drag-and-drop
• Dozens of data connectors
• 150+ pre-built transforms
• Data lineage
• Batch and Streaming
• Extensible
© 2016 Think Big, a Teradata Company 7/10/2016
9
Role separation
• IT Designers design models in NiFi
• Register with framework
• Integrated development process
© 2016 Think Big, a Teradata Company 7/10/2016
Apache NiFi Think Big framework
• Users configure new feeds
• Based on common model
• Generated and executed in NiFi
register
deploy
1010
7/10/2016
© 2015 Think Big, a Teradata Company
User features
around
org. roles
Visual design
Streaming
and Batch
Fully
governed
Integrated
Best
Practices
Secure, modern
architecture
Design Approach
Will be open
source (Apache
license)
1111
Ingest and Prepare
• UI-guided feed creation
• Data protection
• Data cleanse
• Data validation
• Data profiling
• Powered by Apache Spark
Unpack and/or
merge small files
Put file
HDFS
Cleanse/Stand
ardize
Spark
Data Profile
Spark
Metadata
Validate
Spark
Data Ingest Model
Metadata determines
behavior of individual
components
Adds many Hadoop-
specific higher-level NiFi
processors
Index Text
Elasticsearch
Merge / Dedupe
Hive
Compress &
Archive Originals
HDFS,S3
Extract Table
JDBC
Get File(s)
Filesystem
Message
JMS/Kafka
Other
HTTP/REST, etc.
Data policies
12
1313
Data self-service and “wrangle”
• Graphical SQL builder
• 100+ transform functions
• Machine learning
• Publish and schedule
• Powered by Apache Spark
1414
Data Discovery
• Google-like searching
• Extensible metadata
• Data profile
• Data sampling
1515
Operations
• Dashboard
• Health Monitoring
• Data Confidence
• SLA enforcement
• Alerts
• Performance reports
16
• Powerful search capabilities for users against data
(think Google-like searching)
• NiFi processor extracts source data from Hadoop table
for indexing in ElasticSearch
• Incremental updates during ingest
ElasticSearch – Full Text Indexing
Data Lake
select id,user,tweet
from twitter_feed
extract JSON
17
Demo
1818

More Related Content

PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PPTX
Real-Time Data Flows with Apache NiFi
PPTX
Apache NiFi Crash Course Intro
ODP
Introduction to Kafka connect
PDF
Data ingestion and distribution with apache NiFi
PDF
Real time stock processing with apache nifi, apache flink and apache kafka
PDF
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Apache Kafka Architecture & Fundamentals Explained
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Real-Time Data Flows with Apache NiFi
Apache NiFi Crash Course Intro
Introduction to Kafka connect
Data ingestion and distribution with apache NiFi
Real time stock processing with apache nifi, apache flink and apache kafka
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...

What's hot (20)

PDF
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
OpenStack Architecture and Use Cases
PDF
Introduction to Spark Streaming
PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
PDF
XStream: stream processing platform at facebook
PDF
NiFi Developer Guide
PPT
PDF
Running Apache NiFi with Apache Spark : Integration Options
PPTX
Apache Atlas: Governance for your Data
PPTX
Leveraging Nexus Repository Manager at the Heart of DevOps
PDF
Monitoring Flink with Prometheus
PDF
Building Robust ETL Pipelines with Apache Spark
PDF
FLiP Into Trino
PPTX
RESTful API - Best Practices
PPT
Tomcat Server
PPTX
Kafka 101
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PPTX
Oracle REST Data Services: Options for your Web Services
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
A Thorough Comparison of Delta Lake, Iceberg and Hudi
OpenStack Architecture and Use Cases
Introduction to Spark Streaming
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
XStream: stream processing platform at facebook
NiFi Developer Guide
Running Apache NiFi with Apache Spark : Integration Options
Apache Atlas: Governance for your Data
Leveraging Nexus Repository Manager at the Heart of DevOps
Monitoring Flink with Prometheus
Building Robust ETL Pipelines with Apache Spark
FLiP Into Trino
RESTful API - Best Practices
Tomcat Server
Kafka 101
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
Oracle REST Data Services: Options for your Web Services
Introduction to Apache Flink - Fast and reliable big data processing
Ad

Viewers also liked (20)

PPTX
The Elephant in the Clouds
PPTX
Apache NiFi- MiNiFi meetup Slides
PPTX
Hortonworks Data In Motion Series Part 4
PDF
Dataflow with Apache NiFi - Crash Course - HS16SJ
PPTX
Building a Smarter Home with Apache NiFi and Spark
PPTX
From Zero to Data Flow in Hours with Apache NiFi
PPTX
IOT, Streaming Analytics and Machine Learning
PPTX
Integrating Apache NiFi and Apache Flink
PPTX
NJ Hadoop Meetup - Apache NiFi Deep Dive
PPTX
Hortonworks Data in Motion Webinar Series - Part 1
PDF
Joe Witt presentation on Apache NiFi
PPTX
Webinar Series Part 5 New Features of HDF 5
PDF
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
PPTX
Hortonworks Data In Motion Series Part 3 - HDF Ambari
PPTX
Apache NiFi in the Hadoop Ecosystem
PPTX
The Avant-garde of Apache NiFi
PPTX
Make Streaming Analytics work for you: The Devil is in the Details
PPTX
Combining Machine Learning frameworks with Apache Spark
PPTX
Next Gen Big Data Analytics with Apache Apex
PPTX
Hortonworks Data In Motion Webinar Series Pt. 2
The Elephant in the Clouds
Apache NiFi- MiNiFi meetup Slides
Hortonworks Data In Motion Series Part 4
Dataflow with Apache NiFi - Crash Course - HS16SJ
Building a Smarter Home with Apache NiFi and Spark
From Zero to Data Flow in Hours with Apache NiFi
IOT, Streaming Analytics and Machine Learning
Integrating Apache NiFi and Apache Flink
NJ Hadoop Meetup - Apache NiFi Deep Dive
Hortonworks Data in Motion Webinar Series - Part 1
Joe Witt presentation on Apache NiFi
Webinar Series Part 5 New Features of HDF 5
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
Hortonworks Data In Motion Series Part 3 - HDF Ambari
Apache NiFi in the Hadoop Ecosystem
The Avant-garde of Apache NiFi
Make Streaming Analytics work for you: The Devil is in the Details
Combining Machine Learning frameworks with Apache Spark
Next Gen Big Data Analytics with Apache Apex
Hortonworks Data In Motion Webinar Series Pt. 2
Ad

Similar to Integrating Apache Spark and NiFi for Data Lakes (20)

PPTX
Marketing Digital Command Center
PPTX
Use of NiFi Product by Apache Foundation
PDF
Social Media Monitoring with NiFi, Druid and Superset
PPTX
Integração de Dados com Apache NIFI - Marco Garcia Cetax
PPTX
Overview of NiFi Product by Apache Foundation
PPTX
Best practices and lessons learnt from Running Apache NiFi at Renault
PPTX
Data ingestion using NiFi - Quick Overview
PDF
ApacheCon 2021 - Apache NiFi Deep Dive 300
PDF
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
PPTX
HDF Powered by Apache NiFi Introduction
PDF
Enterprise IIoT Edge Processing with Apache NiFi
PPTX
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
PDF
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
PDF
Introduction to Apache NiFi dws19 DWS - DC 2019
PPTX
Turning a Data Pond into a Data Lake with Apache NiFi
PDF
Hail hydrate! from stream to lake using open source
PDF
Data Ingest Self Service and Management using Nifi and Kafka
PDF
AIDEVDAY_ Data-in-Motion to Supercharge AI
PDF
Building Real-Time Travel Alerts
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Marketing Digital Command Center
Use of NiFi Product by Apache Foundation
Social Media Monitoring with NiFi, Druid and Superset
Integração de Dados com Apache NIFI - Marco Garcia Cetax
Overview of NiFi Product by Apache Foundation
Best practices and lessons learnt from Running Apache NiFi at Renault
Data ingestion using NiFi - Quick Overview
ApacheCon 2021 - Apache NiFi Deep Dive 300
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
HDF Powered by Apache NiFi Introduction
Enterprise IIoT Edge Processing with Apache NiFi
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Introduction to Apache NiFi dws19 DWS - DC 2019
Turning a Data Pond into a Data Lake with Apache NiFi
Hail hydrate! from stream to lake using open source
Data Ingest Self Service and Management using Nifi and Kafka
AIDEVDAY_ Data-in-Motion to Supercharge AI
Building Real-Time Travel Alerts
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
cuic standard and advanced reporting.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
Teaching material agriculture food technology
PPTX
Cloud computing and distributed systems.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
A Presentation on Artificial Intelligence
Reach Out and Touch Someone: Haptics and Empathic Computing
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
cuic standard and advanced reporting.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Modernizing your data center with Dell and AMD
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
The Rise and Fall of 3GPP – Time for a Sabbatical?
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Teaching material agriculture food technology
Cloud computing and distributed systems.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Unlocking AI with Model Context Protocol (MCP)
A Presentation on Artificial Intelligence

Integrating Apache Spark and NiFi for Data Lakes

  • 1. MAKING BIG DATA COME ALIVE Integrating Apache Spark And NiFi For Data Lakes Ron Bodkin Founder & President Scott Reisdorf R&D Architect
  • 3. 3 • A central repository with trusted, consistent data • Reduce costs by offloading analytical systems and archiving cold data • Derive value quickly with easier discovery and prototyping • A laboratory for experimenting with new technologies and data Goals for a Data Lake
  • 4. 4 • Automation of pipelines with metadata and performance tracking • Governance with clear distinction of roles and responsibilities • SLA tracking with alerts on failures or violations • Interactive data discovery and experimentation What’s Needed For A Hadoop Data Lake?
  • 5. 5 Example Ingestion Project • 4000+ unique flat files and RDMS tables, plus a few streaming data feeds • Mix of incremental and snapshot data • Ingest into Hadoop (minimally HDFS and Hive tables) • Cleansing/encryption and data validation • Metadata capture Focus shifts over time from data ingestion to transformation then to analytics
  • 7. 7 Apache Spark Functions • Cleanse • Validate • Profile • Wrangle
  • 8. 8 Pipeline design with Apache • Visual drag-and-drop • Dozens of data connectors • 150+ pre-built transforms • Data lineage • Batch and Streaming • Extensible © 2016 Think Big, a Teradata Company 7/10/2016
  • 9. 9 Role separation • IT Designers design models in NiFi • Register with framework • Integrated development process © 2016 Think Big, a Teradata Company 7/10/2016 Apache NiFi Think Big framework • Users configure new feeds • Based on common model • Generated and executed in NiFi register deploy
  • 10. 1010 7/10/2016 © 2015 Think Big, a Teradata Company User features around org. roles Visual design Streaming and Batch Fully governed Integrated Best Practices Secure, modern architecture Design Approach Will be open source (Apache license)
  • 11. 1111 Ingest and Prepare • UI-guided feed creation • Data protection • Data cleanse • Data validation • Data profiling • Powered by Apache Spark
  • 12. Unpack and/or merge small files Put file HDFS Cleanse/Stand ardize Spark Data Profile Spark Metadata Validate Spark Data Ingest Model Metadata determines behavior of individual components Adds many Hadoop- specific higher-level NiFi processors Index Text Elasticsearch Merge / Dedupe Hive Compress & Archive Originals HDFS,S3 Extract Table JDBC Get File(s) Filesystem Message JMS/Kafka Other HTTP/REST, etc. Data policies 12
  • 13. 1313 Data self-service and “wrangle” • Graphical SQL builder • 100+ transform functions • Machine learning • Publish and schedule • Powered by Apache Spark
  • 14. 1414 Data Discovery • Google-like searching • Extensible metadata • Data profile • Data sampling
  • 15. 1515 Operations • Dashboard • Health Monitoring • Data Confidence • SLA enforcement • Alerts • Performance reports
  • 16. 16 • Powerful search capabilities for users against data (think Google-like searching) • NiFi processor extracts source data from Hadoop table for indexing in ElasticSearch • Incremental updates during ingest ElasticSearch – Full Text Indexing Data Lake select id,user,tweet from twitter_feed extract JSON
  • 18. 1818

Editor's Notes

  • #13: Notice that we delegate processing to the Spark and Hadoop cluster for much of our work