SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache NiFi Integration with Apache Spark
Timothy Spann, Solutions Engineer
2 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Disclaimer
à This document may contain product features and technology directions that are under
development, may be under development in the future or may ultimately not be
developed.
à Technical feasibility, market demand, user feedback, and the Apache Software
Foundation community development process can all effect timing and final delivery.
à This document’s description of these features and technology directions does not
represent a contractual commitment, promise or obligation from Hortonworks to deliver
these features in any generally available product.
à Product features and technology directions are subject to change, and must not be
included in contracts, purchase orders, or sales agreements of any kind.
à Since this document contains an outline of general product development plans,
customers should not rely upon it when making a purchase decision.
3 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Integration Options
§ Apache Spark Integration via Kafka and Spark Streaming (1.6+)
§ Apache Spark Integration via Kafka and Spark Structured Streaming (2.2+)
§ Apache Spark Integration via Apache Livy
4
Apache Kafka and Apache NiFi
Integration
+
5 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
NiFi and Kafka Are Complementary
NiFi
Provide dataflow solution
• Centralized management, from edge to core
• Great traceability, event level data provenance
starting when data is born
• Interactive command and control – real time
operational visibility
• Dataflow management, including prioritization,
back pressure, and edge intelligence
• Visual representation of global dataflow
Kafka
Provide durable stream store
• Low latency
• Distributed data durability
• Decentralized management of producers &
consumers
+
6 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Integrated Provisioning and Security
Kafka 1.0 Support
To enhance data governance and lineage, users can
now manage access control policies using resource or
tag-based security in Ranger for Kafka 1.0 clusters.
Users can now install, configure, manage, upgrade,
monitor, and secure Kafka 1.0 clusters with Ambari.
New processors in NiFi and Streaming Analytics
Manager support Kafka 1.0 features including message
headers and transactions.
7 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache NiFi and Kafka 1.0 – Use Case for Kafka Message Headers
8
Apache Spark – Apache Kafka – Apache
NiFi Architecture
9 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Join
Architecture Example
Acquire/Move
Routing
&
Filtering
Parse
Analyze Model
Topic 1
Topic 2
AggregateCorrolate Pattern Matching
JSON Data
AVRO Data
Windowing
Aggregations
Spark Processing
Flow Management Stream Analysis
++
10 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Stream Processing
Streaming Analytics
Manager
Machine Learning
Distributed queue
Buffering
Process decoupling
Structured Streaming with SQL
Orchestration
Queueing
Simple Event Processing
Data Definition Between Environments
Schema Versioning
11 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Key Integration Points – NiFi & Kafka
NiFi
MiNiFi
MiNiFi
MiNiFi
Kafka
Consumer 1
Consumer 2
Consumer N
• Producer Processors (Main)
• PublishKafka_0_11 (0.10 Kafka Client)
• PublishKafka_1_0 (1.0 Kafka Client)
• PublishKafkaRecord_0_11 (0.11 Kafka Client)
• PublishKafkaRecord_1_0 (1.0 Kafka Client)
+
12 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Key Integration Points – NiFi & Kafka
Kafka
Producer 1
Producer 2
Producer N
NiFi
Destination 1
Destination 2
Destination 3
• Consumer Processors (Main)
• ConsumeKafka_0_11 (0.11 Kafka Client)
• ConsumeKafka_1_0 (1.0 Kafka Client)
• ConsumeKafkaRecord_0_11 (0.11 Kafka Client)
• ConsumeKafkaRecord_1_0 (1.0 Kafka Client)
+
13 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Better Together
NiFiMiNiFi
Kafka
Spark
Incoming Topic
Results Topic
PublishKafka
ConsumeKafka
Destinations
MiNiFi
• MiNiFi – Collection, filtering, and prioritization at the edge
• NiFi - Central data flow management, routing, enriching, and transformation
• Kafka - Central messaging bus for subscription by downstream consumers
• Spark - Streaming analytics focused on complex event processing
+ +SR
14 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
NiFi PublishKafkaRecord_1_0
Apache NiFi - Node 1
Apache Kafka
Topic 1 - Partition 1
Topic 1 - Partition 2
PublishKafka
Apache NiFi – Node 2
PublishKafka
= Concurrent Task
• Each NiFi node runs an
instance of
PublishKafkaRecord_1_0
• Each instance has one or
more concurrent tasks
(threads)
• Each concurrent task is an
independent producer,
sends data round-robin to
partitions of a topic
• Records with Schemas for
Performance
+
15
Apache Spark Streaming – Apache Kafka
– Apache NiFi Architecture
16 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Spark Streaming
à Spark Streaming is an extension of Spark-core API that supports scalable, high throughput and
fault-tolerant streaming applications.
à Data can be ingested from various data sources like Kafka, Flume, Twitter, ZeroMQ or TCP
sockets
à Data is processed using the now-familiar API: map, filter, reduce, join and window
à Processed data can be stored in databases, filesystems, or live dashboards
17 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache Spark Streaming Integration via Kafka
https://community.hortonworks.com/content/kbentry/173818/hdp-264-hdf-31-apache-spark-streaming-integration.html
18 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache Spark Streaming Integration via Kafka
19
Apache Spark Structured Streaming –
Apache Kafka – Apache NiFi Architecture
20 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache Spark Structured Streaming Integration via Kafka
https://community.hortonworks.com/articles/91379/spark-structured-streaming-with-nifi-and-kafka-usi.html
https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html
https://community.hortonworks.com/content/kbentry/174105/hdp-264-hdf-31-apache-spark-structured-streaming-
i.html
val records = spark.
readStream.
format("kafka").
option("subscribe", "smartPlug2").
option("kafka.bootstrap.servers",
"mykafkabroker:6667").load
21 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache NiFi – Apache Kafka – Apache Spark
22
Apache Spark – Apache Livy
23 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Introducing Apache Livy
à Apache Livy is the open source REST interface for interacting with Apache Spark from
anywhere
à Installed as Spark2 Ambari Service
Livy Client
HTTP HTTP (RPC)
Spark Interactive Session
SparkContext
Spark Batch Session
SparkContext
Livy Server
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-
guide/content/ch_submit-spark-apps-livy.html
24 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Livy Server as a Session Management Service
Livy
Server
Remote
Spark
Driver
Session
Remote
Context
Interactive
REST API
Batch
REST API
Standard Spark
Batch Job
Spark
Executor
Spark
Executor
Spark
Executor
Spark
Executor
https://livy.incubator.apache.org/docs/latest/rest-api.html
25
Apache Spark – Apache Livy – Apache
NiFi Integration
26 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
SQL
Architecture Example
Routing & Filtering
Parse
Analyze
Session 1
Session 1
AggregateSQL
JSON Data
Spark Processing
Flow Management Analytics
27 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
NiFi to Spark Processing
Streaming Analytics
Manager
Machine Learning
REST API
Enterprise Tested
Secure
Structured Streaming with SQL
Orchestration
Queueing
Simple Event Processing
Data Definition Between Environments
Schema Versioning
28 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Key Integration Points – NiFi & Spark
NiFi
MiNiFi
MiNiFi
MiNiFi
Livy
Spark
Spark 2
Spark N
• Processor and Controller
• ExecuteSparkInteractive – setup job and code to Livy Session Service
• LivySessionService – manages Spark Livy connection pool
+ +
29 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Better Together
NiFiMiNiFi
Livy
Spark
Session
Batch
ExecuteSpark
Interactive
MiNiFi
• MiNiFi – Collection, filtering, and prioritization at the edge
• NiFi - Central data flow management, routing, enriching, and transformation
• Livy – Secure HTTPS connection to running Spark batch and sessions jobs with
cached RDD sharing and a live Spark context.
• Spark - Streaming analytics focused on complex event processing
+ +
LivySessionService
30
Apache Spark – Apache Livy – Apache
NiFi Architecture
31 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache Spark Integration via Apache Livy
32 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Apache Spark Integration via Apache Livy
https://community.hortonworks.com/articles/171787/hdf-31-executing-apache-spark-via-executesparkinte.html
https://community.hortonworks.com/articles/171893/hdf-31-executing-apache-spark-via-executesparkinte-1.html
33 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
34 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Questions?
Hortonworks Community Connection:
Data Ingestion and Streaming
https://community.hortonworks.com/
35 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Contact
https://community.hortonworks.com/users/9304/tspann.html
https://dzone.com/users/297029/bunkertor.html
https://www.meetup.com/futureofdata-princeton/
https://twitter.com/PaaSDev
https://community.hortonworks.com/articles/174105/hdp-264-hdf-31-apache-spark-structured-streaming-i.html
36 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Hortonworks Community Connection
Read access for everyone, join to participate and be recognized
• Full Q&A Platform (like StackOverflow)
• Knowledge Base Articles
• Code Samples and Repositories
37 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Community Engagement
Participate now at: community.hortonworks.com© Hortonworks Inc. 2011 – 2015. All Rights Reserved
4,000+
Registered Users
10,000+
Answers
15,000+
Technical Assets
One Website!
38 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
Register at dataworkssummit.com
#DWS18
Berlin, Germany
San Jose, California
APRIL 16-19, 2018 | ESTREL HOTEL
JUNE 17-21, 2018 | MCENERY CONVENTION CENTER

More Related Content

PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
Apache Nifi Crash Course
PDF
Dataflow with Apache NiFi
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Real-Time Data Flows with Apache NiFi
PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
Top 5 Mistakes When Writing Spark Applications
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Apache Nifi Crash Course
Dataflow with Apache NiFi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Real-Time Data Flows with Apache NiFi
Apache Kafka Architecture & Fundamentals Explained
Top 5 Mistakes When Writing Spark Applications
HBase and HDFS: Understanding FileSystem Usage in HBase

What's hot (20)

PDF
Can Apache Kafka Replace a Database?
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
Apache Nifi Crash Course
PDF
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
PDF
A Deep Dive into Query Execution Engine of Spark SQL
ODP
Introduction to Kafka connect
PDF
Introduction to elasticsearch
PDF
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
PDF
Kafka Streams State Stores Being Persistent
PDF
Data ingestion and distribution with apache NiFi
PDF
PDF
Introduction to Kafka Streams
PDF
Nifi workshop
PPTX
Apache NiFi in the Hadoop Ecosystem
PDF
NiFi 시작하기
PDF
Kafka 101 and Developer Best Practices
PPTX
Hive + Tez: A Performance Deep Dive
PDF
Introduction to Spark Streaming
PPTX
Best practices and lessons learnt from Running Apache NiFi at Renault
PDF
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Can Apache Kafka Replace a Database?
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Apache Nifi Crash Course
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
A Deep Dive into Query Execution Engine of Spark SQL
Introduction to Kafka connect
Introduction to elasticsearch
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Kafka Streams State Stores Being Persistent
Data ingestion and distribution with apache NiFi
Introduction to Kafka Streams
Nifi workshop
Apache NiFi in the Hadoop Ecosystem
NiFi 시작하기
Kafka 101 and Developer Best Practices
Hive + Tez: A Performance Deep Dive
Introduction to Spark Streaming
Best practices and lessons learnt from Running Apache NiFi at Renault
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Ad

Similar to Running Apache NiFi with Apache Spark : Integration Options (20)

PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
PDF
HDF 3.1 : An Introduction to New Features
PPTX
Future of Data New Jersey - HDF 3.0 Deep Dive
PPTX
State of the Apache NiFi Ecosystem & Community
PPTX
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
PPTX
Mission to NARs with Apache NiFi
PDF
Apache Deep Learning 101 - DWS Berlin 2018
PPTX
Apache NiFi in the Hadoop Ecosystem
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
PPTX
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
PPTX
The Avant-garde of Apache NiFi
PPTX
The Avant-garde of Apache NiFi
PPTX
Apache NiFi Crash Course Intro
PDF
Real time stock processing with apache nifi, apache flink and apache kafka
PDF
Curing the Kafka blindness—Streams Messaging Manager
PDF
Dataflow Management From Edge to Core with Apache NiFi
PPTX
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
PPTX
Integrating Apache NiFi and Apache Flink
PPTX
Integrating Apache NiFi and Apache Flink
PPTX
Integrating Apache NiFi and Apache Flink
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
HDF 3.1 : An Introduction to New Features
Future of Data New Jersey - HDF 3.0 Deep Dive
State of the Apache NiFi Ecosystem & Community
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Mission to NARs with Apache NiFi
Apache Deep Learning 101 - DWS Berlin 2018
Apache NiFi in the Hadoop Ecosystem
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
The Avant-garde of Apache NiFi
The Avant-garde of Apache NiFi
Apache NiFi Crash Course Intro
Real time stock processing with apache nifi, apache flink and apache kafka
Curing the Kafka blindness—Streams Messaging Manager
Dataflow Management From Edge to Core with Apache NiFi
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
Ad

More from Timothy Spann (20)

PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
01-Oct-2024_PES-VectorDatabasesAndAI.pdf

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Empathic Computing: Creating Shared Understanding
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Electronic commerce courselecture one. Pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation theory and applications.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
Empathic Computing: Creating Shared Understanding
“AI and Expert System Decision Support & Business Intelligence Systems”
Electronic commerce courselecture one. Pdf
Chapter 3 Spatial Domain Image Processing.pdf
Network Security Unit 5.pdf for BCA BBA.
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Understanding_Digital_Forensics_Presentation.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Running Apache NiFi with Apache Spark : Integration Options

  • 1. 1 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache NiFi Integration with Apache Spark Timothy Spann, Solutions Engineer
  • 2. 2 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Disclaimer à This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed. à Technical feasibility, market demand, user feedback, and the Apache Software Foundation community development process can all effect timing and final delivery. à This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product. à Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. à Since this document contains an outline of general product development plans, customers should not rely upon it when making a purchase decision.
  • 3. 3 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Integration Options § Apache Spark Integration via Kafka and Spark Streaming (1.6+) § Apache Spark Integration via Kafka and Spark Structured Streaming (2.2+) § Apache Spark Integration via Apache Livy
  • 4. 4 Apache Kafka and Apache NiFi Integration +
  • 5. 5 © Hortonworks Inc. 2011 – 2018 All Rights Reserved NiFi and Kafka Are Complementary NiFi Provide dataflow solution • Centralized management, from edge to core • Great traceability, event level data provenance starting when data is born • Interactive command and control – real time operational visibility • Dataflow management, including prioritization, back pressure, and edge intelligence • Visual representation of global dataflow Kafka Provide durable stream store • Low latency • Distributed data durability • Decentralized management of producers & consumers +
  • 6. 6 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Integrated Provisioning and Security Kafka 1.0 Support To enhance data governance and lineage, users can now manage access control policies using resource or tag-based security in Ranger for Kafka 1.0 clusters. Users can now install, configure, manage, upgrade, monitor, and secure Kafka 1.0 clusters with Ambari. New processors in NiFi and Streaming Analytics Manager support Kafka 1.0 features including message headers and transactions.
  • 7. 7 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache NiFi and Kafka 1.0 – Use Case for Kafka Message Headers
  • 8. 8 Apache Spark – Apache Kafka – Apache NiFi Architecture
  • 9. 9 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Join Architecture Example Acquire/Move Routing & Filtering Parse Analyze Model Topic 1 Topic 2 AggregateCorrolate Pattern Matching JSON Data AVRO Data Windowing Aggregations Spark Processing Flow Management Stream Analysis ++
  • 10. 10 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Stream Processing Streaming Analytics Manager Machine Learning Distributed queue Buffering Process decoupling Structured Streaming with SQL Orchestration Queueing Simple Event Processing Data Definition Between Environments Schema Versioning
  • 11. 11 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Key Integration Points – NiFi & Kafka NiFi MiNiFi MiNiFi MiNiFi Kafka Consumer 1 Consumer 2 Consumer N • Producer Processors (Main) • PublishKafka_0_11 (0.10 Kafka Client) • PublishKafka_1_0 (1.0 Kafka Client) • PublishKafkaRecord_0_11 (0.11 Kafka Client) • PublishKafkaRecord_1_0 (1.0 Kafka Client) +
  • 12. 12 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Key Integration Points – NiFi & Kafka Kafka Producer 1 Producer 2 Producer N NiFi Destination 1 Destination 2 Destination 3 • Consumer Processors (Main) • ConsumeKafka_0_11 (0.11 Kafka Client) • ConsumeKafka_1_0 (1.0 Kafka Client) • ConsumeKafkaRecord_0_11 (0.11 Kafka Client) • ConsumeKafkaRecord_1_0 (1.0 Kafka Client) +
  • 13. 13 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Better Together NiFiMiNiFi Kafka Spark Incoming Topic Results Topic PublishKafka ConsumeKafka Destinations MiNiFi • MiNiFi – Collection, filtering, and prioritization at the edge • NiFi - Central data flow management, routing, enriching, and transformation • Kafka - Central messaging bus for subscription by downstream consumers • Spark - Streaming analytics focused on complex event processing + +SR
  • 14. 14 © Hortonworks Inc. 2011 – 2018 All Rights Reserved NiFi PublishKafkaRecord_1_0 Apache NiFi - Node 1 Apache Kafka Topic 1 - Partition 1 Topic 1 - Partition 2 PublishKafka Apache NiFi – Node 2 PublishKafka = Concurrent Task • Each NiFi node runs an instance of PublishKafkaRecord_1_0 • Each instance has one or more concurrent tasks (threads) • Each concurrent task is an independent producer, sends data round-robin to partitions of a topic • Records with Schemas for Performance +
  • 15. 15 Apache Spark Streaming – Apache Kafka – Apache NiFi Architecture
  • 16. 16 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Spark Streaming à Spark Streaming is an extension of Spark-core API that supports scalable, high throughput and fault-tolerant streaming applications. à Data can be ingested from various data sources like Kafka, Flume, Twitter, ZeroMQ or TCP sockets à Data is processed using the now-familiar API: map, filter, reduce, join and window à Processed data can be stored in databases, filesystems, or live dashboards
  • 17. 17 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache Spark Streaming Integration via Kafka https://community.hortonworks.com/content/kbentry/173818/hdp-264-hdf-31-apache-spark-streaming-integration.html
  • 18. 18 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache Spark Streaming Integration via Kafka
  • 19. 19 Apache Spark Structured Streaming – Apache Kafka – Apache NiFi Architecture
  • 20. 20 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache Spark Structured Streaming Integration via Kafka https://community.hortonworks.com/articles/91379/spark-structured-streaming-with-nifi-and-kafka-usi.html https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-KafkaSource.html https://community.hortonworks.com/content/kbentry/174105/hdp-264-hdf-31-apache-spark-structured-streaming- i.html val records = spark. readStream. format("kafka"). option("subscribe", "smartPlug2"). option("kafka.bootstrap.servers", "mykafkabroker:6667").load
  • 21. 21 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache NiFi – Apache Kafka – Apache Spark
  • 22. 22 Apache Spark – Apache Livy
  • 23. 23 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Introducing Apache Livy à Apache Livy is the open source REST interface for interacting with Apache Spark from anywhere à Installed as Spark2 Ambari Service Livy Client HTTP HTTP (RPC) Spark Interactive Session SparkContext Spark Batch Session SparkContext Livy Server https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component- guide/content/ch_submit-spark-apps-livy.html
  • 24. 24 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Livy Server as a Session Management Service Livy Server Remote Spark Driver Session Remote Context Interactive REST API Batch REST API Standard Spark Batch Job Spark Executor Spark Executor Spark Executor Spark Executor https://livy.incubator.apache.org/docs/latest/rest-api.html
  • 25. 25 Apache Spark – Apache Livy – Apache NiFi Integration
  • 26. 26 © Hortonworks Inc. 2011 – 2018 All Rights Reserved SQL Architecture Example Routing & Filtering Parse Analyze Session 1 Session 1 AggregateSQL JSON Data Spark Processing Flow Management Analytics
  • 27. 27 © Hortonworks Inc. 2011 – 2018 All Rights Reserved NiFi to Spark Processing Streaming Analytics Manager Machine Learning REST API Enterprise Tested Secure Structured Streaming with SQL Orchestration Queueing Simple Event Processing Data Definition Between Environments Schema Versioning
  • 28. 28 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Key Integration Points – NiFi & Spark NiFi MiNiFi MiNiFi MiNiFi Livy Spark Spark 2 Spark N • Processor and Controller • ExecuteSparkInteractive – setup job and code to Livy Session Service • LivySessionService – manages Spark Livy connection pool + +
  • 29. 29 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Better Together NiFiMiNiFi Livy Spark Session Batch ExecuteSpark Interactive MiNiFi • MiNiFi – Collection, filtering, and prioritization at the edge • NiFi - Central data flow management, routing, enriching, and transformation • Livy – Secure HTTPS connection to running Spark batch and sessions jobs with cached RDD sharing and a live Spark context. • Spark - Streaming analytics focused on complex event processing + + LivySessionService
  • 30. 30 Apache Spark – Apache Livy – Apache NiFi Architecture
  • 31. 31 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache Spark Integration via Apache Livy
  • 32. 32 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Apache Spark Integration via Apache Livy https://community.hortonworks.com/articles/171787/hdf-31-executing-apache-spark-via-executesparkinte.html https://community.hortonworks.com/articles/171893/hdf-31-executing-apache-spark-via-executesparkinte-1.html
  • 33. 33 © Hortonworks Inc. 2011 – 2018 All Rights Reserved
  • 34. 34 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Questions? Hortonworks Community Connection: Data Ingestion and Streaming https://community.hortonworks.com/
  • 35. 35 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Contact https://community.hortonworks.com/users/9304/tspann.html https://dzone.com/users/297029/bunkertor.html https://www.meetup.com/futureofdata-princeton/ https://twitter.com/PaaSDev https://community.hortonworks.com/articles/174105/hdp-264-hdf-31-apache-spark-structured-streaming-i.html
  • 36. 36 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Hortonworks Community Connection Read access for everyone, join to participate and be recognized • Full Q&A Platform (like StackOverflow) • Knowledge Base Articles • Code Samples and Repositories
  • 37. 37 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Community Engagement Participate now at: community.hortonworks.com© Hortonworks Inc. 2011 – 2015. All Rights Reserved 4,000+ Registered Users 10,000+ Answers 15,000+ Technical Assets One Website!
  • 38. 38 © Hortonworks Inc. 2011 – 2018 All Rights Reserved Register at dataworkssummit.com #DWS18 Berlin, Germany San Jose, California APRIL 16-19, 2018 | ESTREL HOTEL JUNE 17-21, 2018 | MCENERY CONVENTION CENTER