SlideShare a Scribd company logo
© Hortonworks Inc. 2011–2018. All rights reserved1
Apache NiFi Crash Course
Andy LoPresto | @yolopey
Sr. Member of Technical Staff at Hortonworks, Apache NiFi PMC & Committer
11 October 2018 Dataworks Summit Singapore
© Hortonworks Inc. 2011–2018. All rights reserved2
Gauging Audience Familiarity with NiFi
“What’s a NeeFee?”
No experience with dataflow
No experience with NiFi
“I can pick this up pretty quickly”
Some experience with dataflow
Some experience with NiFi
“I refactored the Ambari
integration endpoint to allow
for mutual authentication
TLS during my coffee break”
Forgotten more about NiFi
than most of us will ever
know
© Hortonworks Inc. 2011–2018. All rights reserved3
Agenda
• Introduction
• What is dataflow?
• What is NiFi?
• What’s next?
• All slides provided online, so no need to transcribe
© Hortonworks Inc. 2011–2018. All rights reserved4
What Is Dataflow?
© Hortonworks Inc. 2011–2018. All rights reserved5
What Is Dataflow?
• Moving some content from A to B
• Content could be any bytes
• Logs
• HTTP
• XML
• CSV
• Images
• Video
• Telemetry
Producers A.K.A
Things
Anything
AND
Everything
Internet!
Consumers
• User
• Storage
• System
• …More Things
© Hortonworks Inc. 2011–2018. All rights reserved6
Connecting Data Points Is Easy
• Simple enough to write a process
• Bash/Ruby/Python
• SQL proc
• etc.
Log files
SQL
Big Data
© Hortonworks Inc. 2011–2018. All rights reserved7
Big Data Is About Scale…
• …and this doesn’t scale
• Example use case:
• AOL Data Processing
• AWS -> HDFS
• 20 TB ingested/day
• Lev Brailovskiy, “Data Ingestion
and Distribution with Apache
NiFi”, Slide 27, 02/2017
• https://www.slideshare.net/LevBr
ailovskiy/data-ingestion-and-
distribution-with-apache-nifi
© Hortonworks Inc. 2011–2018. All rights reserved8
Moving Data Effectively Is Hard
“Data Pipeline” https://xkcd.com/2054/
© Hortonworks Inc. 2011–2018. All rights reserved9
• Standards
• Formats
• Protocols
• Veracity
• Validity
• Schemas
• Partitioning/Bun
dling
Data
Dataflow Challenges in 3 Categories
Infrastructure
• “Exactly Once”
Delivery
• Ensuring
Security
• Overcoming
Security
• Credential
Management
• Network
People
• Compliance
• “That
[person|team|g
roup]”
• Consumers
Change
• Requirements
Change
• “Exactly Once”
Delivery
© Hortonworks Inc. 2011–2018. All rights reserved10
Raise your hand if you want to maintain Python scripts for the rest of your life
Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕs
© Hortonworks Inc. 2011–2018. All rights reserved11
What Is Apache NiFi?
© Hortonworks Inc. 2011–2018. All rights reserved12
NiFi Is Based on Flow Based Programming (FBP)
FBP Term NiFi Term Description
Information
Packet
FlowFile Each object moving through the system.
Black Box FlowFile
Processor
Performs the work, doing some combination of data routing, transformation,
or mediation between systems.
Bounded
Buffer
Connection The linkage between processors, acting as queues and allowing various
processes to interact at differing rates.
Scheduler Flow
Controller
Maintains the knowledge of how processes are connected, and manages the
threads and allocations thereof which all processes use.
Subnet Process
Group
A set of processes and their connections, which can receive and send data via
ports. A process group allows creation of entirely new component simply by
composition of its components.
© Hortonworks Inc. 2011–2018. All rights reserved13
• Guaranteed delivery
• Data buffering
• Backpressure
• Pressure release
• Prioritized queuing
• Flow specific QoS
• Latency vs. throughput
• Loss tolerance
Key Features
Apache NiFi
• Data provenance
• Supports push and pull models
• Recovery/recording
a rolling log of fine-grained history
• Visual command and control
• Flow templates
• Pluggable, multi-tenant security
• Designed for extension
• Clustering
© Hortonworks Inc. 2011–2018. All rights reserved14
Flowfiles Are Like HTTP Data
HTTPData FlowFile
HTTP/1.1 200 OK
Date: Sun, 10 Oct 2010 23:26:07 GMT
Server: Apache/2.2.8 (CentOS) OpenSSL/0.9.8g
Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT
ETag: "45b6-834-49130cc1182c0"
Accept-Ranges: bytes
Content-Length: 13
Connection: close
Content-Type: text/html
Hello world!
Standard FlowFile Attributes
Key: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'
Key: 'lineageStartDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016'
Key: 'fileSize’ Value: '23609'
FlowFile Attribute Map Content
Key: 'filename’ Value: '15650246997242'
Key: 'path’ Value: './’
Binary Content *
Header
Content
© Hortonworks Inc. 2011–2018. All rights reserved15
User Interface
Less of this… … more of this
© Hortonworks Inc. 2011–2018. All rights reserved16
Deeper Ecosystem Integration: 274+ Processors,
57 Controller Services
Hash
Extract
Merge
Duplicate
Scan
GeoEnrich
Replace
ConvertSplit
Translate
Route Content
Route Context
Route Text
Control Rate
Distribute Load
Generate Table Fetch
Jolt Transform JSON
Prioritized Delivery
Encrypt
Tail
Evaluate
Execute
All Apache project logos are trademarks of the ASF and the respective projects.
Fetch
HTTP
Syslog
Email
HTML
Image
HL7
FTP
UDP
XML
SFTP
AMQP
WebSocket
Parse Records Convert Records
© Hortonworks Inc. 2011–2018. All rights reserved17
Extension / Integration Points
NiFi Term Description
Flow File
Processor
Push/Pull behavior. Custom UI
Reporting
Task
Used to push data from NiFi to some external service (metrics, provenance,
etc.)
Controller
Service
Used to enable reusable components / shared services throughout the flow
REST API Allows clients to connect to pull information, change behavior, etc.
© Hortonworks Inc. 2011–2018. All rights reserved18
Architecture
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Standalone
Cluster
© Hortonworks Inc. 2011–2018. All rights reserved19
NiFi Architecture – Repositories – Pass by Reference
FlowFile Content Provenance
F1à C1 C1 P1à F1
Excerpt of demo flow… What’s happening inside the repositories…
BEFORE
AFTER
F2à C1 C1 P3à F2 – Clone (F1)
F1à C1 P2à F1 – Route
P1à F1 – Create
© Hortonworks Inc. 2011–2018. All rights reserved20
NiFi Architecture – Repositories – Copy on Write
FlowFile Content Provenance
F1à C1 C1 P1à F1 - CREATE
Excerpt of demo flow… What’s happening inside the repositories…
BEFORE
AFTER
F1à C1
F1.1à C2 C2 (encrypted)
C1 (plaintext)
P2à F1.1 - MODIFY
P1à F1 - CREATE
© Hortonworks Inc. 2011–2018. All rights reserved21
Data Provenance
Constrained
High-latency
Localized context
Hybrid – cloud/on-premises
Low-latency
Global context
Origin – attribution
Replay – recovery
Evolution of topologies
Long retention
Types of Lineage
• Event
• Configuration
© Hortonworks Inc. 2011–2018. All rights reserved22
• Previously, data had to be divided
into individual flowfiles to perform
work
• CSV output with 50k lines would
need to be split, operated on, re-
merged
• 1 + 50k + 50k + 1 flowfiles = 100k
flowfiles
Record Parsing
© Hortonworks Inc. 2011–2018. All rights reserved23
• Now flowfile content can contain many “record”
elements
• Read and write with *Reader and *Writer Controller
Services
• Perform lookups, routing, conversion, SQL queries,
validation, and more…
• 1 + 1 flowfiles = 2 flowfiles
Record Parsing
© Hortonworks Inc. 2011–2018. All rights reserved24
• Every provenance event
record is encrypted with
AES G/CM before being
persisted to disk
• Decrypted on
deserialization for
retrieval/query
• Random access via offset
seek
• Handles key migration &
rotation
Encrypted Provenance Repository
© Hortonworks Inc. 2011–2018. All rights reserved25
What’s Next?
© Hortonworks Inc. 2011–2018. All rights reserved26
• NiFi 1.8.0 — … Oct 2018 (170+ Jiras)
• Jetty, DB improvements
• Auto load-balancing queues
• TLS Toolkit w/ external CA
• Record processor improvements
• MiNiFi C++ 0.5.0 — 6 June 2018
• MiNiFi Java 0.5.0 — 7 July 2018
• NiFi Registry 0.3.0 — 25 Sept 2018
Introducing Apache NiFi Registry
New Announcements
© Hortonworks Inc. 2011–2018. All rights reserved27
• Previously, flows were exported via XML
templates
• Didn’t contain sensitive values
• Couldn’t be updated in-place
• No tracking system
• NiFi Registry brings asset management
as first-class citizen to NiFi
• Flows can be versioned
• Flows can be promoted between
environments
Introducing Apache NiFi Registry 0.3.0
NiFi Registry for Dataflows
© Hortonworks Inc. 2011–2018. All rights reserved28
Community Health
© Hortonworks Inc. 2011–2018. All rights reserved29
Apache NiFi site
https://nifi.apache.org
Subproject MiNiFi site
https://nifi.apache.org/minifi/
Subscribe to and collaborate at
dev@nifi.apache.org
users@nifi.apache.org
Submit Ideas or Issues
https://issues.apache.org/jira/browse/NIFI
Follow us on Twitter
@apachenifi
Learn More and Join Us
© Hortonworks Inc. 2011–2018. All rights reserved30
More NiFi Today…
Title Time Room
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi 1400 - 1440 Orchard
Dataflow Management From Edge to Core with Apache NiFi 1450 - 1330 Orchard
© Hortonworks Inc. 2011–2018. All rights reserved31
https://hortonworks.com/tutorial/analyze-
transit-patterns-with-apache-nifi/
© Hortonworks Inc. 2011–2018. All rights reserved32
Thank You
alopresto@hortonworks.com | alopresto@apache.org | @yolopey
github.com/alopresto/slides

More Related Content

PDF
Data ingestion and distribution with apache NiFi
PDF
Dataflow with Apache NiFi
PDF
Apache NiFi Meetup - Princeton NJ 2016
PDF
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
PDF
Apache Nifi Crash Course
PDF
PPTX
NiFi Best Practices for the Enterprise
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Data ingestion and distribution with apache NiFi
Dataflow with Apache NiFi
Apache NiFi Meetup - Princeton NJ 2016
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Apache Nifi Crash Course
NiFi Best Practices for the Enterprise
Exactly-Once Financial Data Processing at Scale with Flink and Pinot

What's hot (20)

PPTX
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
Changelog Stream Processing with Apache Flink
PPTX
Real-Time Data Flows with Apache NiFi
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PPTX
Apache NiFi Crash Course Intro
PDF
Apache Airflow
PDF
Introduction to Apache NiFi dws19 DWS - DC 2019
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PDF
Running Apache NiFi with Apache Spark : Integration Options
PDF
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
PDF
Apache Kafka
PPTX
Airflow presentation
PDF
Stream processing with Apache Flink (Timo Walther - Ververica)
PDF
Nifi workshop
PPTX
Apache Flink and what it is used for
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Changelog Stream Processing with Apache Flink
Real-Time Data Flows with Apache NiFi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Apache NiFi Crash Course Intro
Apache Airflow
Introduction to Apache NiFi dws19 DWS - DC 2019
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Where is my bottleneck? Performance troubleshooting in Flink
Running Apache NiFi with Apache Spark : Integration Options
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Evening out the uneven: dealing with skew in Flink
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Apache Kafka
Airflow presentation
Stream processing with Apache Flink (Timo Walther - Ververica)
Nifi workshop
Apache Flink and what it is used for
Ad

Similar to Apache Nifi Crash Course (20)

PPTX
State of the Apache NiFi Ecosystem & Community
PPTX
Connecting the Drops with Apache NiFi & Apache MiNiFi
PDF
Dataflow Management From Edge to Core with Apache NiFi
PPTX
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
PDF
Dataflow Management From Edge to Core with Apache NiFi
PPTX
NJ Hadoop Meetup - Apache NiFi Deep Dive
PPTX
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
PPTX
Apache NiFi Crash Course - San Jose Hadoop Summit
PDF
Devnexus 2018 - Let Your Data Flow with Apache NiFi
PDF
Apache NiFi - Flow Based Programming Meetup
PPTX
Hortonworks Data in Motion Webinar Series - Part 1
PPTX
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
PDF
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
PDF
Introduction to data flow management using apache nifi
PPTX
Hadoop Summit Tokyo Apache NiFi Crash Course
PDF
The First Mile - Edge and IoT Data Collection With Apache Nifi and MiniFi
PPTX
Apache NiFi in the Hadoop Ecosystem
PPTX
Apache NiFi in the Hadoop Ecosystem
PDF
The First Mile -- Edge and IoT Data Collection with Apache NiFi and MiNiFi
PPTX
The Avant-garde of Apache NiFi
State of the Apache NiFi Ecosystem & Community
Connecting the Drops with Apache NiFi & Apache MiNiFi
Dataflow Management From Edge to Core with Apache NiFi
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Dataflow Management From Edge to Core with Apache NiFi
NJ Hadoop Meetup - Apache NiFi Deep Dive
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Apache NiFi Crash Course - San Jose Hadoop Summit
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Apache NiFi - Flow Based Programming Meetup
Hortonworks Data in Motion Webinar Series - Part 1
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Introduction to data flow management using apache nifi
Hadoop Summit Tokyo Apache NiFi Crash Course
The First Mile - Edge and IoT Data Collection With Apache Nifi and MiniFi
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
The First Mile -- Edge and IoT Data Collection with Apache NiFi and MiNiFi
The Avant-garde of Apache NiFi
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Modernizing your data center with Dell and AMD
PPT
Teaching material agriculture food technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Cloud computing and distributed systems.
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Modernizing your data center with Dell and AMD
Teaching material agriculture food technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Review of recent advances in non-invasive hemoglobin estimation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Electronic commerce courselecture one. Pdf
NewMind AI Weekly Chronicles - August'25 Week I
Diabetes mellitus diagnosis method based random forest with bat algorithm
Cloud computing and distributed systems.
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Monthly Chronicles - July 2025
20250228 LYD VKU AI Blended-Learning.pptx
Understanding_Digital_Forensics_Presentation.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Apache Nifi Crash Course

  • 1. © Hortonworks Inc. 2011–2018. All rights reserved1 Apache NiFi Crash Course Andy LoPresto | @yolopey Sr. Member of Technical Staff at Hortonworks, Apache NiFi PMC & Committer 11 October 2018 Dataworks Summit Singapore
  • 2. © Hortonworks Inc. 2011–2018. All rights reserved2 Gauging Audience Familiarity with NiFi “What’s a NeeFee?” No experience with dataflow No experience with NiFi “I can pick this up pretty quickly” Some experience with dataflow Some experience with NiFi “I refactored the Ambari integration endpoint to allow for mutual authentication TLS during my coffee break” Forgotten more about NiFi than most of us will ever know
  • 3. © Hortonworks Inc. 2011–2018. All rights reserved3 Agenda • Introduction • What is dataflow? • What is NiFi? • What’s next? • All slides provided online, so no need to transcribe
  • 4. © Hortonworks Inc. 2011–2018. All rights reserved4 What Is Dataflow?
  • 5. © Hortonworks Inc. 2011–2018. All rights reserved5 What Is Dataflow? • Moving some content from A to B • Content could be any bytes • Logs • HTTP • XML • CSV • Images • Video • Telemetry Producers A.K.A Things Anything AND Everything Internet! Consumers • User • Storage • System • …More Things
  • 6. © Hortonworks Inc. 2011–2018. All rights reserved6 Connecting Data Points Is Easy • Simple enough to write a process • Bash/Ruby/Python • SQL proc • etc. Log files SQL Big Data
  • 7. © Hortonworks Inc. 2011–2018. All rights reserved7 Big Data Is About Scale… • …and this doesn’t scale • Example use case: • AOL Data Processing • AWS -> HDFS • 20 TB ingested/day • Lev Brailovskiy, “Data Ingestion and Distribution with Apache NiFi”, Slide 27, 02/2017 • https://www.slideshare.net/LevBr ailovskiy/data-ingestion-and- distribution-with-apache-nifi
  • 8. © Hortonworks Inc. 2011–2018. All rights reserved8 Moving Data Effectively Is Hard “Data Pipeline” https://xkcd.com/2054/
  • 9. © Hortonworks Inc. 2011–2018. All rights reserved9 • Standards • Formats • Protocols • Veracity • Validity • Schemas • Partitioning/Bun dling Data Dataflow Challenges in 3 Categories Infrastructure • “Exactly Once” Delivery • Ensuring Security • Overcoming Security • Credential Management • Network People • Compliance • “That [person|team|g roup]” • Consumers Change • Requirements Change • “Exactly Once” Delivery
  • 10. © Hortonworks Inc. 2011–2018. All rights reserved10 Raise your hand if you want to maintain Python scripts for the rest of your life Let’s Connect Lots of As to Bs to As to Cs to Bs to Δs to Cs to ϕs
  • 11. © Hortonworks Inc. 2011–2018. All rights reserved11 What Is Apache NiFi?
  • 12. © Hortonworks Inc. 2011–2018. All rights reserved12 NiFi Is Based on Flow Based Programming (FBP) FBP Term NiFi Term Description Information Packet FlowFile Each object moving through the system. Black Box FlowFile Processor Performs the work, doing some combination of data routing, transformation, or mediation between systems. Bounded Buffer Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates. Scheduler Flow Controller Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use. Subnet Process Group A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
  • 13. © Hortonworks Inc. 2011–2018. All rights reserved13 • Guaranteed delivery • Data buffering • Backpressure • Pressure release • Prioritized queuing • Flow specific QoS • Latency vs. throughput • Loss tolerance Key Features Apache NiFi • Data provenance • Supports push and pull models • Recovery/recording a rolling log of fine-grained history • Visual command and control • Flow templates • Pluggable, multi-tenant security • Designed for extension • Clustering
  • 14. © Hortonworks Inc. 2011–2018. All rights reserved14 Flowfiles Are Like HTTP Data HTTPData FlowFile HTTP/1.1 200 OK Date: Sun, 10 Oct 2010 23:26:07 GMT Server: Apache/2.2.8 (CentOS) OpenSSL/0.9.8g Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT ETag: "45b6-834-49130cc1182c0" Accept-Ranges: bytes Content-Length: 13 Connection: close Content-Type: text/html Hello world! Standard FlowFile Attributes Key: 'entryDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016' Key: 'lineageStartDate’ Value: 'Fri Jun 17 17:15:04 EDT 2016' Key: 'fileSize’ Value: '23609' FlowFile Attribute Map Content Key: 'filename’ Value: '15650246997242' Key: 'path’ Value: './’ Binary Content * Header Content
  • 15. © Hortonworks Inc. 2011–2018. All rights reserved15 User Interface Less of this… … more of this
  • 16. © Hortonworks Inc. 2011–2018. All rights reserved16 Deeper Ecosystem Integration: 274+ Processors, 57 Controller Services Hash Extract Merge Duplicate Scan GeoEnrich Replace ConvertSplit Translate Route Content Route Context Route Text Control Rate Distribute Load Generate Table Fetch Jolt Transform JSON Prioritized Delivery Encrypt Tail Evaluate Execute All Apache project logos are trademarks of the ASF and the respective projects. Fetch HTTP Syslog Email HTML Image HL7 FTP UDP XML SFTP AMQP WebSocket Parse Records Convert Records
  • 17. © Hortonworks Inc. 2011–2018. All rights reserved17 Extension / Integration Points NiFi Term Description Flow File Processor Push/Pull behavior. Custom UI Reporting Task Used to push data from NiFi to some external service (metrics, provenance, etc.) Controller Service Used to enable reusable components / shared services throughout the flow REST API Allows clients to connect to pull information, change behavior, etc.
  • 18. © Hortonworks Inc. 2011–2018. All rights reserved18 Architecture OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage Standalone Cluster
  • 19. © Hortonworks Inc. 2011–2018. All rights reserved19 NiFi Architecture – Repositories – Pass by Reference FlowFile Content Provenance F1à C1 C1 P1à F1 Excerpt of demo flow… What’s happening inside the repositories… BEFORE AFTER F2à C1 C1 P3à F2 – Clone (F1) F1à C1 P2à F1 – Route P1à F1 – Create
  • 20. © Hortonworks Inc. 2011–2018. All rights reserved20 NiFi Architecture – Repositories – Copy on Write FlowFile Content Provenance F1à C1 C1 P1à F1 - CREATE Excerpt of demo flow… What’s happening inside the repositories… BEFORE AFTER F1à C1 F1.1à C2 C2 (encrypted) C1 (plaintext) P2à F1.1 - MODIFY P1à F1 - CREATE
  • 21. © Hortonworks Inc. 2011–2018. All rights reserved21 Data Provenance Constrained High-latency Localized context Hybrid – cloud/on-premises Low-latency Global context Origin – attribution Replay – recovery Evolution of topologies Long retention Types of Lineage • Event • Configuration
  • 22. © Hortonworks Inc. 2011–2018. All rights reserved22 • Previously, data had to be divided into individual flowfiles to perform work • CSV output with 50k lines would need to be split, operated on, re- merged • 1 + 50k + 50k + 1 flowfiles = 100k flowfiles Record Parsing
  • 23. © Hortonworks Inc. 2011–2018. All rights reserved23 • Now flowfile content can contain many “record” elements • Read and write with *Reader and *Writer Controller Services • Perform lookups, routing, conversion, SQL queries, validation, and more… • 1 + 1 flowfiles = 2 flowfiles Record Parsing
  • 24. © Hortonworks Inc. 2011–2018. All rights reserved24 • Every provenance event record is encrypted with AES G/CM before being persisted to disk • Decrypted on deserialization for retrieval/query • Random access via offset seek • Handles key migration & rotation Encrypted Provenance Repository
  • 25. © Hortonworks Inc. 2011–2018. All rights reserved25 What’s Next?
  • 26. © Hortonworks Inc. 2011–2018. All rights reserved26 • NiFi 1.8.0 — … Oct 2018 (170+ Jiras) • Jetty, DB improvements • Auto load-balancing queues • TLS Toolkit w/ external CA • Record processor improvements • MiNiFi C++ 0.5.0 — 6 June 2018 • MiNiFi Java 0.5.0 — 7 July 2018 • NiFi Registry 0.3.0 — 25 Sept 2018 Introducing Apache NiFi Registry New Announcements
  • 27. © Hortonworks Inc. 2011–2018. All rights reserved27 • Previously, flows were exported via XML templates • Didn’t contain sensitive values • Couldn’t be updated in-place • No tracking system • NiFi Registry brings asset management as first-class citizen to NiFi • Flows can be versioned • Flows can be promoted between environments Introducing Apache NiFi Registry 0.3.0 NiFi Registry for Dataflows
  • 28. © Hortonworks Inc. 2011–2018. All rights reserved28 Community Health
  • 29. © Hortonworks Inc. 2011–2018. All rights reserved29 Apache NiFi site https://nifi.apache.org Subproject MiNiFi site https://nifi.apache.org/minifi/ Subscribe to and collaborate at dev@nifi.apache.org users@nifi.apache.org Submit Ideas or Issues https://issues.apache.org/jira/browse/NIFI Follow us on Twitter @apachenifi Learn More and Join Us
  • 30. © Hortonworks Inc. 2011–2018. All rights reserved30 More NiFi Today… Title Time Room The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi 1400 - 1440 Orchard Dataflow Management From Edge to Core with Apache NiFi 1450 - 1330 Orchard
  • 31. © Hortonworks Inc. 2011–2018. All rights reserved31 https://hortonworks.com/tutorial/analyze- transit-patterns-with-apache-nifi/
  • 32. © Hortonworks Inc. 2011–2018. All rights reserved32 Thank You alopresto@hortonworks.com | alopresto@apache.org | @yolopey github.com/alopresto/slides