SlideShare a Scribd company logo
Apache NiFi 101: Introduction
and Best Practices
Timothy Spann
Principal Developer Advocate
AIDevWorldApacheNiFi101
3
Tim Spann
Twitter: @PaasDev // Blog: datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC, ex-HPE
https://medium.com/@tspann
https://github.com/tspannhw
© 2021 Cloudera, Inc. All rights reserved. 4
Future of Data - New York + Princeton + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton/
https://www.meetup.com/futureofdata-newyork/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
FLaNK Stack Weekly
This week in Apache NiFi, Apache Flink, Apache
Kafka, Apache Spark, Apache Iceberg, Python, Java,
AI, ML, LLM and Open Source friends.
https://bit.ly/32dAJft
© 2019 Cloudera, Inc. All rights reserved. 6
https://www.datainmotion.dev/2020/10/top-25-use-cases-of-cloudera-flow.html
https://www.datainmotion.dev/2020/12/basic-understanding-of-cloudera-flow.html
https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html
https://github.com/tspannhw/EverythingApacheNiFi
Apache NiFi Resources
8
● NiFi Cluster Architecture
● Content Repository
● Provenance Repository
● FlowFile Repository
● FlowFile, Attributes, Process Groups, Connections, Flow
Controllers
● Controller Services
● Common Attributes (uuid, filename, path, file size, ...)
● Expression Language
● Relationships
● Bulletins
● Input Port
● Output Port
● Empty Queues
● Setting Warning Levels
● Funnels
● RecordPath
● Using Record Processors (Readers/Writers)
● NiFi REST API
● Handling Errors
● Parameter Context / Parameters
● Summary / Cluster / Bulletins
● Reporting Tasks
● Back pressure
● Prioritized Queues
● Load Balancing Strategies
● Prioritization
● Using Search
● Using Documentation
● Site-to-Site Communication / Remote Process Groups
● Extensions
● Scheduling
● Tailing Files
● Reading sFTP/FTP Files
● Wait and Notify
● RetryFlowFile Pattern
● NiFi Calcite SQL
● Using Jolt
● Using JsonPath
● Making REST Calls
● Receiving REST Calls
● LookupRecord
● Working with Caches
● Restarting Flows
● Pass by Reference
● Using Regular Expressions
● Funnels
Basic Understanding of Cloudera Flow Management - Apache NiFi
9
Do Not:
● Do not Put 1,000 Flows on one workspace.
● If your flow has hundreds of steps, this is a Flow Smell. Investigate
why.
● Do not Use ExecuteProcess, ExecuteScripts or a lot of Groovy scripts
as a default, look for existing processors
● Do not Use Random Custom Processors you find that have no
documentation or are unknown.
● Do not forget to upgrade, if you are running anything before Apache
NiFi 1.10, upgrade now!
● Do not run on default 512M RAM.
● Do not run one node and think you have a highly available cluster.
● Do not split a file with millions of records to individual records in one
shot without checking available space/memory and back pressure.
● Use Split processors only as an absolute last resort. Many processors
are designed to work on FlowFiles that contain many records or many
lines of text. Keeping the FlowFiles together instead of splitting them
apart can often yield performance that is improved by 1-2 orders of
magnitude.
10
Do:
● Reduce, Reuse, Recycle. Use Parameters to reuse common modules.
● Put flows, reusable chunks (write to Slack, Database, Kafka) into separate
Process Groups.
● Write custom processors if you need new or specialized features
● Use RecordProcessors everywhere
● Read the Docs!
● Use the NiFi Registry for version control.
● Use NiFi CLI and DevOps for Migrations.
● Run a CDP NiFi Datahub or CFM managed 3 or more node cluster.
● Walk through your flow and make sure you understand every step and it’s easy
to read and follow. Is every processor used? Are there dead ends?
● Do run Zookeeper on different nodes from Apache NiFi.
● For Cloud Hosted Apache NiFi - go with the "high cpu" instances, such as 8
cores, 7 GB ram.
● same flow 'templatized' and deployed many many times with different params in
the same instance
● Use routing based on content and attributes to allow one flow to handle multiple
nearly identical flows is better than deploying the same flow many times with
tweaks to parameters in same cluster.
● Use the correct driver for your database. There's usually a couple different
JDBC drivers.
© 2023 Cloudera, Inc. All rights reserved. 11
CLOUDERA FLOW MANAGEMENT - POWERED BY APACHE NiFi
Ingest and manage data from edge-to-cloud using a no-code interface
● #1 data ingestion/movement engine
● Strong community
● Product maturity over 11 years
● Deploy on-premises or in the cloud
● Over 400+ pre-built processors
● Built-in data provenance
● Guaranteed delivery
● Throttling and Back pressure
CLOUD
© 2023 Cloudera, Inc. All rights reserved. 13
Development & Runtime of DataFlow Functions
Step1. Develop functions on
local workstation or in CDP
Public Cloud using no-code,
UI designer
Step 2. Run functions on
serverless compute
services in AWS, Azure &
GCP
AWS Lambda Azure Functions Google Cloud Functions
14
Flow Catalog
• Central repository for flow
definitions
• Import existing NiFi flows
• Manage flow definitions
• Initiate flow deployments
15
ReadyFlows
• Cloudera provided flow
definitions
• Cover most common data flow
use cases
• Can be deployed and adjusted
as needed
• Made available through docs
during Tech Preview
16
Deployment
Wizard
• Turns flow definitions into flow
deployments
• Guides users through providing
required configuration
• Pick from pre-defined NiFi
node sizes
• Define KPIs for the deployment
Start Deployment Wizard Provide Parameters
Configure Sizing & Scaling Define KPIs
17
Key Performance
Indicators
• Visibility into flow deployments
• Track high level flow
performance
• Track in-depth NiFi component
metrics
• Defined in Deployment Wizard
• Monitoring & Alerts in
Deployment Details
KPI Definition in Deployment Wizard KPI Monitoring
18
Dashboard
• Central Monitoring View
• Monitors flow deployments
across CDP environments
• Monitors flow deployment
health & performance
• Drill into flow deployment to
monitor system metrics and
deployment events
19
DATA FLOW
DESIGN FOR
EVERYONE
• Cloud-native data flow
development
• Developers get their own
sandbox
• Start developing flows without
installing NiFi
• Redesigned visual canvas
• Optimized interaction patterns
• Integration into CDF-PC Catalog
for versioning
NEW
Records
New ExcelRecord Reader
AmazonGlueSchemaRegistry
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12316020&version=12353320
New to 2023 Processors
GenerateRecord
GetAsanaObject
PutSalesforceObject
QuerySalesforceObject
PutIoTDBRecord
QueryIoTDBRecord
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12316020&version=12353320
ListGoogleDrive
FetchGoogleDrive
PutGoogleDrive
PutBoxFile
ListBoxFile
FetchBoxFile
PutDropbox
DecryptContent
DecryptContentCompatibility
New to 2023 Processors
ExtractRecordSchema
RemoveRecordField
VerifyContentMAC
TriggerHiveMetaStoreEvent
“count” function added to RecordPath
AWS ML Service Processors
https://github.com/tspannhw/FLaNK-AWSML
AWS Translate
2.0!
Thanks to Pierre!
28
https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450
NiFi 2.0 Coming
● Python Integration
● Parameters
● JDK 17, maybe JDK 21+
● JSON Flow Serialization
● Rules Engine for Development Assistance
● Run Process Group as Stateless
● flow.json.gz
https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
Deprecating for Removal
Deprecate Lua and Ruby Script Engines
Deprecate ECMAScript Script Engine
Deprecate the Ambari Reporting Task
Deprecate Kafka 1.x components and 2.0 components
XML Templates
Variables
See:
https://cwiki.apache.org/confluence/display/NIFI/Deprecated+Components+and+Features
Start Using
ExecuteStateless -> run your stateless flows right in a regular NiFi cluster
Parameters
JSON Flow Serialization
Records everywhere
32
Python as First Class (NIFI-11241)
Graphical UI with custom Python based extensions
NEW
in NiFi
2.0
33
Apache NiFi in a few numbers
A very active project with a dynamic community & comparison with ACEU 2019
2800+ members on the Slack channel (535+ - 4 years ago)
475+ contributors on Github across the repositories
(260+ - 4 years ago)
65 committers in the Apache NiFi community (45 - 4
years ago)
Apache NiFi 1.23.2 is the latest release, NiFi 2.0 coming
soon (NiFi 1.10 - 4 years ago)
14M+ docker pulls of the Apache NiFi image (1M+ - 4
years ago)
34
© 2023 Cloudera, Inc. All rights reserved.
Cloudera Edge Flow Manager
(Command & Control of MiNiFi Agents)
MiNiFi C++
(small footprint)
MiNiFi Java
(headless version of NiFi)
NiFi Registry
Cloudera NiFi for Kafka
Connect
NiFi in
Cloudera DataFlow Functions
Cloudera DataFlow
Stateless NiFi
NiFi Deploy Options from Open Source to Managed
35
© 2023 Cloudera, Inc. All rights reserved.
NiFi 2.0 is coming… https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450
- First-class citizen Python API
- Rules Engine
- NiFi Stateless at Process Group level
- Java 21 (virtual threads, perf improvements, etc)
https://medium.com/@george.vetticaden/accelerating-ai-data-pipelines-building-an-evernote-chatbot-with-apache-nifi-2-0-and-generative-ai-9d977466ff4c
Closing the gap between data engineers and data scientists…
- Export documentation (Sharepoint, OCR) to build the knowledge base powering your chatbot
- Scrape the internet (Sitemap) to build the knowledge base powering your chatbot
- Real-time streaming ingest of Slack to build the knowledge base powering your chatbot
WALKTHRU
37
TH N Y U

More Related Content

PDF
Introduction to Apache NiFi 1.10
PDF
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
PDF
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
PDF
Introduction to Apache NiFi 1.11.4
PPTX
Apache-NiFi.pptx
PDF
WarsawITDays_ ApacheNiFi202
PDF
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
PDF
CoC23_ Looking at the New Features of Apache NiFi
Introduction to Apache NiFi 1.10
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
Introduction to Apache NiFi 1.11.4
Apache-NiFi.pptx
WarsawITDays_ ApacheNiFi202
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
CoC23_ Looking at the New Features of Apache NiFi

Similar to AIDevWorldApacheNiFi101 (20)

PDF
CoC23_ Looking at the New Features of Apache NiFi
PDF
Meet the Committers Webinar_ Lab Preparation
PDF
Api world apache nifi 101
PDF
ApacheCon 2021 - Apache NiFi Deep Dive 300
PDF
Automate your data flows with Apache NIFI
PDF
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
PDF
AIDEVDAY_ Data-in-Motion to Supercharge AI
PDF
Introduction to Apache NiFi dws19 DWS - DC 2019
PDF
Building Real-Time Travel Alerts
PDF
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
PDF
ApacheCon 2021: Apache NiFi 101- introduction and best practices
PDF
BYOP: Custom Processor Development with Apache NiFi
PDF
Introduction to data flow management using apache nifi
PDF
The Never Landing Stream with HTAP and Streaming
PDF
Learning the basics of Apache NiFi for iot OSS Europe 2020
PPTX
Best practices and lessons learnt from Running Apache NiFi at Renault
PPTX
HDF Powered by Apache NiFi Introduction
PDF
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
PDF
Joe Witt presentation on Apache NiFi
PDF
Flink sql for continuous sql etl apps & Apache NiFi devops
CoC23_ Looking at the New Features of Apache NiFi
Meet the Committers Webinar_ Lab Preparation
Api world apache nifi 101
ApacheCon 2021 - Apache NiFi Deep Dive 300
Automate your data flows with Apache NIFI
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
AIDEVDAY_ Data-in-Motion to Supercharge AI
Introduction to Apache NiFi dws19 DWS - DC 2019
Building Real-Time Travel Alerts
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
ApacheCon 2021: Apache NiFi 101- introduction and best practices
BYOP: Custom Processor Development with Apache NiFi
Introduction to data flow management using apache nifi
The Never Landing Stream with HTAP and Streaming
Learning the basics of Apache NiFi for iot OSS Europe 2020
Best practices and lessons learnt from Running Apache NiFi at Renault
HDF Powered by Apache NiFi Introduction
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Joe Witt presentation on Apache NiFi
Flink sql for continuous sql etl apps & Apache NiFi devops
Ad

More from Timothy Spann (20)

PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Ad

Recently uploaded (20)

PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Foundation of Data Science unit number two notes
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Computer network topology notes for revision
PDF
Introduction to Business Data Analytics.
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Reliability_Chapter_ presentation 1221.5784
IBA_Chapter_11_Slides_Final_Accessible.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Fluorescence-microscope_Botany_detailed content
Foundation of Data Science unit number two notes
Major-Components-ofNKJNNKNKNKNKronment.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Computer network topology notes for revision
Introduction to Business Data Analytics.
Supervised vs unsupervised machine learning algorithms
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Moving the Public Sector (Government) to a Digital Adoption
climate analysis of Dhaka ,Banglades.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

AIDevWorldApacheNiFi101

  • 1. Apache NiFi 101: Introduction and Best Practices Timothy Spann Principal Developer Advocate
  • 3. 3 Tim Spann Twitter: @PaasDev // Blog: datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC, ex-HPE https://medium.com/@tspann https://github.com/tspannhw
  • 4. © 2021 Cloudera, Inc. All rights reserved. 4 Future of Data - New York + Princeton + Virtual @PaasDev https://www.meetup.com/futureofdata-princeton/ https://www.meetup.com/futureofdata-newyork/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ...
  • 5. FLaNK Stack Weekly This week in Apache NiFi, Apache Flink, Apache Kafka, Apache Spark, Apache Iceberg, Python, Java, AI, ML, LLM and Open Source friends. https://bit.ly/32dAJft
  • 6. © 2019 Cloudera, Inc. All rights reserved. 6
  • 8. 8 ● NiFi Cluster Architecture ● Content Repository ● Provenance Repository ● FlowFile Repository ● FlowFile, Attributes, Process Groups, Connections, Flow Controllers ● Controller Services ● Common Attributes (uuid, filename, path, file size, ...) ● Expression Language ● Relationships ● Bulletins ● Input Port ● Output Port ● Empty Queues ● Setting Warning Levels ● Funnels ● RecordPath ● Using Record Processors (Readers/Writers) ● NiFi REST API ● Handling Errors ● Parameter Context / Parameters ● Summary / Cluster / Bulletins ● Reporting Tasks ● Back pressure ● Prioritized Queues ● Load Balancing Strategies ● Prioritization ● Using Search ● Using Documentation ● Site-to-Site Communication / Remote Process Groups ● Extensions ● Scheduling ● Tailing Files ● Reading sFTP/FTP Files ● Wait and Notify ● RetryFlowFile Pattern ● NiFi Calcite SQL ● Using Jolt ● Using JsonPath ● Making REST Calls ● Receiving REST Calls ● LookupRecord ● Working with Caches ● Restarting Flows ● Pass by Reference ● Using Regular Expressions ● Funnels Basic Understanding of Cloudera Flow Management - Apache NiFi
  • 9. 9 Do Not: ● Do not Put 1,000 Flows on one workspace. ● If your flow has hundreds of steps, this is a Flow Smell. Investigate why. ● Do not Use ExecuteProcess, ExecuteScripts or a lot of Groovy scripts as a default, look for existing processors ● Do not Use Random Custom Processors you find that have no documentation or are unknown. ● Do not forget to upgrade, if you are running anything before Apache NiFi 1.10, upgrade now! ● Do not run on default 512M RAM. ● Do not run one node and think you have a highly available cluster. ● Do not split a file with millions of records to individual records in one shot without checking available space/memory and back pressure. ● Use Split processors only as an absolute last resort. Many processors are designed to work on FlowFiles that contain many records or many lines of text. Keeping the FlowFiles together instead of splitting them apart can often yield performance that is improved by 1-2 orders of magnitude.
  • 10. 10 Do: ● Reduce, Reuse, Recycle. Use Parameters to reuse common modules. ● Put flows, reusable chunks (write to Slack, Database, Kafka) into separate Process Groups. ● Write custom processors if you need new or specialized features ● Use RecordProcessors everywhere ● Read the Docs! ● Use the NiFi Registry for version control. ● Use NiFi CLI and DevOps for Migrations. ● Run a CDP NiFi Datahub or CFM managed 3 or more node cluster. ● Walk through your flow and make sure you understand every step and it’s easy to read and follow. Is every processor used? Are there dead ends? ● Do run Zookeeper on different nodes from Apache NiFi. ● For Cloud Hosted Apache NiFi - go with the "high cpu" instances, such as 8 cores, 7 GB ram. ● same flow 'templatized' and deployed many many times with different params in the same instance ● Use routing based on content and attributes to allow one flow to handle multiple nearly identical flows is better than deploying the same flow many times with tweaks to parameters in same cluster. ● Use the correct driver for your database. There's usually a couple different JDBC drivers.
  • 11. © 2023 Cloudera, Inc. All rights reserved. 11 CLOUDERA FLOW MANAGEMENT - POWERED BY APACHE NiFi Ingest and manage data from edge-to-cloud using a no-code interface ● #1 data ingestion/movement engine ● Strong community ● Product maturity over 11 years ● Deploy on-premises or in the cloud ● Over 400+ pre-built processors ● Built-in data provenance ● Guaranteed delivery ● Throttling and Back pressure
  • 12. CLOUD
  • 13. © 2023 Cloudera, Inc. All rights reserved. 13 Development & Runtime of DataFlow Functions Step1. Develop functions on local workstation or in CDP Public Cloud using no-code, UI designer Step 2. Run functions on serverless compute services in AWS, Azure & GCP AWS Lambda Azure Functions Google Cloud Functions
  • 14. 14 Flow Catalog • Central repository for flow definitions • Import existing NiFi flows • Manage flow definitions • Initiate flow deployments
  • 15. 15 ReadyFlows • Cloudera provided flow definitions • Cover most common data flow use cases • Can be deployed and adjusted as needed • Made available through docs during Tech Preview
  • 16. 16 Deployment Wizard • Turns flow definitions into flow deployments • Guides users through providing required configuration • Pick from pre-defined NiFi node sizes • Define KPIs for the deployment Start Deployment Wizard Provide Parameters Configure Sizing & Scaling Define KPIs
  • 17. 17 Key Performance Indicators • Visibility into flow deployments • Track high level flow performance • Track in-depth NiFi component metrics • Defined in Deployment Wizard • Monitoring & Alerts in Deployment Details KPI Definition in Deployment Wizard KPI Monitoring
  • 18. 18 Dashboard • Central Monitoring View • Monitors flow deployments across CDP environments • Monitors flow deployment health & performance • Drill into flow deployment to monitor system metrics and deployment events
  • 19. 19 DATA FLOW DESIGN FOR EVERYONE • Cloud-native data flow development • Developers get their own sandbox • Start developing flows without installing NiFi • Redesigned visual canvas • Optimized interaction patterns • Integration into CDF-PC Catalog for versioning
  • 20. NEW
  • 22. New to 2023 Processors GenerateRecord GetAsanaObject PutSalesforceObject QuerySalesforceObject PutIoTDBRecord QueryIoTDBRecord https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12316020&version=12353320 ListGoogleDrive FetchGoogleDrive PutGoogleDrive PutBoxFile ListBoxFile FetchBoxFile PutDropbox DecryptContent DecryptContentCompatibility
  • 23. New to 2023 Processors ExtractRecordSchema RemoveRecordField VerifyContentMAC TriggerHiveMetaStoreEvent “count” function added to RecordPath
  • 24. AWS ML Service Processors https://github.com/tspannhw/FLaNK-AWSML
  • 26. 2.0!
  • 28. 28
  • 29. https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450 NiFi 2.0 Coming ● Python Integration ● Parameters ● JDK 17, maybe JDK 21+ ● JSON Flow Serialization ● Rules Engine for Development Assistance ● Run Process Group as Stateless ● flow.json.gz https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
  • 30. Deprecating for Removal Deprecate Lua and Ruby Script Engines Deprecate ECMAScript Script Engine Deprecate the Ambari Reporting Task Deprecate Kafka 1.x components and 2.0 components XML Templates Variables See: https://cwiki.apache.org/confluence/display/NIFI/Deprecated+Components+and+Features
  • 31. Start Using ExecuteStateless -> run your stateless flows right in a regular NiFi cluster Parameters JSON Flow Serialization Records everywhere
  • 32. 32 Python as First Class (NIFI-11241) Graphical UI with custom Python based extensions NEW in NiFi 2.0
  • 33. 33 Apache NiFi in a few numbers A very active project with a dynamic community & comparison with ACEU 2019 2800+ members on the Slack channel (535+ - 4 years ago) 475+ contributors on Github across the repositories (260+ - 4 years ago) 65 committers in the Apache NiFi community (45 - 4 years ago) Apache NiFi 1.23.2 is the latest release, NiFi 2.0 coming soon (NiFi 1.10 - 4 years ago) 14M+ docker pulls of the Apache NiFi image (1M+ - 4 years ago)
  • 34. 34 © 2023 Cloudera, Inc. All rights reserved. Cloudera Edge Flow Manager (Command & Control of MiNiFi Agents) MiNiFi C++ (small footprint) MiNiFi Java (headless version of NiFi) NiFi Registry Cloudera NiFi for Kafka Connect NiFi in Cloudera DataFlow Functions Cloudera DataFlow Stateless NiFi NiFi Deploy Options from Open Source to Managed
  • 35. 35 © 2023 Cloudera, Inc. All rights reserved. NiFi 2.0 is coming… https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450 - First-class citizen Python API - Rules Engine - NiFi Stateless at Process Group level - Java 21 (virtual threads, perf improvements, etc) https://medium.com/@george.vetticaden/accelerating-ai-data-pipelines-building-an-evernote-chatbot-with-apache-nifi-2-0-and-generative-ai-9d977466ff4c Closing the gap between data engineers and data scientists… - Export documentation (Sharepoint, OCR) to build the knowledge base powering your chatbot - Scrape the internet (Sitemap) to build the knowledge base powering your chatbot - Real-time streaming ingest of Slack to build the knowledge base powering your chatbot