You	Can’t	Search	Without	Data
Bryan	Bende	– Staff	Software	Engineer	@Hortonworks
NYC	Solr/Lucene	Meetup	– December	7th 2017
2 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Agenda
à The	Problem
à Apache	NiFi Overview
à Integration	between	NiFi &	Solr
à Recent	&	Future	Work
à Demo	Cool	Stuff!
à Q&A
3 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
About	Me
à Staff	Software	Engineer	@	Hortonworks
à Apache	NiFi PMC	&	Committer
à Contributed	Solr processors	in	March	2015
– https://issues.apache.org/jira/browse/NIFI-461
à bbende@hortonworks.com /	Twitter	@bbende /	bryanbende.com
4 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
The	Problem
5 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Team	2
It	starts	out	so	simple…
Hey!	We	have	some	
important	data	to	
send	you!	
Cool!	Your	data	is	
really	important	to	
us!
Team	1
This	should	be	easy	right?...
6 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
But	what	about	formats	&	protocols?
Team	2
We	can	publish	
Avro	records	to	a	
Kafka	topic,	does	
that	work?
Oh,	well	we	have	
a	REST	service	
that	accepts	
JSON…
Team	1
7 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
And	what	about	security	&	authentication?
Team	2
Hmm	what	about	
security?	We	can	
authenticate	via	
Kerberos
Sorry,	we	only	
support	2-Way	
TLS	with	
certificates
Team	1
8 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
And	what	about	all	these	devices	at	the	edge?
We	also	need	to	
grab	data	from	all	
these	devices,	how	
are	we	going	to	do	
that?
Team	2
9 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Wouldn’t	it	be	nice	if	there	was	a	tool	that	could	
help	these	teams?
10 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Enter	Apache	NiFi…
11 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Apache NiFi
• Created to address the challenges of global enterprise dataflow
• Key features:
– Visual	Command	and	Control
– Data	Lineage	(Provenance)
– Data	Prioritization
– Data	Buffering/Back-Pressure
– Control	Latency	vs.	Throughput
– Secure	Control	Plane	/	Data	Plane
– Scale	Out	Clustering
– Extensibility
12 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
NiFi Core Concepts
FBP	Term NiFi Term Description
Information	
Packet
FlowFile Each object	moving	through	the	system.
Black Box FlowFile	
Processor
Performs	the	work, doing	some	combination	of	data	routing,	transformation,	
or	mediation	between	systems.
Bounded	
Buffer
Connection The	linkage between	processors, acting	as	queues	and	allowing	various	
processes	to	interact	at	differing	rates.
Scheduler Flow	
Controller
Maintains	the	knowledge	of	how	processes	are	connected, and	manages	the	
threads	and	allocations	thereof	which	all	processes	use.
Subnet Process	
Group
A	set	of	processes	and	their	connections,	which	can	receive	and	send	data	via	
ports.	A	process group	allows	creation	of	entirely	new	component	simply	by	
composition	of	its components.
13 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Visual	Command	&	Control
• Drag	& drop	processors	to	build	a	flow
• Start,	stop,	&	configure	components	in
real-time
• View	errors	& corresponding	messages
• View	statistics	& health	of the
dataflow
• Create shareable templates	of	
common	flows
14 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Provenance/Lineage
• Tracks	data	at	each	point	as	it	flows	
through	the	system
• Records,	indexes,	and	makes	events	
available	for	display
• Handles	fan-in/fan-out,	i.e.	merging	
and	splitting	data
• View	attributes	and	content	at	given	
points	in	time
15 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Prioritization
• Configure	a	prioritizer per	connection
• Determine	what	is	important	for	your	
data	– time	based,	arrival	order,	
importance	of	a	data	set
• Funnel	many	connections	down	to	a	
single	connection	to	prioritize	across	
data	sets
16 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Back-Pressure
• Configure	back-pressure	per	
connection
• Based	on	number	of	FlowFiles or	
total	size	of	FlowFiles
• Upstream	processor	no	longer	
scheduled	to	run	until	below	
threshold
17 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Latency	vs.	Throughput
• Choose	between	lower	latency,	or	higher	throughput	on	each	processor
• Higher	throughput	allows	framework	to	batch	together	all	operations	for	the	selected	
amount	of	time	for	improved	performance
• Processor	developer	determines	whether	to	support	this	by	using	@SupportsBatching
annotation
18 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Architecture	- Standalone
OS/Host
JVM
Flow	Controller
Web	Server
Processor	1 Extension	N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local	Storage
à FlowFile Repository
– Write	Ahead	Log	
– State	of	every	FlowFile
– Pointers	to	content	repository	
(pass-by-reference)
à Content	Repository
– FlowFile content
– Copy-on-write
à Provenance	Repository
– Write	Ahead	Log	+	Lucene Indexes
– Store	&	search	lineage	events
19 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
OS/Host
JVM
Flow	Controller
Web	Server
Processor	1 Extension	N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local	Storage
OS/Host
JVM
Flow	Controller
Web	Server
Processor	1 Extension	N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local	Storage
Architecture	- Cluster
OS/Host
JVM
Flow	Controller
Web	Server
Processor	1 Extension	N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local	Storage
ZooKeeper
à Same	dataflow	on	each	node,	
data	partitioned	across	cluster
à Access	the	UI	from	any	node
à ZooKeeper for	auto-election	of	
Cluster	Coordinator	&	Primary	
Node	
à Cluster	Coordinator	receives	
heartbeats	from	other	nodes,	
manages	joining/	disconnecting
à Primary	Node	for	scheduling	
processors	on	a	single	node
20 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
NiFi &	Solr
21 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
NiFi Solr Processors	
à Support	Solr Cloud	and	stand-alone	Solr instances
à Leverage	SolrJ (CloudSolrClient &	HttpSolrClient)
à GetSolr – Extract	new	documents
à PutSolrContentStream – Stream	flow	file	content	to	an	update	handler
22 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
PutSolrContentStream
à Choose	Solr Type	– Cloud	or	
Standard
à Specify	ZooKeeper hosts,	or	the	
Solr URL	with	core
à Specify	the	Solr path	for	the	
update	handler
à Dynamic	Properties	sent	as	
key/value	pairs	on	request
à Relationships	for	success,	
failure,	and	connection	failure
23 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
GetSolr
à Incrementally	extract	new	documents	
à Main	query	is	*:*,	Solr Query	is	
optional	filter	query
à Date	Field	used	as	filter	query,	from	
last	execution	or	initial	value
à Sorted	by	date	field	and	unique	key
à Cursor	mark	used	behind	the	scenes
à Specify	return	fields,	or	all	if	blank
à Output	Solr XML,	or	Records
24 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Interacting	With	a	Secure	Solr
à Basic	Auth
– Provider	username/password
à Kerberos
– Set	JAAS	system	property	in	bootstrap.conf
– Provide	name	of	JAAS	entry	for	processor	to	use
à TLS/SSL
– Provide	an	SSL	Context	Service
– One-way	TLS	with	Truststore only
– Two-way	TLS	with	Keystore +	Truststore
25 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Recent	&	Future	Work
26 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Problem	– Conversion	Between	Data	Formats
à Specialized	processors	to	operate	on	different	data	types
à Sometimes	missing	conversions
à Sometimes	missing	a	specific	function	for	a	data	type
à Sometimes	implemented	with	different	libraries	causing	inconsistencies
27 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Solution	– Record	Processing
à Introduce	the	concept	of	a	”record”
– Released	in	Apache	NiFi 1.2.0	(May	2017),	improvements	in	1.3.0	and	1.4.0
à Centralize	the	logic	for	reading/writing	records	into	controller	services
– Readers/Writers	for	CSV,	Json,	Avro,	etc.
à Provide	standard	processors	that	operate	on	records
– ConvertRecord,	QueryRecord,	PartitionRecord,	UpdateRecord,	etc.
à Provide	integration	with	schema	registries
– Local	Schema	Registry,	Hortonworks	Schema	Registry,	Confluent	Schema	Registry
à Can	still	handle	arbitrary	data,	but	process	records	when	appropriate
28 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Problem	– Variable	Handling
à Need	to	parametrize	values	in	the	flow	per	environment
– Connection	strings,	URLs,	File	System	paths,	etc.
à Can	set	variables	in	bootstrap.conf
– -Dmy.var=foo
à Can	set	a	properties	file	in	nifi.properties
– nifi.variable.registry.properties=production.properties
à Both	require	command	line	access
à Both	require	restart	to	pick	up	changes
29 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Solution	– First	Class	Variable	Registry	
à Variables	associated	with	a	process	group,	released	in	1.4.0
à Right-click	on	canvas	to	view	variables	for	current	group
à Hierarchical	order	of	precedence,	resolve	closest	reference	to	
component
à Editing	variables	automatically	restarts	any	components	
referencing	the	variables
30 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Problem	– How	do	I	deploy	my	flow?
à Most	organizations	want	the	classic	development	lifecycle	(dev	->	int ->	prod)
à Can	copy	flow.xml.gz between	environments
– Requires	copying	entire	data	flow
– Can’t	tell	what	changed,	hard	to	diff	if	you	put	in	version	control
– Requires	all	environments	use	the	same	encryption	key	for	sensitive	properties
à Can	make	templates	for	portions	of	the	flow
– Script	creation	of	template	and	deployment	to	next	environment
– Requires	stopping	flow	and	removing	components,	then	re-instantiating	template
– No	easy	way	to	see	changes,	hard	to	rollback
31 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Solution	– NiFi Registry
à DISCLAIMER	- UNDER	DEVELOPMENT	&	NOT	RELEASED	YET!
à Complimentary	application,	sub-project	of	Apache	NiFi
– https://github.com/apache/nifi-registry
– https://issues.apache.org/jira/projects/NIFIREG
à Central	location	for	storage/management	of	shared	resources	across	NiFi instances
à Initial	capability	to	store	and	retrieve	“versioned	flows”
à A	versioned	flow	is	a	snapshot	of	a	process	group	at	a	given	point	in	time
à Potentially	store	extensions,	shared	data	sets,	and	more	in	the	future
32 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
DEMO!!
33 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Example	Scenario
à User	data
– https://randomuser.me
à Initially	in	CSV	format
– name.title,name.first,name.last,email,registered
– mr,dennis,reyes,dennis.reyes@example.com,2012-04-10 01:54:19
– miss,carole,gomez,carole.gomez@example.com,2002-12-17 22:15:49
à Requirements
– Convert	CSV	to	JSON
– Add	a	full_name field	with	first	name	+	last	name
– Add	a	gender	field	based	on	title	(i.e.	if	title	==	mr then	MALE)
– Ingest	to	different	Solr collections	depending	on	environment
34 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Questions?
35 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Learn	more	and	join	us!
Apache NiFi site
http://nifi.apache.org
Subscribe to and collaborate at
dev@nifi.apache.org
users@nifi.apache.org
Submit Ideas or Issues
https://issues.apache.org/jira/browse/NIFI
Follow us on Twitter
@apachenifi
36 ©	Hortonworks	Inc.	2011	– 2016.	All	Rights	Reserved
Thank	you!

More Related Content

PDF
Devnexus 2018 - Let Your Data Flow with Apache NiFi
PPTX
NJ Hadoop Meetup - Apache NiFi Deep Dive
PDF
Apache NiFi Record Processing
PPTX
Building Data Pipelines for Solr with Apache NiFi
PDF
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
PDF
Apache NiFi Meetup - Introduction to NiFi Registry
PPTX
Apache NiFi in the Hadoop Ecosystem
PDF
Introduction to data flow management using apache nifi
Devnexus 2018 - Let Your Data Flow with Apache NiFi
NJ Hadoop Meetup - Apache NiFi Deep Dive
Apache NiFi Record Processing
Building Data Pipelines for Solr with Apache NiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Apache NiFi Meetup - Introduction to NiFi Registry
Apache NiFi in the Hadoop Ecosystem
Introduction to data flow management using apache nifi

What's hot (20)

PDF
Apache Nifi Crash Course
PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
PPTX
Apache NiFi Crash Course Intro
PPTX
Integrating NiFi and Apex
PDF
Dataflow with Apache NiFi
PDF
Dataflow Management From Edge to Core with Apache NiFi
PPTX
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
PDF
Dataflow Management From Edge to Core with Apache NiFi
PDF
Data ingestion and distribution with apache NiFi
PPTX
NiFi Best Practices for the Enterprise
PDF
Introduction to Apache NiFi 1.11.4
PPTX
Apache NiFi Crash Course - San Jose Hadoop Summit
PPTX
Integrating NiFi and Flink
PDF
Running Apache NiFi with Apache Spark : Integration Options
PDF
Apache NiFi: latest developments for flow management at scale
PPTX
Building a Smarter Home with Apache NiFi and Spark
PDF
NiFi Developer Guide
PPTX
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
PDF
Meet HBase 2.0 and Phoenix 5.0
PDF
Local Apache NiFi Processor Debug
Apache Nifi Crash Course
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Apache NiFi Crash Course Intro
Integrating NiFi and Apex
Dataflow with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow Management From Edge to Core with Apache NiFi
Data ingestion and distribution with apache NiFi
NiFi Best Practices for the Enterprise
Introduction to Apache NiFi 1.11.4
Apache NiFi Crash Course - San Jose Hadoop Summit
Integrating NiFi and Flink
Running Apache NiFi with Apache Spark : Integration Options
Apache NiFi: latest developments for flow management at scale
Building a Smarter Home with Apache NiFi and Spark
NiFi Developer Guide
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Meet HBase 2.0 and Phoenix 5.0
Local Apache NiFi Processor Debug
Ad

Similar to You Can't Search Without Data (20)

PPTX
Apache NiFi in the Hadoop Ecosystem
PPTX
Introduction to Apache NiFi - Seattle Scalability Meetup
PDF
Real time cloud native open source streaming of any data to apache solr
PDF
Nifi workshop
PPTX
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
PDF
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
PDF
Social Media Monitoring with NiFi, Druid and Superset
PDF
Apache Nifi Crash Course
PPTX
Integrating Apache NiFi and Apache Flink
PPTX
Integrating Apache NiFi and Apache Flink
PPTX
Integrating Apache NiFi and Apache Flink
PPTX
Integrating Apache NiFi and Apache Flink
PPTX
State of the Apache NiFi Ecosystem & Community
PDF
Integrating Apache NiFi and Apache Apex
PPTX
Hortonworks Data in Motion Webinar Series - Part 1
PPTX
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
PPTX
Apache NiFi 1.0 in Nutshell
PPTX
Connecting the Drops with Apache NiFi & Apache MiNiFi
PDF
Introduction to Apache NiFi dws19 DWS - DC 2019
PDF
HDF: Hortonworks DataFlow: Technical Workshop
Apache NiFi in the Hadoop Ecosystem
Introduction to Apache NiFi - Seattle Scalability Meetup
Real time cloud native open source streaming of any data to apache solr
Nifi workshop
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Social Media Monitoring with NiFi, Druid and Superset
Apache Nifi Crash Course
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
State of the Apache NiFi Ecosystem & Community
Integrating Apache NiFi and Apache Apex
Hortonworks Data in Motion Webinar Series - Part 1
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Apache NiFi 1.0 in Nutshell
Connecting the Drops with Apache NiFi & Apache MiNiFi
Introduction to Apache NiFi dws19 DWS - DC 2019
HDF: Hortonworks DataFlow: Technical Workshop
Ad

Recently uploaded (20)

PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
DOCX
How to Use SharePoint as an ISO-Compliant Document Management System
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PDF
MCP Security Tutorial - Beginner to Advanced
PPTX
Patient Appointment Booking in Odoo with online payment
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
Cost to Outsource Software Development in 2025
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
Time Tracking Features That Teams and Organizations Actually Need
PDF
Types of Token_ From Utility to Security.pdf
PPTX
Introduction to Windows Operating System
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
How to Use SharePoint as an ISO-Compliant Document Management System
Topaz Photo AI Crack New Download (Latest 2025)
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
Wondershare Recoverit Full Crack New Version (Latest 2025)
MCP Security Tutorial - Beginner to Advanced
Patient Appointment Booking in Odoo with online payment
Tech Workshop Escape Room Tech Workshop
Cost to Outsource Software Development in 2025
Advanced SystemCare Ultimate Crack + Portable (2025)
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Oracle Fusion HCM Cloud Demo for Beginners
DNT Brochure 2025 – ISV Solutions @ D365
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
"Secure File Sharing Solutions on AWS".pptx
Time Tracking Features That Teams and Organizations Actually Need
Types of Token_ From Utility to Security.pdf
Introduction to Windows Operating System
Weekly report ppt - harsh dattuprasad patel.pptx

You Can't Search Without Data