SlideShare a Scribd company logo
Real-Time Streaming Pipelines With FLaNK
(Apache Flink, Apache NiFi & Apache Kafka)
Timothy Spann - Principal DataFlow Field Engineer
15-April-2021
2
https://github.com/tspannhw https://www.datainmotion.dev/
https://www.meetup.com/futureofdata-princeton/
3
FLaNK Stack for Cloud Data Engineers
Multiple users, frameworks, languages, clouds, data sources & clusters
CLOUD DATA ENGINEER
• Experience in ETL/ELT
• Coding skills in Python or Java
• Knowledge of database query
languages such as SQL
• Experience with Streaming
• Knowledge of Cloud Tools
• Expert in ETL (Eating, Ties and Laziness)
• Edge Camera Interaction
• Typical User
• No Coding Skills
• Can use NiFi
• Questions your cloud spend
CAT AI / Deep Learning / ML / DS
• Can run in Apache NiFi
• Can run in Kafka Streams
• Can run in Apache Flink
• Can run in MiNiFi Agents
4
I Can Haz Data?
Today’s Data. REST and Websocket JSON “stonks”
{"symbol":"CLDR",
"uuid":"10640832-f139-4b82-8780-e3ad37b3d0
ce",
"ts":1618529574078,
"dt":1612098900000,
"datetime":"2021/01/31 08:15:00",
"open":"12.24500",
"close":"12.25500",
"high":"12.25500",
"volume":"12353",
"low":"12.24500"}
5
https://github.com/tspannhw/SmartStocks
6
End to End Streaming Demo Pipeline
Enterprise
sources
Weather
Errors
Aggregates
Alerts
Stocks
ETL
Analytics
Streaming SQL
Clickstream Market data
Machine logs Social
https://github.com/tspannhw/CloudDemo2021
7
8
Apache NiFi in a Nutshell
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
Advanced tooling to industrialize
flow development (Flow Development
Life Cycle)
ACQUIRE
• Over 340 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
PROCESS
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ENCRYPT
TALL
EVALUATE
EXECUTE
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
ROUTE RATE
DISTRIBUTE LOAD
DELIVER
• Guaranteed Delivery
• Full data provenance from
acquisition to delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
© 2021 Cloudera, Inc. All rights reserved. 9
ParquetReader /
ParquetWriter
Records
• Native Record Processors for
Apache Parquet Files!
• CVS <-> Parquet
• XML <-> Parquet
• AVRO <-> Parquet
• JSON <-> Parquet
• More...
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apac
he_7.html
© 2021 Cloudera, Inc. All rights reserved. 10
UpdateRecord
• Use with LookupRecord
• ELT
• Works on CSV, XML, JSON,
AVRO, …
• RecordPath or Literals
• Use Schemas and Schema
Registry
© 2021 Cloudera, Inc. All rights reserved. 11
ValidateRecord
• Works on CSV, XML, JSON,
AVRO, …
• RecordPath or Literals
• Use Schemas and Schema
Registry
• Checks fields, types, nullable
© 2021 Cloudera, Inc. All rights reserved. 12
RestLookupService
• Works on CSV, XML, JSON,
AVRO, …
• Use Schemas and Schema
Registry
• Can call Cloduera ML Models
• SSL and Proxy enabled
© 2021 Cloudera, Inc. All rights reserved. 13
https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html
● Reduce, Reuse, Recycle. Use Parameters to reuse
common modules.
● Put flows, reusable chunks into separate Process
Groups.
● Write custom processors if you need new or
specialized features
● Use Cloudera supported NiFi Processors
● Use Record Processors everywhere
No More Spaghetti Flows
© 2021 Cloudera, Inc. All rights reserved. 14
New Features
… based on Apache NiFi 1.13.2
https://www.datainmotion.dev/2021/02/new-features-of-apache-nifi-1130.html
● ListenFTP
● Data Drift
● SampleRecord
● Generic Record Sink
● Generic Record Reader
● PutRecord
● WindowsEventLogReader
15
Yes, Franz, It’s Kafka
Let’s do a metamorphosis on your data. Don’t fear changing data.
You don’t need to be a brilliant writer to stream
data.
Franz Kafka was a German-speaking
Bohemian novelist and short-story writer,
widely regarded as one of the major figures of
20th-century literature. His work fuses
elements of realism and the fantastic.
Wikipedia
© 2021 Cloudera, Inc. All rights reserved. 16
Apache Kafka
• Highly reliable distributed
messaging system
• Decouple applications, enables
many-to-many patterns
• Publish-Subscribe semantics
• Horizontal scalability
• Efficient implementation to
operate at speed with big data
volumes
• Organized by topic to support
several use cases
Source
System
Source
System
Source
System
Kafka
Fraud
Detection
Security
Systems
Real-Time
Monitoring
Many-To-Many
Publish-Subscribe
EVENTS
© 2021 Cloudera, Inc. All rights reserved. 17
Flink SQL
… based on Apache Flink 1.12
https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache Calcite
● Deployed Apache Flink Apps on YARN
● Scalable Stream Processing
18
Flink SQL
-- specify Kafka partition key on output
SELECT foo AS _eventKey FROM sensors
-- use event time timestamp from kafka
-- exactly once compatible
SELECT eventTimestamp FROM sensors
-- nested structures access
SELECT foo.’bar’ FROM table; -- must quote nested
column
-- timestamps
SELECT * FROM payments
WHERE eventTimestamp > CURRENT_TIMESTAMP-interval
'10' second;
-- unnest
SELECT b.*, u.*
FROM bgp_avro b,
UNNEST(b.path) AS u(pathitem)
-- aggregations and windows
SELECT card,
MAX(amount) as theamount,
TUMBLE_END(eventTimestamp, interval '5' minute) as
ts
FROM payments
WHERE lat IS NOT NULL
AND lon IS NOT NULL
GROUP BY card,
TUMBLE(eventTimestamp, interval '5' minute)
HAVING COUNT(*) > 4 -- >4==fraud
-- try to do this ksql!
SELECT us_west.user_score+ap_south.user_score
FROM kafka_in_zone_us_west us_west
FULL OUTER JOIN kafka_in_zone_ap_south ap_south
ON us_west.user_id = ap_south.user_id;
Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
© 2021 Cloudera, Inc. All rights reserved. 19
Flink SQL
SELECT location, station_id, latitude, longitude, observation_time, weather, temperature_string,
relative_humidity, wind_string, wind_dir, wind_degrees, wind_mph, pressure_in, dewpoint_string,
dewpoint_f, dewpoint_c FROM weather2 WHERE location is not null and location <> 'null' and
trim(location) <> '' and location like '%NJ'
SELECT HOP_END(eventTimestamp, INTERVAL '1' SECOND, INTERVAL '30' SECOND) as
windowEnd, count("close") as closeCount, sum(cast("close" as float)) as closeSum, avg(cast("close" as
float)) as closeAverage, min("close") as closeMin, max("close") as closeMax, sum(case when "close" >
14 then 1 else 0 end) as stockGreaterThan14 FROM stocksraw GROUP BY HOP(eventTimestamp,
INTERVAL '1' SECOND, INTERVAL '30' SECOND)
© 2021 Cloudera, Inc. All rights reserved. 20
© 2021 Cloudera, Inc. All rights reserved. 21
Upcoming Events
April 27 May 19
22
TH N Y U

More Related Content

PDF
ApacheCon 2021: Apache NiFi 101- introduction and best practices
PDF
Learning the basics of Apache NiFi for iot OSS Europe 2020
PDF
fluentd -- the missing log collector
PPTX
Matt Franklin - Apache Software (Geekfest)
PPTX
Spark optimization
PDF
Cracking the nut, solving edge ai with apache tools and frameworks
PDF
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
PDF
StreamNative FLiP into scylladb - scylla summit 2022
ApacheCon 2021: Apache NiFi 101- introduction and best practices
Learning the basics of Apache NiFi for iot OSS Europe 2020
fluentd -- the missing log collector
Matt Franklin - Apache Software (Geekfest)
Spark optimization
Cracking the nut, solving edge ai with apache tools and frameworks
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
StreamNative FLiP into scylladb - scylla summit 2022

What's hot (20)

PDF
Music city data Hail Hydrate! from stream to lake
PPTX
Bullet: A Real Time Data Query Engine
PDF
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
PDF
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
PDF
Using FLiP with influxdb for edgeai iot at scale 2022
PDF
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PDF
Using the flipn stack for edge ai (flink, nifi, pulsar)
PPTX
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
PDF
Cracking the nut, solving edge ai with apache tools and frameworks
PDF
Api world apache nifi 101
PDF
Real time stock processing with apache nifi, apache flink and apache kafka
PDF
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
PDF
Data science online camp using the flipn stack for edge ai (flink, nifi, pu...
PDF
Big data conference europe real-time streaming in any and all clouds, hybri...
PDF
Incrementally streaming rdbms data to your data lake automagically
PDF
Osacon 2021 hello hydrate! from stream to clickhouse with apache pulsar and...
PDF
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
PDF
Python web conference 2022 apache pulsar development 101 with python (f li-...
PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
PDF
Migrating pipelines into Docker
Music city data Hail Hydrate! from stream to lake
Bullet: A Real Time Data Query Engine
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Using FLiP with influxdb for edgeai iot at scale 2022
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
Using the flipn stack for edge ai (flink, nifi, pulsar)
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
Cracking the nut, solving edge ai with apache tools and frameworks
Api world apache nifi 101
Real time stock processing with apache nifi, apache flink and apache kafka
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Data science online camp using the flipn stack for edge ai (flink, nifi, pu...
Big data conference europe real-time streaming in any and all clouds, hybri...
Incrementally streaming rdbms data to your data lake automagically
Osacon 2021 hello hydrate! from stream to clickhouse with apache pulsar and...
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
Python web conference 2022 apache pulsar development 101 with python (f li-...
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Migrating pipelines into Docker
Ad

Similar to Real-time Streaming Pipelines with FLaNK (20)

PDF
Jug - ecosystem
PDF
Chti jug - 2018-06-26
PDF
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
PDF
Chicago Kafka Meetup
PDF
Running Apache Spark Jobs Using Kubernetes
PDF
JConWorld_ Continuous SQL with Kafka and Flink
PDF
Ingesting hdfs intosolrusingsparktrimmed
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
PDF
Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...
PDF
Hands on with CoAP and Californium
PDF
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
PDF
DBCC 2021 - FLiP Stack for Cloud Data Lakes
PDF
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PDF
RTAS 2023: Building a Real-Time IoT Application
PDF
BigDataFest_ Building Modern Data Streaming Apps
PDF
big data fest building modern data streaming apps
PDF
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
PDF
Cloud lunch and learn real-time streaming in azure
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Jug - ecosystem
Chti jug - 2018-06-26
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Chicago Kafka Meetup
Running Apache Spark Jobs Using Kubernetes
JConWorld_ Continuous SQL with Kafka and Flink
Ingesting hdfs intosolrusingsparktrimmed
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Kubernetes connectivity to Cloud Native Kafka | Evan Shortiss and Hugo Guerre...
Hands on with CoAP and Californium
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
DBCC 2021 - FLiP Stack for Cloud Data Lakes
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
RTAS 2023: Building a Real-Time IoT Application
BigDataFest_ Building Modern Data Streaming Apps
big data fest building modern data streaming apps
Analitica de datos en tiempo real con Apache Flink y Apache BEAM
Cloud lunch and learn real-time streaming in azure
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
The AUB Centre for AI in Media Proposal.docx
Programs and apps: productivity, graphics, security and other tools
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Understanding_Digital_Forensics_Presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
MIND Revenue Release Quarter 2 2025 Press Release
Dropbox Q2 2025 Financial Results & Investor Presentation
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Real-time Streaming Pipelines with FLaNK

  • 1. Real-Time Streaming Pipelines With FLaNK (Apache Flink, Apache NiFi & Apache Kafka) Timothy Spann - Principal DataFlow Field Engineer 15-April-2021
  • 3. 3 FLaNK Stack for Cloud Data Engineers Multiple users, frameworks, languages, clouds, data sources & clusters CLOUD DATA ENGINEER • Experience in ETL/ELT • Coding skills in Python or Java • Knowledge of database query languages such as SQL • Experience with Streaming • Knowledge of Cloud Tools • Expert in ETL (Eating, Ties and Laziness) • Edge Camera Interaction • Typical User • No Coding Skills • Can use NiFi • Questions your cloud spend CAT AI / Deep Learning / ML / DS • Can run in Apache NiFi • Can run in Kafka Streams • Can run in Apache Flink • Can run in MiNiFi Agents
  • 4. 4 I Can Haz Data? Today’s Data. REST and Websocket JSON “stonks” {"symbol":"CLDR", "uuid":"10640832-f139-4b82-8780-e3ad37b3d0 ce", "ts":1618529574078, "dt":1612098900000, "datetime":"2021/01/31 08:15:00", "open":"12.24500", "close":"12.25500", "high":"12.25500", "volume":"12353", "low":"12.24500"}
  • 6. 6 End to End Streaming Demo Pipeline Enterprise sources Weather Errors Aggregates Alerts Stocks ETL Analytics Streaming SQL Clickstream Market data Machine logs Social https://github.com/tspannhw/CloudDemo2021
  • 7. 7
  • 8. 8 Apache NiFi in a Nutshell Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance Advanced tooling to industrialize flow development (Flow Development Life Cycle) ACQUIRE • Over 340 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG PROCESS HASH MERGE EXTRACT DUPLICATE SPLIT ENCRYPT TALL EVALUATE EXECUTE GEOENRICH SCAN REPLACE TRANSLATE CONVERT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT ROUTE RATE DISTRIBUTE LOAD DELIVER • Guaranteed Delivery • Full data provenance from acquisition to delivery • Diverse, Non-Traditional Sources • Eco-system integration FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG
  • 9. © 2021 Cloudera, Inc. All rights reserved. 9 ParquetReader / ParquetWriter Records • Native Record Processors for Apache Parquet Files! • CVS <-> Parquet • XML <-> Parquet • AVRO <-> Parquet • JSON <-> Parquet • More... https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apac he_7.html
  • 10. © 2021 Cloudera, Inc. All rights reserved. 10 UpdateRecord • Use with LookupRecord • ELT • Works on CSV, XML, JSON, AVRO, … • RecordPath or Literals • Use Schemas and Schema Registry
  • 11. © 2021 Cloudera, Inc. All rights reserved. 11 ValidateRecord • Works on CSV, XML, JSON, AVRO, … • RecordPath or Literals • Use Schemas and Schema Registry • Checks fields, types, nullable
  • 12. © 2021 Cloudera, Inc. All rights reserved. 12 RestLookupService • Works on CSV, XML, JSON, AVRO, … • Use Schemas and Schema Registry • Can call Cloduera ML Models • SSL and Proxy enabled
  • 13. © 2021 Cloudera, Inc. All rights reserved. 13 https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html ● Reduce, Reuse, Recycle. Use Parameters to reuse common modules. ● Put flows, reusable chunks into separate Process Groups. ● Write custom processors if you need new or specialized features ● Use Cloudera supported NiFi Processors ● Use Record Processors everywhere No More Spaghetti Flows
  • 14. © 2021 Cloudera, Inc. All rights reserved. 14 New Features … based on Apache NiFi 1.13.2 https://www.datainmotion.dev/2021/02/new-features-of-apache-nifi-1130.html ● ListenFTP ● Data Drift ● SampleRecord ● Generic Record Sink ● Generic Record Reader ● PutRecord ● WindowsEventLogReader
  • 15. 15 Yes, Franz, It’s Kafka Let’s do a metamorphosis on your data. Don’t fear changing data. You don’t need to be a brilliant writer to stream data. Franz Kafka was a German-speaking Bohemian novelist and short-story writer, widely regarded as one of the major figures of 20th-century literature. His work fuses elements of realism and the fantastic. Wikipedia
  • 16. © 2021 Cloudera, Inc. All rights reserved. 16 Apache Kafka • Highly reliable distributed messaging system • Decouple applications, enables many-to-many patterns • Publish-Subscribe semantics • Horizontal scalability • Efficient implementation to operate at speed with big data volumes • Organized by topic to support several use cases Source System Source System Source System Kafka Fraud Detection Security Systems Real-Time Monitoring Many-To-Many Publish-Subscribe EVENTS
  • 17. © 2021 Cloudera, Inc. All rights reserved. 17 Flink SQL … based on Apache Flink 1.12 https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html ● Streaming Analytics ● Continuous SQL ● Continuous ETL ● Complex Event Processing ● Standard SQL Powered by Apache Calcite ● Deployed Apache Flink Apps on YARN ● Scalable Stream Processing
  • 18. 18 Flink SQL -- specify Kafka partition key on output SELECT foo AS _eventKey FROM sensors -- use event time timestamp from kafka -- exactly once compatible SELECT eventTimestamp FROM sensors -- nested structures access SELECT foo.’bar’ FROM table; -- must quote nested column -- timestamps SELECT * FROM payments WHERE eventTimestamp > CURRENT_TIMESTAMP-interval '10' second; -- unnest SELECT b.*, u.* FROM bgp_avro b, UNNEST(b.path) AS u(pathitem) -- aggregations and windows SELECT card, MAX(amount) as theamount, TUMBLE_END(eventTimestamp, interval '5' minute) as ts FROM payments WHERE lat IS NOT NULL AND lon IS NOT NULL GROUP BY card, TUMBLE(eventTimestamp, interval '5' minute) HAVING COUNT(*) > 4 -- >4==fraud -- try to do this ksql! SELECT us_west.user_score+ap_south.user_score FROM kafka_in_zone_us_west us_west FULL OUTER JOIN kafka_in_zone_ap_south ap_south ON us_west.user_id = ap_south.user_id; Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
  • 19. © 2021 Cloudera, Inc. All rights reserved. 19 Flink SQL SELECT location, station_id, latitude, longitude, observation_time, weather, temperature_string, relative_humidity, wind_string, wind_dir, wind_degrees, wind_mph, pressure_in, dewpoint_string, dewpoint_f, dewpoint_c FROM weather2 WHERE location is not null and location <> 'null' and trim(location) <> '' and location like '%NJ' SELECT HOP_END(eventTimestamp, INTERVAL '1' SECOND, INTERVAL '30' SECOND) as windowEnd, count("close") as closeCount, sum(cast("close" as float)) as closeSum, avg(cast("close" as float)) as closeAverage, min("close") as closeMin, max("close") as closeMax, sum(case when "close" > 14 then 1 else 0 end) as stockGreaterThan14 FROM stocksraw GROUP BY HOP(eventTimestamp, INTERVAL '1' SECOND, INTERVAL '30' SECOND)
  • 20. © 2021 Cloudera, Inc. All rights reserved. 20
  • 21. © 2021 Cloudera, Inc. All rights reserved. 21 Upcoming Events April 27 May 19