SlideShare a Scribd company logo
Building Secure RAG
Applications with Open Large
Language Models
Tim Spann, Senior Solutions Engineer
Tim Spann
paasdev.bsky.social
@PaasDev // Blog: datainmotion.dev
Senior Solutions Engineer, Snowflake
NY/NJ/Philly - Cloud Data + AI Meetups
ex-Zilliz, ex-Pivotal, ex-Cloudera, ex-HPE,
ex-StreamNative, ex-EY, ex-Hortonworks.
https://medium.com/@tspann
https://github.com/tspannhw
This week in Apache NiFi, Apache Polaris,
Apache Flink, Apache Kafka, ML, AI,
Streamlit, Jupyter, Apache Iceberg, Python,
Java, LLM, GenAI, Snowflake, Unstructured
Data and Open Source friends.
https://bit.ly/32dAJft
AI + Streaming Weekly by Tim Spann
AGENDA
Introduction and Overview
Data
Demo
Resources
5
Building Secure
RAG Apps
Requires a Team
For Data
AWS S3
Bucket
Structured,
Semistructured,
Unstructured
Data
When you think of RAG, you think of
unstructured data like documents or
giant chunks of text.
Unstructured Data
● Lots of formats
● Text, Documents, PDF
● Images, Videos, Audio
● Email
● Variants
Unstructured
● Open Data like Open AQ - Air
Quality Data
● Location, Time,Sensors
● Apache Avro, Parquet, Orc
● JSON and XML
● Hierarchical Data
● Logs
● Key-Value
Semi-Structured Data
https://docs.snowflake.com/en/sql-refe
rence/data-types-semistructured
Semi-structured
Structured Data
● Snowflake Tables
● Snowflake Hybrid Tables
● Apache Iceberg Tables
● Relational Tables
● Postgresql Tables
● CSV, TSV
Structured
Apache Iceberg™ - Append
● NiFi - PutIcebergTable
● Snowpark -
df.write.mode("append").
save_as_table("atable_iceberg")
https://quickstarts.snowflake.com/guide/getting_started_iceberg_tables/
I Can
Haz
Data?
Open Large Language Models
Snowflake Arctic Instruct
https://huggingface.co/Snowflake/snowflake-arctic-instruct
Snowflake's Arctic-embed-m-v2.0
https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0
Llama-3.3-70b, mixtral-8x7b, llama3.1-405b,
mistral-7b
Retrieval Augmented Generation (RAG)
Build
Ingest -> Extract -> Split -> Build Indexes
Serve
Orchestration | Observability <-> Retrieval
<-> Inference
Open Source Option
Apache NiFi
Build
Ingest, Extract, Split, LLM
Calls
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Hundreds of processors
• Visual command and
control
• Hundreds of sources
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
• Version Control
Apache NiFi for Data Ingest, Movement and Routing
• Moving Binary, Unstructured, Image
and Tabular Data
• Enrichment
• Universal Visual Processor
• Simple Event Processor
• Routing
• Feeding data to Central Messaging
• Support for modern protocols
• Kafka Protocol Source/Sink
• Pulsar Protocol Source/Sink
The Power of Apache NiFi
APACHE NIFI 2.0 FEATURES
Major Updates:
● Python Integration
● Parameterization
● JDK 21+
● Provenance / Data Lineage
● Rules Engine for Development Assistance
● Additional Azure Processors
● Integration with Zendesk, Slack,
● Database Tables as Schemas
● Amazon Glue Schema Registry
● OpenTelemetry Support
Real-Time Integration and AI
Architecture
https://nifi.apache.org/docs/nifi-docs/html/overview.html
18
PROVENANCE
UNSTRUCTURED DATA WITH NIFI
• Archives - tar, gzipped, zipped, …
• Images - PNG, JPG, GIF, BMP, …
• Documents - HTML, Markdown, RSS, PDF, Doc, RTF,
Plain Text, …
• Videos - MP4, Clips, Mov, Youtube URL…
• Sound - MP3, …
• Social / Chat - Slack, Discord, Twitter, REST, Email, …
• Identify Mime Types, Chunk Documents, Store to Vector Database
• Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
RECORD-ORIENTED DATA WITH NIFI
• Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet,
Scripted, Syslog5424, Syslog, WindowsEvent, XML
• Record Writers - Avro, CSV, FreeFromText, Json, Parquet, Scripted, XML
• Record Reader and Writer support referencing a schema registry for
retrieving schemas when necessary.
• Enable processors that accept any data format without having to worry about
the parsing and serialization logic.
• Allows us to keep FlowFiles larger, each consisting of multiple records, which
results in far better performance.
Extract Company Names
● Python 3.10+
● Hugging Face, NLP, SpaCY, PyTorch
https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
CaptionImage
● Python 3.10+
● Hugging Face
● Salesforce/blip-image-captioning-large
● Generate Captions for Images
● Adds captions to FlowFile Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
RESNetImageClassification
● Python 3.10+
● Hugging Face
● Transformers
● Pytorch
● Datasets
● microsoft/resnet-50
● Adds classification label to FlowFile
Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
Address To Lat/Long
● Python 3.10+
● geopy Library
● Nominatim
● OpenStreetMaps (OSM)
● openstreetmap.org/copyright
● Returns as attributes and JSON file
● Works with partial addresses
● Categorizes location
● Bounding Box
https://github.com/tspannhw/FLaNKAI-Boston
DEMO
RESOURCES AND WRAP-UP
https://www.linkedin.com/in/timothyspann/
Open Source Edition
● Apache NiFi in
Docker
● Runs in Docker
● Try new features
quickly
● Develop applications
locally
● Docker NiFi
○ docker run --name nifi -p 8443:8443 -d -e
SINGLE_USER_CREDENTIALS_USERNAME=admin -e
SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUghv
vgEvjnaLjFEB apache/nifi:latest
● Licensed under the ASF License
● Unsupported
https://hub.docker.com/r/apache/nifi
Free Data and AI Event
● King of Prussia
● Princeton
● New York
● Virtual
https://www.snowflake.com/events/data-for-breakfast/
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM

More Related Content

PDF
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
PDF
Real time cloud native open source streaming of any data to apache solr
PDF
28March2024-Codeless-Generative-AI-Pipelines
PDF
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
PDF
DBCC 2021 - FLiP Stack for Cloud Data Lakes
PDF
Music city data Hail Hydrate! from stream to lake
PDF
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Real time cloud native open source streaming of any data to apache solr
28March2024-Codeless-Generative-AI-Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
DBCC 2021 - FLiP Stack for Cloud Data Lakes
Music city data Hail Hydrate! from stream to lake
Using Apache NiFi with Apache Pulsar for Fast Data On-Ramp

Similar to 2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM (20)

PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
PDF
ApacheCon 2021 - Apache NiFi Deep Dive 300
PDF
Cloud lunch and learn real-time streaming in azure
PDF
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
PDF
Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...
PDF
Using FLiP with influxdb for EdgeAI IoT at Scale
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
PDF
ApacheCon 2021 Apache Deep Learning 302
PDF
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
PDF
Trend Micro Big Data Platform and Apache Bigtop
PDF
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
PDF
Using FLiP with influxdb for edgeai iot at scale 2022
PDF
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
PDF
Apache Deep Learning 101 - DWS Berlin 2018
PDF
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
PDF
AIDevWorldApacheNiFi101
PPTX
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
PPTX
HDInsight for Architects
PPTX
Apache Deep Learning 201
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
ApacheCon 2021 - Apache NiFi Deep Dive 300
Cloud lunch and learn real-time streaming in azure
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...
Using FLiP with influxdb for EdgeAI IoT at Scale
Apache Arrow at DataEngConf Barcelona 2018
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
ApacheCon 2021 Apache Deep Learning 302
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Trend Micro Big Data Platform and Apache Bigtop
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Using FLiP with influxdb for edgeai iot at scale 2022
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Apache Deep Learning 101 - DWS Berlin 2018
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
AIDevWorldApacheNiFi101
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
HDInsight for Architects
Apache Deep Learning 201
Ad

More from Timothy Spann (20)

PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
PDF
09-25-2024 NJX Venture Summit Introduction to Unstructured Data
PDF
09-19-2024 AI Camp Hybrid Seach - Milvus for Vector Database
PDF
09-18-2024 NYC Meetup Vector Databases 102
PDF
09-26-2024 Conf 42 Kube Native: Unleashing the Potential of Cloud Native Open...
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
09-25-2024 NJX Venture Summit Introduction to Unstructured Data
09-19-2024 AI Camp Hybrid Seach - Milvus for Vector Database
09-18-2024 NYC Meetup Vector Databases 102
09-26-2024 Conf 42 Kube Native: Unleashing the Potential of Cloud Native Open...
Ad

Recently uploaded (20)

PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Computer network topology notes for revision
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Lecture1 pattern recognition............
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
.pdf is not working space design for the following data for the following dat...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
oil_refinery_comprehensive_20250804084928 (1).pptx
Mega Projects Data Mega Projects Data
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Computer network topology notes for revision
Business Acumen Training GuidePresentation.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Fluorescence-microscope_Botany_detailed content
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Lecture1 pattern recognition............
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Supervised vs unsupervised machine learning algorithms
.pdf is not working space design for the following data for the following dat...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM

  • 1. Building Secure RAG Applications with Open Large Language Models Tim Spann, Senior Solutions Engineer
  • 2. Tim Spann paasdev.bsky.social @PaasDev // Blog: datainmotion.dev Senior Solutions Engineer, Snowflake NY/NJ/Philly - Cloud Data + AI Meetups ex-Zilliz, ex-Pivotal, ex-Cloudera, ex-HPE, ex-StreamNative, ex-EY, ex-Hortonworks. https://medium.com/@tspann https://github.com/tspannhw
  • 3. This week in Apache NiFi, Apache Polaris, Apache Flink, Apache Kafka, ML, AI, Streamlit, Jupyter, Apache Iceberg, Python, Java, LLM, GenAI, Snowflake, Unstructured Data and Open Source friends. https://bit.ly/32dAJft AI + Streaming Weekly by Tim Spann
  • 5. 5 Building Secure RAG Apps Requires a Team For Data AWS S3 Bucket
  • 6. Structured, Semistructured, Unstructured Data When you think of RAG, you think of unstructured data like documents or giant chunks of text.
  • 7. Unstructured Data ● Lots of formats ● Text, Documents, PDF ● Images, Videos, Audio ● Email ● Variants Unstructured
  • 8. ● Open Data like Open AQ - Air Quality Data ● Location, Time,Sensors ● Apache Avro, Parquet, Orc ● JSON and XML ● Hierarchical Data ● Logs ● Key-Value Semi-Structured Data https://docs.snowflake.com/en/sql-refe rence/data-types-semistructured Semi-structured
  • 9. Structured Data ● Snowflake Tables ● Snowflake Hybrid Tables ● Apache Iceberg Tables ● Relational Tables ● Postgresql Tables ● CSV, TSV Structured
  • 10. Apache Iceberg™ - Append ● NiFi - PutIcebergTable ● Snowpark - df.write.mode("append"). save_as_table("atable_iceberg") https://quickstarts.snowflake.com/guide/getting_started_iceberg_tables/ I Can Haz Data?
  • 11. Open Large Language Models Snowflake Arctic Instruct https://huggingface.co/Snowflake/snowflake-arctic-instruct Snowflake's Arctic-embed-m-v2.0 https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0 Llama-3.3-70b, mixtral-8x7b, llama3.1-405b, mistral-7b
  • 12. Retrieval Augmented Generation (RAG) Build Ingest -> Extract -> Split -> Build Indexes Serve Orchestration | Observability <-> Retrieval <-> Inference
  • 13. Open Source Option Apache NiFi Build Ingest, Extract, Split, LLM Calls
  • 14. • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Supports push and pull models • Hundreds of processors • Visual command and control • Hundreds of sources • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering • Version Control Apache NiFi for Data Ingest, Movement and Routing
  • 15. • Moving Binary, Unstructured, Image and Tabular Data • Enrichment • Universal Visual Processor • Simple Event Processor • Routing • Feeding data to Central Messaging • Support for modern protocols • Kafka Protocol Source/Sink • Pulsar Protocol Source/Sink The Power of Apache NiFi
  • 16. APACHE NIFI 2.0 FEATURES Major Updates: ● Python Integration ● Parameterization ● JDK 21+ ● Provenance / Data Lineage ● Rules Engine for Development Assistance ● Additional Azure Processors ● Integration with Zendesk, Slack, ● Database Tables as Schemas ● Amazon Glue Schema Registry ● OpenTelemetry Support Real-Time Integration and AI
  • 19. UNSTRUCTURED DATA WITH NIFI • Archives - tar, gzipped, zipped, … • Images - PNG, JPG, GIF, BMP, … • Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, … • Videos - MP4, Clips, Mov, Youtube URL… • Sound - MP3, … • Social / Chat - Slack, Discord, Twitter, REST, Email, … • Identify Mime Types, Chunk Documents, Store to Vector Database • Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
  • 20. RECORD-ORIENTED DATA WITH NIFI • Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet, Scripted, Syslog5424, Syslog, WindowsEvent, XML • Record Writers - Avro, CSV, FreeFromText, Json, Parquet, Scripted, XML • Record Reader and Writer support referencing a schema registry for retrieving schemas when necessary. • Enable processors that accept any data format without having to worry about the parsing and serialization logic. • Allows us to keep FlowFiles larger, each consisting of multiple records, which results in far better performance.
  • 21. Extract Company Names ● Python 3.10+ ● Hugging Face, NLP, SpaCY, PyTorch https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
  • 22. CaptionImage ● Python 3.10+ ● Hugging Face ● Salesforce/blip-image-captioning-large ● Generate Captions for Images ● Adds captions to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 23. RESNetImageClassification ● Python 3.10+ ● Hugging Face ● Transformers ● Pytorch ● Datasets ● microsoft/resnet-50 ● Adds classification label to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 24. Address To Lat/Long ● Python 3.10+ ● geopy Library ● Nominatim ● OpenStreetMaps (OSM) ● openstreetmap.org/copyright ● Returns as attributes and JSON file ● Works with partial addresses ● Categorizes location ● Bounding Box https://github.com/tspannhw/FLaNKAI-Boston
  • 25. DEMO
  • 27. Open Source Edition ● Apache NiFi in Docker ● Runs in Docker ● Try new features quickly ● Develop applications locally ● Docker NiFi ○ docker run --name nifi -p 8443:8443 -d -e SINGLE_USER_CREDENTIALS_USERNAME=admin -e SINGLE_USER_CREDENTIALS_PASSWORD=ctsBtRBKHRAx69EqUghv vgEvjnaLjFEB apache/nifi:latest ● Licensed under the ASF License ● Unsupported https://hub.docker.com/r/apache/nifi
  • 28. Free Data and AI Event ● King of Prussia ● Princeton ● New York ● Virtual https://www.snowflake.com/events/data-for-breakfast/