SlideShare a Scribd company logo
Text and Data Mining
Searching Vectors
1. Named Entity Recognition - Deeper Dive
2. Semantic Searching as a Concept
3. Vector Databases
4. Semantic Searching
5. Multi-Modal Data Mining
6. Retrieval-Augmented Generation (RAG)
Goals
Named Entity Recognition (NER)
Jim did not like the store.
It did not have chocolate.
He could not find anyone to help him.
Paris is a lovely city.
Paris enjoys flying to Asia.
Paris sails are fun.
(Presume this is a bad transcription of audio)
NER
Overview
● Classify individual spans, or
sequence of tokens, in a text
● Types Classification
○ Hard Classification
○ Soft Classification
● Types of Methods
○ Machine Learning
○ Rules-Based
NER
Labels
● Locations
○ LOC - Location
○ GPE - Geopolitical Entity
● PERSON
● NORP - Nationalities, religious,
or political groups
● TIME
● DATE
● EVENT
● PRODUCT
● FAC - Buildings, airports,
highways, bridges, etc.
Brief Recap on Vectors or Embeddings
Representing
Texts
Digitally
● Bag-of-Words
● Embeddings
Representing
Texts
Digitally
Bag-of-Words
● The apple is in the tree.
○ 1-the
○ 2-apple
○ 3-is
○ 4-in
○ 1-the
○ 5-tree
● [1, 2, 3, 4, 1, 5]
Representing
Texts
Digitally
Embeddings
● The apple is in the tree.
○ 1-[0.01234, -0.23456, 0.87654,
0.45678, -0.56123, 0.65432,
0.12345, -0.77123, 0.08456,
0.34567, ...]
○ 2-different vector
○ 3-different vector
○ 4-different vector
○ 1-[0.01234, -0.23456, 0.87654,
0.45678, -0.56123, 0.65432,
0.12345, -0.77123, 0.08456,
0.34567, ...]
○ 5-different vector
Vector Databases
Vector
Database
What is it?
● It holds vectors in a database
as storage.
● Similar vectors are stored
closer.
Mattingly "Text and Data Mining: Searching Vectors"
Vector
Database
How do we use a vector
database?
● We populate a vector database
with by using a machine
learning model to vectorize
data and send them to the
database.
Vector
Database
Why use a vector database?
Vector
Database
Why use a vector database?
● Vector databases allow users
to store vector data in a way
that allows users to query it
and find similarity based on a
vector-level similarity, rather
than explicit human-defined
similarity.
Vector
Database
What is it?
● A vector database holds
numerous vectors or
embeddings of data.
Sometimes, the database will
also store the original data
alongside these vectors.
Vector Database Stacks
Vector Database Stacks
Vector Database
Stacks
What is available to us?
● Python, Annoy, Streamlit
○ Cheap, easy to deploy, great for
smaller datasets, but requires a
little bit of knowledge to build from
scratch
○ Best for smaller databases (under
10,000 data)
● Python, txtAI
○ Cheap and easy to use, more
resource intensive but easy to
deploy
○ Allows for easy interpretability (via
highlighting)
Vector Database
Stacks
What are available to us?
● Python/JavaScript and
Weaviate
○ Open-source
○ Can be done locally, on a server,
or via the Weaviate paid-hosting
○ API is easy to use and easy to
setup
Multi-Modal Mining
Multi-Modal
What is it?
● Multi-modal data mining is
when we use one type of data
to find data of a different type.
● We could use text to find
images (which do not have
metadata or descriptions) or
images to find text.
Multi-Modal
How does it work?
Retrieval-Augmented Generation
How tall is Wookie?
Mattingly "Text and Data Mining: Searching Vectors"
How tall is Wookie?
RAG
What is it?
● RAG allows for you to combine
the strengths of large language
models (LLMs) with vector
databases
● It limits the chances for an LLM
to hallucinate (generate fake
information)
● It uses a vector database to
find relevant material to a query
RAG
What is it?
● RAG allows for you to combine
the strengths of large language
models (LLMs) with vector
databases
● It limits the chances for an LLM
to hallucinate (generate fake
information)
● It uses a vector database to
find relevant material to a query

More Related Content

PDF
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
PDF
HPEC 2021 sparse binary format
PDF
Linking knowledge spaces
PPTX
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
PDF
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
PDF
Big data for the rest of us with hadoop
PDF
Apache Spark 101 - Demi Ben-Ari
PPTX
Introducing Datawave
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
HPEC 2021 sparse binary format
Linking knowledge spaces
Mattingly "AI and Prompt Design: LLMs with Text Classification and Open Source"
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Big data for the rest of us with hadoop
Apache Spark 101 - Demi Ben-Ari
Introducing Datawave

Similar to Mattingly "Text and Data Mining: Searching Vectors" (20)

PPT
D.3.1: State of the Art - Linked Data and Digital Preservation
PPTX
Big Data with IOT approach and trends with case study
PDF
Extending DCAM for Metadata Provenance
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
PDF
CS6007 information retrieval - 5 units notes
PDF
Our World is Socio-technical
PDF
First steps in Data Mining Kindergarten
PDF
Py tables
PDF
PyTables
PDF
Large Data Analyze With PyTables
PPTX
Data Structures & Algorithms
PPTX
Data Science Machine Lerning Bigdat.pptx
PDF
Data mining and data warehousing notes
PPTX
Session 2
PPT
Dwdmunit1 a
PPTX
Week-1-Introduction to Data Mining.pptx
PPTX
An Intro to Elasticsearch and Kibana
PDF
Babak Rasolzadeh: The importance of entities
PDF
Scaling the (evolving) web data –at low cost-
PDF
Data science
D.3.1: State of the Art - Linked Data and Digital Preservation
Big Data with IOT approach and trends with case study
Extending DCAM for Metadata Provenance
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
CS6007 information retrieval - 5 units notes
Our World is Socio-technical
First steps in Data Mining Kindergarten
Py tables
PyTables
Large Data Analyze With PyTables
Data Structures & Algorithms
Data Science Machine Lerning Bigdat.pptx
Data mining and data warehousing notes
Session 2
Dwdmunit1 a
Week-1-Introduction to Data Mining.pptx
An Intro to Elasticsearch and Kibana
Babak Rasolzadeh: The importance of entities
Scaling the (evolving) web data –at low cost-
Data science
Ad

More from National Information Standards Organization (NISO) (20)

PPTX
Larry Bennett_ ALA Annual Convention 2025AL2 slides.pptx
PPTX
Potash "Our Journey & Vision for Accessible Content"
PPTX
O'Leary "Progress Assessment - How Far Are We from Delivery"
PPTX
Carpenter and O'Leary "Accessibility Standards and the Future of Inclusive Pu...
PPTX
Davidian "Transfer Code of Practice Standing Committee Update"
PPTX
Patham "NISO Open Discovery Initiative (ODI) Update"
PPTX
Hichliffe "A Standard Terminology for Peer Review"
PPTX
Levin "KBART RP Update at ALA Annual 2025"
PPTX
Carpenter "Advancing Infrastructure for Sustainable Collections: CCLP Project...
PPTX
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
PPTX
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
PDF
Carpenter "2025 NISO Annual Members Meeting"
PPTX
Allen "Social Marketing in Scholarly Communications"
PPTX
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
PDF
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
PDF
Pfeiffer "Secrets to Changing Behavior in Scholarly Communication: A 2025 NIS...
PPTX
Gilstrap "Accessibility Essentials: A 2025 NISO Training Series, Session 7, M...
PPTX
Turner "Accessibility Essentials: A 2025 NISO Training Series, Session 7, Lan...
PPTX
Comeford "Accessibility Essentials: A 2025 NISO Training Series, Session 7, A...
PPTX
Laverick and Richard "Accessibility Essentials: A 2025 NISO Training Series, ...
Larry Bennett_ ALA Annual Convention 2025AL2 slides.pptx
Potash "Our Journey & Vision for Accessible Content"
O'Leary "Progress Assessment - How Far Are We from Delivery"
Carpenter and O'Leary "Accessibility Standards and the Future of Inclusive Pu...
Davidian "Transfer Code of Practice Standing Committee Update"
Patham "NISO Open Discovery Initiative (ODI) Update"
Hichliffe "A Standard Terminology for Peer Review"
Levin "KBART RP Update at ALA Annual 2025"
Carpenter "Advancing Infrastructure for Sustainable Collections: CCLP Project...
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
Carpenter "2025 NISO Annual Members Meeting"
Allen "Social Marketing in Scholarly Communications"
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
Gibson "Secrets to Changing Behaviour in Scholarly Communication: A 2025 NISO...
Pfeiffer "Secrets to Changing Behavior in Scholarly Communication: A 2025 NIS...
Gilstrap "Accessibility Essentials: A 2025 NISO Training Series, Session 7, M...
Turner "Accessibility Essentials: A 2025 NISO Training Series, Session 7, Lan...
Comeford "Accessibility Essentials: A 2025 NISO Training Series, Session 7, A...
Laverick and Richard "Accessibility Essentials: A 2025 NISO Training Series, ...
Ad

Recently uploaded (20)

PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Cell Structure & Organelles in detailed.
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Pharma ospi slides which help in ospi learning
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Insiders guide to clinical Medicine.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Classroom Observation Tools for Teachers
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Lesson notes of climatology university.
PDF
Pre independence Education in Inndia.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Computing-Curriculum for Schools in Ghana
Cell Structure & Organelles in detailed.
Microbial disease of the cardiovascular and lymphatic systems
Pharma ospi slides which help in ospi learning
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
TR - Agricultural Crops Production NC III.pdf
O7-L3 Supply Chain Operations - ICLT Program
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
human mycosis Human fungal infections are called human mycosis..pptx
Final Presentation General Medicine 03-08-2024.pptx
Insiders guide to clinical Medicine.pdf
RMMM.pdf make it easy to upload and study
Classroom Observation Tools for Teachers
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Lesson notes of climatology university.
Pre independence Education in Inndia.pdf

Mattingly "Text and Data Mining: Searching Vectors"