SlideShare a Scribd company logo
Powered by
Vector Search for
Data Scientists
A Case Study with Twitter Analytics
#1 - How is my data distributed?
#2 - Are there outliers in my data?
#3 - Are my variables correlated with each other?
Common questions in Data Science
#1 - Can we capture the semantics in vector representations?
#2 - What can we learn about our data from semantic clusters?
Vector Search
Social Media Clicks and Twitter Analytics
Twitter Analytics
Twitter Analytics CSV Data
Tweet Text Time Impressions Engagements Engagement
Rate
Retweets Replies Likes User
Profile
Clicks
Url
Clicks
I just published “ANN
Benchmarks with Etienne
Dilcoker -- Weaviate
Podcast #16 on Medium..
May 27th,
1:34pm
1905 50 2.6% 3 1 15 2 18
Approximate Nearest
Neighbor algorithms
allow us to Vector Search
in massive datasets! …
May 24th,
1:13pm
7182 252 3.5% 14 1 50 27 36
Feature Engineering:
Contains Emoji?
Character Count?
Word Count?
Contains “Weaviate”?
Key Takeaways:
“Vector Search
for Data
Scientists”
1. Segmentation in Data
Science
2. Vector Representations of
Data
3. Vector Segmentation
4. Weaviate for Twitter
Analytics
5. Research Questions and
Discussion
Slides, Colab Notebook, Video Presentation available on:
github.com/CShorten/Vector-Search-for-Data-Scientists
Key Takeaway #1 -
Segmentation in
Data Science
Visualizing Distributions of Values
Segmentation in Data Science
● What Time was the Tweet sent?
● Is there a URL Link in the Tweet?
● Symbolic vs. Vector Segmentation
What Time was the Tweet sent?
Is there a URL Link in the Tweet?
Can we split Impressions based on
the Semantics of the content?
Weaviate Podcast Weaviate Tutorial AI Weekly Update
How can we segment analytics based on the
semantics of…
● Text
● Images
● Code
● Audio
● Video
● Graph-Structure
● Biological Sequences
● … !
Summary of
Takeaway #1
Segmentation in
Data Science
We visualize the Distribution
of our data to get a sense of it.
For example we see that
Impressions are somewhat
Normally Distributed.
Is that also true for Tweets sent
at 3 AM?
What about Tweets related to
Deep Learning for Robotics?
Key Takeaway #2 -
Vector
Representations
of Data
Symbols compared to Vectors
Symbols
Category - [0, 1, 0, 0, 0, 0]
Numeric - 52
Boolean - True
[0.1, 0.8, 0.34, 0.8, … 0.2]
Vectors
Vector Representations of Data
Photo by Shayna Douglas on Unsplash
0.83
0.35
..
0.02
Photo by Bill Stephan on Unsplash
0.74
0.01
..
0.95
Are these puppies similar?
Let’s ask Vector Distance!
L2 Distance = ∑ || ai
- bi
||2
L2 Distance (Puppy1, Puppy2) = (4-2)2
+ (8-9)2
+ (10-11)2
= 6
L2 Distance (Puppy1, Airplane) = (4-1)2
+ (8-20)2
+ (10-20)2
= 253
6 << 253, Puppy1 is thus much more semantically similar to Puppy2 than Airplane
Vector Name Value 1 Value 2 Value 3
Puppy1 4 8 10
Puppy2 2 9 11
Airplane 1 20 20
Capturing Semantics in Vector Representations
How do Vectors represent real-world objects?
0.08 0.53 0.16 … 0.83 0.18
384 dimensional vector
Does this represent how much of a “brand” this is?
We aren’t sure! But there are research fields such as “Multimodal Neurons”
from OpenAI, and the general field of Disentangled RepresentationLearning
that are making great strides in understanding this.
Can we compress these vectors?
…
384 dimensional vector
Sometimes!
Ideas like Binary Passage Retrieval (shown above) - fp32 to Binary values
Ideas like Product Quantization - 384-d vector mapped to 32-d
Semantic Similarity with Vector Representations
Sentence-BERT:
Sentence Embeddings
using Siamese
BERT-Networks
Authored by
Nils Reimers and Iryna
Gurevych
Published 2019
Query Point
Positive and Negative Pair Sampling
Positive (Semantically Similar)
Negative (Semantically Different)
Another strategy - Data2Vec, Baevski et al. 2022
Do we need to train our own
models?
No! There are many pre-trained
models that work very well for
a broad range of data!
Great place to get started: Sentence Transformers
Summary of
Takeaway #2
Vector
Representations
of Data
Data such as Images, Text, Code,
… can be represented as Vectors
with Deep Learning models.
These models are trained to
maximize semantic similarity
with massive collections of data.
We often do not need to train the
models ourselves for particular
data domains to reach
reasonable performance.
Key Takeaway #3 -
Vector
Segmentation
● Text
● Images
● Code
● Audio
● Video
● Graph-Structure
● Biological Sequences
● … !
We can segment analytics based on the
semantics of…
Can we split Impressions based on
the Semantics of the content?
Weaviate Podcast Weaviate Tutorial AI Weekly Update
More Examples
House Hunting
Symbols: # of bedrooms, # of bathrooms, square feet, city
→ With Vectors we can encode:
● Visual style
● Neighborhood structure
● Moreflexibleinterfacetodefinefeatureswithtext
e-Commerce Products
Symbols: “Shoes”, “T-Shirt”, “Pants” or colors
→ With Vectors we can encode visual styles
Movies
Symbols can differentiate between genres like “Children”, “Action”, or “Sci-Fi”
→ With Vectors we can encode:
● Themes
● Characters
● Storylines
Scientific Papers
Symbols: “Biology”, “Machine Learning”
→ With Vectors we can encode
● Nuance of the ideas
● Writing style
Music
Symbols can differentiate between genres like “Hip Hop”, “Dance”
→ With Vectors we can encode:
● Tone
● Lyrics
● Instruments
“That’s the magic of deep learning:
turning meaning into vectors, then into geometric
spaces, and then incrementally learning complex
geometric transformations that map one space to
another. All you need are spaces of sufficiently high
dimensionality in order to capture the full scope of
the relationships found in the original data.”
- Francois Chollet, Deep Learning with Python, 2nd edition
Summary of
Takeaway #3
Vector
Segmentation
Vector representations, also
known as embeddings,
enable an Interfaceto split
analytics based on the
Semanticsof the content.
This content could be Text,
Images, Code, Audio,
Videos, …
Key Takeaway #4 -
Weaviate for
Twitter Analytics
Twitter Analytics
Tweet Text Time Impressions Engagements Engagement
Rate
Retweets Replies Likes User
Profile
Clicks
Url
Clicks
I just published “ANN
Benchmarks with Etienne
Dilcoker -- Weaviate
Podcast #16 on Medium..
May 27th,
1:34pm
1905 50 2.6% 3 1 15 2 18
Approximate Nearest
Neighbor algorithms allow
us to Vector Search in
massive datasets! …
May 24th,
1:13pm
7182 252 3.5% 14 1 50 27 36
Vector Search for Data Scientists.pdf
Cloud Data Upload
There are many other ways to do this as well
Google Colab Weaviate Cloud
Services
GraphQL Live Demo
5 Nearest Neighbors to → “Weaviate Coding Tutorial”
Content Impressions
“We have 4 Weaviate Podcast Episodes so far [ … ] how to utilize the
Weaviate Database as a Document Store in Haystack pipelines … ”
311
“We have 2 new coding tutorials on Weaviate YouTube…” 1144
“@weaviate_io Love the integration of this with the GraphQL API!” 378
“Here are some thoughts on combining Weaviate and Haystack! TLDR:
Weavaite is a great Vector Search database…”
15563
“Weaviate (@weaviate_io) is also announcing a collaboration with Jina
AI (@JinaAI_)! …”
586
Vector Search for Data Scientists.pdf
What was the Tweet about?
Have I tweeted something like this before?
Have any Weaviate Podcast guests
tweeted something like this recently?
Vector Search for Data Scientists.pdf
Tweet, Author, Likes
GraphQL Live Demo
GraphQL Wikipedia Demo
Wikipedia Live Demo - Graph Data Model
GraphQL Wikipedia Demo
● Weaviate is a Vector
Search Database, rather
than a Library such as
Facebook’s FAISS or
ANNOY from Spotify
● Weaviate has a
Graph-like Data Model
Expanding Twitter project with Graph Model
Vector Search for Data Scientists.pdf
Summary of
Takeaway #4
Weaviate for
Twitter Analytics
We can segment Impressions
on Twitterbased on the
content of the tweet without
manual labeling!
Weaviateis a Vector Search
Databasethat can be used to
store and search through
semantic embeddings of data.
Key Takeaway #5 -
Research Questions
and Discussion
Research Questions and Discussion
● Should I fine-tune my embedding model?
● Large-Scale Vector Search with Approximate
Nearest Neighbor (ANN) Algorithms
● How does Vector Search differ from Classification or
Regression models?
Vector Search versus Regression on Impressions
8,530 Impressions
Model Prediction
Interpretability of Vector Search
Nearest Neighbors
Interpretability of Vector Search and Prediction
8,530 Impressions
Model Prediction
What do we want to know about our Tweets?
Should I post this?
When might be a better time to post it?
What might be a better phrasing of this tweet?
Expanding from individuals to teams
● Has anyone on my team tweeted something
like this recently?
● Who on our team would be best fit to tell this
story?
● What topics should we be tweeting about?
Summary of
Takeaway #5
Research
Questions and
Discussion
How can we improve these
systems? What looks
promising?
Key Takeaways:
“Vector Search
for Data
Scientists”
1. Segmentation in Data
Science
2. Vector Representations of
Unstructured Data
3. Vector Segmentation
4. Weaviate Example for
Twitter Analytics
5. Research Questions and
Discussion
Slides, Colab Notebook, Video Presentation available on:
github.com/CShorten/Vector-Search-for-Data-Scientists
Connect with us!
Weaviate Slack Channel
YouTube: Weaviate • Vector Search Engine
Weaviate Podcast
Twitter @weaviate_io
Thank you for Watching!
Special thanks to Sebastian Witalec in
advising the development of this presentation
and Svitlana Smolianova for visual styling.

More Related Content

PDF
Beyond Retrieval Augmented Generation (RAG): Vector Databases
PDF
Vector Databases 101 - An introduction to the world of Vector Databases
PDF
stackconf 2022: Introduction to Vector Search with Weaviate
PDF
Vector databases and neural search
PPTX
Vector_db_introduction.pptx
PDF
Vector Databases - A Technical Primer.pdf
PDF
Vector database
PDF
Chunking, Embeddings, and Vector Databases
Beyond Retrieval Augmented Generation (RAG): Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
stackconf 2022: Introduction to Vector Search with Weaviate
Vector databases and neural search
Vector_db_introduction.pptx
Vector Databases - A Technical Primer.pdf
Vector database
Chunking, Embeddings, and Vector Databases

What's hot (20)

PDF
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
PDF
Mother of Language`s Langchain
PDF
Introducing Databricks Delta
PDF
Data Mesh 101
PDF
Building End-to-End Delta Pipelines on GCP
PPTX
Hadoop Tutorial For Beginners
PPTX
How Graph Data Science can turbocharge your Knowledge Graph
PDF
What is MLOps
PDF
Introduction SQL Analytics on Lakehouse Architecture
PPTX
MLOps - The Assembly Line of ML
PPTX
Securing Hadoop with Apache Ranger
PPTX
Apache HBase™
ODP
Deep Dive Into Elasticsearch
PDF
How Kafka Powers the World's Most Popular Vector Database System with Charles...
PPTX
Elastic Stack Introduction
PPTX
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
PPTX
Data Lake Overview
PDF
Introdution to Dataops and AIOps (or MLOps)
PDF
Weaviate Air #3 - New in AI segment.pdf
PDF
Elasticsearch
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Mother of Language`s Langchain
Introducing Databricks Delta
Data Mesh 101
Building End-to-End Delta Pipelines on GCP
Hadoop Tutorial For Beginners
How Graph Data Science can turbocharge your Knowledge Graph
What is MLOps
Introduction SQL Analytics on Lakehouse Architecture
MLOps - The Assembly Line of ML
Securing Hadoop with Apache Ranger
Apache HBase™
Deep Dive Into Elasticsearch
How Kafka Powers the World's Most Popular Vector Database System with Charles...
Elastic Stack Introduction
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Data Lake Overview
Introdution to Dataops and AIOps (or MLOps)
Weaviate Air #3 - New in AI segment.pdf
Elasticsearch
Ad

Similar to Vector Search for Data Scientists.pdf (20)

PDF
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
PPTX
Vector-Databases-Powering-the-Next-Generation-of-AI-Applications.pptx
PDF
London IR Meetup - Players in Vector Search_ algorithms, software and use cases
PDF
How Vector Search Transforms Information Retrieval?
PDF
Mattingly "Text and Data Mining: Searching Vectors"
PDF
Distributed Vector Databases - What, Why, and How
PPTX
Vector Databases and Why Are They Used in Modern AI - Marko Lohert - ATD 2024
PPTX
A Simple Introduction to Neural Information Retrieval
PDF
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
PDF
Red Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AI
PDF
Cheat sheets for AI
PDF
The Rise of Vector Data
PDF
word2vec-DataPalooza-Seattle
PDF
Unleashing the Power of Vector Search in .NET - DotNETConf2024.pdf
PDF
09-03-2024_UnstructuredDataAndAIDiscussion.pdf
PDF
Python for Computer Vision - Revision 2nd Edition
PPTX
Deep Learning for Search
PDF
Unleashing the Power of Vector Search in .NET - SharpCoding2024.pdf
PDF
Introduction to Open Source RAG and RAG Evaluation
PDF
#4 Convolutional Neural Networks for Natural Language Processing
WMF 2024 - Unlocking the Future of Data Powering Next-Gen AI with Vector Data...
Vector-Databases-Powering-the-Next-Generation-of-AI-Applications.pptx
London IR Meetup - Players in Vector Search_ algorithms, software and use cases
How Vector Search Transforms Information Retrieval?
Mattingly "Text and Data Mining: Searching Vectors"
Distributed Vector Databases - What, Why, and How
Vector Databases and Why Are They Used in Modern AI - Marko Lohert - ATD 2024
A Simple Introduction to Neural Information Retrieval
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
Red Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AI
Cheat sheets for AI
The Rise of Vector Data
word2vec-DataPalooza-Seattle
Unleashing the Power of Vector Search in .NET - DotNETConf2024.pdf
09-03-2024_UnstructuredDataAndAIDiscussion.pdf
Python for Computer Vision - Revision 2nd Edition
Deep Learning for Search
Unleashing the Power of Vector Search in .NET - SharpCoding2024.pdf
Introduction to Open Source RAG and RAG Evaluation
#4 Convolutional Neural Networks for Natural Language Processing
Ad

Recently uploaded (20)

PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
top salesforce developer skills in 2025.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
System and Network Administraation Chapter 3
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
ai tools demonstartion for schools and inter college
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
medical staffing services at VALiNTRY
PDF
System and Network Administration Chapter 2
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
top salesforce developer skills in 2025.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Operating system designcfffgfgggggggvggggggggg
2025 Textile ERP Trends: SAP, Odoo & Oracle
ManageIQ - Sprint 268 Review - Slide Deck
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Odoo Companies in India – Driving Business Transformation.pdf
System and Network Administraation Chapter 3
Softaken Excel to vCard Converter Software.pdf
ai tools demonstartion for schools and inter college
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Odoo POS Development Services by CandidRoot Solutions
How Creative Agencies Leverage Project Management Software.pdf
L1 - Introduction to python Backend.pptx
medical staffing services at VALiNTRY
System and Network Administration Chapter 2
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Online Work Permit System for Fast Permit Processing
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool

Vector Search for Data Scientists.pdf

  • 1. Powered by Vector Search for Data Scientists A Case Study with Twitter Analytics
  • 2. #1 - How is my data distributed? #2 - Are there outliers in my data? #3 - Are my variables correlated with each other? Common questions in Data Science
  • 3. #1 - Can we capture the semantics in vector representations? #2 - What can we learn about our data from semantic clusters? Vector Search
  • 4. Social Media Clicks and Twitter Analytics
  • 6. Twitter Analytics CSV Data Tweet Text Time Impressions Engagements Engagement Rate Retweets Replies Likes User Profile Clicks Url Clicks I just published “ANN Benchmarks with Etienne Dilcoker -- Weaviate Podcast #16 on Medium.. May 27th, 1:34pm 1905 50 2.6% 3 1 15 2 18 Approximate Nearest Neighbor algorithms allow us to Vector Search in massive datasets! … May 24th, 1:13pm 7182 252 3.5% 14 1 50 27 36 Feature Engineering: Contains Emoji? Character Count? Word Count? Contains “Weaviate”?
  • 7. Key Takeaways: “Vector Search for Data Scientists” 1. Segmentation in Data Science 2. Vector Representations of Data 3. Vector Segmentation 4. Weaviate for Twitter Analytics 5. Research Questions and Discussion Slides, Colab Notebook, Video Presentation available on: github.com/CShorten/Vector-Search-for-Data-Scientists
  • 8. Key Takeaway #1 - Segmentation in Data Science
  • 10. Segmentation in Data Science ● What Time was the Tweet sent? ● Is there a URL Link in the Tweet? ● Symbolic vs. Vector Segmentation
  • 11. What Time was the Tweet sent?
  • 12. Is there a URL Link in the Tweet?
  • 13. Can we split Impressions based on the Semantics of the content? Weaviate Podcast Weaviate Tutorial AI Weekly Update
  • 14. How can we segment analytics based on the semantics of… ● Text ● Images ● Code ● Audio ● Video ● Graph-Structure ● Biological Sequences ● … !
  • 15. Summary of Takeaway #1 Segmentation in Data Science We visualize the Distribution of our data to get a sense of it. For example we see that Impressions are somewhat Normally Distributed. Is that also true for Tweets sent at 3 AM? What about Tweets related to Deep Learning for Robotics?
  • 16. Key Takeaway #2 - Vector Representations of Data
  • 17. Symbols compared to Vectors Symbols Category - [0, 1, 0, 0, 0, 0] Numeric - 52 Boolean - True [0.1, 0.8, 0.34, 0.8, … 0.2] Vectors
  • 18. Vector Representations of Data Photo by Shayna Douglas on Unsplash 0.83 0.35 .. 0.02 Photo by Bill Stephan on Unsplash 0.74 0.01 .. 0.95
  • 19. Are these puppies similar? Let’s ask Vector Distance! L2 Distance = ∑ || ai - bi ||2 L2 Distance (Puppy1, Puppy2) = (4-2)2 + (8-9)2 + (10-11)2 = 6 L2 Distance (Puppy1, Airplane) = (4-1)2 + (8-20)2 + (10-20)2 = 253 6 << 253, Puppy1 is thus much more semantically similar to Puppy2 than Airplane Vector Name Value 1 Value 2 Value 3 Puppy1 4 8 10 Puppy2 2 9 11 Airplane 1 20 20
  • 20. Capturing Semantics in Vector Representations
  • 21. How do Vectors represent real-world objects? 0.08 0.53 0.16 … 0.83 0.18 384 dimensional vector Does this represent how much of a “brand” this is? We aren’t sure! But there are research fields such as “Multimodal Neurons” from OpenAI, and the general field of Disentangled RepresentationLearning that are making great strides in understanding this.
  • 22. Can we compress these vectors? … 384 dimensional vector Sometimes! Ideas like Binary Passage Retrieval (shown above) - fp32 to Binary values Ideas like Product Quantization - 384-d vector mapped to 32-d
  • 23. Semantic Similarity with Vector Representations Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks Authored by Nils Reimers and Iryna Gurevych Published 2019
  • 24. Query Point Positive and Negative Pair Sampling
  • 27. Another strategy - Data2Vec, Baevski et al. 2022
  • 28. Do we need to train our own models? No! There are many pre-trained models that work very well for a broad range of data!
  • 29. Great place to get started: Sentence Transformers
  • 30. Summary of Takeaway #2 Vector Representations of Data Data such as Images, Text, Code, … can be represented as Vectors with Deep Learning models. These models are trained to maximize semantic similarity with massive collections of data. We often do not need to train the models ourselves for particular data domains to reach reasonable performance.
  • 31. Key Takeaway #3 - Vector Segmentation
  • 32. ● Text ● Images ● Code ● Audio ● Video ● Graph-Structure ● Biological Sequences ● … ! We can segment analytics based on the semantics of…
  • 33. Can we split Impressions based on the Semantics of the content? Weaviate Podcast Weaviate Tutorial AI Weekly Update
  • 35. House Hunting Symbols: # of bedrooms, # of bathrooms, square feet, city → With Vectors we can encode: ● Visual style ● Neighborhood structure ● Moreflexibleinterfacetodefinefeatureswithtext
  • 36. e-Commerce Products Symbols: “Shoes”, “T-Shirt”, “Pants” or colors → With Vectors we can encode visual styles
  • 37. Movies Symbols can differentiate between genres like “Children”, “Action”, or “Sci-Fi” → With Vectors we can encode: ● Themes ● Characters ● Storylines
  • 38. Scientific Papers Symbols: “Biology”, “Machine Learning” → With Vectors we can encode ● Nuance of the ideas ● Writing style
  • 39. Music Symbols can differentiate between genres like “Hip Hop”, “Dance” → With Vectors we can encode: ● Tone ● Lyrics ● Instruments
  • 40. “That’s the magic of deep learning: turning meaning into vectors, then into geometric spaces, and then incrementally learning complex geometric transformations that map one space to another. All you need are spaces of sufficiently high dimensionality in order to capture the full scope of the relationships found in the original data.” - Francois Chollet, Deep Learning with Python, 2nd edition
  • 41. Summary of Takeaway #3 Vector Segmentation Vector representations, also known as embeddings, enable an Interfaceto split analytics based on the Semanticsof the content. This content could be Text, Images, Code, Audio, Videos, …
  • 42. Key Takeaway #4 - Weaviate for Twitter Analytics
  • 43. Twitter Analytics Tweet Text Time Impressions Engagements Engagement Rate Retweets Replies Likes User Profile Clicks Url Clicks I just published “ANN Benchmarks with Etienne Dilcoker -- Weaviate Podcast #16 on Medium.. May 27th, 1:34pm 1905 50 2.6% 3 1 15 2 18 Approximate Nearest Neighbor algorithms allow us to Vector Search in massive datasets! … May 24th, 1:13pm 7182 252 3.5% 14 1 50 27 36
  • 45. Cloud Data Upload There are many other ways to do this as well Google Colab Weaviate Cloud Services
  • 47. 5 Nearest Neighbors to → “Weaviate Coding Tutorial” Content Impressions “We have 4 Weaviate Podcast Episodes so far [ … ] how to utilize the Weaviate Database as a Document Store in Haystack pipelines … ” 311 “We have 2 new coding tutorials on Weaviate YouTube…” 1144 “@weaviate_io Love the integration of this with the GraphQL API!” 378 “Here are some thoughts on combining Weaviate and Haystack! TLDR: Weavaite is a great Vector Search database…” 15563 “Weaviate (@weaviate_io) is also announcing a collaboration with Jina AI (@JinaAI_)! …” 586
  • 49. What was the Tweet about?
  • 50. Have I tweeted something like this before?
  • 51. Have any Weaviate Podcast guests tweeted something like this recently?
  • 56. Wikipedia Live Demo - Graph Data Model
  • 58. ● Weaviate is a Vector Search Database, rather than a Library such as Facebook’s FAISS or ANNOY from Spotify ● Weaviate has a Graph-like Data Model
  • 59. Expanding Twitter project with Graph Model
  • 61. Summary of Takeaway #4 Weaviate for Twitter Analytics We can segment Impressions on Twitterbased on the content of the tweet without manual labeling! Weaviateis a Vector Search Databasethat can be used to store and search through semantic embeddings of data.
  • 62. Key Takeaway #5 - Research Questions and Discussion
  • 63. Research Questions and Discussion ● Should I fine-tune my embedding model? ● Large-Scale Vector Search with Approximate Nearest Neighbor (ANN) Algorithms ● How does Vector Search differ from Classification or Regression models?
  • 64. Vector Search versus Regression on Impressions 8,530 Impressions Model Prediction
  • 65. Interpretability of Vector Search Nearest Neighbors
  • 66. Interpretability of Vector Search and Prediction 8,530 Impressions Model Prediction
  • 67. What do we want to know about our Tweets? Should I post this? When might be a better time to post it? What might be a better phrasing of this tweet?
  • 68. Expanding from individuals to teams ● Has anyone on my team tweeted something like this recently? ● Who on our team would be best fit to tell this story? ● What topics should we be tweeting about?
  • 69. Summary of Takeaway #5 Research Questions and Discussion How can we improve these systems? What looks promising?
  • 70. Key Takeaways: “Vector Search for Data Scientists” 1. Segmentation in Data Science 2. Vector Representations of Unstructured Data 3. Vector Segmentation 4. Weaviate Example for Twitter Analytics 5. Research Questions and Discussion Slides, Colab Notebook, Video Presentation available on: github.com/CShorten/Vector-Search-for-Data-Scientists
  • 71. Connect with us! Weaviate Slack Channel YouTube: Weaviate • Vector Search Engine Weaviate Podcast Twitter @weaviate_io
  • 72. Thank you for Watching! Special thanks to Sebastian Witalec in advising the development of this presentation and Svitlana Smolianova for visual styling.