SlideShare a Scribd company logo
Applications of Multimodal
Learning in media search
engines
Dmitry Voitekh
Proxet
1
Media content search. Information retrieval problem
To perform search on media content (gifs, images, videos) one can’t simply use
original files (set of pixels, frames etc), since they cannot be efficiently indexed
2
Media content search. Information retrieval problem
Usually media documents are converted into more compressed representations
(textual or vectorized) for which various known search strategies can be applied.
Search = Content + Candidate Generation + Ranking
3
Media search engines. Textual representations
Media content can be converted into textual data via the following approaches:
1) OCR (Optical Character Recognition)
2) ASR (Automatic Speech Recognition)
3) Tags annotation (either manual or automatic via ML model)
4) Video summarization models
4
Search can be organized in one of the following ways:
1) Full-text search solutions to rank generated text documents for the given
search query
2) Train an LTR (Learning To Rank) model that predicts relevancy for each pair
(text document, search query). Training dataset is needed!
Media search engines. Textual representations
5
Issues with “textual” approach:
1) Visual (or audio) signal cannot be converted into text without information loss
(discretization problem)
2) To better represent the content, various models/signals should be used =>
more complicated system
Media search engines. Textual representations
6
Images and videos can be converted into meaningful and efficiently compressed
vector representations via CV models.
We can build similarity index of all documents, perform clustering to group
documents into categories that can be used for search etc.
Media search engines. Vector representations
7
NLP models can be used to represent search query.
To match search query against documents:
1) LTR - predict relevance for the given pair of vectors
2) Mapper model - fuse both search query and document vectors into a single
vector space
Media search engines. Vector representations
8
Dataset
Pairs (document, search query) with relevance scores.
1) Manual annotation (e.g. via crowdsourcing job)
a) Takes time to collect
b) Can be expensive because should cover large part of the search space
2) Online. Based on engagement data (logged events)
a) Approximates relevance with some noise
b) Having substantial traffic, large and diverse dataset can be built on a periodic
basis (trends, seasonality)
9
Case Study. Gifs platform
10
Dataset. Engagement data (training/validation)
Billions of anonymized events per day
are logged to capture:
1) views
2) clicks
3) shares
4) favourites
for each gif and search query.
Can be grouped into “sessions” by utilizing
client-specific details
11
“Sessions” can be unfolded into sequences of gifs clicked by each user:
session 1: gif_1, gif_2, gif_3
...
Or we can incorporate both search queries and gifs:
session 1: hello, gif_1, gif_2, good_morning, gif_3
...
Dataset. Engagement data (training/validation)
12
To address positional bias for different grids:
1) shuffling of search results for a small percentage
of traffic
2) probabilistic modeling based on hierarchical
pooling to estimate positional bias effect on CTR
For content safety: both search queries and gifs
datasets are filtered via maintained blacklists and
nsfw models
Dataset. Engagement data (training/validation)
13
Human judgements obtained via
crowdsourcing tasks that estimate:
1) query-gif relevance
2) gif-gif relevance
● Complex relevance criteria defined by
business
● Rarely updated and relatively compact
Dataset. Manually labeled (benchmark)
14
Metric - % of triplets for which
(anchor, positive) relevancy >
(anchor, negative) relevancy
Dataset. Manually labeled (benchmark)
Triplets dataset (anchor, positive, negative)
OR
15
MVP. Gifs embeddings for Recommender System.
Train Gensim Skip-Gram model only on gifs:
session 1: gif_1, gif_2, gif_3
, where gif_* is an identifier of a gif that was clicked during a session.
For inference: kNN search in the embedding space (nmslib).
Baseline. Word2Vec model
16
V1. Joint embeddings for search queries and gifs:
session 1: query_1, gif_1, gif_2, query_2, query_3
...
, where query_* - identifier of a search query issued by a user,
and gif_* - identifier of a gif that was clicked during a session
Baseline. Word2Vec model
17
18
Baseline. Word2Vec model
Pros: Search queries and gifs in a single space. Also, gifs’ tags can be
incorporated. Applications:
1) Search (query -> relevant gifs)
2) Recommender System (gif -> relevant gifs)
3) Tags Suggestion (query -> relevant tags)
Cons: Identifiers (not gif/query content) are used => cold start problem
The less frequent is the identifier, the less accurately it is positioned in the
embedding space
19
Baseline. Word2Vec model
Search prototype
20
Baseline. Word2Vec model
Tag Suggestion for gifs
21
Baseline. Word2Vec model
Search. Implicit usage. Features for ElasticSearch
1) Query Expansion
love you to the moon and back => love, adore you, couple, happy
2) Tag Suggestion for gifs
gif_1 => love, happy, couple
Results:
+ 10% CTR relative change
22
Baseline. Word2Vec model
Recommender System. kNN index
+ 9% CTR relative change compared to MVP version
23
Baseline. Word2Vec model
Tag Suggestion. kNN index
+ 40% CTR relative change compared to previous version
24
Cold start. Part 1. StarSpace
Extend search query with identifiers of its word n-grams:
how_are_you_id, gif_1, doing_good_id, gif_2
becomes:
how_are_you_id, how_id, are_id, you_id, gif_1, doing_good_id, doing_id, good_id, gif_2
● Model additionally learns to compare word n-grams with document identifiers
● Unseen search query vector = average of available tokens’ vectors
25
Cold start. Part 2. Word2Vec + BERT
Take pre-trained BERT model and fine-tune it jointly with Word2Vec
BERT learns mapping from search query tokens to Word2Vec gifs space
Cold start problem is solved for queries, but is still an issue for gifs ;(
26
The key point is that we haven’t really
utilized gif data (e.g. visual representation,
tags etc) yet.
What if we extend the approach like
BERT+Word2Vec to all available signals?
Mixture of Embedding Experts
27
https://arxiv.org/pdf/1804.02516.pdf
28
29
30
We still have the same unified embedding space, but without the cold start
problem
Leverage all available gifs metadata:
1) Visual representation
2) Tags representation
3) OCR representation
Mixture of Embedding Experts
31
Bonus. Expand a search query
32
33
34
Summary
1) Embeddings are great for various IR tasks
2) The ideal application is a candidate generation step
3) Start with a simple baseline with recall as high as possible
4) Wise collection of implicit users’ feedback is a vital part of good embeddings
5) Use human-verified datasets for benchmarks
6) The more data sources you have, the better is the quality of representations
35
1) Word2Vec illustration: http://jalammar.github.io/illustrated-word2vec
2) nmslib. Efficient aNN search: https://github.com/nmslib/nmslib
3) Starspace for space fusion: https://github.com/facebookresearch/StarSpace
4) DSSM: https://www.microsoft.com/en-us/research/project/dssm
5) Pinterest multimodal learning:
https://labs.pinterest.com/user/themes/pin_labs/assets/paper/training-and-evaluating.pdf
6) Mixture of embedding experts: https://arxiv.org/pdf/1804.02516.pdf
Links
36

More Related Content

PDF
IRJET- Encryption based Approach to Find Fake Uploaders in Social Media
PDF
A Robust Cybersecurity Topic Classification Tool
PDF
Genetic Algorithm based Mosaic Image Steganography for Enhanced Security
PDF
D0341829
PDF
Steganography using Coefficient Replacement and Adaptive Scaling based on DTCWT
PDF
Socially Shared Images with Automated Annotation Process by Using Improved Us...
PDF
An Efficient User Privacy and Protecting Location Content in Location Based S...
PDF
Introduction to Text Mining and Visualization with Interactive Web Application
IRJET- Encryption based Approach to Find Fake Uploaders in Social Media
A Robust Cybersecurity Topic Classification Tool
Genetic Algorithm based Mosaic Image Steganography for Enhanced Security
D0341829
Steganography using Coefficient Replacement and Adaptive Scaling based on DTCWT
Socially Shared Images with Automated Annotation Process by Using Improved Us...
An Efficient User Privacy and Protecting Location Content in Location Based S...
Introduction to Text Mining and Visualization with Interactive Web Application

Similar to Dmitry Voitekh "Applications of Multimodal Learning in media search engines" (20)

DOCX
Text Mining with Automatic Annotation from Unstructured Content
PDF
IRJET- Image Seeker:Finding Similar Images
PDF
System analysis and design for multimedia retrieval systems
PDF
Analysing image collections with the computer vision network approach
PDF
IRJET - Content based Image Classification
PDF
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
PPTX
final ppt.pptx
PPTX
final ppt.pptx
PDF
How to build your in-house ChatGPT
PPT
Image Tagging
PDF
IRJET - Event Notifier on Scraped Mails using NLP
PDF
Imagically Image Forensic Tool
PPTX
Web crawler with email extractor and image extractor
PDF
Automatic Visual Concept Detection in Videos: Review
PDF
Paper 153
PPTX
Ai use cases
PPTX
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
PDF
IRJET- Foster Hashtag from Image and Text
PDF
Privacy Preserving Mining in Code Profiling Data
PDF
An Stepped Forward Security System for Multimedia Content Material for Cloud ...
Text Mining with Automatic Annotation from Unstructured Content
IRJET- Image Seeker:Finding Similar Images
System analysis and design for multimedia retrieval systems
Analysing image collections with the computer vision network approach
IRJET - Content based Image Classification
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
final ppt.pptx
final ppt.pptx
How to build your in-house ChatGPT
Image Tagging
IRJET - Event Notifier on Scraped Mails using NLP
Imagically Image Forensic Tool
Web crawler with email extractor and image extractor
Automatic Visual Concept Detection in Videos: Review
Paper 153
Ai use cases
Agile Mumbai 2022 - Rohit Handa | Combining Human and Artificial Intelligence...
IRJET- Foster Hashtag from Image and Text
Privacy Preserving Mining in Code Profiling Data
An Stepped Forward Security System for Multimedia Content Material for Cloud ...
Ad

More from Fwdays (20)

PDF
"Mastering UI Complexity: State Machines and Reactive Patterns at Grammarly",...
PDF
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
PPTX
"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai
PPTX
"Як ми переписали Сільпо на Angular", Євген Русаков
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
PDF
"Validation and Observability of AI Agents", Oleksandr Denisyuk
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
PPTX
"Co-Authoring with a Machine: What I Learned from Writing a Book on Generativ...
PPTX
"Human-AI Collaboration Models for Better Decisions, Faster Workflows, and Cr...
PDF
"AI is already here. What will happen to your team (and your role) tomorrow?"...
PPTX
"Is it worth investing in AI in 2025?", Alexander Sharko
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
PDF
"Scaling in space and time with Temporal", Andriy Lupa.pdf
PDF
"Database isolation: how we deal with hundreds of direct connections to the d...
PDF
"Scaling in space and time with Temporal", Andriy Lupa .pdf
PPTX
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
PPTX
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
PPTX
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
PPTX
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...
"Mastering UI Complexity: State Machines and Reactive Patterns at Grammarly",...
"Effect, Fiber & Schema: tactical and technical characteristics of Effect.ts"...
"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai
"Як ми переписали Сільпо на Angular", Євген Русаков
"AI Transformation: Directions and Challenges", Pavlo Shaternik
"Validation and Observability of AI Agents", Oleksandr Denisyuk
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
"Co-Authoring with a Machine: What I Learned from Writing a Book on Generativ...
"Human-AI Collaboration Models for Better Decisions, Faster Workflows, and Cr...
"AI is already here. What will happen to your team (and your role) tomorrow?"...
"Is it worth investing in AI in 2025?", Alexander Sharko
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
"Scaling in space and time with Temporal", Andriy Lupa.pdf
"Database isolation: how we deal with hundreds of direct connections to the d...
"Scaling in space and time with Temporal", Andriy Lupa .pdf
"Provisioning via DOT-Chain: from catering to drone marketplaces", Volodymyr ...
" Observability with Elasticsearch: Best Practices for High-Load Platform", A...
"How to survive Black Friday: preparing e-commerce for a peak season", Yurii ...
"Istio Ambient Mesh in production: our way from Sidecar to Sidecar-less",Hlib...
Ad

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Machine Learning_overview_presentation.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Big Data Technologies - Introduction.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Getting Started with Data Integration: FME Form 101
PPTX
A Presentation on Artificial Intelligence
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Machine Learning_overview_presentation.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Mobile App Security Testing_ A Comprehensive Guide.pdf
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
Big Data Technologies - Introduction.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Getting Started with Data Integration: FME Form 101
A Presentation on Artificial Intelligence
Encapsulation_ Review paper, used for researhc scholars
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Building Integrated photovoltaic BIPV_UPV.pdf

Dmitry Voitekh "Applications of Multimodal Learning in media search engines"

  • 1. Applications of Multimodal Learning in media search engines Dmitry Voitekh Proxet 1
  • 2. Media content search. Information retrieval problem To perform search on media content (gifs, images, videos) one can’t simply use original files (set of pixels, frames etc), since they cannot be efficiently indexed 2
  • 3. Media content search. Information retrieval problem Usually media documents are converted into more compressed representations (textual or vectorized) for which various known search strategies can be applied. Search = Content + Candidate Generation + Ranking 3
  • 4. Media search engines. Textual representations Media content can be converted into textual data via the following approaches: 1) OCR (Optical Character Recognition) 2) ASR (Automatic Speech Recognition) 3) Tags annotation (either manual or automatic via ML model) 4) Video summarization models 4
  • 5. Search can be organized in one of the following ways: 1) Full-text search solutions to rank generated text documents for the given search query 2) Train an LTR (Learning To Rank) model that predicts relevancy for each pair (text document, search query). Training dataset is needed! Media search engines. Textual representations 5
  • 6. Issues with “textual” approach: 1) Visual (or audio) signal cannot be converted into text without information loss (discretization problem) 2) To better represent the content, various models/signals should be used => more complicated system Media search engines. Textual representations 6
  • 7. Images and videos can be converted into meaningful and efficiently compressed vector representations via CV models. We can build similarity index of all documents, perform clustering to group documents into categories that can be used for search etc. Media search engines. Vector representations 7
  • 8. NLP models can be used to represent search query. To match search query against documents: 1) LTR - predict relevance for the given pair of vectors 2) Mapper model - fuse both search query and document vectors into a single vector space Media search engines. Vector representations 8
  • 9. Dataset Pairs (document, search query) with relevance scores. 1) Manual annotation (e.g. via crowdsourcing job) a) Takes time to collect b) Can be expensive because should cover large part of the search space 2) Online. Based on engagement data (logged events) a) Approximates relevance with some noise b) Having substantial traffic, large and diverse dataset can be built on a periodic basis (trends, seasonality) 9
  • 10. Case Study. Gifs platform 10
  • 11. Dataset. Engagement data (training/validation) Billions of anonymized events per day are logged to capture: 1) views 2) clicks 3) shares 4) favourites for each gif and search query. Can be grouped into “sessions” by utilizing client-specific details 11
  • 12. “Sessions” can be unfolded into sequences of gifs clicked by each user: session 1: gif_1, gif_2, gif_3 ... Or we can incorporate both search queries and gifs: session 1: hello, gif_1, gif_2, good_morning, gif_3 ... Dataset. Engagement data (training/validation) 12
  • 13. To address positional bias for different grids: 1) shuffling of search results for a small percentage of traffic 2) probabilistic modeling based on hierarchical pooling to estimate positional bias effect on CTR For content safety: both search queries and gifs datasets are filtered via maintained blacklists and nsfw models Dataset. Engagement data (training/validation) 13
  • 14. Human judgements obtained via crowdsourcing tasks that estimate: 1) query-gif relevance 2) gif-gif relevance ● Complex relevance criteria defined by business ● Rarely updated and relatively compact Dataset. Manually labeled (benchmark) 14
  • 15. Metric - % of triplets for which (anchor, positive) relevancy > (anchor, negative) relevancy Dataset. Manually labeled (benchmark) Triplets dataset (anchor, positive, negative) OR 15
  • 16. MVP. Gifs embeddings for Recommender System. Train Gensim Skip-Gram model only on gifs: session 1: gif_1, gif_2, gif_3 , where gif_* is an identifier of a gif that was clicked during a session. For inference: kNN search in the embedding space (nmslib). Baseline. Word2Vec model 16
  • 17. V1. Joint embeddings for search queries and gifs: session 1: query_1, gif_1, gif_2, query_2, query_3 ... , where query_* - identifier of a search query issued by a user, and gif_* - identifier of a gif that was clicked during a session Baseline. Word2Vec model 17
  • 18. 18
  • 19. Baseline. Word2Vec model Pros: Search queries and gifs in a single space. Also, gifs’ tags can be incorporated. Applications: 1) Search (query -> relevant gifs) 2) Recommender System (gif -> relevant gifs) 3) Tags Suggestion (query -> relevant tags) Cons: Identifiers (not gif/query content) are used => cold start problem The less frequent is the identifier, the less accurately it is positioned in the embedding space 19
  • 21. Baseline. Word2Vec model Tag Suggestion for gifs 21
  • 22. Baseline. Word2Vec model Search. Implicit usage. Features for ElasticSearch 1) Query Expansion love you to the moon and back => love, adore you, couple, happy 2) Tag Suggestion for gifs gif_1 => love, happy, couple Results: + 10% CTR relative change 22
  • 23. Baseline. Word2Vec model Recommender System. kNN index + 9% CTR relative change compared to MVP version 23
  • 24. Baseline. Word2Vec model Tag Suggestion. kNN index + 40% CTR relative change compared to previous version 24
  • 25. Cold start. Part 1. StarSpace Extend search query with identifiers of its word n-grams: how_are_you_id, gif_1, doing_good_id, gif_2 becomes: how_are_you_id, how_id, are_id, you_id, gif_1, doing_good_id, doing_id, good_id, gif_2 ● Model additionally learns to compare word n-grams with document identifiers ● Unseen search query vector = average of available tokens’ vectors 25
  • 26. Cold start. Part 2. Word2Vec + BERT Take pre-trained BERT model and fine-tune it jointly with Word2Vec BERT learns mapping from search query tokens to Word2Vec gifs space Cold start problem is solved for queries, but is still an issue for gifs ;( 26
  • 27. The key point is that we haven’t really utilized gif data (e.g. visual representation, tags etc) yet. What if we extend the approach like BERT+Word2Vec to all available signals? Mixture of Embedding Experts 27
  • 29. 29
  • 30. 30
  • 31. We still have the same unified embedding space, but without the cold start problem Leverage all available gifs metadata: 1) Visual representation 2) Tags representation 3) OCR representation Mixture of Embedding Experts 31
  • 32. Bonus. Expand a search query 32
  • 33. 33
  • 34. 34
  • 35. Summary 1) Embeddings are great for various IR tasks 2) The ideal application is a candidate generation step 3) Start with a simple baseline with recall as high as possible 4) Wise collection of implicit users’ feedback is a vital part of good embeddings 5) Use human-verified datasets for benchmarks 6) The more data sources you have, the better is the quality of representations 35
  • 36. 1) Word2Vec illustration: http://jalammar.github.io/illustrated-word2vec 2) nmslib. Efficient aNN search: https://github.com/nmslib/nmslib 3) Starspace for space fusion: https://github.com/facebookresearch/StarSpace 4) DSSM: https://www.microsoft.com/en-us/research/project/dssm 5) Pinterest multimodal learning: https://labs.pinterest.com/user/themes/pin_labs/assets/paper/training-and-evaluating.pdf 6) Mixture of embedding experts: https://arxiv.org/pdf/1804.02516.pdf Links 36