Dmitry Voitekh "Applications of Multimodal Learning in media search engines"

Applications of Multimodal
Learning in media search
engines
Dmitry Voitekh
Proxet
1

Media content search. Information retrieval problem
To perform search on media content (gifs, images, videos) one can’t simply use
original files (set of pixels, frames etc), since they cannot be efficiently indexed
2

Media content search. Information retrieval problem
Usually media documents are converted into more compressed representations
(textual or vectorized) for which various known search strategies can be applied.
Search = Content + Candidate Generation + Ranking
3

Media search engines. Textual representations
Media content can be converted into textual data via the following approaches:
1) OCR (Optical Character Recognition)
2) ASR (Automatic Speech Recognition)
3) Tags annotation (either manual or automatic via ML model)
4) Video summarization models
4

Search can be organized in one of the following ways:
1) Full-text search solutions to rank generated text documents for the given
search query
2) Train an LTR (Learning To Rank) model that predicts relevancy for each pair
(text document, search query). Training dataset is needed!
5

Issues with “textual” approach:
1) Visual (or audio) signal cannot be converted into text without information loss
(discretization problem)
2) To better represent the content, various models/signals should be used =>
more complicated system
6

Images and videos can be converted into meaningful and efficiently compressed
vector representations via CV models.
We can build similarity index of all documents, perform clustering to group
documents into categories that can be used for search etc.
Media search engines. Vector representations
7

NLP models can be used to represent search query.
To match search query against documents:
1) LTR - predict relevance for the given pair of vectors
2) Mapper model - fuse both search query and document vectors into a single
vector space
Media search engines. Vector representations
8

Dataset
Pairs (document, search query) with relevance scores.
1) Manual annotation (e.g. via crowdsourcing job)
a) Takes time to collect
b) Can be expensive because should cover large part of the search space
2) Online. Based on engagement data (logged events)
a) Approximates relevance with some noise
b) Having substantial traffic, large and diverse dataset can be built on a periodic
basis (trends, seasonality)
9

Dataset. Engagement data (training/validation)
Billions of anonymized events per day
are logged to capture:
1) views
2) clicks
3) shares
4) favourites
for each gif and search query.
Can be grouped into “sessions” by utilizing
client-specific details
11

“Sessions” can be unfolded into sequences of gifs clicked by each user:
session 1: gif_1, gif_2, gif_3
...
Or we can incorporate both search queries and gifs:
session 1: hello, gif_1, gif_2, good_morning, gif_3
...
12

To address positional bias for different grids:
1) shuffling of search results for a small percentage
of traffic
2) probabilistic modeling based on hierarchical
pooling to estimate positional bias effect on CTR
For content safety: both search queries and gifs
datasets are filtered via maintained blacklists and
nsfw models
13

Human judgements obtained via
crowdsourcing tasks that estimate:
1) query-gif relevance
2) gif-gif relevance
● Complex relevance criteria defined by
business
● Rarely updated and relatively compact
Dataset. Manually labeled (benchmark)
14

Metric - % of triplets for which
(anchor, positive) relevancy >
(anchor, negative) relevancy
Dataset. Manually labeled (benchmark)
Triplets dataset (anchor, positive, negative)
OR
15

MVP. Gifs embeddings for Recommender System.
Train Gensim Skip-Gram model only on gifs:
session 1: gif_1, gif_2, gif_3
, where gif_* is an identifier of a gif that was clicked during a session.
For inference: kNN search in the embedding space (nmslib).
Baseline. Word2Vec model
16

V1. Joint embeddings for search queries and gifs:
session 1: query_1, gif_1, gif_2, query_2, query_3
...
, where query_* - identifier of a search query issued by a user,
and gif_* - identifier of a gif that was clicked during a session
17

Pros: Search queries and gifs in a single space. Also, gifs’ tags can be
incorporated. Applications:
1) Search (query -> relevant gifs)
2) Recommender System (gif -> relevant gifs)
3) Tags Suggestion (query -> relevant tags)
Cons: Identifiers (not gif/query content) are used => cold start problem
The less frequent is the identifier, the less accurately it is positioned in the
embedding space
19

Search prototype
20

Tag Suggestion for gifs
21

Search. Implicit usage. Features for ElasticSearch
1) Query Expansion
love you to the moon and back => love, adore you, couple, happy
2) Tag Suggestion for gifs
gif_1 => love, happy, couple
Results:
+ 10% CTR relative change
22

Recommender System. kNN index
+ 9% CTR relative change compared to MVP version
23

Tag Suggestion. kNN index
+ 40% CTR relative change compared to previous version
24

Cold start. Part 1. StarSpace
Extend search query with identifiers of its word n-grams:
how_are_you_id, gif_1, doing_good_id, gif_2
becomes:
how_are_you_id, how_id, are_id, you_id, gif_1, doing_good_id, doing_id, good_id, gif_2
● Model additionally learns to compare word n-grams with document identifiers
● Unseen search query vector = average of available tokens’ vectors
25

Cold start. Part 2. Word2Vec + BERT
Take pre-trained BERT model and fine-tune it jointly with Word2Vec
BERT learns mapping from search query tokens to Word2Vec gifs space
Cold start problem is solved for queries, but is still an issue for gifs ;(
26

The key point is that we haven’t really
utilized gif data (e.g. visual representation,
tags etc) yet.
What if we extend the approach like
BERT+Word2Vec to all available signals?
Mixture of Embedding Experts
27

https://arxiv.org/pdf/1804.02516.pdf
28

We still have the same unified embedding space, but without the cold start
problem
Leverage all available gifs metadata:
1) Visual representation
2) Tags representation
3) OCR representation
Mixture of Embedding Experts
31

Bonus. Expand a search query
32

Summary
1) Embeddings are great for various IR tasks
2) The ideal application is a candidate generation step
3) Start with a simple baseline with recall as high as possible
4) Wise collection of implicit users’ feedback is a vital part of good embeddings
5) Use human-verified datasets for benchmarks
6) The more data sources you have, the better is the quality of representations
35

1) Word2Vec illustration: http://jalammar.github.io/illustrated-word2vec
2) nmslib. Efficient aNN search: https://github.com/nmslib/nmslib
3) Starspace for space fusion: https://github.com/facebookresearch/StarSpace
4) DSSM: https://www.microsoft.com/en-us/research/project/dssm
5) Pinterest multimodal learning:
https://labs.pinterest.com/user/themes/pin_labs/assets/paper/training-and-evaluating.pdf
6) Mixture of embedding experts: https://arxiv.org/pdf/1804.02516.pdf
Links
36

Dmitry Voitekh "Applications of Multimodal Learning in media search engines"

More Related Content

Similar to Dmitry Voitekh "Applications of Multimodal Learning in media search engines" (20)

More from Fwdays (20)

Recently uploaded (20)

Dmitry Voitekh "Applications of Multimodal Learning in media search engines"