生成候選字總覽
透過集合功能整理內容
你可以依據偏好儲存及分類內容。
產生候選字是建議的第一個階段。查詢後,
系統會產生一組相關的候選文字。下表顯示兩個
產生候選人的常見做法:
類型 | 定義 | 範例 |
依據內容篩選 |
依據項目的相似度推薦項目
跟使用者喜歡的品差不多 |
如果使用者 A 看了兩部可愛貓咪影片,系統就會判定
可以向該使用者推薦可愛的動物影片 |
協同過濾 |
同時使用查詢和項目的相似性
以便提供建議 |
如果使用者 A 和使用者 B 相似,而使用者 B 喜歡影片 1,
系統就能向使用者 A 推薦影片 1 (即使使用者 A
觀看過與影片 1 類似的任何影片)。 |
嵌入空間
無論是內容篩選或協同篩選功能,都會對應各個項目和每項查詢
對應到嵌入向量
\(E = \mathbb R^d\)。一般而言,嵌入空間是低維
(也就是 \(d\) 比語料庫的大小小很多),而且擷取
項目或查詢集的潛在結構。類似項目,例如 YouTube
同一位使用者通常能收看的影片,最後由同位觀眾一起觀看
嵌入空間「關閉」的概念取決於相似度量
相似度評估
相似度量測量指的是 \(s : E \times E \to \mathbb R\)
會採用一對嵌入,並傳回用於測量相似度的純量。
您可以用下列方式產生待候選項目:
查詢嵌入 \(q \in E\)時,系統會尋找項目嵌入
\(x \in E\) 接近 \(q\), 也就是高值嵌入
相似度 \(s(q, x)\)。
大多數推薦系統為了判斷相似度
重複下列一或多項操作:
餘弦
這是兩者間角的餘弦
向量, \(s(q, x) = \cos(q, x)\)
點產品
兩個向量的內積
\(s(q, x) = \langle q, x \rangle = \sum_{i = 1}^d q_i x_i\)。
它也是由 \(s(q, x) = \|x\| \|q\| \cos(q, x)\) (
角度乘以常態乘積)。因此,如果嵌入
正規化,然後是點積和餘弦。
歐幾里德距離
這是歐幾里德的平時距離
空格 \(s(q, x) = \|q - x\| = \left[ \sum_{i = 1}^d (q_i - x_i)^2\right]^{\frac{1}{2}}\)。
距離越短表示相似度越高。請注意,在嵌入
的平方距離與內積相等
(和餘弦) 達到常數,因為
案件 \(\frac{1}{2}\|q - x\|^2 = 1 - \langle q, x \rangle\)。

比較相似度指標
請參考右側圖中的範例。黑色向量說明瞭
查詢嵌入其他三個嵌入向量 (項目 A、項目 B、項目 C)
代表候選項目視所用的相似度量而定
項目排名可能不同。
使用該圖片,試著使用上述三種
相似度測量指標:餘弦、內積和歐幾裡距離。
答案
你的表現如何?
項目 A 的常態值最大,且排名依據為
項目 C 採用最小角度來查詢,因此
先根據餘弦相似度排名。項目 B 確實存在
更接近查詢內容,因此歐幾里德距離極高。

哪些相似度測量的資料?
與餘弦相比,內積的相似度可判斷
解碼器的常態值也就是說
相似度越高 (適用於具有銳角的項目)
推薦商品的機率也越高這可能會影響
請參考以下建議:
在訓練集內非常頻繁出現的項目 (例如
熱門 YouTube 影片) 通常含有大型規範的嵌入項目。
如果需要擷取熱門程度資訊
偏好中階產品但要是不小心
項目可能會導致系統優先顯示推薦內容。實務上
可使用其他相似度量的變化版本,
一直以來都是如此舉例來說
\(s(q, x) = \|q\|^\alpha \|x\|^\alpha \cos(q, x)\) 範圍
部分 \(\alpha \in (0, 1)\)。
在極少出現的項目中,系統無法頻繁更新
訓練而成因此如果一開始使用較大的常態進行初始化,
所以系統可能優先推薦較不相關的商品。為了避免這種情況
請謹慎考慮嵌入初始化
一般化我們會在第一個練習中詳細說明這個問題。
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2024-07-26 (世界標準時間)。
[null,null,["上次更新時間:2024-07-26 (世界標準時間)。"],[[["\u003cp\u003eCandidate generation, the initial step in recommendation, involves identifying a set of relevant items using approaches like content-based filtering (similarity between items) and collaborative filtering (similarity between users and items).\u003c/p\u003e\n"],["\u003cp\u003eBoth methods utilize embedding spaces to represent items and queries as vectors, where similar items are positioned closer to each other based on chosen similarity measures (cosine, dot product, or Euclidean distance).\u003c/p\u003e\n"],["\u003cp\u003eThe choice of similarity measure influences item ranking: dot product considers embedding norms (favoring frequent items), while cosine focuses on the angle between vectors (favoring items with high semantic similarity).\u003c/p\u003e\n"],["\u003cp\u003eEuclidean distance prioritizes items physically closest in the embedding space and is less sensitive to norm or frequency, offering a balance between popularity and relevance.\u003c/p\u003e\n"],["\u003cp\u003eUnderstanding the properties and biases of different similarity measures is crucial for building effective and unbiased recommendation systems.\u003c/p\u003e\n"]]],[],null,["Candidate generation is the first stage of recommendation. Given a query, the\nsystem generates a set of relevant candidates. The following table shows two\ncommon candidate generation approaches:\n\n| Type | Definition | Example |\n|-----------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **content-based filtering** | Uses *similarity between items* to recommend items similar to what the user likes. | If user A watches two cute cat videos, then the system can recommend cute animal videos to that user. |\n| **collaborative filtering** | Uses *similarities between queries and items simultaneously* to provide recommendations. | If user A is similar to user B, and user B likes video 1, then the system can recommend video 1 to user A (even if user A hasn't seen any videos similar to video 1). |\n\nEmbedding space\n\nBoth content-based and collaborative filtering map each item and each query\n(or context) to an embedding vector in a common embedding space\n\\\\(E = \\\\mathbb R\\^d\\\\). Typically, the embedding space is low-dimensional\n(that is, \\\\(d\\\\) is much smaller than the size of the corpus), and captures\nsome latent structure of the item or query set. Similar items, such as YouTube\nvideos that are usually watched by the same user, end up close together in the\nembedding space. The notion of \"closeness\" is defined by a similarity measure.\n| **Extra Resource:** [projector.tensorflow.org](http://projector.tensorflow.org/) is an interactive tool to visualize embeddings.\n\nSimilarity measures\n\nA similarity measure is a function \\\\(s : E \\\\times E \\\\to \\\\mathbb R\\\\) that\ntakes a pair of embeddings and returns a scalar measuring their similarity.\nThe embeddings can be used for candidate generation as follows: given a\nquery embedding \\\\(q \\\\in E\\\\), the system looks for item embeddings\n\\\\(x \\\\in E\\\\) that are close to \\\\(q\\\\), that is, embeddings with high\nsimilarity \\\\(s(q, x)\\\\).\n\nTo determine the degree of similarity, most recommendation systems rely\non one or more of the following:\n\n- cosine\n- dot product\n- Euclidean distance\n\nCosine\n\nThis is simply the cosine of the angle between the two\nvectors, \\\\(s(q, x) = \\\\cos(q, x)\\\\)\n\nDot product\n\nThe dot product of two vectors is\n\\\\(s(q, x) = \\\\langle q, x \\\\rangle = \\\\sum_{i = 1}\\^d q_i x_i\\\\).\nIt is also given by \\\\(s(q, x) = \\\\\\|x\\\\\\| \\\\\\|q\\\\\\| \\\\cos(q, x)\\\\) (the cosine of the\nangle multiplied by the product of norms). Thus, if the embeddings are\nnormalized, then dot-product and cosine coincide.\n\nEuclidean distance\n\nThis is the usual distance in Euclidean\nspace, \\\\(s(q, x) = \\\\\\|q - x\\\\\\| = \\\\left\\[ \\\\sum_{i = 1}\\^d (q_i - x_i)\\^2\\\\right\\]\\^{\\\\frac{1}{2}}\\\\).\nA smaller distance means higher similarity. Note that when the embeddings\nare normalized, the squared Euclidean distance coincides with dot-product\n(and cosine) up to a constant, since in that\ncase \\\\(\\\\frac{1}{2}\\\\\\|q - x\\\\\\|\\^2 = 1 - \\\\langle q, x \\\\rangle\\\\).\n\nComparing similarity measures\n\nConsider the example in the figure to the right. The black vector illustrates the\nquery embedding. The other three embedding vectors (Item A, Item B, Item C)\nrepresent candidate items. Depending on the similarity measure used, the\nranking of the items can be different.\n\nUsing the image, try to determine the item ranking using all three of the\nsimilarity measures: cosine, dot product, and Euclidean distance.\n\nAnswer key\n\n**How did you do?**\n\nItem A has the largest norm, and is ranked higher according to the\ndot-product. Item C has the smallest angle with the query, and is thus\nranked first according to the cosine similarity. Item B is physically\nclosest to the query so Euclidean distance favors it.\n\nWhich similarity measure?\n\nCompared to the cosine, the dot product similarity is sensitive to\nthe norm of the embedding. That is, the larger the norm of an\nembedding, the higher the similarity (for items with an acute angle)\nand the more likely the item is to be recommended. This can affect\nrecommendations as follows:\n\n- Items that appear very frequently in the training set (for example,\n popular YouTube videos) tend to have embeddings with large norms.\n If capturing popularity information is desirable, then you should\n prefer dot product. However, if you're not careful, the popular\n items may end up dominating the recommendations. In practice, you\n can use other variants of similarity measures that put less emphasis\n on the norm of the item. For example, define\n \\\\(s(q, x) = \\\\\\|q\\\\\\|\\^\\\\alpha \\\\\\|x\\\\\\|\\^\\\\alpha \\\\cos(q, x)\\\\) for\n some \\\\(\\\\alpha \\\\in (0, 1)\\\\).\n\n- Items that appear very rarely may not be updated frequently during\n training. Consequently, if they are initialized with a large norm, the\n system may recommend rare items over more relevant items. To avoid this\n problem, be careful about embedding initialization, and use appropriate\n regularization. We will detail this problem in the first exercise."]]