候选集生成概览
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
候选集生成是推荐流程的第一阶段。对于给定的查询,
生成一组相关的候选字词。下表显示了
常见的候选生成方法:
类型 | 定义 | 示例 |
基于内容的过滤 |
使用推荐项之间相似度来推荐与用户兴趣相似的内容。 |
如果用户 A 观看了两个可爱猫咪的视频,则系统可以向该用户推荐可爱动物的视频。 |
协同过滤 |
同时使用查询和推荐项之间的相似度来提供推荐。 |
如果用户 A 与用户 B 相似,并且用户 B 喜欢视频 1,则
系统就可以向用户 A 推荐视频 1(即使用户 A 未
观看过与视频 1 类似的任何视频)。 |
嵌入空间
基于内容的过滤和协作过滤都会映射每个项目和每个查询
映射到通用嵌入空间中的嵌入向量
\(E = \mathbb R^d\)。通常,嵌入空间是低维的
(即, \(d\) 远小于语料库的大小),
项或查询集的某种潜在结构。类似商品,例如 YouTube
同一位用户通常会同时观看多个视频,在观看时
嵌入空间。“紧密”的概念由相似度度量定义。
相似性度量
相似度度量是 \(s : E \times E \to \mathbb R\) 一种函数,
接受一对嵌入,并返回一个衡量它们相似性的标量。
这些嵌入可用于生成候选,如下所示:
查询嵌入 \(q \in E\),系统会查找项嵌入
接近 \(q\)的\(x \in E\) 即,
相似度 \(s(q, x)\)。
为了确定相似程度,大多数推荐系统都依赖于
针对以下一项或多项问题:
余弦
这只是两点之间角度的余弦
矢量、 \(s(q, x) = \cos(q, x)\)
点积
两个向量的点积为
\(s(q, x) = \langle q, x \rangle = \sum_{i = 1}^d q_i x_i\)。
它的计算公式也为: \(s(q, x) = \|x\| \|q\| \cos(q, x)\) (
角度乘以范数的乘积)。因此,如果嵌入
归一化,则点积和余弦重合。
欧几里得距离
这是通常的欧几里得距离
空格, \(s(q, x) = \|q - x\| = \left[ \sum_{i = 1}^d (q_i - x_i)^2\right]^{\frac{1}{2}}\)。
距离越短,相似度越高。请注意,
归一化后,欧几里得距离的平方与点积重合
(和余弦)最高到一个常数,因为在该示例中,
支持请求 \(\frac{1}{2}\|q - x\|^2 = 1 - \langle q, x \rangle\)。

比较相似性度量
请思考右图中的示例。黑色矢量表示
查询嵌入。其他三个嵌入向量(项 A、项 B、项 C)
代表候选字词。根据所使用的相似性度量,
项的排名可能有所不同。
使用该图片,尝试使用全部三个条件来确定项目排名,
相似度度量:余弦、点积和欧几里得距离。
答案表
您做得怎么样?
项 A 的范数最大,根据
dot-product。项 C 与查询的角度最小,因此为
根据余弦相似度排在首位。商品 B 实际位于
因此欧几里得距离更偏向于查询。

哪种相似度衡量指标?
与余弦相比,点积相似度对
嵌入的范数。也就是说,
相似度越高(对于具有锐角的项)
商品被推荐的可能性就越大这会影响
建议如下:
训练集中频繁出现的项(例如,
热门 YouTube 视频)的嵌入往往具有大规范的嵌入。
如果您希望捕获热门程度信息,则应
偏好点积。不过,如果您不够谨慎,就会发现,
最终可能会成为推荐的主要内容。在实践中,
可以使用相似性度量的其他变体,这些变体的强调程度较低
概率。例如,定义
\(s(q, x) = \|q\|^\alpha \|x\|^\alpha \cos(q, x)\) :
一些 \(\alpha \in (0, 1)\)。
极少显示的项目在运行期间可能不会经常更新
训练。因此,如果以较大范数初始化它们,
系统可能会推荐稀有商品,而不是更相关的商品。为了避免这种情况
注意嵌入初始化,并使用适当的
正则化。我们将在第一个练习中详细介绍这个问题。
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2024-07-26。
[null,null,["最后更新时间 (UTC):2024-07-26。"],[[["\u003cp\u003eCandidate generation, the initial step in recommendation, involves identifying a set of relevant items using approaches like content-based filtering (similarity between items) and collaborative filtering (similarity between users and items).\u003c/p\u003e\n"],["\u003cp\u003eBoth methods utilize embedding spaces to represent items and queries as vectors, where similar items are positioned closer to each other based on chosen similarity measures (cosine, dot product, or Euclidean distance).\u003c/p\u003e\n"],["\u003cp\u003eThe choice of similarity measure influences item ranking: dot product considers embedding norms (favoring frequent items), while cosine focuses on the angle between vectors (favoring items with high semantic similarity).\u003c/p\u003e\n"],["\u003cp\u003eEuclidean distance prioritizes items physically closest in the embedding space and is less sensitive to norm or frequency, offering a balance between popularity and relevance.\u003c/p\u003e\n"],["\u003cp\u003eUnderstanding the properties and biases of different similarity measures is crucial for building effective and unbiased recommendation systems.\u003c/p\u003e\n"]]],[],null,["Candidate generation is the first stage of recommendation. Given a query, the\nsystem generates a set of relevant candidates. The following table shows two\ncommon candidate generation approaches:\n\n| Type | Definition | Example |\n|-----------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **content-based filtering** | Uses *similarity between items* to recommend items similar to what the user likes. | If user A watches two cute cat videos, then the system can recommend cute animal videos to that user. |\n| **collaborative filtering** | Uses *similarities between queries and items simultaneously* to provide recommendations. | If user A is similar to user B, and user B likes video 1, then the system can recommend video 1 to user A (even if user A hasn't seen any videos similar to video 1). |\n\nEmbedding space\n\nBoth content-based and collaborative filtering map each item and each query\n(or context) to an embedding vector in a common embedding space\n\\\\(E = \\\\mathbb R\\^d\\\\). Typically, the embedding space is low-dimensional\n(that is, \\\\(d\\\\) is much smaller than the size of the corpus), and captures\nsome latent structure of the item or query set. Similar items, such as YouTube\nvideos that are usually watched by the same user, end up close together in the\nembedding space. The notion of \"closeness\" is defined by a similarity measure.\n| **Extra Resource:** [projector.tensorflow.org](http://projector.tensorflow.org/) is an interactive tool to visualize embeddings.\n\nSimilarity measures\n\nA similarity measure is a function \\\\(s : E \\\\times E \\\\to \\\\mathbb R\\\\) that\ntakes a pair of embeddings and returns a scalar measuring their similarity.\nThe embeddings can be used for candidate generation as follows: given a\nquery embedding \\\\(q \\\\in E\\\\), the system looks for item embeddings\n\\\\(x \\\\in E\\\\) that are close to \\\\(q\\\\), that is, embeddings with high\nsimilarity \\\\(s(q, x)\\\\).\n\nTo determine the degree of similarity, most recommendation systems rely\non one or more of the following:\n\n- cosine\n- dot product\n- Euclidean distance\n\nCosine\n\nThis is simply the cosine of the angle between the two\nvectors, \\\\(s(q, x) = \\\\cos(q, x)\\\\)\n\nDot product\n\nThe dot product of two vectors is\n\\\\(s(q, x) = \\\\langle q, x \\\\rangle = \\\\sum_{i = 1}\\^d q_i x_i\\\\).\nIt is also given by \\\\(s(q, x) = \\\\\\|x\\\\\\| \\\\\\|q\\\\\\| \\\\cos(q, x)\\\\) (the cosine of the\nangle multiplied by the product of norms). Thus, if the embeddings are\nnormalized, then dot-product and cosine coincide.\n\nEuclidean distance\n\nThis is the usual distance in Euclidean\nspace, \\\\(s(q, x) = \\\\\\|q - x\\\\\\| = \\\\left\\[ \\\\sum_{i = 1}\\^d (q_i - x_i)\\^2\\\\right\\]\\^{\\\\frac{1}{2}}\\\\).\nA smaller distance means higher similarity. Note that when the embeddings\nare normalized, the squared Euclidean distance coincides with dot-product\n(and cosine) up to a constant, since in that\ncase \\\\(\\\\frac{1}{2}\\\\\\|q - x\\\\\\|\\^2 = 1 - \\\\langle q, x \\\\rangle\\\\).\n\nComparing similarity measures\n\nConsider the example in the figure to the right. The black vector illustrates the\nquery embedding. The other three embedding vectors (Item A, Item B, Item C)\nrepresent candidate items. Depending on the similarity measure used, the\nranking of the items can be different.\n\nUsing the image, try to determine the item ranking using all three of the\nsimilarity measures: cosine, dot product, and Euclidean distance.\n\nAnswer key\n\n**How did you do?**\n\nItem A has the largest norm, and is ranked higher according to the\ndot-product. Item C has the smallest angle with the query, and is thus\nranked first according to the cosine similarity. Item B is physically\nclosest to the query so Euclidean distance favors it.\n\nWhich similarity measure?\n\nCompared to the cosine, the dot product similarity is sensitive to\nthe norm of the embedding. That is, the larger the norm of an\nembedding, the higher the similarity (for items with an acute angle)\nand the more likely the item is to be recommended. This can affect\nrecommendations as follows:\n\n- Items that appear very frequently in the training set (for example,\n popular YouTube videos) tend to have embeddings with large norms.\n If capturing popularity information is desirable, then you should\n prefer dot product. However, if you're not careful, the popular\n items may end up dominating the recommendations. In practice, you\n can use other variants of similarity measures that put less emphasis\n on the norm of the item. For example, define\n \\\\(s(q, x) = \\\\\\|q\\\\\\|\\^\\\\alpha \\\\\\|x\\\\\\|\\^\\\\alpha \\\\cos(q, x)\\\\) for\n some \\\\(\\\\alpha \\\\in (0, 1)\\\\).\n\n- Items that appear very rarely may not be updated frequently during\n training. Consequently, if they are initialized with a large norm, the\n system may recommend rare items over more relevant items. To avoid this\n problem, be careful about embedding initialization, and use appropriate\n regularization. We will detail this problem in the first exercise."]]