OneRec-A Recommendation Model Based on Large Language Model
The difference between OneRec and traditional recommendation algorithm architectures lies in the fact that OneRec is an end-to-end video recommendation model. This means it does not have the complex framework structure of traditional recommendation pipelines, such as recall, coarse ranking, ranking, and re-ranking. Instead, a single model covers the entire recommendation pipeline. Its origin is mainly inspired by the success of current large language models (LLMs). The author believes that when LLMs have sufficiently large data and model sizes, they can achieve excellent inference results. Similarly, the recommendation field is not short of data. Therefore, as long as the model size is scaled up sufficiently, recommendation models can also achieve good results. Another reason is that in recommendation scenarios with a large user base, the recommendation pipeline is very long, and there are multiple recommendation teams, each with different objectives and optimization goals. This leads to a heavy historical burden and chaos in the entire recommendation pipeline. Additionally, the GPU resource utilization of multiple online recommendation models is very low, resulting in severe underutilization of resources. For these two reasons, the OneRec model was developed and successfully deployed by Kuaishou Short Video.
The most significant difference between OneRec and traditional recommendation models is its adoption of a generative architecture similar to LLMs. Items are no longer recommended but generated. Since OneRec generates items directly, it does not require the steps of recall and ranking, making it truly end-to-end.
The overall workflow of OneRec is quite similar to that of LLM training. It consists of three core components: a tokenizer, an encoder, and a decoder, plus a reward system used for fine-tuning in subsequent training. Among them, the tokenizer functions to split hundreds of millions of items into semantic IDs through a certain method, which is essentially a clustering process. The encoder feeds various behavioral features of users (such as which videos they have watched, how long they stayed, likes, comments, etc.) into the model to enhance the model’s understanding. This part actually inherits the way traditional recommendation models stack features. The decoder is actually performing sequential recommendation: based on the user’s profile and the sequence of watched items, it generates a string of semantic IDs that match the user’s preferences step by step, which are finally mapped to specific items for recommendation.
Tokenizer
In large-scale recommendation scenarios, the number of items is extremely large, ranging from millions to hundreds of millions. Directly modeling with item IDs would lead to sparsity in item ID embeddings. OneRec’s approach is to feed a video’s captions, tags, automatic speech recognition (ASR), optical character recognition (OCR), cover image, and 5 uniformly sampled frames into a large model called miniCPM-V-8B to obtain high-dimensional feature vectors. Then, a lightweight QFormer is used to compress these high-dimensional representations, which not only retains information but also facilitates subsequent processing.
Specifically, the cover image, 5 evenly sampled frames, subtitles, tags, ASR, and OCR are input into miniCPM-V-8B to obtain M∈R^{N_{M}×d_{t}}(N_{M}=1280, d_{t}=512)。Then, a Qformer is used to compress M, where the N_{M} is converted to N_{\tilde(M)}=4.
where
N_{c} = 4, which is the layer of Qformer.
Furthermore, to enable embeddings to have representative expressive significance, OneRec has also performed two operations to enhance semantic distinctiveness:
(1). Item Pairs Construction: There are two methods for constructing item pairs:
- One is to form pairs by combining the target item clicked by the user with the most recent positive behavior item.
- The other is to form pairs using items with an I2I (Item-to-Item) score higher than a certain threshold, such as those calculated by the Swing algorithm.
(2). LLama3 Caption: Using LLaMa3 as the decoder to predict the next token for video subtitles (it should be noted that the embedding of the caption here is shared with that of the miniCPM caption mentioned above), so as to learn video representations and align with the actual user-item behavior distribution.
Therefore, the final loss function is as follows:
where
t^{k} indicates the k-th caption token.
\tilde{M}∈R^{N_{\tilde{M}}×d_{t}} is the compression of M.
After obtaining the compressed item embeddings, how does OneRec convert them into generatable tokens? OneRec uses a multi-level balanced quantization mechanism, specifically the Residual K-Means Quantization algorithm (RQ-Kmeans), to transform the embeddings. Generally speaking, the embedding undergoes three levels of clustering. The first clustering yields the first-level ID, from which the residual is calculated. The second clustering provides the second-level ID, and this process continues until the third level. Each video is ultimately represented by a three-tiered “coarse-medium-fine” semantic code {s1m, s2m, s3m} (where s1m denotes the category represented by the clustering center to which the embedding belongs in the first level, and similarly for s2m and s3m). This coding system allows for both broad categorization and fine-grained distinction of styles and preferences. In this way, a single server can efficiently “tokenize” hundreds of millions of videos. During subsequent recommendation processes, semantic IDs that best match user interests can be directly generated and then mapped back to specific videos, greatly simplifying system complexity. The specific algorithm steps are as follows:
1). In the first-level of clustering, the initial residual is defined as r^{1}_{i}=e_{i};
2). each level has a codebook C^{l}={c^{l}_{1},c^{l}_{2},…,c^{l}^{K}}, where K is the size of code notebook and the number of clusters. C^{l} is a set of cluster centers. To construct a balanced codebook, the Balanced K-Means algorithm is introduced to partition the item set. This algorithm is relatively simple: it adds a condition to K-Means that each cluster contains the same number of samples w=∣V∣/K, where V is the video set.
3). Find the index of the nearest center vector: s_{i}^{l}=argmin_{k} ||r_{i}^{l}−c_{k}^{l}||_{2};
4). The residual for the next level is defined as r_{i}^{l+1}=r_{i}^{l}−c_{s_{i}^{l}}^{l}.
The corresponding video token v_{i}={s_{i}^{1},s_{i}^{2},…,s_{i}^{L}} gradually generates a series of residual cluster centers through the above L-level hierarchical indexing mechanism. The specific calculation process is as follows:
Encoder
The encoder of OneRec incorporates four types of user-related features, which mainly include:
- Static User Features: These encompass user ID, age, and gender. Each feature has its own embedding.
- Short-Term Behavior Pathway: This pathway processes the most recent (L_{s} = 20) user interaction records, including video ID, author ID, tags, timestamp, playback duration, total video duration, and interaction labels. Each feature is embedded individually.
- Positive Feedback Behavior Pathway: This pathway handles sequences of user interactions that indicate high engagement (such as likes, follows, etc.), with a maximum length of L_{p} = 256.
- Lifecycle Pathway: This pathway processes extremely long historical behavior sequences (up to 100,000 entries). First, embeddings are mapped and concatenated as described in step (2). Subsequently, a QFormer is employed for further compression. The QFormer uses (N_{q} = 128) query vectors and N_{l} = 2 layers of processing to generate the final compressed lifecycle features.
Finally, the encoder concatenates the features output by these four pathways, adds positional encoding to the result, and feeds it into a series of standard Transformer encoder layers. Each layer enables all positions to “attend to” each other (via fully connected self-attention) followed by a small-scale feed-forward computation. After such multi-layer processing, OneRec can obtain a comprehensive interest representation that incorporates both short-term hotspots and long-term preferences, laying a solid foundation for subsequent personalized recommendations.
Decoder
Before explaining the decoder, it is necessary to first clarify the concept of a semantic ID sequence. This is analogous to a sentence sequence in natural language. Here, the sentence sequence refers to the sequence composed of multiple tokens generated by the aforementioned RQ-kmeans. Specifically, this sequence represents the user’s click sequence, which typically contains 5 to 10 videos. It can be expressed as:
During training, each token (i.e., the part in parentheses) is separated by BOS.
Returning to the decoder part, the decoder is actually not much different from the Transformer decoder, except that the last layer is replaced with a MoE (Mixture of Experts) structure to facilitate fast inference. During inference, starting from the beginning symbol (BOS), the user’s expected click sequence is inferred step by step.
There has always been a confusion here: how is the correspondence between semantic IDs and actual video IDs handled? After all, what is ultimately recommended to users is a list of actual video IDs. The specific process is shown in Figure 5:
There exists a mapping relationship from semantic tokens to video IDs. The article does not elaborate on how this mapping is implemented; I guess that it involves taking the intersection of videos in each cluster of the inferred sequence. Additionally, in practical applications, the cardinality of the semantic ID sequence (N_{t}) is much larger than the number of video IDs. This design ensures coverage of all items, and the larger vocabulary capacity can introduce more model parameters, thereby improving performance. However, it may also cause the model to generate semantic ID sequences that cannot be mapped to actual video IDs during the inference phase, which is known as “invalid generation”. It can be seen that in most cases, semantic IDs and video IDs can be considered to have a one-to-one correspondence. The issue of “invalid generation” is addressed through training with the introduction of format rewards, consistent with the ECPO reinforcement learning objective.
Reinforcement Learning
Through the above training, the model can generate items that meet user interests. However, actual recommendation services are complex and need to consider factors such as duration, clicks, conversions, and diversity. Considering this, OneRec first uses a small neural network to integrate multiple feedbacks such as clicks, likes, and viewing duration into a “P-Score”. Then, it uses an algorithm called ECPO (Early Clipped GRPO) to continuously optimize the model based on this score, making the model’s recommendations more in line with the comprehensive goals of the business.
- Small Fine-Ranking Model
In fact, this small model is a fine-ranking model, which adopts a multi-objective Mixture of Experts (MoE) model. Here, OneRec uses the Pantheon method, which was recently proposed by Kuaishou, to implement multi-objective prediction.
- Early Clipped GRPO
The reinforcement learning part of OneRec improves on DeepSeek’s GRPO and proposes the ECPO (Early Clipped GRPO) method:
where
sg represents the stop gradient operation
𝛿 is a hyperparameter greater than 0.
The approach of ECPO is similar to that of GRPO. Both involve the generative model performing beam search to generate multiple recommendation paths for a user, and then using a Reward Model (RM) to score each recommendation path to obtain a reward. However, the difference between ECPO and GRPO lies in the following: For negative improvements, the policy ratio of GRPO would lead to gradient explosion in the OneRec scenario. Therefore, a parameter 𝛿 (set to 0.1 in the paper) is added to make the training more stable. As can be seen from Figure 6, when A < 0, early clipping is performed directly. For positive improvements, ECPO remains consistent with GRPO.
- Format Reward
There is a mapping process from semantic IDs to video IDs. A semantic ID is considered valid if a corresponding video ID can be found; otherwise, it is invalid (i.e., the semantic ID does not exist in the content pool).
OneRec mentions a “squeezing effect” here, which refers to the phenomenon where, after introducing reinforcement learning, changes in the model’s probability distribution cause the generation probability of some originally high-probability valid semantic IDs to be squeezed to a level close to that of invalid semantic IDs. This makes it difficult for the model to distinguish between valid and invalid semantic IDs.
This occurs because in reinforcement learning, especially when training with negative advantages, the model attempts to adjust its probability distribution to reduce the likelihood of generating low-reward items. However, such adjustments may lead to excessive compression of the generation probability of some valid semantic IDs. Although ECPO limits the policy gradient of negative advantage samples through Early-Clipping technology to avoid gradient explosion, the handling of negative advantages may still cause the probability of valid semantic IDs to be squeezed.
To address the issues caused by the squeezing effect, OneRec introduces a format reward mechanism in reinforcement learning to encourage the model to generate valid semantic IDs, thereby improving the validity of generation results.
- Industrial Rewards
For different business scenarios, OneRec also supports incorporating “industrial rewards” into the reward mechanism. For example, if the platform intends to appropriately reduce the exposure of low-quality content or increase the visibility of new creators, it can assign different positive or negative scores to such content in the reward function. This allows the model to naturally take these business or ecological indicators into account during the unified learning process.
Suppressing specific types of content in the reward mechanism.
It is possible to reduce the exposure of a specific type of content by 9.59% without compromising the recommendation duration metric.
Training Process
- Pretraining
The input of OneRec’s pretraining phase is user behavior representations, and the model structure is as shown in Figure 4. The output is the item sequence of the target user, where each target item corresponds to a 3-layer semantic ID, i.e., 3 tokens. In OneRec’s business scenario, it generates 18 billion samples per day, which corresponds to 54 billion tokens in the decoder. A 0.935B OneRec model requires approximately 100 billion samples to converge.
- Posttraining
Posttraining includes online training with real-time data, rejection sampling fine-tuning, and reinforcement learning:
(1). Rejection Sampling: Filter out the 50% of samples with the shortest playback duration.
(2). RL (Reinforcement Learning): Randomly select 1% of users from the rejection-sampled data to generate reinforcement learning samples. These users will generate 512 items, which are then scored by the RM (Reward Model) before being fed into the RL model for training.
The entire training process is shown in the following figure:
Performance
From the above indicators, it can be seen that from the perspective of pure OneRec, the improvement of the model is not that significant. However, after adding the RM (Reward Model), the effect has been significantly enhanced. Nevertheless, the RM actually relies on a reward model similar to a fine-ranking model, so in reality, this generative model has not completely broken away from the shadow of traditional recommendation fine-ranking models. That said, it must be acknowledged that it performs better than fine-ranking models. In addition, in Kuaishou’s local lifestyle service scenario, OneRec has achieved a 21.01% increase in GMV, a 17.89% growth in the number of orders, an 18.58% rise in the number of purchasing users, among which the efficiency of acquiring new customers has increased by 23.02%.