Skip to main content

Text Embedding Benchmark for Recommender Systems

July 5, 2025technicalAI/MLAbout 4 min

Text Embedding Benchmark for Recommender Systems

Embedding models encode multimodal information such as images and text into high-dimensional vectors, enabling the calculation of relationships between multimodal information by measuring the distance between embedding vectors in online services like search engines and recommender systems. Text embeddings are the most widely used. Major AI service providers offer text embedding APIs to their users, and there are also many open-source text embedding models available for self-hosting. The current mainstream evaluation standard for text embedding models is MTEBopen in new window. However, MTEBopen in new window does not assess the capabilities of text embedding models in recommender systems, and this post will attempt to evaluate the performance of text embedding models in recommender systems.

Measurement

Embedding vectors in recommender systems are typically used for similar or related recommendations, with the goal that the distance between embedding vectors accurately reflects the similarity or relevance between items. The de facto relevance or similarity between items does not exist, but in recommender systems, the overlap of users between items, denoted as sijs_{ij}, can be used as an approximate measure of similarity or relevance.

sij=UiUjUiUj s_{ij} = \frac{|U_i \cap U_j|}{|U_i \cup U_j|}

Here, UiU_i and UjU_j represent the user sets for items ii and jj, respectively. A higher sijs_{ij} value indicates greater user overlap between items ii and jj. The distance between embedding vectors can be calculated using Euclidean distance:

dij=vivj d_{ij} = ||\textbf{v}_i - \textbf{v}_j||

Here, vi\textbf{v}_i and vj\textbf{v}_j represent the embedding vectors for items ii and jj, respectively. To evaluate how well the distance between embedding vectors reflects the similarity or relevance between items, we can use Top-K recall (Recall@KRecall@K). Top-K recall measures the proportion of items among the KK items with the smallest embedding vector distance from item ii that also appear in the list of KK items with the highest user overlap with item ii. Higher Top-K recall indicates stronger capability of embedding vectors to recommend similar or relevant items.

The dataset used in the experiment is the MovieLens 1Mopen in new window dataset, which contains 1 million user ratings for movies. We convert the synopsis of each movie into embedding vectors using a text embedding model. We then calculate the Top-K recall for each movie and take the average of all movies' Top-K recall as the performance measurement for the model in recommender systems.

Text Embedding Models

For open-source models, we selected the top three models by download count on Ollama:

For commercial models, we chose the strongest text embedding models from Alibaba Cloud and OpenAI:

Note

Testing commercial models incurs costs, and to control expenses, we did not test models from additional service providers. We have open-sourced the testing scriptopen in new window and welcome to run it with other models.

Results

Text Embedding Benchmark for Recommender Systems

Based on the results, the following conclusions can be drawn:

  1. Under the same dimension, commercial models such as text-embedding-v4open in new window and text-embedding-3-largeopen in new window generally outperform open-source models, while the differences between commercial models are minimal.
  2. Among open-source models, mxbai-embed-largeopen in new window performs the best and can be the preferred choice for self-hosting.
  3. Dimensionality significantly impacts model performance. As the dimension increases, the Top-K recall of the models generally improves, but the rate of improvement gradually diminishes. It is necessary to balance recommendation precision with computational and storage costs to select an appropriate dimension.

Tips

Unless there are strict computational or storage resource limits, it is recommended to use vector lengths of 512 or higher to achieve better recommendation precision.

In addition to quantitative comparisons, we can also intuitively evaluate model performance through recommendation results. Below are lists of similar movies recommended for "Vertigo"open in new window using embeddings of two different dimensions from text-embedding-v4open in new window:

MovieYearDirectorGenre
Rear Window1954Alfred HitchcockSuspenseful Thriller
Spellbound1945Alfred HitchcockPsychological Thriller
Psycho1960Alfred HitchcockPsychological Thriller/Horror
North by Northwest1959Alfred HitchcockSuspense Thriller
Stage Fright1950Alfred HitchcockMystery Thriller
Dial M for Murder1954Alfred HitchcockMystery Thriller
The Spiral Staircase1946Robert SiodmakPsychological Thriller/Noir
The Hitch-Hiker1953Ida LupinoNoir Thriller
Shadow of a Doubt1943Alfred HitchcockSuspense Thriller
Woman on the Verge of a Nervous Breakdown1988Pedro AlmodóvarDark Comedy/Drama

For users who have just watched Vertigo, the movies recommended by the 2048-dimensional embedding vectors exhibit higher relevance and similarity to Vertigo.

Conclusion

This post evaluates the performance of text embedding models in recommender systems. The results indicate that commercial models generally outperform open-source models, with mxbai-embed-largeopen in new window standing out among open-source models. Dimensionality significantly affects model performance, and it is recommended to use vector lengths of 512 or higher for better recommendation results.