Benchmark Text Embedding Models for RecSys in 2026
Benchmark Text Embedding Models for RecSys in 2026
In the 2025 post Text Embedding Benchmark for Recommender Systems, we benchmarked the performance of text embedding models in similarity-based recommendations. Within six months of that post's publication, Alibaba Cloud and Google launched a new generation of open-source text embedding models: qwen3-embedding by Alibaba Cloud and embeddinggemma by Google. Recently, the gorse-cli also added a benchmarking feature for text embedding models. This post will use gorse-cli and the playground dataset to conduct a comprehensive benchmark of popular open-source text embedding models.
Evaluation: 1-shot Similarity-based Recommendation
The 2026 benchmark uses a methodology closer to actual recommendation scenarios. The specific steps are as follows:
- Sample Split: For each user, their feeedback is sorted chronologically. The latest feedback is taken as the test set, and the feedback immediately preceding it is taken as the training set. Since there is no training process, the training set is not used for training but for calculating the similarity between items in the candidate set and the training set item as the scores for ranking.
- Candidate Generation: 99 items that the user has not interacted with are randomly selected and combined with the item from the test set to form a candidate set of 100 items.
- Ranking: The Euclidean distance between the embedding vector of the training set item and the embedding vectors of the 100 items in the candidate set is calculated. Items are ranked in ascending order of distance, with smaller distances indicating higher similarity.
- Evaluation Metric: NDCG@10 is calculated based on the ranking. A higher value indicates better ranking accuracy.
Configuration
First, you need to add the API endpoint and API key to the following fields in the configuration file:
[openai]
# Base URL of OpenAI API.
base_url = "https://integrate.api.nvidia.com/v1"
# API key of OpenAI API.
auth_token = "NVIDIA_API_KEY"These fields can also be overridden via environment variables:
OPENAI_BASE_URL="https://integrate.api.nvidia.com/v1"
OPENAI_AUTH_TOKEN="NVIDIA_API_KEY"Compile gorse-cli from Gorse repository and run the following command to evaluate the performance of the text embedding model:
./gorse-cli bench-embedding --config ./config/config.toml \
--text-column item.Comment \
--embedding-model qwen3-embedding:0.6b \
--embedding-dimensions 1024 \
--shot 1- The
--text-columnparameter specifies the text field used to generate embeddings. - The
--embedding-modelparameter specifies the text embedding model to use. - The
--embedding-dimensionsparameter specifies the dimension of the embedding vector. - The
--shotparameter specifies how many training samples to use for calculating similarity. This post uses 1-shot.
Results
The evaluated open-source models include qwen3-embedding from Alibaba Cloud and the embeddinggemma from Google, building upon the models in Comparing Text Embedding Model Performance in Recommendation Scenarios. Additionally, Alibaba Cloud's text-embedding-v4 is included as a reference for commercial models:
Based on the results, we can draw the following conclusions:
- Commercial Models Still Lead: Alibaba Cloud's
text-embedding-v4performed best in most dimensions. Particularly at 2048 dimensions, the NDCG@10 reached 0.1727, demonstrating its powerful semantic representation capabilities. - Impressive Performance of the Qwen3 Embedding:
- qwen3-embedding:4b performed very robustly, reaching a performance peak at around 512 dimensions, even surpassing the 8b model with more parameters. This indicates that in embedding tasks, larger model size is not always better.
- qwen3-embedding:0.6b, as a lightweight model, demonstrated high efficiency at extremely low dimensions (32, 64, 128), making it very suitable for resource-constrained edge scenarios.
- Trade-off Between Dimensions and Performance: Most models reach performance saturation between 512 and 1024 dimensions. For most recommendation systems, choosing 512 dimensions can ensure accuracy while significantly reducing storage and indexing costs.
Conclusion
For text embedding models for recommender systems in 2026, we offer the following advice:
- Pursue Ultimate Performance: Commercial models are the top choice. They have high performance ceilings across various dimensions and excellent multilingual support.
- Cost-Efficiency/Private Deployment: qwen3-embedding:4b is the current king of cost-efficiency. It achieves recommendation accuracy comparable to commercial models with fewer parameters.
- Low Latency/Edge: qwen3-embedding:0.6b with 64 or 128-dimension is the best lightweight solution.
While this post provides some guidance, it is recommended to use gorse-cli to evaluate on your own dataset to choose the text embedding model that best fits your specific business scenario.
