AI news #5: battle of embedding models

Avenga

A global IT engineering and consulting company specializing in custom software development.

Published Aug 30, 2024

Greetings, AI enthusiasts! Welcome to the latest edition of our newsletter. If you’ve been following our previous discussions, you'll remember our last piece's exploration of semantic search and its reliance on text embeddings. Today, we're building on that foundation with insights from our recent review of the most prominent embedding model comparisons.So, grab a snack and a beverage, and let’s dive into how these models stack up against each other and what that means for your AI projects.

Decoding embedding models

First, let’s recap what an embedding model even is. As we’ve mentioned, it enables machines to understand and process human language by converting words, phrases, and even entire documents into numerical representations or vectors. These vectors have dimensions that capture the semantic meaning and context of the text. The more dimensions a vector has, the more comprehensive the machine’s “understanding” of the textual data piece. Embedding is, therefore, essential for tasks like search, clustering, and classification.

As we move along, keep these two definitions in mind:

Embedding generation. The core process of converting text into vectors. You’ve probably come across names like ada-002 developed by OpenAI, e5, or all-MiniLM-L6-v2 from sentence transformers - well, all these are among the most famous embedding models.
Dimensions. The number of values in a vector. Higher dimensions allow for capturing more nuanced meaning but come at the cost of computational power.

Now, let’s get to the findings.

In our investigation, we analyzed several comprehensive comparisons. Our efforts were directed toward gaining a thorough understanding of model properties and their performance capabilities in the context of semantic search.

A few quick research highlights

There is always nuance in choosing models. The main insight we’d like you to take away from this piece is that there isn’t a clear-cut winner in terms of model performance. The results of our broad analysis point to the fact that the choice of the algorithm will always depend on your project’s requirements, as each of them has unique strengths and weaknesses that may or may not align with your objectives. Additionally, there are very active companies worth paying attention to, such as OpenAI, Voyage AI, Mistral AI, and Cohere. However, the model best suited for your tasks may come from any provider, leading us to the next point.

Proprietary vs. open source. Another important insight, which reiterates the sentiment stated earlier, is that there isn’t always a clear link between a company’s prowess and reputation and its model’s performance. For example, we’ve seen in comparisons that a small open-source model like e5 can, in some instances, deliver comparable performance at a fraction of the cost of ada-002 (OpenAI), which has long been considered one of the leaders. With advancements happening so rapidly, even the most promising models can't maintain supremacy for long. In the case of ada, besides competition from various large and small companies, OpenAI itself has released models that have already surpassed it, such as text-embeddings-3. Therefore, to find the most well-rounded choice, companies should always keep their eyes open and consider both proprietary and open-source algorithms.
The rising demand for multilingual AI. Next, we’d like to emphasize the importance and growing relevance of AI multilingualism. While many prominent companies have traditionally focused on English, there is increasing demand for projects requiring multilingual support. As a result, well-performing “cosmopolitan” models are expected to gain more attention in the near future. One algorithm that caught our eye was 5-mistral-7b-instruct. While it may not be the fastest due to its size, its adaptability and versatility across various domains and languages make it a strong candidate for projects requiring broad applicability. In addition to Mistral, other noteworthy models include voyage-multilingual and embed-multilingual.
The fast pace of progress in the field. The extremely fast-paced dynamics of the field are a good thing. Case in point: Voyage AI gained attention with its voyage-large-2-instruct model, which briefly claimed the top spot on the MTEB leaderboard. However, as is common in the field, it has since dropped to 11th place. This scenario underscores both the frequent emergence of promising algorithms and the fact that the competitive nature of the industry forces companies to keep the pace up. We thus encourage you to stay informed not only about OpenAI’s models, which have frequently dominated the news, but also about those from Voyage AI, Cohere, Mistral AI, Mixedbread, and other AI organizations that are continually releasing new generations of embedding models and driving progress.

For the curious techies out there, here’s a brief summation of the models’ characteristics from each of the comparisons we studied.

Pinecone:

Ada-002 (OpenAI). Powerful but the slowest, with the largest dimensions.
Embed-english-v3.0 (Cohere). Fairly accurate and performs well in various tasks.
E5 (intfloat). Despite being a smaller, open-source model, it struggled in the tests, delivering the least satisfactory results compared to the other two.

MyScale:

Embed-english-v3.0 (Cohere). Again, outperformed others in accuracy across many applications.
Text-embedding-3-large (OpenAI). Excelled in speed, with the smaller version offering a good balance between speed and quality.
E5-mistral-7b-instruct (Mistral). Stood out for its versatility, making it highly adaptable across different domains and languages.

Voyage AI:

Voyage-large-2-instruct. Initially ranked highly on the MTEB leaderboard, later dropped in ranking but still demonstrates strong performance.
Voyage-multilingual-2. Designed for multilingual retrieval, outperforms competitors in most languages - though it should be noted that the results were self-reported - especially in retrieval-augmented generation (RAG) tasks.

Medium and Towards Data Science:

OpenAI vs. open-source models. OpenAI models showed consistent performance across languages, particularly in creating universally applicable embeddings. However, open-source models like GME-3 from Beijing Academy of AI (BAAI) and e5-mistral-7b-instruct were noted for their strong performance too, often surpassing the proprietary alternatives in specific use cases.

We should additionally mention the study from SoftwareMill, which gave us the following insights.

Re-ranking. This is an optional step that can significantly boost the performance of a retrieval engine. How? Well, in semantic search, a retrieval system initially retrieves a large pool of documents that might be relevant to a user's query. This first-pass stage is often fast, but it may not be perfect. Some retrieved documents might be only vaguely relevant. However, re-ranking can then refine this first list. It takes the top-k results (a smaller set of the most likely candidates) from the initial retrieval and re-orders them by comparing each document to the user’s query individually. While this process takes more time, it produces more accurate results.
BAAI/bge-reranker-large. This is considered one of the best-in-class re-ranking models. It can be used to improve the performance of custom embedding models or fine-tuned pre-trained models like text-embedding-ada-002.
Impact. Re-ranking module can significantly improve the accuracy of custom embedding models. However, for pre-trained models like text-embedding-3-small, the impact might be less pronounced. This suggests that custom models may benefit more from re-ranking, while pre-trained models may already achieve good performance without it.

What does all of this mean?

Moving forward, let's consider some of the practical implications in the context of semantic search.

Speed vs. accuracy. There’s always this battle. OpenAI’s text-embedding-3-small model struck an excellent balance between speed and result quality, making it ideal for applications where quick processing is crucial. However, for tasks requiring the highest accuracy, models like Cohere’s embed-english-v3.0 took the lead, albeit with longer processing times.
Re-ranking is powerful. Introducing a re-ranking step can significantly boost the performance of custom embedding models. For instance, using the BAAI/bge-reranker-large model improved a retrieval engine's accuracy from 77% to 84%. This additional layer helps refine results, especially in complex search tasks.
Edge cases and limitations. All models have limitations. For instance, the e5 model struggled with certain types of queries, often requiring additional preprocessing or re-ranking to deliver satisfactory results. Similarly, Cohere’s model, while generally accurate, showed inconsistencies in multilingual settings, particularly for less common languages.

What’s ahead?

As the field continues to evolve, we will likely see further improvements in both proprietary and open-source models. The trends we expect to solidify include:

Continued optimization of models for speed and accuracy.
Increased focus on multilingual capabilities.
Emergence of new players in the competitive landscape.

Wrapping up

While the full impact of semantic search remains to be seen, it's a rapidly growing field that shows no signs of slowing down. Embedding models, a key component of semantic search, will also continue to be a focus for us, and not just in this area. For anyone trying to develop a multilingual chatbot or explore other NLP applications, understanding how to choose the right embedding model is crucial.

Stay tuned for our next edition, where we'll continue to bring you the latest insights, trends, and breakthroughs in the ever-evolving world of AI!

Check out our blog posts:

Avenga,

your competitive advantage 🚀

avenga.com