Skip to main content

Why embedding model selection matters

Embedding models are AI models that capture "meanings" of objects by turning text, images, audio and more into sequences of numbers.

Neural Network Basic DiagramNeural Network Basic Diagram

Recent developments have greatly improved their capabilities, but this also makes model selection challenging with a vast set of ever-expanding options.

The performance gap

Let's look at a concrete example comparing two models from different eras:

  • FastText (2015): An early embedding model
  • Snowflake Arctic Embed (2024): A modern embedding model

When searching for documents matching "How do I make chocolate chip cookies from scratch" in a dataset of 20 documents, here's what we see:

Candidate Documents

FastText results: Search results from FastText

The FastText model finds somewhat relevant results, but includes off-topic recipes and misses the ideal step-by-step recipe.

Arctic Embed results: Search results from Snowflake Arctic

The Arctic model correctly identifies the ideal result as the top match and includes highly relevant results in the top positions.

Quantitative comparison

Using the nDCG@10 metric (which rewards relevant results at the top of the list):

ModelnDCG@10
FastText0.595
Snowflake Arctic0.908

This dramatic improvement shows why model selection matters for retrieval quality.

Resource implications

Beyond performance, models vary significantly in resource requirements. For a vector database with 1 million documents:

Estimated memory requirements for 1 million documents

  • High-dimension model (nv-embed-v2): ~3.3 TB memory
  • Low-dimension model (embed-english-light-v3.0): ~300 GB memory

This 10x difference in memory requirements directly impacts infrastructure costs and deployment feasibility.

The challenge of choice

The embedding model landscape includes innovations like word2vec, FastText, GloVe, BERT, CLIP, OpenAI ada, Cohere multi-lingual, Snowflake Arctic, ColBERT, and ColPali. Each brings improvements in architecture, training data, methodology, modality support, or efficiency.

With hundreds of models available and new ones released regularly, making the right choice requires a systematic approach.

What's next?

Now that you understand why embedding model selection is crucial, let's explore a systematic workflow that will guide you through this complex decision-making process.

Login to track your progress