Weaviate Academy

Embedding models are AI models that capture "meanings" of objects by turning text, images, audio and more into sequences of numbers.

Recent developments have greatly improved their capabilities, but this also makes model selection challenging with a vast set of ever-expanding options.

The performance gap

Let's look at a concrete example comparing two models from different eras:

FastText (2015): An early embedding model
Snowflake Arctic Embed (2024): A modern embedding model

When searching for documents matching "How do I make chocolate chip cookies from scratch" in a dataset of 20 documents, here's what we see:

Candidate Documents

FastText results: Search results from FastText

The FastText model finds somewhat relevant results, but includes off-topic recipes and misses the ideal step-by-step recipe.

Arctic Embed results: Search results from Snowflake Arctic

The Arctic model correctly identifies the ideal result as the top match and includes highly relevant results in the top positions.

Quantitative comparison

Using the nDCG@10 metric (which rewards relevant results at the top of the list):

Model	nDCG@10
FastText	0.595
Snowflake Arctic	0.908

This dramatic improvement shows why model selection matters for retrieval quality.

Resource implications

Beyond performance, models vary significantly in resource requirements. For a vector database with 1 million documents:

Estimated memory requirements for 1 million documents

High-dimension model (nv-embed-v2): ~3.3 TB memory
Low-dimension model (embed-english-light-v3.0): ~300 GB memory

This 10x difference in memory requirements directly impacts infrastructure costs and deployment feasibility.

The challenge of choice

The embedding model landscape includes innovations like word2vec, FastText, GloVe, BERT, CLIP, OpenAI ada, Cohere multi-lingual, Snowflake Arctic, ColBERT, and ColPali. Each brings improvements in architecture, training data, methodology, modality support, or efficiency.

With hundreds of models available and new ones released regularly, making the right choice requires a systematic approach.

What's next?

Now that you understand why embedding model selection is crucial, let's explore a systematic workflow that will guide you through this complex decision-making process.

← Back to Lesson Overview

Why embedding model selection matters

The performance gap

Quantitative comparison

Resource implications

The challenge of choice

← Back to Lesson Overview

Why embedding model selection matters

The performance gap​

Quantitative comparison​

Resource implications​

The challenge of choice​

The performance gap

Quantitative comparison

Resource implications

The challenge of choice