Weaviate Academy

With your candidate models identified, it's time for thorough evaluation. This stage collects concrete evidence to support your embedding model selection through systematic testing of your specific use case.

Types of evaluation

Understanding two distinct evaluation approaches helps frame your testing strategy:

Model evaluation: Assesses the embedding model itself through direct metrics, typically using benchmarks or custom tasks matching your use case.

Downstream evaluation: Examines how the embedding model performs within your complete system (RAG pipeline, recommendation engine, etc.).

This lesson focuses on model evaluation to help you make initial selections while building AI-powered applications. We'll cover downstream evaluation in future courses.

Standard benchmarks

Standard benchmarks provide valuable reference points created by experts, often with readily available results requiring minimal effort to interpret.

Understanding MTEB in detail

The Massive Text Embedding Benchmark (MTEB) serves as an industry-standard measure, composed of over 100 individual tasks grouped by type:

Retrieval: Finding relevant documents based on queries
Classification: Categorizing text into predefined classes
Clustering: Grouping similar texts together
Reranking: Sorting results by relevance
Semantic textual similarity (STS): Measuring semantic similarity between texts

Interpreting detailed MTEB scores

Rather than relying on overall scores, examine task-specific performance. MTEB's retrieval score combines results from 17 different benchmarks, revealing important nuances.

General domain tasks: MTEB benchmarks by task - general

Specialized domain tasks: MTEB benchmarks by task - specialized

These charts reveal that while some models like gemini-embedding-exp-03-07 perform consistently well, others excel in specific areas:

Snowflake Arctic models: Outstanding performance on LEMBPasskeyRetrieval (finding specific text in long embeddings)
Cohere multilingual models: Excel on MIRACL tasks (highly multilingual retrieval)

Creating custom benchmark weights

Standard MTEB scores may not reflect your priorities. Consider creating weighted scores based on:

Task relevance: How closely does each task match your use case?
Data distribution: Does the benchmark data represent your domain?
Metric alignment: Do reported metrics align with your requirements?
Language distribution: Does the language mix match your data?

For example, if your application requires strong multilingual support, weight multilingual tasks more heavily than English-only tasks.

Benchmark limitations

Data leakage risk: Public benchmarks may appear in model training data, inflating scores and creating unfair advantages.

Limited correlation: Standard benchmarks rarely perfectly align with your specific use case, data distribution, and success metrics.

Mitigation strategy: Use standard benchmarks as reference points while conducting custom evaluations for definitive model selection.

Custom benchmarks

Creating your own evaluation addresses standard benchmark limitations and provides use case-specific insights.

Setting benchmark objectives

Define clear goals addressing knowledge gaps, such as:

"Which model best retrieves customer reviews about coffee in English, French, and Korean?"
"How do models perform across Python/Golang code retrieval and related documentation?"

Determining appropriate metrics

For retrieval tasks, common metrics include:

Precision: Percentage of retrieved items that are relevant
Recall: Percentage of relevant items that were retrieved
MAP (Mean Average Precision): Average precision across multiple queries
MRR (Mean Reciprocal Rank): Average of reciprocal ranks of first relevant results
NDCG (Normalized Discounted Cumulative Gain): Rewards high-relevance results at top positions

NDCG is often ideal as it rewards systems for correctly prioritizing highly relevant results.

Curating benchmark datasets

Effective datasets should:

Reflect your retrieval task: Match query types and document characteristics
Capture task difficulty: Include challenging cases that differentiate models
Represent data distribution: Mirror your actual data (domain, language, length)
Provide sufficient volume: Even 20 objects with multiple queries can yield meaningful results

Pragmatic approach: Start small but representative. Quality trumps quantity in custom benchmarks.

Running benchmarks systematically

Ensure reproducibility and consistency:

Modularize components: Separate embedding creation, dataset loading, metric evaluation, and result presentation
Document methodology: Record model versions, parameters, and evaluation procedures
Control variables: Test models under identical conditions
Plan for iteration: Structure code to accommodate new models and updated requirements

Comprehensive result evaluation

Quantitative analysis

Calculate definitive metrics like NDCG@k scores to rank models objectively. However, remember these rankings depend on dataset composition and chosen metrics.

Qualitative analysis

Often provides more actionable insights through pattern recognition:

Common failure patterns:

Favoring longer documents over shorter, highly relevant ones
Struggling with negation in sentences
Poor handling of domain-specific terminology
Language-specific performance variations (strong in English/Mandarin, weak in Hungarian)

Domain expertise advantage: Qualitative assessment often requires domain knowledge and system context, making it critically important for final decisions.

The combination of quantitative rankings and qualitative insights provides the complete picture needed for informed model selection.

What's next?

Now let's see this evaluation process in action with a practical example that demonstrates how to design and execute custom benchmarks for technical documentation retrieval.

← Back to Lesson Overview

Detailed evaluation methods

Types of evaluation

Standard benchmarks

Understanding MTEB in detail

Interpreting detailed MTEB scores

Creating custom benchmark weights

Benchmark limitations

Custom benchmarks

Setting benchmark objectives

Determining appropriate metrics

Curating benchmark datasets

Running benchmarks systematically

Comprehensive result evaluation

Quantitative analysis

Qualitative analysis

← Back to Lesson Overview

Detailed evaluation methods

Types of evaluation​

Standard benchmarks​

Understanding MTEB in detail​

Interpreting detailed MTEB scores​

Creating custom benchmark weights​

Benchmark limitations​

Custom benchmarks​

Setting benchmark objectives​

Determining appropriate metrics​

Curating benchmark datasets​

Running benchmarks systematically​

Comprehensive result evaluation​

Quantitative analysis​

Qualitative analysis​

Types of evaluation

Standard benchmarks

Understanding MTEB in detail

Interpreting detailed MTEB scores

Creating custom benchmark weights

Benchmark limitations

Custom benchmarks

Setting benchmark objectives

Determining appropriate metrics

Curating benchmark datasets

Running benchmarks systematically

Comprehensive result evaluation

Quantitative analysis

Qualitative analysis