Weaviate Academy

Let's walk through a complete custom benchmark example to demonstrate the evaluation process in practice. This example shows how to design, implement, and analyze a benchmark for technical documentation retrieval.

Scenario setup

Goal: Implement a RAG system for company technical documentation including product docs, code examples, and support forum logs.

Challenge: You've shortlisted two embedding models (Model A and Model B) based on MTEB scores and practical considerations, but need to determine which performs better on your specific, diverse content.

Step 1: Set benchmark objectives

Your documentation comes from varied sources with different characteristics:

Writing styles: Informal forum posts vs. formal documentation
Text lengths: Comprehensive guides vs. short answers
Content types: Code snippets vs. English explanations

Objective: Test how each model handles style, length, and content type variability in technical documentation retrieval.

Step 2: Determine evaluation metrics

For this retrieval problem where some results are more relevant than others, we'll use NDCG@k.

NDCG implementation:

def calculate_dcg(relevance_scores: list[int], k: Optional[int] = None) -> float:
    """
    Args:
        relevance_scores: List of relevance scores (0, 1, or 2)
        k: Number of results to consider. If None, uses all results.
    """
    if k is not None:
        relevance_scores = relevance_scores[:k]

    gains = [2**score - 1 for score in relevance_scores]
    dcg = 0
    for i, gain in enumerate(gains):
        dcg += gain / np.log2(i + 2) if i > 0 else gain

    return dcg

def calculate_ndcg(
    actual_scores: list[int], ideal_scores: list[int], k: Optional[int] = None
) -> float:
    """
    Args:
        actual_scores: List of relevance scores in predicted order
        ideal_scores: List of relevance scores in ideal order
        k: Number of results to consider
    """
    dcg = calculate_dcg(actual_scores, k)
    idcg = calculate_dcg(ideal_scores, k)
    return dcg / idcg if idcg > 0 else 0.0

Note: Libraries like scikit-learn provide built-in NDCG implementations for production use.

Step 3: Curate benchmark dataset

Design a dataset that captures the diversity you want to evaluate:

dataset = {
    # Search query
    "query": "How to set up a vector index with binary quantization",
    # Candidate document set, with scores on a scale of 0-3
    "documents": [
        {
            "id": "doc001",
            # Highly relevant documentation text
            "text": "Each collection can be configured to use BQ compression. BQ can be enabled at collection creation time, before data is added to it. This can be done by setting the vector_index_config of the collection to enable BQ compression.",
            "score": 3
        },
        {
            "id": "doc002",
            # Highly relevant, long code example
            "text": "from weaviate.classes.config import Configure, Property, DataType, VectorDistances, VectorFilterStrategy\n\nclient.collections.create(\n    'Article',\n    # Additional configuration not shown\n    vector_index_config=Configure.VectorIndex.hnsw(\n        quantizer=Configure.VectorIndex.Quantizer.bq(\n            cache=True,\n            rescore_limit=1000\n        ),\n        ef_construction=300,\n        distance_metric=VectorDistances.COSINE,\n        filter_strategy=VectorFilterStrategy.SWEEPING\n    ),)",
            "score": 3
        },
        {
            "id": "doc003",
            # Highly relevant, short code example
            "text": "client.collections.create(\nname='Movie',\nvector_index_config=wc.Configure.VectorIndex.flat(\nquantizer=wc.Configure.VectorIndex.Quantizer.bq()\n))",
            "score": 3
        },
        {
            "id": "doc004",
            # Less relevant forum post, even though right words appear
            "text": "No change in vector size after I set up Binary Quantization\nHello! I was curious to try out how binary quantization works. To embed data I use gtr-t5-large model, which creates 768-dimensional vectors. My database stores around 2k of vectors. My python code to turn PQ on is following: client.schema.update_config(\n    'Document',\n    {\n        'vectorIndexConfig': {\n            'bq': {\n                'enabled': True, \n            }\n        }\n    },\n)",
            "score": 1
        },
        # ... more documents ...
        {
            "id": "doc030",
            # Irrelevant documentation text
            "text": "Weaviate stores data objects in collections. Data objects are represented as JSON-documents. Objects normally include a vector that is derived from a machine learning model. The vector is also called an embedding or a vector embedding.",
            "score": 0
        },
    ]
}

This dataset strategically includes:

Varying relevance scores (0-3) to test ranking ability
Different document types: formal docs, code examples, forum posts
Different lengths: from short code snippets to comprehensive explanations
Mixed languages: code and English explanations

Step 4: Implement benchmark runner

Create a systematic, reproducible evaluation framework:

import numpy as np
from typing import List, Dict, Any

class Document:
    """Document with text and relevance score"""
    def __init__(self, id, text, relevance_score):
        self.id = id
        self.text = text
        self.relevance_score = relevance_score

class EmbeddingModel:
    """Abstract embedding model interface"""
    def __init__(self, name):
        self.name = name

    def embed(self, text):
        """Generate embedding for text"""
        return embedding

class BenchmarkRunner:
    """Runs embedding model benchmarks"""
    def __init__(self, queries, documents, models):
        self.queries = queries
        self.documents = documents
        self.models = models

    def run(self, k=10):
        """Run benchmark for all models

        Returns: Dict mapping model names to metrics
        """
        results = {}

        for model in self.models:
            # Get embeddings for all texts
            query_embeddings = {q: model.embed(q) for q in self.queries}
            doc_embeddings = {doc.id: model.embed(doc.text) for doc in self.documents}

            # Calculate metrics for each query
            ndcg_scores = []
            for query, query_emb in query_embeddings.items():
                # Get top k documents by similarity
                top_docs = self._retrieve_top_k(query_emb, doc_embeddings, k)

                # Calculate NDCG
                ndcg = self._calculate_ndcg(top_docs, query, k)
                ndcg_scores.append(ndcg)

            # Store results
            results[model.name] = {
                'avg_ndcg': np.mean(ndcg_scores),
                'all_scores': ndcg_scores
            }

        return results

    def _retrieve_top_k(self, query_emb, doc_embeddings, k):
        """Retrieve top k docs by similarity"""
        # Implementation: calculate similarities and return top k
        pass

    def _calculate_ndcg(self, retrieved_docs, query, k):
        """Calculate NDCG@k for retrieved documents"""
        # Implementation: calculate DCG and IDCG
        pass

Step 5: Analyze results comprehensively

Quantitative analysis

Start with overall performance comparison:

# Example benchmark results
results = {
    'Model A': {'avg_ndcg': 0.87, 'all_scores': [0.92, 0.85, 0.84]},
    'Model B': {'avg_ndcg': 0.79, 'all_scores': [0.95, 0.72, 0.70]}
}

# Print summary
for model_name, metrics in results.items():
    print(f"{model_name}: NDCG@10 = {metrics['avg_ndcg']:.4f}")

Look beyond averages:

Consistency: Model A shows more consistent performance (smaller score variance)
Peak performance: Model B achieves higher peak scores but with greater variability
Query-specific patterns: Group scores by query characteristics

Qualitative analysis

Examine actual retrieval results for actionable insights:

Pattern identification:

Document type preferences: Does a model consistently favor code over text explanations?
Length bias: Are shorter, highly relevant snippets overlooked for longer documents?
Language handling: How well are technical terms and domain jargon processed?
Context understanding: Do models grasp the relationship between questions and code solutions?

Model comparison:

Where do models disagree most significantly?
Do models prioritize different relevance aspects?
Which model better handles your specific domain challenges?

Decision framework

The ideal model balances multiple factors:

Performance metrics: Raw NDCG scores provide objective ranking Specific strengths: Performance on your most critical query types Practical considerations: Cost, latency, and deployment requirements from your initial requirements analysis

Key insight: The best model isn't necessarily the highest-scoring one overall—it's the one that excels at your most important tasks while meeting operational constraints.

Beyond model selection

This evaluation process provides valuable insights beyond just choosing a model:

Understanding limitations: Know where your chosen model struggles
System design implications: Design around model strengths and weaknesses
Performance expectations: Set realistic expectations for production performance
Future evaluation: Establish baselines for comparing future models

Your custom benchmark becomes a reusable asset for ongoing model evaluation and system improvement.

What's next?

With your model selected and deployed, the evaluation process doesn't end. Let's explore how to maintain optimal performance through periodic re-evaluation as your needs and the model landscape evolve.

← Back to Lesson Overview

Practical benchmark example

Scenario setup

Step 1: Set benchmark objectives

Step 2: Determine evaluation metrics

Step 3: Curate benchmark dataset

Step 4: Implement benchmark runner

Step 5: Analyze results comprehensively

Quantitative analysis

Qualitative analysis

Decision framework

Beyond model selection

← Back to Lesson Overview

Practical benchmark example

Scenario setup​

Step 1: Set benchmark objectives​

Step 2: Determine evaluation metrics​

Step 3: Curate benchmark dataset​

Step 4: Implement benchmark runner​

Step 5: Analyze results comprehensively​

Quantitative analysis​

Qualitative analysis​

Decision framework​

Beyond model selection​

Scenario setup

Step 1: Set benchmark objectives

Step 2: Determine evaluation metrics

Step 3: Curate benchmark dataset

Step 4: Implement benchmark runner

Step 5: Analyze results comprehensively

Quantitative analysis

Qualitative analysis

Decision framework

Beyond model selection