Practical benchmark example
Let's walk through a complete custom benchmark example to demonstrate the evaluation process in practice. This example shows how to design, implement, and analyze a benchmark for technical documentation retrieval.
Scenario setup
Goal: Implement a RAG system for company technical documentation including product docs, code examples, and support forum logs.
Challenge: You've shortlisted two embedding models (Model A and Model B) based on MTEB scores and practical considerations, but need to determine which performs better on your specific, diverse content.
Step 1: Set benchmark objectives
Your documentation comes from varied sources with different characteristics:
- Writing styles: Informal forum posts vs. formal documentation
- Text lengths: Comprehensive guides vs. short answers
- Content types: Code snippets vs. English explanations
Objective: Test how each model handles style, length, and content type variability in technical documentation retrieval.
Step 2: Determine evaluation metrics
For this retrieval problem where some results are more relevant than others, we'll use NDCG@k.
NDCG implementation:
def calculate_dcg(relevance_scores: list[int], k: Optional[int] = None) -> float:
"""
Args:
relevance_scores: List of relevance scores (0, 1, or 2)
k: Number of results to consider. If None, uses all results.
"""
if k is not None:
relevance_scores = relevance_scores[:k]
gains = [2**score - 1 for score in relevance_scores]
dcg = 0
for i, gain in enumerate(gains):
dcg += gain / np.log2(i + 2) if i > 0 else gain
return dcg
def calculate_ndcg(
actual_scores: list[int], ideal_scores: list[int], k: Optional[int] = None
) -> float:
"""
Args:
actual_scores: List of relevance scores in predicted order
ideal_scores: List of relevance scores in ideal order
k: Number of results to consider
"""
dcg = calculate_dcg(actual_scores, k)
idcg = calculate_dcg(ideal_scores, k)
return dcg / idcg if idcg > 0 else 0.0
Note: Libraries like scikit-learn provide built-in NDCG implementations for production use.
Step 3: Curate benchmark dataset
Design a dataset that captures the diversity you want to evaluate:
dataset = {
# Search query
"query": "How to set up a vector index with binary quantization",
# Candidate document set, with scores on a scale of 0-3
"documents": [
{
"id": "doc001",
# Highly relevant documentation text
"text": "Each collection can be configured to use BQ compression. BQ can be enabled at collection creation time, before data is added to it. This can be done by setting the vector_index_config of the collection to enable BQ compression.",
"score": 3
},
{
"id": "doc002",
# Highly relevant, long code example
"text": "from weaviate.classes.config import Configure, Property, DataType, VectorDistances, VectorFilterStrategy\n\nclient.collections.create(\n 'Article',\n # Additional configuration not shown\n vector_index_config=Configure.VectorIndex.hnsw(\n quantizer=Configure.VectorIndex.Quantizer.bq(\n cache=True,\n rescore_limit=1000\n ),\n ef_construction=300,\n distance_metric=VectorDistances.COSINE,\n filter_strategy=VectorFilterStrategy.SWEEPING\n ),)",
"score": 3
},
{
"id": "doc003",
# Highly relevant, short code example
"text": "client.collections.create(\nname='Movie',\nvector_index_config=wc.Configure.VectorIndex.flat(\nquantizer=wc.Configure.VectorIndex.Quantizer.bq()\n))",
"score": 3
},
{
"id": "doc004",
# Less relevant forum post, even though right words appear
"text": "No change in vector size after I set up Binary Quantization\nHello! I was curious to try out how binary quantization works. To embed data I use gtr-t5-large model, which creates 768-dimensional vectors. My database stores around 2k of vectors. My python code to turn PQ on is following: client.schema.update_config(\n 'Document',\n {\n 'vectorIndexConfig': {\n 'bq': {\n 'enabled': True, \n }\n }\n },\n)",
"score": 1
},
# ... more documents ...
{
"id": "doc030",
# Irrelevant documentation text
"text": "Weaviate stores data objects in collections. Data objects are represented as JSON-documents. Objects normally include a vector that is derived from a machine learning model. The vector is also called an embedding or a vector embedding.",
"score": 0
},
]
}
This dataset strategically includes:
- Varying relevance scores (0-3) to test ranking ability
- Different document types: formal docs, code examples, forum posts
- Different lengths: from short code snippets to comprehensive explanations
- Mixed languages: code and English explanations
Step 4: Implement benchmark runner
Create a systematic, reproducible evaluation framework:
import numpy as np
from typing import List, Dict, Any
class Document:
"""Document with text and relevance score"""
def __init__(self, id, text, relevance_score):
self.id = id
self.text = text
self.relevance_score = relevance_score
class EmbeddingModel:
"""Abstract embedding model interface"""
def __init__(self, name):
self.name = name
def embed(self, text):
"""Generate embedding for text"""
return embedding
class BenchmarkRunner:
"""Runs embedding model benchmarks"""
def __init__(self, queries, documents, models):
self.queries = queries
self.documents = documents
self.models = models
def run(self, k=10):
"""Run benchmark for all models
Returns: Dict mapping model names to metrics
"""
results = {}
for model in self.models:
# Get embeddings for all texts
query_embeddings = {q: model.embed(q) for q in self.queries}
doc_embeddings = {doc.id: model.embed(doc.text) for doc in self.documents}
# Calculate metrics for each query
ndcg_scores = []
for query, query_emb in query_embeddings.items():
# Get top k documents by similarity
top_docs = self._retrieve_top_k(query_emb, doc_embeddings, k)
# Calculate NDCG
ndcg = self._calculate_ndcg(top_docs, query, k)
ndcg_scores.append(ndcg)
# Store results
results[model.name] = {
'avg_ndcg': np.mean(ndcg_scores),
'all_scores': ndcg_scores
}
return results
def _retrieve_top_k(self, query_emb, doc_embeddings, k):
"""Retrieve top k docs by similarity"""
# Implementation: calculate similarities and return top k
pass
def _calculate_ndcg(self, retrieved_docs, query, k):
"""Calculate NDCG@k for retrieved documents"""
# Implementation: calculate DCG and IDCG
pass
Step 5: Analyze results comprehensively
Quantitative analysis
Start with overall performance comparison:
# Example benchmark results
results = {
'Model A': {'avg_ndcg': 0.87, 'all_scores': [0.92, 0.85, 0.84]},
'Model B': {'avg_ndcg': 0.79, 'all_scores': [0.95, 0.72, 0.70]}
}
# Print summary
for model_name, metrics in results.items():
print(f"{model_name}: NDCG@10 = {metrics['avg_ndcg']:.4f}")
Look beyond averages:
- Consistency: Model A shows more consistent performance (smaller score variance)
- Peak performance: Model B achieves higher peak scores but with greater variability
- Query-specific patterns: Group scores by query characteristics
Qualitative analysis
Examine actual retrieval results for actionable insights:
Pattern identification:
- Document type preferences: Does a model consistently favor code over text explanations?
- Length bias: Are shorter, highly relevant snippets overlooked for longer documents?
- Language handling: How well are technical terms and domain jargon processed?
- Context understanding: Do models grasp the relationship between questions and code solutions?
Model comparison:
- Where do models disagree most significantly?
- Do models prioritize different relevance aspects?
- Which model better handles your specific domain challenges?
Decision framework
The ideal model balances multiple factors:
Performance metrics: Raw NDCG scores provide objective ranking Specific strengths: Performance on your most critical query types Practical considerations: Cost, latency, and deployment requirements from your initial requirements analysis
Key insight: The best model isn't necessarily the highest-scoring one overall—it's the one that excels at your most important tasks while meeting operational constraints.
Beyond model selection
This evaluation process provides valuable insights beyond just choosing a model:
- Understanding limitations: Know where your chosen model struggles
- System design implications: Design around model strengths and weaknesses
- Performance expectations: Set realistic expectations for production performance
- Future evaluation: Establish baselines for comparing future models
Your custom benchmark becomes a reusable asset for ongoing model evaluation and system improvement.
With your model selected and deployed, the evaluation process doesn't end. Let's explore how to maintain optimal performance through periodic re-evaluation as your needs and the model landscape evolve.