Impact on BM25 & selection guide
Now that you've seen how tokenization affects filtering, let's understand how to select the right method and how it applies to keyword (BM25) search.
Impact on keyword (BM25) search
Tokenization affects keyword search the same way it affects filtering - it determines what tokens are created and compared. The same principles apply:
Query: "clark"
word: Matches"Clark:","clark","CLARK"(normalized)lowercase: Matches"clark:","clark"(preserves symbols)whitespace: Only matches exact case"clark"(case-sensitive)field: Only for exact full-field matching
Query: "variable_name" in code search
word: Matches too broadly - finds"variable","name","variable_name"lowercase: Matches"variable_name","Variable_Name"(case-insensitive, preserves underscore)whitespace: Matches only"variable_name"(case-sensitive)field: Matches only exact full string
BM25 scoring differences
Different tokenization produces different scores:
- More tokens (e.g.,
wordsplitting"variable_name") → more potential matches, potentially lower precision - Fewer tokens (e.g.,
lowercasepreserving"variable_name") → fewer but more precise matches - Case sensitivity (
whitespace) → even more precise, but may miss valid matches
Selection guide
Choose tokenization based on whether symbols and case have meaning in your data.
Decision flow:
-
Do symbols have meaning in your data?
- No → Use
word(default) - Yes → Continue to #2
- No → Use
-
Does case matter?
- No → Use
lowercase - Yes → Use
whitespace
- No → Use
-
Need exact full-string matching?
- Yes → Use
field
- Yes → Use
Examples of use cases
| Data Type | Recommended | Reason |
|---|---|---|
| General text/prose | word | Case-insensitive, ignores punctuation |
| Email addresses | lowercase | Preserves @ and . symbols |
| Code snippets | lowercase or whitespace | Preserves _ and . |
| Product SKUs | whitespace or field | Case may matter |
| URLs | lowercase | Preserves structure |
| Proper names | whitespace | Case distinguishes meaning |
| Category IDs | field | Exact matching only |
Example implementations
General content search
For most general text, word tokenization is the default and right choice.
from weaviate.classes.config import Property, DataType, Tokenization
client.collections.create(
name="Articles",
properties=[
Property(
name="title",
data_type=DataType.TEXT,
tokenization=Tokenization.WORD # Default for prose
),
Property(
name="content",
data_type=DataType.TEXT,
tokenization=Tokenization.WORD
)
]
)
Code search
For code, you may wish to preserve the symbols. However, any accompanying description should be tokenized as natural language.
client.collections.create(
name="CodeSnippets",
properties=[
Property(
name="function_name",
data_type=DataType.TEXT,
tokenization=Tokenization.LOWERCASE # Preserve underscores
),
Property(
name="file_path",
data_type=DataType.TEXT,
tokenization=Tokenization.LOWERCASE # Preserve dots and slashes
),
Property(
name="description",
data_type=DataType.TEXT,
tokenization=Tokenization.WORD # Natural language
)
]
)
E-commerce products
For e-commerce products, the type of text being stored may determine the tokenization method.
client.collections.create(
name="Products",
properties=[
Property(
name="name",
data_type=DataType.TEXT,
tokenization=Tokenization.WORD # Natural language
),
Property(
name="sku",
data_type=DataType.TEXT,
tokenization=Tokenization.WHITESPACE # Case may matter
),
Property(
name="category",
data_type=DataType.TEXT,
tokenization=Tokenization.FIELD # Exact matching
)
]
)
Key takeaways
- Tokenization affects both filtering and keyword search - choose based on your data semantics
wordis the default and right for most general textlowercaseandwhitespacepreserve symbols - better for technical content and identifiersfieldis very strict - use sparingly for exact matching only (IDs, categories)- Different properties can use different tokenization - mix and match based on each property's meaning
Use word tokenization for most text properties. Only switch to lowercase, whitespace, or field when you have specific requirements around symbols or case sensitivity.
You now understand search type selection and tokenization configuration. In the next module, we'll explore optimization techniques including property boosting, re-rankers, and hybrid search tuning.