Skip to main content

Impact on BM25 & selection guide

Now that you've seen how tokenization affects filtering, let's understand how to select the right method and how it applies to keyword (BM25) search.

Tokenization affects keyword search the same way it affects filtering - it determines what tokens are created and compared. The same principles apply:

Query: "clark"

  • word: Matches "Clark:", "clark", "CLARK" (normalized)
  • lowercase: Matches "clark:", "clark" (preserves symbols)
  • whitespace: Only matches exact case "clark" (case-sensitive)
  • field: Only for exact full-field matching

Query: "variable_name" in code search

  • word: Matches too broadly - finds "variable", "name", "variable_name"
  • lowercase: Matches "variable_name", "Variable_Name" (case-insensitive, preserves underscore)
  • whitespace: Matches only "variable_name" (case-sensitive)
  • field: Matches only exact full string

BM25 scoring differences

Different tokenization produces different scores:

  • More tokens (e.g., word splitting "variable_name") → more potential matches, potentially lower precision
  • Fewer tokens (e.g., lowercase preserving "variable_name") → fewer but more precise matches
  • Case sensitivity (whitespace) → even more precise, but may miss valid matches

Selection guide

Choose tokenization based on whether symbols and case have meaning in your data.

Decision flow:

  1. Do symbols have meaning in your data?

    • No → Use word (default)
    • Yes → Continue to #2
  2. Does case matter?

    • No → Use lowercase
    • Yes → Use whitespace
  3. Need exact full-string matching?

    • Yes → Use field

Examples of use cases

Data TypeRecommendedReason
General text/prosewordCase-insensitive, ignores punctuation
Email addresseslowercasePreserves @ and . symbols
Code snippetslowercase or whitespacePreserves _ and .
Product SKUswhitespace or fieldCase may matter
URLslowercasePreserves structure
Proper nameswhitespaceCase distinguishes meaning
Category IDsfieldExact matching only

Example implementations

For most general text, word tokenization is the default and right choice.

from weaviate.classes.config import Property, DataType, Tokenization

client.collections.create(
name="Articles",
properties=[
Property(
name="title",
data_type=DataType.TEXT,
tokenization=Tokenization.WORD # Default for prose
),
Property(
name="content",
data_type=DataType.TEXT,
tokenization=Tokenization.WORD
)
]
)

For code, you may wish to preserve the symbols. However, any accompanying description should be tokenized as natural language.

client.collections.create(
name="CodeSnippets",
properties=[
Property(
name="function_name",
data_type=DataType.TEXT,
tokenization=Tokenization.LOWERCASE # Preserve underscores
),
Property(
name="file_path",
data_type=DataType.TEXT,
tokenization=Tokenization.LOWERCASE # Preserve dots and slashes
),
Property(
name="description",
data_type=DataType.TEXT,
tokenization=Tokenization.WORD # Natural language
)
]
)

E-commerce products

For e-commerce products, the type of text being stored may determine the tokenization method.

client.collections.create(
name="Products",
properties=[
Property(
name="name",
data_type=DataType.TEXT,
tokenization=Tokenization.WORD # Natural language
),
Property(
name="sku",
data_type=DataType.TEXT,
tokenization=Tokenization.WHITESPACE # Case may matter
),
Property(
name="category",
data_type=DataType.TEXT,
tokenization=Tokenization.FIELD # Exact matching
)
]
)

Key takeaways

  1. Tokenization affects both filtering and keyword search - choose based on your data semantics
  2. word is the default and right for most general text
  3. lowercase and whitespace preserve symbols - better for technical content and identifiers
  4. field is very strict - use sparingly for exact matching only (IDs, categories)
  5. Different properties can use different tokenization - mix and match based on each property's meaning
Start with defaults

Use word tokenization for most text properties. Only switch to lowercase, whitespace, or field when you have specific requirements around symbols or case sensitivity.

What's next?

You now understand search type selection and tokenization configuration. In the next module, we'll explore optimization techniques including property boosting, re-rankers, and hybrid search tuning.

Login to track your progress