Weaviate Academy

Now that you've seen how tokenization affects filtering, let's understand how to select the right method and how it applies to keyword (BM25) search.

Impact on keyword (BM25) search

Tokenization affects keyword search the same way it affects filtering - it determines what tokens are created and compared. The same principles apply:

Query: "clark"

word: Matches "Clark:", "clark", "CLARK" (normalized)
lowercase: Matches "clark:", "clark" (preserves symbols)
whitespace: Only matches exact case "clark" (case-sensitive)
field: Only for exact full-field matching

Query: "variable_name" in code search

word: Matches too broadly - finds "variable", "name", "variable_name"
lowercase: Matches "variable_name", "Variable_Name" (case-insensitive, preserves underscore)
whitespace: Matches only "variable_name" (case-sensitive)
field: Matches only exact full string

BM25 scoring differences

Different tokenization produces different scores:

More tokens (e.g., word splitting "variable_name") → more potential matches, potentially lower precision
Fewer tokens (e.g., lowercase preserving "variable_name") → fewer but more precise matches
Case sensitivity (whitespace) → even more precise, but may miss valid matches

Selection guide

Choose tokenization based on whether symbols and case have meaning in your data.

Decision flow:

Do symbols have meaning in your data?
- No → Use word (default)
- Yes → Continue to #2
Does case matter?
- No → Use lowercase
- Yes → Use whitespace
Need exact full-string matching?
- Yes → Use field

Examples of use cases

Data Type	Recommended	Reason
General text/prose	`word`	Case-insensitive, ignores punctuation
Email addresses	`lowercase`	Preserves @ and . symbols
Code snippets	`lowercase` or `whitespace`	Preserves _ and .
Product SKUs	`whitespace` or `field`	Case may matter
URLs	`lowercase`	Preserves structure
Proper names	`whitespace`	Case distinguishes meaning
Category IDs	`field`	Exact matching only

Example implementations

General content search

For most general text, word tokenization is the default and right choice.

from weaviate.classes.config import Property, DataType, Tokenization

client.collections.create(
    name="Articles",
    properties=[
        Property(
            name="title",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WORD  # Default for prose
        ),
        Property(
            name="content",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WORD
        )
    ]
)

API docs

Code search

For code, you may wish to preserve the symbols. However, any accompanying description should be tokenized as natural language.

client.collections.create(
    name="CodeSnippets",
    properties=[
        Property(
            name="function_name",
            data_type=DataType.TEXT,
            tokenization=Tokenization.LOWERCASE  # Preserve underscores
        ),
        Property(
            name="file_path",
            data_type=DataType.TEXT,
            tokenization=Tokenization.LOWERCASE  # Preserve dots and slashes
        ),
        Property(
            name="description",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WORD  # Natural language
        )
    ]
)

API docs

E-commerce products

For e-commerce products, the type of text being stored may determine the tokenization method.

client.collections.create(
    name="Products",
    properties=[
        Property(
            name="name",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WORD  # Natural language
        ),
        Property(
            name="sku",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WHITESPACE  # Case may matter
        ),
        Property(
            name="category",
            data_type=DataType.TEXT,
            tokenization=Tokenization.FIELD  # Exact matching
        )
    ]
)

API docs

Key takeaways

Tokenization affects both filtering and keyword search - choose based on your data semantics
word is the default and right for most general text
lowercase and whitespace preserve symbols - better for technical content and identifiers
field is very strict - use sparingly for exact matching only (IDs, categories)
Different properties can use different tokenization - mix and match based on each property's meaning

Start with defaults

Use word tokenization for most text properties. Only switch to lowercase, whitespace, or field when you have specific requirements around symbols or case sensitivity.

What's next?

You now understand search type selection and tokenization configuration. In the next module, we'll explore optimization techniques including property boosting, re-rankers, and hybrid search tuning.

← Back to Lesson Overview

Impact on BM25 & selection guide

Impact on keyword (BM25) search

BM25 scoring differences

Selection guide

Examples of use cases

Example implementations

General content search

Code search

E-commerce products

Key takeaways

← Back to Lesson Overview

Impact on BM25 & selection guide

Impact on keyword (BM25) search​

BM25 scoring differences​

Selection guide​

Examples of use cases​

Example implementations​

General content search​

Code search​

E-commerce products​

Key takeaways​

Impact on keyword (BM25) search

BM25 scoring differences

Selection guide

Examples of use cases

Example implementations

General content search

Code search

E-commerce products

Key takeaways