Weaviate Academy

Tokenization is the process of breaking text into searchable units. This happens during BOTH indexing (when data is added) and querying (when you search). In turn, the choice of tokenization method has a significant impact on the outcome of keyword search and filtering.

Scope

Tokenization here refers to keyword search and filtering only. This doesn't impact tokenization for AI models (embeddings or generative AI).

Scope

Tokenization here refers to inverted index tokenization used for keyword search and filtering. This is completely separate from tokenization for embedding or generative AI models, which happens in input processing.

The tokenization problem

Consider this TV show title: "Lois & Clark: The New Adventures of Superman"

Question: Should a filter for "clark" match this title?

Split by punctuation: "clark:" → "clark" ✅ Matches
Preserve punctuation: "clark:" → "clark:" ❌ Doesn't match
Preserve case: "clark:" → "Clark:" ❌ Doesn't match

Tokenization lets you control this behavior.

Tokenization methods

Method	Splits on	Case	Symbols	Example: `"Lois & Clark:"`
`word`	Non-alphanumeric	Lowercased	Removed	`["lois", "clark"]`
`lowercase`	Whitespace	Lowercased	Preserved	`["lois", "&", "clark:"]`
`whitespace`	Whitespace	Preserved	Preserved	`["Lois", "&", "Clark:"]`
`field`	None	Preserved	Preserved	`["Lois & Clark:"]`

Testing tokenization behavior

Let's create a test collection with the same text stored using different tokenization methods, then see how filtering behaves.

Setup test collection

from weaviate.classes.config import Configure, Property, DataType, Tokenization

# Create collection with multiple tokenization methods for comparison
demo = client.collections.create(
    name="TokenizationDemo",
    vector_config=Configure.Vectors.self_provided(),  # No vectors needed
    properties=[
        Property(
            name="text_word",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WORD  # Default
        ),
        Property(
            name="text_lowercase",
            data_type=DataType.TEXT,
            tokenization=Tokenization.LOWERCASE
        ),
        Property(
            name="text_whitespace",
            data_type=DataType.TEXT,
            tokenization=Tokenization.WHITESPACE
        ),
        Property(
            name="text_field",
            data_type=DataType.TEXT,
            tokenization=Tokenization.FIELD
        ),
    ]
)

API docs

Add test data

# Test strings designed to reveal tokenization differences
test_strings = [
    "Lois & Clark: The New Adventures of Superman",
    "computer mouse",
    "variable_name",
]

# Insert same text in all properties
demo = client.collections.use("TokenizationDemo")

for text in test_strings:
    test_obj = {
        "text_word": text,
        "text_lowercase": text,
        "text_whitespace": text,
        "text_field": text
    }
    demo.data.insert(properties=test_obj)

API docs

Test function

Here's a function to test how different tokenizations handle the same query:

from weaviate.classes.query import Filter

def test_tokenization(query_string: str):
    """
    Filter each property and see which tokenization methods match
    """
    demo = client.collections.use("TokenizationDemo")

    for prop in ["text_word", "text_lowercase", "text_whitespace", "text_field"]:
        response = demo.query.fetch_objects(
            filters=Filter.by_property(prop).equal(query_string),
        )
        match_status = "✅ MATCH" if len(response.objects) > 0 else "❌ NO MATCH"
        print(f"{prop:20} {match_status}")

API docs

Results: "Clark" variations

Testing with data containing "Lois & Clark: The New Adventures of Superman":

Query: "clark" (lowercase, no punctuation)

text_word            ✅ MATCH
text_lowercase       ❌ NO MATCH
text_whitespace      ❌ NO MATCH
text_field           ❌ NO MATCH

Analysis:

word: Lowercases and strips punctuation → token is "clark" → matches
lowercase: Token is "clark:" (preserves colon) → doesn't match
whitespace: Token is "Clark:" (preserves case and colon) → doesn't match
field: Requires entire string match

Query: "Clark" (capitalized, no punctuation)

text_word            ✅ MATCH
text_lowercase       ❌ NO MATCH
text_whitespace      ❌ NO MATCH
text_field           ❌ NO MATCH

Analysis:

word: Lowercases query too → still matches "clark"
Others still don't match due to punctuation or case

Query: "clark:" (lowercase with colon)

text_word            ✅ MATCH
text_lowercase       ✅ MATCH
text_whitespace      ❌ NO MATCH
text_field           ❌ NO MATCH

Analysis:

word: Strips punctuation from both → still matches
lowercase: Now matches! Both are "clark:"
whitespace: Case still doesn't match

Query: "Clark:" (capitalized with colon)

text_word            ✅ MATCH
text_lowercase       ✅ MATCH
text_whitespace      ✅ MATCH
text_field           ❌ NO MATCH

Analysis:

Three methods now match!
Only field doesn't match (requires full string)

Results: Symbols - "variable_name"

Testing with data containing "variable_name":

Query: "variable"

text_word            ✅ MATCH
text_lowercase       ❌ NO MATCH
text_whitespace      ❌ NO MATCH
text_field           ❌ NO MATCH

Analysis:

word: Treats underscore as separator → tokens: ["variable", "name"] → matches "variable"
Others: Preserve underscore → token is ["variable_name"] → "variable" alone doesn't match

Query: "variable_name"

text_word            ✅ MATCH
text_lowercase       ✅ MATCH
text_whitespace      ✅ MATCH
text_field           ✅ MATCH

All match for exact query.

Query: "Variable_Name"

text_word            ✅ MATCH
text_lowercase       ✅ MATCH
text_whitespace      ❌ NO MATCH
text_field           ❌ NO MATCH

Analysis:

word and lowercase: Case-insensitive → match
whitespace and field: Case-sensitive → don't match

Implication for code search:

word is too flexible: matches "variable" alone when data contains "variable_name"
lowercase or whitespace better preserve identifier semantics

Results: Stop words

Stop words

Stop words are words that are ignored by the tokenizer as they do not contribute to the meaning of the text. They are typically common words like "the", "a", "an", and so on. The default stop words are defined here.

Testing with data containing "computer mouse":

Query: "computer mouse"

text_word            ✅ MATCH
text_lowercase       ✅ MATCH
text_whitespace      ✅ MATCH
text_field           ✅ MATCH

All match for exact query.

Query: "a computer mouse"

text_word            ✅ MATCH
text_lowercase       ✅ MATCH
text_whitespace      ✅ MATCH
text_field           ❌ NO MATCH

Analysis:

First three match because .equal() checks if query tokens are present
Data has tokens ["computer", "mouse"]
Query has tokens ["a", "computer", "mouse"]
"a" is a stop word (ignored), "computer" and "mouse" are found → match
field requires exact full string

Query: "the mighty computer mouse"

text_word            ❌ NO MATCH
text_lowercase       ❌ NO MATCH
text_whitespace      ❌ NO MATCH
text_field           ❌ NO MATCH

Analysis:

"mighty" is not in the data → no match
Stop words are ignored, but other non-matching words still cause failure

What's next?

Now that you've seen how tokenization affects filtering, let's understand how to select the right method for your data and how it applies to keyword (BM25) search.

← Back to Lesson Overview

Tokenization behavior

The tokenization problem

Tokenization methods

Testing tokenization behavior

Setup test collection

Add test data

Test function

Results: "Clark" variations

Query: "clark" (lowercase, no punctuation)

Query: "Clark" (capitalized, no punctuation)

Query: "clark:" (lowercase with colon)

Query: "Clark:" (capitalized with colon)

Results: Symbols - "variable_name"

Query: "variable"

Query: "variable_name"

Query: "Variable_Name"

Results: Stop words

Query: "computer mouse"

Query: "a computer mouse"

Query: "the mighty computer mouse"

← Back to Lesson Overview

Tokenization behavior

The tokenization problem​

Tokenization methods​

Testing tokenization behavior​

Setup test collection​

Add test data​

Test function​

Results: "Clark" variations​

Query: "clark" (lowercase, no punctuation)​

Query: "Clark" (capitalized, no punctuation)​

Query: "clark:" (lowercase with colon)​

Query: "Clark:" (capitalized with colon)​

Results: Symbols - "variable_name"​

Query: "variable"​

Query: "variable_name"​

Query: "Variable_Name"​

Results: Stop words​

Query: "computer mouse"​

Query: "a computer mouse"​

Query: "the mighty computer mouse"​

The tokenization problem

Tokenization methods

Testing tokenization behavior

Setup test collection

Add test data

Test function

Results: "Clark" variations

Query: "clark" (lowercase, no punctuation)

Query: "Clark" (capitalized, no punctuation)

Query: "clark:" (lowercase with colon)

Query: "Clark:" (capitalized with colon)

Results: Symbols - "variable_name"

Query: "variable"

Query: "variable_name"

Query: "Variable_Name"

Results: Stop words

Query: "computer mouse"

Query: "a computer mouse"

Query: "the mighty computer mouse"