Tokenization behavior
Tokenization is the process of breaking text into searchable units. This happens during BOTH indexing (when data is added) and querying (when you search). In turn, the choice of tokenization method has a significant impact on the outcome of keyword search and filtering.
Tokenization here refers to keyword search and filtering only. This doesn't impact tokenization for AI models (embeddings or generative AI).
Tokenization here refers to inverted index tokenization used for keyword search and filtering. This is completely separate from tokenization for embedding or generative AI models, which happens in input processing.
The tokenization problem
Consider this TV show title: "Lois & Clark: The New Adventures of Superman"
Question: Should a filter for "clark" match this title?
- Split by punctuation:
"clark:"→"clark"✅ Matches - Preserve punctuation:
"clark:"→"clark:"❌ Doesn't match - Preserve case:
"clark:"→"Clark:"❌ Doesn't match
Tokenization lets you control this behavior.
Tokenization methods
| Method | Splits on | Case | Symbols | Example: "Lois & Clark:" |
|---|---|---|---|---|
word | Non-alphanumeric | Lowercased | Removed | ["lois", "clark"] |
lowercase | Whitespace | Lowercased | Preserved | ["lois", "&", "clark:"] |
whitespace | Whitespace | Preserved | Preserved | ["Lois", "&", "Clark:"] |
field | None | Preserved | Preserved | ["Lois & Clark:"] |
Testing tokenization behavior
Let's create a test collection with the same text stored using different tokenization methods, then see how filtering behaves.
Setup test collection
from weaviate.classes.config import Configure, Property, DataType, Tokenization
# Create collection with multiple tokenization methods for comparison
demo = client.collections.create(
name="TokenizationDemo",
vector_config=Configure.Vectors.self_provided(), # No vectors needed
properties=[
Property(
name="text_word",
data_type=DataType.TEXT,
tokenization=Tokenization.WORD # Default
),
Property(
name="text_lowercase",
data_type=DataType.TEXT,
tokenization=Tokenization.LOWERCASE
),
Property(
name="text_whitespace",
data_type=DataType.TEXT,
tokenization=Tokenization.WHITESPACE
),
Property(
name="text_field",
data_type=DataType.TEXT,
tokenization=Tokenization.FIELD
),
]
)
Add test data
# Test strings designed to reveal tokenization differences
test_strings = [
"Lois & Clark: The New Adventures of Superman",
"computer mouse",
"variable_name",
]
# Insert same text in all properties
demo = client.collections.use("TokenizationDemo")
for text in test_strings:
test_obj = {
"text_word": text,
"text_lowercase": text,
"text_whitespace": text,
"text_field": text
}
demo.data.insert(properties=test_obj)
Test function
Here's a function to test how different tokenizations handle the same query:
from weaviate.classes.query import Filter
def test_tokenization(query_string: str):
"""
Filter each property and see which tokenization methods match
"""
demo = client.collections.use("TokenizationDemo")
for prop in ["text_word", "text_lowercase", "text_whitespace", "text_field"]:
response = demo.query.fetch_objects(
filters=Filter.by_property(prop).equal(query_string),
)
match_status = "✅ MATCH" if len(response.objects) > 0 else "❌ NO MATCH"
print(f"{prop:20} {match_status}")
Results: "Clark" variations
Testing with data containing "Lois & Clark: The New Adventures of Superman":
Query: "clark" (lowercase, no punctuation)
text_word ✅ MATCH
text_lowercase ❌ NO MATCH
text_whitespace ❌ NO MATCH
text_field ❌ NO MATCH
Analysis:
word: Lowercases and strips punctuation → token is"clark"→ matcheslowercase: Token is"clark:"(preserves colon) → doesn't matchwhitespace: Token is"Clark:"(preserves case and colon) → doesn't matchfield: Requires entire string match
Query: "Clark" (capitalized, no punctuation)
text_word ✅ MATCH
text_lowercase ❌ NO MATCH
text_whitespace ❌ NO MATCH
text_field ❌ NO MATCH
Analysis:
word: Lowercases query too → still matches"clark"- Others still don't match due to punctuation or case
Query: "clark:" (lowercase with colon)
text_word ✅ MATCH
text_lowercase ✅ MATCH
text_whitespace ❌ NO MATCH
text_field ❌ NO MATCH
Analysis:
word: Strips punctuation from both → still matcheslowercase: Now matches! Both are"clark:"whitespace: Case still doesn't match
Query: "Clark:" (capitalized with colon)
text_word ✅ MATCH
text_lowercase ✅ MATCH
text_whitespace ✅ MATCH
text_field ❌ NO MATCH
Analysis:
- Three methods now match!
- Only
fielddoesn't match (requires full string)
Results: Symbols - "variable_name"
Testing with data containing "variable_name":
Query: "variable"
text_word ✅ MATCH
text_lowercase ❌ NO MATCH
text_whitespace ❌ NO MATCH
text_field ❌ NO MATCH
Analysis:
word: Treats underscore as separator → tokens:["variable", "name"]→ matches "variable"- Others: Preserve underscore → token is
["variable_name"]→ "variable" alone doesn't match
Query: "variable_name"
text_word ✅ MATCH
text_lowercase ✅ MATCH
text_whitespace ✅ MATCH
text_field ✅ MATCH
All match for exact query.
Query: "Variable_Name"
text_word ✅ MATCH
text_lowercase ✅ MATCH
text_whitespace ❌ NO MATCH
text_field ❌ NO MATCH
Analysis:
wordandlowercase: Case-insensitive → matchwhitespaceandfield: Case-sensitive → don't match
Implication for code search:
wordis too flexible: matches"variable"alone when data contains"variable_name"lowercaseorwhitespacebetter preserve identifier semantics
Results: Stop words
Stop words are words that are ignored by the tokenizer as they do not contribute to the meaning of the text. They are typically common words like "the", "a", "an", and so on. The default stop words are defined here.
Testing with data containing "computer mouse":
Query: "computer mouse"
text_word ✅ MATCH
text_lowercase ✅ MATCH
text_whitespace ✅ MATCH
text_field ✅ MATCH
All match for exact query.
Query: "a computer mouse"
text_word ✅ MATCH
text_lowercase ✅ MATCH
text_whitespace ✅ MATCH
text_field ❌ NO MATCH
Analysis:
- First three match because
.equal()checks if query tokens are present - Data has tokens
["computer", "mouse"] - Query has tokens
["a", "computer", "mouse"] "a"is a stop word (ignored),"computer"and"mouse"are found → matchfieldrequires exact full string
Query: "the mighty computer mouse"
text_word ❌ NO MATCH
text_lowercase ❌ NO MATCH
text_whitespace ❌ NO MATCH
text_field ❌ NO MATCH
Analysis:
"mighty"is not in the data → no match- Stop words are ignored, but other non-matching words still cause failure
Now that you've seen how tokenization affects filtering, let's understand how to select the right method for your data and how it applies to keyword (BM25) search.