Skip to main content

Tokenization behavior

Tokenization is the process of breaking text into searchable units. This happens during BOTH indexing (when data is added) and querying (when you search). In turn, the choice of tokenization method has a significant impact on the outcome of keyword search and filtering.

Scope

Tokenization here refers to keyword search and filtering only. This doesn't impact tokenization for AI models (embeddings or generative AI).

Scope

Tokenization here refers to inverted index tokenization used for keyword search and filtering. This is completely separate from tokenization for embedding or generative AI models, which happens in input processing.

The tokenization problem

Consider this TV show title: "Lois & Clark: The New Adventures of Superman"

Question: Should a filter for "clark" match this title?

  • Split by punctuation: "clark:""clark" ✅ Matches
  • Preserve punctuation: "clark:""clark:" ❌ Doesn't match
  • Preserve case: "clark:""Clark:" ❌ Doesn't match

Tokenization lets you control this behavior.

Tokenization methods

MethodSplits onCaseSymbolsExample: "Lois & Clark:"
wordNon-alphanumericLowercasedRemoved["lois", "clark"]
lowercaseWhitespaceLowercasedPreserved["lois", "&", "clark:"]
whitespaceWhitespacePreservedPreserved["Lois", "&", "Clark:"]
fieldNonePreservedPreserved["Lois & Clark:"]

Testing tokenization behavior

Let's create a test collection with the same text stored using different tokenization methods, then see how filtering behaves.

Setup test collection

from weaviate.classes.config import Configure, Property, DataType, Tokenization

# Create collection with multiple tokenization methods for comparison
demo = client.collections.create(
name="TokenizationDemo",
vector_config=Configure.Vectors.self_provided(), # No vectors needed
properties=[
Property(
name="text_word",
data_type=DataType.TEXT,
tokenization=Tokenization.WORD # Default
),
Property(
name="text_lowercase",
data_type=DataType.TEXT,
tokenization=Tokenization.LOWERCASE
),
Property(
name="text_whitespace",
data_type=DataType.TEXT,
tokenization=Tokenization.WHITESPACE
),
Property(
name="text_field",
data_type=DataType.TEXT,
tokenization=Tokenization.FIELD
),
]
)

Add test data

# Test strings designed to reveal tokenization differences
test_strings = [
"Lois & Clark: The New Adventures of Superman",
"computer mouse",
"variable_name",
]

# Insert same text in all properties
demo = client.collections.use("TokenizationDemo")

for text in test_strings:
test_obj = {
"text_word": text,
"text_lowercase": text,
"text_whitespace": text,
"text_field": text
}
demo.data.insert(properties=test_obj)

Test function

Here's a function to test how different tokenizations handle the same query:

from weaviate.classes.query import Filter

def test_tokenization(query_string: str):
"""
Filter each property and see which tokenization methods match
"""
demo = client.collections.use("TokenizationDemo")

for prop in ["text_word", "text_lowercase", "text_whitespace", "text_field"]:
response = demo.query.fetch_objects(
filters=Filter.by_property(prop).equal(query_string),
)
match_status = "✅ MATCH" if len(response.objects) > 0 else "❌ NO MATCH"
print(f"{prop:20} {match_status}")

Results: "Clark" variations

Testing with data containing "Lois & Clark: The New Adventures of Superman":

Query: "clark" (lowercase, no punctuation)

text_word            ✅ MATCH
text_lowercase ❌ NO MATCH
text_whitespace ❌ NO MATCH
text_field ❌ NO MATCH

Analysis:

  • word: Lowercases and strips punctuation → token is "clark" → matches
  • lowercase: Token is "clark:" (preserves colon) → doesn't match
  • whitespace: Token is "Clark:" (preserves case and colon) → doesn't match
  • field: Requires entire string match

Query: "Clark" (capitalized, no punctuation)

text_word            ✅ MATCH
text_lowercase ❌ NO MATCH
text_whitespace ❌ NO MATCH
text_field ❌ NO MATCH

Analysis:

  • word: Lowercases query too → still matches "clark"
  • Others still don't match due to punctuation or case

Query: "clark:" (lowercase with colon)

text_word            ✅ MATCH
text_lowercase ✅ MATCH
text_whitespace ❌ NO MATCH
text_field ❌ NO MATCH

Analysis:

  • word: Strips punctuation from both → still matches
  • lowercase: Now matches! Both are "clark:"
  • whitespace: Case still doesn't match

Query: "Clark:" (capitalized with colon)

text_word            ✅ MATCH
text_lowercase ✅ MATCH
text_whitespace ✅ MATCH
text_field ❌ NO MATCH

Analysis:

  • Three methods now match!
  • Only field doesn't match (requires full string)

Results: Symbols - "variable_name"

Testing with data containing "variable_name":

Query: "variable"

text_word            ✅ MATCH
text_lowercase ❌ NO MATCH
text_whitespace ❌ NO MATCH
text_field ❌ NO MATCH

Analysis:

  • word: Treats underscore as separator → tokens: ["variable", "name"] → matches "variable"
  • Others: Preserve underscore → token is ["variable_name"] → "variable" alone doesn't match

Query: "variable_name"

text_word            ✅ MATCH
text_lowercase ✅ MATCH
text_whitespace ✅ MATCH
text_field ✅ MATCH

All match for exact query.

Query: "Variable_Name"

text_word            ✅ MATCH
text_lowercase ✅ MATCH
text_whitespace ❌ NO MATCH
text_field ❌ NO MATCH

Analysis:

  • word and lowercase: Case-insensitive → match
  • whitespace and field: Case-sensitive → don't match

Implication for code search:

  • word is too flexible: matches "variable" alone when data contains "variable_name"
  • lowercase or whitespace better preserve identifier semantics

Results: Stop words

Stop words

Stop words are words that are ignored by the tokenizer as they do not contribute to the meaning of the text. They are typically common words like "the", "a", "an", and so on. The default stop words are defined here.

Testing with data containing "computer mouse":

Query: "computer mouse"

text_word            ✅ MATCH
text_lowercase ✅ MATCH
text_whitespace ✅ MATCH
text_field ✅ MATCH

All match for exact query.

Query: "a computer mouse"

text_word            ✅ MATCH
text_lowercase ✅ MATCH
text_whitespace ✅ MATCH
text_field ❌ NO MATCH

Analysis:

  • First three match because .equal() checks if query tokens are present
  • Data has tokens ["computer", "mouse"]
  • Query has tokens ["a", "computer", "mouse"]
  • "a" is a stop word (ignored), "computer" and "mouse" are found → match
  • field requires exact full string

Query: "the mighty computer mouse"

text_word            ❌ NO MATCH
text_lowercase ❌ NO MATCH
text_whitespace ❌ NO MATCH
text_field ❌ NO MATCH

Analysis:

  • "mighty" is not in the data → no match
  • Stop words are ignored, but other non-matching words still cause failure
What's next?

Now that you've seen how tokenization affects filtering, let's understand how to select the right method for your data and how it applies to keyword (BM25) search.

Login to track your progress