Skip to main content

UUID strategies for production

Each data object in Weaviate has a unique identifier in UUID (Universally Unique Identifier) format. A strategic use of UUIDs can provide a great deal of value, on top of identifying objects.

Let's explore how to put these strategies into practice.

UUID generation strategies

Auto-generated UUIDs (default)

If a UUID is not provided, Weaviate generates a new random UUID for each object:

from weaviate.classes.query import Filter

products = client.collections.use("Products")

# Each import creates new objects (different UUIDs)
product = {"sku": "SKU123", "name": "Mouse", "price": 29.99}

products.data.insert(product) # Creates object with UUID A
products.data.insert(product) # Creates object with UUID B (duplicate!)

response = products.query.fetch_objects(
filters=Filter.by_property("sku").equal("SKU123")
)
print(f"Number of products with SKU123: {len(response.objects)}") # Outputs 2

# This should show identical properties but different UUIDs
for obj in response.objects:
print(f"Object UUID: {obj.uuid}, Name: {obj.properties['name']}, Price: {obj.properties['price']}")

Result: Every import creates new objects, even if data is identical.

Use when:

  • Objects are truly unique each time
  • No need to re-import same data
  • Example: Time-series events, user interactions, logs

Deterministic UUIDs

You can generate deterministic UUIDs using standards like UUIDv5. This allows the same input data to always produce the same UUID.

In the Weaviate Python client, use generate_uuid5() to create such UUIDs, using a unique identifier from your data, such as a SKU, email address, URL or file path.

from weaviate.util import generate_uuid5

product = {"sku": "SKU-123", "name": "Mouse", "price": 29.99}
uuid = generate_uuid5(product["sku"])

Then, use the generated UUID when adding objects.

Note that single inserts and batch inserts treat this differently. Single inserts don't allow this, but batch inserts do:

Inserting an object with the same deterministic UUID results in an error.

from weaviate.util import generate_uuid5
from weaviate.exceptions import UnexpectedStatusCodeError

products = client.collections.use("Products")

product = {"sku": "SKU-123", "name": "Mouse", "price": 29.99}

# Same input = same UUID
uuid = generate_uuid5(product["sku"])

products.data.insert(product, uuid=uuid) # Creates object

# Attempting to insert again with same UUID raises error
product = {"sku": "SKU-123", "name": "Mouse", "price": 39.99} # Different price
try:
products.data.insert(product, uuid=uuid) # Throws error as object exists
except UnexpectedStatusCodeError as e:
print(f"Error on duplicate insert: {e}")

Weaviate is primarily designed to use UUIDv4 (random) or UUIDv5 (namespace name-based, deterministic) identifiers.

When generating deterministic UUIDs, you can use the weaviate.util.generate_uuid5() function with different input strategies:

Generally, the seed for UUIDv5 should be a natural unique identifier in your data that doesn't change. Take these examples:

Object TypeGood IdentifierBad Identifier
E-commerce productProduct SKUProduct name or details
User accountUser's email addressUser's address, name, description
Document or fileURL or file pathFile contents or metadata

From unique identifier

If a natural unique identifier exists, you can use it directly to generate a UUID.

from weaviate.util import generate_uuid5

product = {"sku": "SKU-123", "name": "Mouse", "price": 29.99}
uuid = generate_uuid5(product["sku"])

From multiple properties

In some cases, a combination of properties creates a unique identifier. Concatenate these properties to form the seed for UUID generation.

One example may be in a chunked document, where the combination of document_path and chunk_index uniquely identifies each chunk.

from weaviate.util import generate_uuid5

chunks = client.collections.use("DocChunks")

# Combine multiple properties for uniqueness
with chunks.batch.fixed_size(batch_size=100) as batch:
for chunk in chunk_objs:
# Unique key from document path + chunk index
unique_key = f"{chunk['document_path']}-{chunk['chunk_index']}"
uuid = generate_uuid5(unique_key)

batch.add_object(
properties=chunk,
uuid=uuid
)

With namespace

UUIDv5 supports namespaces to avoid collisions. If you import similar data into multiple collections, use a different namespace for each collection to ensure uniqueness.

from weaviate.util import generate_uuid5

products = client.collections.use("Products")
manuals = client.collections.use("Manuals")

with products.batch.fixed_size(batch_size=100) as batch:
for product in product_objs:
batch.add_object(
properties=product,
# Use different namespaces to avoid collisions
uuid=generate_uuid5(identifier=product["sku"], namespace="products")
)

with manuals.batch.fixed_size(batch_size=100) as batch:
for manual in manual_objs:
batch.add_object(
properties=manual,
# Use different namespaces to avoid collisions
uuid=generate_uuid5(identifier=manual["sku"], namespace="manuals")
)

Use cases for deterministic UUIDs

Once you have a strategy for generating deterministic UUIDs, you can leverage them for several use cases, such as Deduplication, Idempotency, Updates, and Deletions.

See below for examples of each.

When importing data that may contain duplicates, using deterministic UUIDs ensures that identical objects are not added multiple times.


products = client.collections.use("Products")

product_objs = [
{"sku": "SKU001"},
{"sku": "SKU001"}, # Duplicate
]

uuid = generate_uuid5("SKU001")
with products.batch.fixed_size(batch_size=100) as batch:
for product in product_objs:
batch.add_object(
properties=product,
uuid=uuid # Deterministic UUID ensures no duplicates
)

response = products.query.fetch_objects(
filters=Filter.by_property("sku").equal("SKU001")
)
print(len(response.objects)) # Outputs 1, deduplicated

When NOT to use deterministic UUIDs

In any scenario where each object is inherently unique, avoid using deterministic UUIDs. Examples include:

  • Time-series data (logs, events)
  • User interactions (clicks, views)
  • Versioned content (documents with multiple versions)
UUIDv7 (time-ordered UUIDs)

UUIDv7 is designed to be time-ordered, which means a timestamp of some sort can be encoded into the UUID itself, and the resulting UUIDs will sort in chronological order.


However, Weaviate is not optimized for UUIDv7 at this time. UUIDv7 will lead to performance degradation in async replication scenarios. Use it with caution and test thoroughly if you choose to do so.

UUID strategy decision tree

Generally, using a deterministic UUID with a robust seed is recommended. In summary, here is a decision tree to help choose the right UUID strategy:

Is each object inherently unique?
├─ No → Use auto-generated UUIDs
└─ Yes → Continue

Does your data have a unique identifier?
├─ Yes → Use generate_uuid5(unique_id)
└─ No → Continue

Is there a combination of fields that's unique?
├─ Yes → Use generate_uuid5(field1 + field2)
└─ No → Use generate_uuid5(entire_object)

Do you import the same data to multiple collections?
└─ Yes → Add namespace parameter

Best practices

  1. Choose the right identifier: Use natural unique IDs when possible
  2. Be consistent: Same strategy across all imports for a collection
  3. Use namespaces: Prevent collisions across collections
  4. Document your strategy: Team needs to understand the approach
  5. Test deduplication: Verify re-imports update instead of duplicate
What's next?

Now that you have a UUID strategy, let's learn how to optimize batch import performance.

Login to track your progress