UUID strategies for production
Each data object in Weaviate has a unique identifier in UUID (Universally Unique Identifier) format. A strategic use of UUIDs can provide a great deal of value, on top of identifying objects.
Let's explore how to put these strategies into practice.
UUID generation strategies
Auto-generated UUIDs (default)
If a UUID is not provided, Weaviate generates a new random UUID for each object:
from weaviate.classes.query import Filter
products = client.collections.use("Products")
# Each import creates new objects (different UUIDs)
product = {"sku": "SKU123", "name": "Mouse", "price": 29.99}
products.data.insert(product) # Creates object with UUID A
products.data.insert(product) # Creates object with UUID B (duplicate!)
response = products.query.fetch_objects(
filters=Filter.by_property("sku").equal("SKU123")
)
print(f"Number of products with SKU123: {len(response.objects)}") # Outputs 2
# This should show identical properties but different UUIDs
for obj in response.objects:
print(f"Object UUID: {obj.uuid}, Name: {obj.properties['name']}, Price: {obj.properties['price']}")
Result: Every import creates new objects, even if data is identical.
Use when:
- Objects are truly unique each time
- No need to re-import same data
- Example: Time-series events, user interactions, logs
Deterministic UUIDs
You can generate deterministic UUIDs using standards like UUIDv5. This allows the same input data to always produce the same UUID.
In the Weaviate Python client, use generate_uuid5() to create such UUIDs, using a unique identifier from your data, such as a SKU, email address, URL or file path.
from weaviate.util import generate_uuid5
product = {"sku": "SKU-123", "name": "Mouse", "price": 29.99}
uuid = generate_uuid5(product["sku"])
Then, use the generated UUID when adding objects.
Note that single inserts and batch inserts treat this differently. Single inserts don't allow this, but batch inserts do:
- Single inserts
- Batch inserts
Inserting an object with the same deterministic UUID results in an error.
from weaviate.util import generate_uuid5
from weaviate.exceptions import UnexpectedStatusCodeError
products = client.collections.use("Products")
product = {"sku": "SKU-123", "name": "Mouse", "price": 29.99}
# Same input = same UUID
uuid = generate_uuid5(product["sku"])
products.data.insert(product, uuid=uuid) # Creates object
# Attempting to insert again with same UUID raises error
product = {"sku": "SKU-123", "name": "Mouse", "price": 39.99} # Different price
try:
products.data.insert(product, uuid=uuid) # Throws error as object exists
except UnexpectedStatusCodeError as e:
print(f"Error on duplicate insert: {e}")
Batch inserting an object with the same deterministic UUID updates the existing object.
from weaviate.util import generate_uuid5
products = client.collections.use("Products")
product = {"sku": "SKU-123", "name": "Mouse", "price": 29.99}
# Same input = same UUID
uuid = generate_uuid5(product["sku"])
products.data.insert(product, uuid=uuid) # Creates object
# Re-inserting with same UUID updates existing object
product = {"sku": "SKU-123", "name": "Mouse", "price": 39.99} # Different price
with products.batch.fixed_size(batch_size=100) as batch:
batch.add_object(properties=product, uuid=uuid) # Runs without error, updates object
print(products.query.fetch_object_by_id(uuid).properties["price"]) # Outputs 39.99 (updated price)
# END DeterministicUUIDSingle
client.collections.delete("DocChunks") # Clean up if exists
client.collections.create(
name="DocChunks",
properties=[
Property(name="document_path", data_type=DataType.TEXT),
Property(name="chunk_index", data_type=DataType.INT),
],
vector_config=Configure.Vectors.text2vec_openai()
)
chunk_objs = [
{"document_path": "/docs/report.pdf", "chunk_index": 0},
{"document_path": "/docs/report.pdf", "chunk_index": 1},
]
# START UUIDFromMultipleProps
from weaviate.util import generate_uuid5
chunks = client.collections.use("DocChunks")
# Combine multiple properties for uniqueness
with chunks.batch.fixed_size(batch_size=100) as batch:
for chunk in chunk_objs:
# Unique key from document path + chunk index
unique_key = f"{chunk['document_path']}-{chunk['chunk_index']}"
uuid = generate_uuid5(unique_key)
batch.add_object(
properties=chunk,
uuid=uuid
)
# END UUIDFromMultipleProps
for cname in ["Products", "Manuals"]:
client.collections.delete(cname) # Clean up if exists
client.collections.create(
name=cname,
properties=[
Property(name="sku", data_type=DataType.TEXT),
],
vector_config=Configure.Vectors.text2vec_openai()
)
product_objs = [
{"sku": "SKU001"},
{"sku": "SKU002"},
]
manual_objs = product_objs
# START UUIDWithNamespace
from weaviate.util import generate_uuid5
products = client.collections.use("Products")
manuals = client.collections.use("Manuals")
with products.batch.fixed_size(batch_size=100) as batch:
for product in product_objs:
batch.add_object(
properties=product,
# Use different namespaces to avoid collisions
uuid=generate_uuid5(identifier=product["sku"], namespace="products")
)
with manuals.batch.fixed_size(batch_size=100) as batch:
for manual in manual_objs:
batch.add_object(
properties=manual,
# Use different namespaces to avoid collisions
uuid=generate_uuid5(identifier=manual["sku"], namespace="manuals")
)
# END UUIDWithNamespace
products.data.delete_by_id(uuid=generate_uuid5("SKU001"))
# START ExampleDeduplication
products = client.collections.use("Products")
product_objs = [
{"sku": "SKU001"},
{"sku": "SKU001"}, # Duplicate
]
uuid = generate_uuid5("SKU001")
with products.batch.fixed_size(batch_size=100) as batch:
for product in product_objs:
batch.add_object(
properties=product,
uuid=uuid # Deterministic UUID ensures no duplicates
)
response = products.query.fetch_objects(
filters=Filter.by_property("sku").equal("SKU001")
)
print(len(response.objects)) # Outputs 1, deduplicated
# END ExampleDeduplication
products.data.delete_by_id(uuid=generate_uuid5("SKU001"))
# START ExampleIdempotency
products = client.collections.use("Products")
with products.batch.fixed_size(batch_size=100) as batch:
for product in product_objs:
batch.add_object(
properties=product,
uuid=generate_uuid5("SKU001")
)
# Re-running the same batch import does not create duplicates
with products.batch.fixed_size(batch_size=100) as batch:
for product in product_objs:
batch.add_object(
properties=product,
uuid=generate_uuid5("SKU001")
)
response = products.query.fetch_objects(
filters=Filter.by_property("sku").equal("SKU001")
)
print(len(response.objects)) # Outputs 1, idempotent
# END ExampleIdempotency
products.data.delete_by_id(uuid=generate_uuid5("SKU001"))
# START ExampleUpdates
products = client.collections.use("Products")
products.data.insert(
properties={"sku": "SKU001", "price": 39.99},
uuid=generate_uuid5("SKU001")
)
# Having a deterministic UUID allows us to easily update the object
products.data.update(
properties={"sku": "SKU001", "price": 30.05},
uuid=generate_uuid5("SKU001")
)
response = products.query.fetch_objects(
filters=Filter.by_property("sku").equal("SKU001")
)
print(response.objects[0].properties["price"]) # Outputs 30.05, updated price
# END ExampleUpdates
products.data.delete_by_id(uuid=generate_uuid5("SKU001"))
# START ExampleDeletions
products = client.collections.use("Products")
products.data.insert(
properties={"sku": "SKU001", "price": 39.99},
uuid=generate_uuid5("SKU001")
)
# Having a deterministic UUID allows us to easily delete the object
products.data.delete_by_id(
uuid=generate_uuid5("SKU001")
)
response = products.query.fetch_objects(
filters=Filter.by_property("sku").equal("SKU001")
)
print(len(response.objects)) # Outputs 0, object deleted
# END ExampleDeletions
client.close()
Recommended UUID strategies
Weaviate is primarily designed to use UUIDv4 (random) or UUIDv5 (namespace name-based, deterministic) identifiers.
When generating deterministic UUIDs, you can use the weaviate.util.generate_uuid5() function with different input strategies:
Generally, the seed for UUIDv5 should be a natural unique identifier in your data that doesn't change. Take these examples:
| Object Type | Good Identifier | Bad Identifier |
|---|---|---|
| E-commerce product | Product SKU | Product name or details |
| User account | User's email address | User's address, name, description |
| Document or file | URL or file path | File contents or metadata |
From unique identifier
If a natural unique identifier exists, you can use it directly to generate a UUID.
from weaviate.util import generate_uuid5
product = {"sku": "SKU-123", "name": "Mouse", "price": 29.99}
uuid = generate_uuid5(product["sku"])
From multiple properties
In some cases, a combination of properties creates a unique identifier. Concatenate these properties to form the seed for UUID generation.
One example may be in a chunked document, where the combination of document_path and chunk_index uniquely identifies each chunk.
from weaviate.util import generate_uuid5
chunks = client.collections.use("DocChunks")
# Combine multiple properties for uniqueness
with chunks.batch.fixed_size(batch_size=100) as batch:
for chunk in chunk_objs:
# Unique key from document path + chunk index
unique_key = f"{chunk['document_path']}-{chunk['chunk_index']}"
uuid = generate_uuid5(unique_key)
batch.add_object(
properties=chunk,
uuid=uuid
)
With namespace
UUIDv5 supports namespaces to avoid collisions. If you import similar data into multiple collections, use a different namespace for each collection to ensure uniqueness.
from weaviate.util import generate_uuid5
products = client.collections.use("Products")
manuals = client.collections.use("Manuals")
with products.batch.fixed_size(batch_size=100) as batch:
for product in product_objs:
batch.add_object(
properties=product,
# Use different namespaces to avoid collisions
uuid=generate_uuid5(identifier=product["sku"], namespace="products")
)
with manuals.batch.fixed_size(batch_size=100) as batch:
for manual in manual_objs:
batch.add_object(
properties=manual,
# Use different namespaces to avoid collisions
uuid=generate_uuid5(identifier=manual["sku"], namespace="manuals")
)
Use cases for deterministic UUIDs
Once you have a strategy for generating deterministic UUIDs, you can leverage them for several use cases, such as Deduplication, Idempotency, Updates, and Deletions.
See below for examples of each.
- Deduplication
- Idempotency
- Updates
- Deletions
When importing data that may contain duplicates, using deterministic UUIDs ensures that identical objects are not added multiple times.
products = client.collections.use("Products")
product_objs = [
{"sku": "SKU001"},
{"sku": "SKU001"}, # Duplicate
]
uuid = generate_uuid5("SKU001")
with products.batch.fixed_size(batch_size=100) as batch:
for product in product_objs:
batch.add_object(
properties=product,
uuid=uuid # Deterministic UUID ensures no duplicates
)
response = products.query.fetch_objects(
filters=Filter.by_property("sku").equal("SKU001")
)
print(len(response.objects)) # Outputs 1, deduplicated
Idempotent imports allow you to safely re-run the same import multiple times without creating duplicates. Using deterministic UUIDs ensures that re-imported objects update existing ones rather than creating new entries.
products = client.collections.use("Products")
with products.batch.fixed_size(batch_size=100) as batch:
for product in product_objs:
batch.add_object(
properties=product,
uuid=generate_uuid5("SKU001")
)
# Re-running the same batch import does not create duplicates
with products.batch.fixed_size(batch_size=100) as batch:
for product in product_objs:
batch.add_object(
properties=product,
uuid=generate_uuid5("SKU001")
)
response = products.query.fetch_objects(
filters=Filter.by_property("sku").equal("SKU001")
)
print(len(response.objects)) # Outputs 1, idempotent
You can update existing objects by re-importing them with the same deterministic UUID but modified properties.
products = client.collections.use("Products")
products.data.insert(
properties={"sku": "SKU001", "price": 39.99},
uuid=generate_uuid5("SKU001")
)
# Having a deterministic UUID allows us to easily update the object
products.data.update(
properties={"sku": "SKU001", "price": 30.05},
uuid=generate_uuid5("SKU001")
)
response = products.query.fetch_objects(
filters=Filter.by_property("sku").equal("SKU001")
)
print(response.objects[0].properties["price"]) # Outputs 30.05, updated price
You can remove specific objects by their deterministic UUIDs.
products = client.collections.use("Products")
products.data.insert(
properties={"sku": "SKU001", "price": 39.99},
uuid=generate_uuid5("SKU001")
)
# Having a deterministic UUID allows us to easily delete the object
products.data.delete_by_id(
uuid=generate_uuid5("SKU001")
)
response = products.query.fetch_objects(
filters=Filter.by_property("sku").equal("SKU001")
)
print(len(response.objects)) # Outputs 0, object deleted
When NOT to use deterministic UUIDs
In any scenario where each object is inherently unique, avoid using deterministic UUIDs. Examples include:
- Time-series data (logs, events)
- User interactions (clicks, views)
- Versioned content (documents with multiple versions)
UUIDv7 is designed to be time-ordered, which means a timestamp of some sort can be encoded into the UUID itself, and the resulting UUIDs will sort in chronological order.
However, Weaviate is not optimized for UUIDv7 at this time. UUIDv7 will lead to performance degradation in async replication scenarios. Use it with caution and test thoroughly if you choose to do so.
UUID strategy decision tree
Generally, using a deterministic UUID with a robust seed is recommended. In summary, here is a decision tree to help choose the right UUID strategy:
Is each object inherently unique?
├─ No → Use auto-generated UUIDs
└─ Yes → Continue
Does your data have a unique identifier?
├─ Yes → Use generate_uuid5(unique_id)
└─ No → Continue
Is there a combination of fields that's unique?
├─ Yes → Use generate_uuid5(field1 + field2)
└─ No → Use generate_uuid5(entire_object)
Do you import the same data to multiple collections?
└─ Yes → Add namespace parameter
Best practices
- Choose the right identifier: Use natural unique IDs when possible
- Be consistent: Same strategy across all imports for a collection
- Use namespaces: Prevent collisions across collections
- Document your strategy: Team needs to understand the approach
- Test deduplication: Verify re-imports update instead of duplicate
Now that you have a UUID strategy, let's learn how to optimize batch import performance.