Skip to main content

Batch import fundamentals

Batch imports are the most efficient way to add large amounts of data to Weaviate. They optimize performance by grouping multiple objects into batches, managing concurrent requests, and providing flexible options for controlling sending rates and error handling.

Batch imports: usage pattern

Batch imports in Python use a context manager pattern.

You can select the batch configuration method as you enter the context manager. While inside the context, the batcher handles sending objects as you add them. When you exit the context, any remaining objects are automatically flushed.

The batcher keeps track of any errors that occur during batching. The error count is available within the context manager, and the errors themselves can be accessed after exiting the context.

collection = client.collections.use("Products")

error_threshold = 10 # Max errors before aborting
# Select batch strategy when entering context manager
with collection.batch.fixed_size(batch_size=100) as batch:
for obj in objects:
batch.add_object(properties=obj) # Automatic flushing on exit

# Number of errors occurred available during batching
if batch.number_errors > error_threshold:
print("Too many errors, aborting batch import")
break

# Error details available after exiting context manager
if collection.batch.failed_objects:
for failed in collection.batch.failed_objects[:3]: # Show first 3 errors
print(f"Error: {failed.message} for object: {failed.object_}")

Batch configuration methods

There are three main batch configurations, .fixed_size(), .dynamic(), and .rate_limit(). Each has different use cases and trade-offs.

fixed_size() sends batches with a set number of objects per batch, allowing predictable control and performance tuning. It also lets you adjust concurrency for parallel processing.

with collection.batch.fixed_size(
batch_size=200, # Objects per batch
concurrent_requests=4 # Parallel batches
) as batch:
for obj in objects:
batch.add_object(properties=obj)

The fixed_size() method is a great starting point for most use cases. You can fine-tune performance by adjusting two key parameters.

Parameters:

  • batch_size: Objects per batch (default: 100)
    • Start with 100-200
    • Increase for smaller objects
    • Decrease for very large objects (e.g. images)
  • concurrent_requests: Parallel batches (default: 2)
    • Start with 2-4
    • Increase if Weaviate isn't saturated
    • Decrease if seeing timeouts

Tune this during development to find the best balance for your data and Weaviate setup.

.rate_limit() - Throttling for API limits

.rate_limit() is designed to accommodate external vectorizers with strict rate limits, such as those imposed by API-based services.

The only parameter here is requests_per_minute, which defines the maximum number of requests the batcher can send in one minute.

# Rate limiting for external API vectorizers
with collection.batch.rate_limit(requests_per_minute=1500) as batch:
for obj in objects:
batch.add_object(properties=obj)

When to use:

  • External vectorizer with rate limits

.dynamic() - Adaptive client-side batching

dynamic() is an adaptive batching method, where the Weaviate client automatically adjust batch sizes based on observed performance and known configuration.

The client monitors the server-side statistics such as the queue length, batch processing rate, and whether an external vectorizer is used. As the batch import proceeds, the client dynamically increases or decreases the batch size to optimize throughput while avoiding timeouts or overloads.

caution

Dynamic batching is by nature less predictable than fixed-size batching. While it can optimize performance in some scenarios, it may also lead to unexpected behavior, such as increased error rates or peaky server loads. In many production scenarios, fixed-size batching is preferred for its predictability and control.

# Dynamic batching - the Weaviate client adjusts size automatically
with collection.batch.dynamic() as batch:
for obj in objects:
batch.add_object(properties=obj)

When to use:

  • Variable object sizes (some small, some large)
  • Unpredictable load on Weaviate
  • You want automatic optimization

When NOT to use:

  • Fixed-size works well (simpler and more predictable)
  • Errors must be minimized (dynamic resizing can increase error rates)
  • Server loads can trigger auto-scaling (may create spikes and trigger scaling events)

UUID and vector strategies

When adding objects in batch imports, you can choose how to handle UUIDs and vectors.

Vectors

You can optionally provide vectors manually when adding objects in batch imports. A provided vector will always be used for the object, regardless of the collection's vectorizer configuration.

However, be aware of the interplay between provided vectors and the collection's vectorizer at both import and query time.

If a built-in vectorizer (e.g., text2vec-openai, text2vec-cohere) is specified in the collection configuration, make sure that the provided vectors are generated using the same model as the integrated vectorizer.


This is to achieve compatibility with any vectors generated using the integrated vectorizer, such as at query time, or for objects added without provided vectors.


client.collections.create(
name="Products",
# Use a user-provided vector, or generate one using "text-embedding-3-small" model from OpenAI
vector_config=Configure.Vectors.text2vec_openai(model="text-embedding-3-small")
)

collection = client.collections.use("Products")

# Batch import with custom vectors
with collection.batch.fixed_size(batch_size=50) as batch:
for obj in objects:
# Insert object with custom vector
batch.add_object(
properties=obj,
# vector=vector # If provided, Weaviate will use this instead of generating one
)

UUIDs

When adding objects in batch imports, you can choose how to handle UUIDs.

If no UUID is provided, Weaviate will generate a new random UUID for each object. This is the default behavior.

If a UUID is provided, however, Weaviate will either create or override the object with that UUID. This allows for deterministic imports, where re-importing the same object updates it instead of creating duplicates.

Quick reference: Choosing batch configuration

Do you have rate limits? (e.g. OpenAI, Cohere API)
├─ Yes: Use .rate_limit(requests_per_minute=X)
└─ No: Continue

Is predictable behavior very important?
├─ Yes: Use .fixed_size() and tune batch_size & concurrent_requests
└─ No: Continue

Is object size highly variable?
├─ Yes: Consider .dynamic()
└─ No: Use .fixed_size()

Recommended starting configuration:
.fixed_size()

Too slow?
├─ Increase concurrent_requests (if CPU not saturated)
└─ Increase batch_size (if objects are small)

Seeing timeouts?
├─ Decrease batch_size
└─ Decrease concurrent_requests
What's next?

Now that you know how to configure batch imports, let's learn about critical error handling patterns to ensure data integrity.

Login to track your progress