Available ingestion methods
You can add data to Weaviate using different methods: single-object inserts (insert()), multi-object inserts (insert_many()), and batch imports (batch.add_object()). Choosing the right method impacts performance, resource usage, and error handling.
Single operations
Both insert() and insert_many() are designed for simplicity and ease of use. Each sends one API request per call, making them straightforward to use.
The downside of single operations is performance. Each API request has overhead, and sending many requests can be slow and inefficient.
Single operation: insert()
Each insert() call is an individual API request to add one object.
import weaviate
import os
client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY"),
headers={
"X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
}
)
products = client.collections.use("Products")
# Insert a single product
product = {
"sku": "SKU-123",
"name": "Wireless Mouse",
"description": "Ergonomic wireless mouse",
"price": 29.99,
"in_stock": True
}
uuid = products.data.insert(properties=product)
print(f"Created product with UUID: {uuid}")
client.close()
Small batches: insert_many()
Each insert_many() call is a single API request to add the provided objects at once.
import weaviate
import os
client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)
products = client.collections.use("Products")
# For small batches (up to ~100 objects)
product_list = [
{"sku": "SKU-001", "name": "Mouse", "price": 29.99},
{"sku": "SKU-002", "name": "Keyboard", "price": 79.99},
{"sku": "SKU-003", "name": "Monitor", "price": 299.99},
# ... up to ~100 objects
]
response = products.data.insert_many(product_list)
print(f"Inserted {len(response.uuids)} products")
if response.has_errors:
print(f"Errors: {response.errors}")
client.close()
import weaviate
import time
import os
client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)
products = client.collections.use("Products")
product_data = [
{"sku": f"SKU-{i:04d}", "name": f"Product {i}", "price": 10.0 + i}
for i in range(2000)
]
start = time.time()
for i in range(0, len(product_data), 200):
batch = product_data[i : i + 200] # Get next 200 products
response = products.data.insert_many(batch) # Single API call for 200 objects; each response must be managed before next call or collected
elapsed = time.time() - start
print(f"Time taken: {elapsed:.1f} seconds")
client.close()
When to use single operations
Single operations are appropriate in these scenarios:
-
Real-time user-generated content
- User performs an action triggering an insert
-
Immediate usage required
- Each new object needs to be available immediately
-
Small volumes and low frequency
- Impact of any overhead is negligible
- Prioritize simplicity
Batch imports
Batch imports are the best choice for any kind of bulk data ingestion. They optimize performance by grouping objects into batches, managing concurrent requests, and providing multiple options to manage the sending rates and error handling.
Example comparison
Let's compare importing 1000 product objects:
Approach 1: Single inserts (DON'T DO THIS)
When inserting many objects, avoid using a loop with insert(). This approach adds significant overhead due to the large number of API requests.
import weaviate
import time
import os
client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)
products = client.collections.use("Products")
# DON'T DO THIS for large datasets
product_data = [
{"sku": f"SKU-{i:04d}", "name": f"Product {i}", "price": 10.0 + i}
for i in range(2000)
]
start = time.time()
for product in product_data:
products.data.insert(product) # 1000 individual API calls!
elapsed = time.time() - start
print(f"Time taken: {elapsed:.1f} seconds") # ~5-10 minutes
client.close()
Result: ~50-60 seconds, 2000 API requests
Approach 2: insert_many()
Using insert_many() reduces the number of API requests by sending multiple objects in each call. It can be a good approach for a small set of objects. However, each insert_many() call is independent, so for larger datasets, you need to implement your own batching, concurrency, and error handling logic.
import weaviate
import time
import os
client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)
products = client.collections.use("Products")
product_data = [
{"sku": f"SKU-{i:04d}", "name": f"Product {i}", "price": 10.0 + i}
for i in range(2000)
]
start = time.time()
for i in range(0, len(product_data), 200):
batch = product_data[i : i + 200] # Get next 200 products
response = products.data.insert_many(batch) # Single API call for 200 objects; each response must be managed before next call or collected
elapsed = time.time() - start
print(f"Time taken: {elapsed:.1f} seconds")
client.close()
Result: ~1 second, 10 API requests, custom logic required for batching & error handling
Approach 3: Batch import (RECOMMENDED)
For large-scale imports, use the batch import API. It handles batching, concurrency, and error management for you.
import weaviate
import time
import os
client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)
products = client.collections.use("Products")
product_data = [
{"sku": f"SKU-{i:04d}", "name": f"Product {i}", "price": 10.0 + i}
for i in range(2000)
]
start = time.time()
with products.batch.fixed_size(batch_size=200) as batch:
for product in product_data:
batch.add_object(properties=product)
elapsed = time.time() - start
print(f"Time taken: {elapsed:.1f} seconds") # ~10-30 seconds
print(f"Errors: {len(products.batch.failed_objects)}") # Weaviate collects responses for the whole batch
client.close()
Result: ~0.5 seconds, 10 API requests, managed concurrency & error handling for the dataset
Why use the batch import API?
The batch context manager provides:
- Automatic batching: Groups objects efficiently
- Concurrent requests: Parallel processing
- Error handling: Grouped error reporting
- Rate limiting: Built-in throttling options
Use the batch import API for:
-
Initial data load
- Loading hundreds to millions of objects
- Setting up a new instance
-
Periodic bulk updates
- Daily/hourly data refreshes
- Syncing from external systems
-
Data migrations
- Moving between Weaviate instances
- Restructuring collections
-
Any production ingestion pipeline
- ETL processes
- Continuous data ingestion
- Background jobs
Now that you understand the difference between single and batch operations, let's dive deeper into implementing batch imports, starting with different batching options.