Skip to main content

Available ingestion methods

You can add data to Weaviate using different methods: single-object inserts (insert()), multi-object inserts (insert_many()), and batch imports (batch.add_object()). Choosing the right method impacts performance, resource usage, and error handling.

Single operations

Both insert() and insert_many() are designed for simplicity and ease of use. Each sends one API request per call, making them straightforward to use.

The downside of single operations is performance. Each API request has overhead, and sending many requests can be slow and inefficient.

Single operation: insert()

Each insert() call is an individual API request to add one object.

import weaviate
import os

client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY"),
headers={
"X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
}
)


products = client.collections.use("Products")

# Insert a single product
product = {
"sku": "SKU-123",
"name": "Wireless Mouse",
"description": "Ergonomic wireless mouse",
"price": 29.99,
"in_stock": True
}

uuid = products.data.insert(properties=product)
print(f"Created product with UUID: {uuid}")

client.close()

Small batches: insert_many()

Each insert_many() call is a single API request to add the provided objects at once.

import weaviate
import os

client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

products = client.collections.use("Products")

# For small batches (up to ~100 objects)
product_list = [
{"sku": "SKU-001", "name": "Mouse", "price": 29.99},
{"sku": "SKU-002", "name": "Keyboard", "price": 79.99},
{"sku": "SKU-003", "name": "Monitor", "price": 299.99},
# ... up to ~100 objects
]

response = products.data.insert_many(product_list)

print(f"Inserted {len(response.uuids)} products")
if response.has_errors:
print(f"Errors: {response.errors}")

client.close()
import weaviate
import time
import os

client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

products = client.collections.use("Products")

product_data = [
{"sku": f"SKU-{i:04d}", "name": f"Product {i}", "price": 10.0 + i}
for i in range(2000)
]

start = time.time()

for i in range(0, len(product_data), 200):
batch = product_data[i : i + 200] # Get next 200 products
response = products.data.insert_many(batch) # Single API call for 200 objects; each response must be managed before next call or collected

elapsed = time.time() - start
print(f"Time taken: {elapsed:.1f} seconds")

client.close()

When to use single operations

Single operations are appropriate in these scenarios:

  1. Real-time user-generated content

    • User performs an action triggering an insert
  2. Immediate usage required

    • Each new object needs to be available immediately
  3. Small volumes and low frequency

    • Impact of any overhead is negligible
    • Prioritize simplicity

Batch imports

Batch imports are the best choice for any kind of bulk data ingestion. They optimize performance by grouping objects into batches, managing concurrent requests, and providing multiple options to manage the sending rates and error handling.

Example comparison

Let's compare importing 1000 product objects:

Approach 1: Single inserts (DON'T DO THIS)

When inserting many objects, avoid using a loop with insert(). This approach adds significant overhead due to the large number of API requests.

import weaviate
import time
import os

client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

products = client.collections.use("Products")

# DON'T DO THIS for large datasets
product_data = [
{"sku": f"SKU-{i:04d}", "name": f"Product {i}", "price": 10.0 + i}
for i in range(2000)
]

start = time.time()

for product in product_data:
products.data.insert(product) # 1000 individual API calls!

elapsed = time.time() - start
print(f"Time taken: {elapsed:.1f} seconds") # ~5-10 minutes

client.close()

Result: ~50-60 seconds, 2000 API requests

Approach 2: insert_many()

Using insert_many() reduces the number of API requests by sending multiple objects in each call. It can be a good approach for a small set of objects. However, each insert_many() call is independent, so for larger datasets, you need to implement your own batching, concurrency, and error handling logic.

import weaviate
import time
import os

client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

products = client.collections.use("Products")

product_data = [
{"sku": f"SKU-{i:04d}", "name": f"Product {i}", "price": 10.0 + i}
for i in range(2000)
]

start = time.time()

for i in range(0, len(product_data), 200):
batch = product_data[i : i + 200] # Get next 200 products
response = products.data.insert_many(batch) # Single API call for 200 objects; each response must be managed before next call or collected

elapsed = time.time() - start
print(f"Time taken: {elapsed:.1f} seconds")

client.close()

Result: ~1 second, 10 API requests, custom logic required for batching & error handling

For large-scale imports, use the batch import API. It handles batching, concurrency, and error management for you.

import weaviate
import time
import os

client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

products = client.collections.use("Products")

product_data = [
{"sku": f"SKU-{i:04d}", "name": f"Product {i}", "price": 10.0 + i}
for i in range(2000)
]

start = time.time()

with products.batch.fixed_size(batch_size=200) as batch:
for product in product_data:
batch.add_object(properties=product)

elapsed = time.time() - start
print(f"Time taken: {elapsed:.1f} seconds") # ~10-30 seconds
print(f"Errors: {len(products.batch.failed_objects)}") # Weaviate collects responses for the whole batch

client.close()

Result: ~0.5 seconds, 10 API requests, managed concurrency & error handling for the dataset

Why use the batch import API?

The batch context manager provides:

  1. Automatic batching: Groups objects efficiently
  2. Concurrent requests: Parallel processing
  3. Error handling: Grouped error reporting
  4. Rate limiting: Built-in throttling options

Use the batch import API for:

  1. Initial data load

    • Loading hundreds to millions of objects
    • Setting up a new instance
  2. Periodic bulk updates

    • Daily/hourly data refreshes
    • Syncing from external systems
  3. Data migrations

    • Moving between Weaviate instances
    • Restructuring collections
  4. Any production ingestion pipeline

    • ETL processes
    • Continuous data ingestion
    • Background jobs
What's next?

Now that you understand the difference between single and batch operations, let's dive deeper into implementing batch imports, starting with different batching options.

Login to track your progress