Weaviate Academy

You can add data to Weaviate using different methods: single-object inserts (insert()), multi-object inserts (insert_many()), and batch imports (batch.add_object()). Choosing the right method impacts performance, resource usage, and error handling.

Single operations

Both insert() and insert_many() are designed for simplicity and ease of use. Each sends one API request per call, making them straightforward to use.

The downside of single operations is performance. Each API request has overhead, and sending many requests can be slow and inefficient.

Single operation: `insert()`

Each insert() call is an individual API request to add one object.

import weaviate
import os

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=os.getenv("WEAVIATE_API_KEY"),
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)


products = client.collections.use("Products")

# Insert a single product
product = {
    "sku": "SKU-123",
    "name": "Wireless Mouse",
    "description": "Ergonomic wireless mouse",
    "price": 29.99,
    "in_stock": True
}

uuid = products.data.insert(properties=product)
print(f"Created product with UUID: {uuid}")

client.close()

API docs

Small batches: `insert_many()`

Each insert_many() call is a single API request to add the provided objects at once.

import weaviate
import os

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

products = client.collections.use("Products")

# For small batches (up to ~100 objects)
product_list = [
    {"sku": "SKU-001", "name": "Mouse", "price": 29.99},
    {"sku": "SKU-002", "name": "Keyboard", "price": 79.99},
    {"sku": "SKU-003", "name": "Monitor", "price": 299.99},
    # ... up to ~100 objects
]

response = products.data.insert_many(product_list)

print(f"Inserted {len(response.uuids)} products")
if response.has_errors:
    print(f"Errors: {response.errors}")

client.close()
import weaviate
import time
import os

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

products = client.collections.use("Products")

product_data = [
    {"sku": f"SKU-{i:04d}", "name": f"Product {i}", "price": 10.0 + i}
    for i in range(2000)
]

start = time.time()

for i in range(0, len(product_data), 200):
    batch = product_data[i : i + 200]  # Get next 200 products
    response = products.data.insert_many(batch)  # Single API call for 200 objects; each response must be managed before next call or collected

elapsed = time.time() - start
print(f"Time taken: {elapsed:.1f} seconds")

client.close()

API docs

When to use single operations

Single operations are appropriate in these scenarios:

Real-time user-generated content
- User performs an action triggering an insert
Immediate usage required
- Each new object needs to be available immediately
Small volumes and low frequency
- Impact of any overhead is negligible
- Prioritize simplicity

Batch imports

Batch imports are the best choice for any kind of bulk data ingestion. They optimize performance by grouping objects into batches, managing concurrent requests, and providing multiple options to manage the sending rates and error handling.

Example comparison

Let's compare importing 1000 product objects:

Approach 1: Single inserts (DON'T DO THIS)

When inserting many objects, avoid using a loop with insert(). This approach adds significant overhead due to the large number of API requests.

import weaviate
import time
import os

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

products = client.collections.use("Products")

# DON'T DO THIS for large datasets
product_data = [
    {"sku": f"SKU-{i:04d}", "name": f"Product {i}", "price": 10.0 + i}
    for i in range(2000)
]

start = time.time()

for product in product_data:
    products.data.insert(product)  # 1000 individual API calls!

elapsed = time.time() - start
print(f"Time taken: {elapsed:.1f} seconds")  # ~5-10 minutes

client.close()

API docs

Result: ~50-60 seconds, 2000 API requests

Approach 2: insert_many()

Using insert_many() reduces the number of API requests by sending multiple objects in each call. It can be a good approach for a small set of objects. However, each insert_many() call is independent, so for larger datasets, you need to implement your own batching, concurrency, and error handling logic.

import weaviate
import time
import os

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

products = client.collections.use("Products")

product_data = [
    {"sku": f"SKU-{i:04d}", "name": f"Product {i}", "price": 10.0 + i}
    for i in range(2000)
]

start = time.time()

for i in range(0, len(product_data), 200):
    batch = product_data[i : i + 200]  # Get next 200 products
    response = products.data.insert_many(batch)  # Single API call for 200 objects; each response must be managed before next call or collected

elapsed = time.time() - start
print(f"Time taken: {elapsed:.1f} seconds")

client.close()

API docs

Result: ~1 second, 10 API requests, custom logic required for batching & error handling

Approach 3: Batch import (RECOMMENDED)

For large-scale imports, use the batch import API. It handles batching, concurrency, and error management for you.

import weaviate
import time
import os

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

products = client.collections.use("Products")

product_data = [
    {"sku": f"SKU-{i:04d}", "name": f"Product {i}", "price": 10.0 + i}
    for i in range(2000)
]

start = time.time()

with products.batch.fixed_size(batch_size=200) as batch:
    for product in product_data:
        batch.add_object(properties=product)

elapsed = time.time() - start
print(f"Time taken: {elapsed:.1f} seconds")  # ~10-30 seconds
print(f"Errors: {len(products.batch.failed_objects)}")  # Weaviate collects responses for the whole batch

client.close()

API docs

Result: ~0.5 seconds, 10 API requests, managed concurrency & error handling for the dataset

Why use the batch import API?

The batch context manager provides:

Automatic batching: Groups objects efficiently
Concurrent requests: Parallel processing
Error handling: Grouped error reporting
Rate limiting: Built-in throttling options

Use the batch import API for:

Initial data load
- Loading hundreds to millions of objects
- Setting up a new instance
Periodic bulk updates
- Daily/hourly data refreshes
- Syncing from external systems
Data migrations
- Moving between Weaviate instances
- Restructuring collections
Any production ingestion pipeline
- ETL processes
- Continuous data ingestion
- Background jobs

What's next?

Now that you understand the difference between single and batch operations, let's dive deeper into implementing batch imports, starting with different batching options.

← Back to Lesson Overview

Available ingestion methods

Single operations

Single operation: `insert()`

Small batches: `insert_many()`

When to use single operations

Batch imports

Example comparison

Approach 1: Single inserts (DON'T DO THIS)

Approach 2: insert_many()

Approach 3: Batch import (RECOMMENDED)

Why use the batch import API?

← Back to Lesson Overview

Available ingestion methods

Single operations​

Single operation: insert()​

Small batches: insert_many()​

When to use single operations​

Batch imports​

Example comparison​

Approach 1: Single inserts (DON'T DO THIS)​

Approach 2: insert_many()​

Approach 3: Batch import (RECOMMENDED)​

Why use the batch import API?​

Single operations

Single operation: `insert()`

Small batches: `insert_many()`

When to use single operations

Batch imports

Example comparison

Approach 1: Single inserts (DON'T DO THIS)

Approach 2: insert_many()

Approach 3: Batch import (RECOMMENDED)

Why use the batch import API?