Weaviate Academy

Appropriate preparation and validation of data before ingestion can save time and prevent errors. Let's consider key aspects of data preparation for Weaviate ingestion.

Data schema considerations

To Autoschema or not to Autoschema?

The "Autoschema" feature, if enabled, allows Weaviate to create any missing property definitions on-the-fly during data ingestion. While convenient, it has trade-offs against defining an explicit schema upfront.

Pros:

Faster initial setup
Flexible for evolving data

Cons:

Potential for incorrect type inference
Harder to catch errors early
Malformed data can go unnoticed until search time

Using Autoschema can be useful for prototyping or exploratory phases. However, we recommend defining an explicit schema for production use cases.

import weaviate
from weaviate.classes.config import Configure, Property, DataType
import os

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=os.getenv("WEAVIATE_API_KEY"),
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)

# Explicit schema definition
client.collections.create(
    name="Products",
    properties=[
        Property(name="sku", data_type=DataType.TEXT),
        Property(name="name", data_type=DataType.TEXT),
        Property(name="description", data_type=DataType.TEXT),
        Property(name="price", data_type=DataType.NUMBER),
        Property(name="in_stock", data_type=DataType.BOOL),
        Property(name="categories", data_type=DataType.TEXT_ARRAY),
        Property(name="created_date", data_type=DataType.DATE),
    ],
    vector_config=Configure.Vectors.text2vec_openai()
)

client.close()

API docs

Handling nulls and missing values

Weaviate handles missing properties gracefully, but be intentional.

Consider whether null values are to be used for object filtering, in which case the collection's inverted index must be configured to do so. Additionally, consider whether the null values should be replaced with defaults.

import weaviate
from datetime import datetime, timezone
import os

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),
    auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

# Explicit schema definition
client.collections.create(
    name="Products",
    properties=[
        Property(name="sku", data_type=DataType.TEXT),
        Property(name="name", data_type=DataType.TEXT),
        Property(name="description", data_type=DataType.TEXT),
        Property(name="price", data_type=DataType.NUMBER),
        Property(name="in_stock", data_type=DataType.BOOL),
        Property(name="categories", data_type=DataType.TEXT_ARRAY),
        Property(name="created_date", data_type=DataType.DATE),
    ],
    vector_config=Configure.Vectors.text2vec_openai(),
    # Consider whether to index null values for filtering
    inverted_index_config=Configure.inverted_index(
        index_null_state=True
    )
)

products = client.collections.use("Products")

# NOT SO GREAT:
bad_product = {
    "sku": "SKU-123",
    "name": "Product Name",
    "description": None,  # Consider an empty string instead
    "price": 29.99,
    "in_stock": None,  # Avoid null for boolean
    "categories": None,  # Consider empty array instead
    "created_date": datetime.now(timezone.utc).isoformat()
}

# GOOD: Use defaults when appropriate
product = {
    "sku": "SKU-123",
    "name": "Product Name",
    "description": "",  # Empty string instead of None
    "price": 29.99,
    "in_stock": True,  # Sensible default
    "categories": [],  # Empty array instead of None
    "created_date": datetime.now(timezone.utc).isoformat()
}

products.data.insert(product)

client.close()

API docs

Pre-ingestion validation patterns

Before bulk ingestion, validate your data to catch issues early. Some common checks include:

Required properties: Ensure critical fields are present
Data types: Verify numbers are numbers, dates are valid, etc.
Value ranges: Check for reasonable values (e.g., ratings 0-10)
Text length: Very long strings might cause issues (consider chunking, summarization or truncation)
Array elements: Validate array item types and sizes

Practical validation example

One approach is to validate the data before ingestion. Here's an indicative example identifying potential errors in the data to be ingested.

from datetime import datetime

def validate_product(product, index):
    """Validate a single product object."""
    errors = []

    # Required fields
    required = ["sku", "name", "price"]
    for field in required:
        if field not in product or product[field] is None:
            errors.append(f"Missing required field: {field}")

    # SKU format (alphanumeric + hyphens)
    if "sku" in product:
        if not isinstance(product["sku"], str) or len(product["sku"]) == 0:
            errors.append("SKU must be non-empty string")

    # Price validation
    if "price" in product:
        if not isinstance(product["price"], (int, float)):
            errors.append("Price must be a number")
        elif product["price"] < 0:
            errors.append("Price cannot be negative")
        elif product["price"] > 1000000:
            errors.append("Price seems unreasonably high")

    # Categories validation
    if "categories" in product:
        if not isinstance(product["categories"], list):
            errors.append("Categories must be a list")
        elif len(product["categories"]) == 0:
            errors.append("At least one category required")

    # Date validation
    if "created_date" in product:
        try:
            datetime.fromisoformat(product["created_date"].replace('Z', '+00:00'))
        except (ValueError, AttributeError):
            errors.append("Invalid date format")

    return errors

def validate_product_batch(products, sample_size=10):
    """Validate a batch of products."""
    sample = products[:sample_size]
    all_errors = {}

    for i, product in enumerate(sample):
        errors = validate_product(product, i)
        if errors:
            all_errors[i] = errors

    if all_errors:
        print(f"Validation failed for {len(all_errors)} objects:")
        for idx, errors in all_errors.items():
            print(f"\nObject {idx}:")
            for error in errors:
                print(f"  - {error}")
        return False
    else:
        print(f"Validation passed for {len(sample)} sample objects")
        return True

# Usage
products = [
    # Your product data
]

if validate_product_batch(products, sample_size=20):
    print("Safe to proceed with bulk import")
else:
    print("Fix validation errors before importing")

API docs

Going through these checks before ingestion can help catch issues early and ensure data quality.

Search strategy and chunking considerations

While this course focuses on ingestion, two important topics affect how you prepare data:

Search strategy

How you plan to search affects data structure:

Keyword search (BM25): Consider tokenization, parameters, and stop words
Vector search: How will you embed the data, and how many vectors per object?
Hybrid search: What balance of keyword/vector search results are needed?

For details, see the Search Strategies: In Depth course.

Chunking

Texts that are too long can lead to:

Errors during vectorization
Increased latency (may increase at a non-linear rate)
Poor search relevance
Wasted context window in generative AI (RAG or agentic) use cases

Chunking is one possible solution for this problem. For texts exceeding a particular length, consider breaking them into smaller pieces before ingestion.

How to chunk is a complex topic, and some considerations include:

What type of data is it? (e.g., natural language, code, tabular)
What embedding model will be used, and what are its token limits?
What kinds of queries will be run against the data?
How will the chunks be used downstream?

A good starting point is to use length-based chunking with some overlap, then iterate based on search relevance testing. This topic is covered in more detail in another course.

What's next?

Now that your data is prepared and validated, let's learn about batch import operations and why they're essential for production.

← Back to Lesson Overview

Preparing for ingestion

Data schema considerations

To Autoschema or not to Autoschema?

Handling nulls and missing values

Pre-ingestion validation patterns

Practical validation example

Search strategy and chunking considerations

Search strategy

Chunking

← Back to Lesson Overview

Preparing for ingestion

Data schema considerations​

To Autoschema or not to Autoschema?​

Handling nulls and missing values​

Pre-ingestion validation patterns​

Practical validation example​

Search strategy and chunking considerations​

Search strategy​

Chunking​

Data schema considerations

To Autoschema or not to Autoschema?

Handling nulls and missing values

Pre-ingestion validation patterns

Practical validation example

Search strategy and chunking considerations

Search strategy

Chunking