Preparing for ingestion
Appropriate preparation and validation of data before ingestion can save time and prevent errors. Let's consider key aspects of data preparation for Weaviate ingestion.
Data schema considerations
To Autoschema or not to Autoschema?
The "Autoschema" feature, if enabled, allows Weaviate to create any missing property definitions on-the-fly during data ingestion. While convenient, it has trade-offs against defining an explicit schema upfront.
Pros:
- Faster initial setup
- Flexible for evolving data
Cons:
- Potential for incorrect type inference
- Harder to catch errors early
- Malformed data can go unnoticed until search time
Using Autoschema can be useful for prototyping or exploratory phases. However, we recommend defining an explicit schema for production use cases.
import weaviate
from weaviate.classes.config import Configure, Property, DataType
import os
client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY"),
headers={
"X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
}
)
# Explicit schema definition
client.collections.create(
name="Products",
properties=[
Property(name="sku", data_type=DataType.TEXT),
Property(name="name", data_type=DataType.TEXT),
Property(name="description", data_type=DataType.TEXT),
Property(name="price", data_type=DataType.NUMBER),
Property(name="in_stock", data_type=DataType.BOOL),
Property(name="categories", data_type=DataType.TEXT_ARRAY),
Property(name="created_date", data_type=DataType.DATE),
],
vector_config=Configure.Vectors.text2vec_openai()
)
client.close()
Handling nulls and missing values
Weaviate handles missing properties gracefully, but be intentional.
Consider whether null values are to be used for object filtering, in which case the collection's inverted index must be configured to do so. Additionally, consider whether the null values should be replaced with defaults.
import weaviate
from datetime import datetime, timezone
import os
client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)
# Explicit schema definition
client.collections.create(
name="Products",
properties=[
Property(name="sku", data_type=DataType.TEXT),
Property(name="name", data_type=DataType.TEXT),
Property(name="description", data_type=DataType.TEXT),
Property(name="price", data_type=DataType.NUMBER),
Property(name="in_stock", data_type=DataType.BOOL),
Property(name="categories", data_type=DataType.TEXT_ARRAY),
Property(name="created_date", data_type=DataType.DATE),
],
vector_config=Configure.Vectors.text2vec_openai(),
# Consider whether to index null values for filtering
inverted_index_config=Configure.inverted_index(
index_null_state=True
)
)
products = client.collections.use("Products")
# NOT SO GREAT:
bad_product = {
"sku": "SKU-123",
"name": "Product Name",
"description": None, # Consider an empty string instead
"price": 29.99,
"in_stock": None, # Avoid null for boolean
"categories": None, # Consider empty array instead
"created_date": datetime.now(timezone.utc).isoformat()
}
# GOOD: Use defaults when appropriate
product = {
"sku": "SKU-123",
"name": "Product Name",
"description": "", # Empty string instead of None
"price": 29.99,
"in_stock": True, # Sensible default
"categories": [], # Empty array instead of None
"created_date": datetime.now(timezone.utc).isoformat()
}
products.data.insert(product)
client.close()
Pre-ingestion validation patterns
Before bulk ingestion, validate your data to catch issues early. Some common checks include:
- Required properties: Ensure critical fields are present
- Data types: Verify numbers are numbers, dates are valid, etc.
- Value ranges: Check for reasonable values (e.g., ratings 0-10)
- Text length: Very long strings might cause issues (consider chunking, summarization or truncation)
- Array elements: Validate array item types and sizes
Practical validation example
One approach is to validate the data before ingestion. Here's an indicative example identifying potential errors in the data to be ingested.
from datetime import datetime
def validate_product(product, index):
"""Validate a single product object."""
errors = []
# Required fields
required = ["sku", "name", "price"]
for field in required:
if field not in product or product[field] is None:
errors.append(f"Missing required field: {field}")
# SKU format (alphanumeric + hyphens)
if "sku" in product:
if not isinstance(product["sku"], str) or len(product["sku"]) == 0:
errors.append("SKU must be non-empty string")
# Price validation
if "price" in product:
if not isinstance(product["price"], (int, float)):
errors.append("Price must be a number")
elif product["price"] < 0:
errors.append("Price cannot be negative")
elif product["price"] > 1000000:
errors.append("Price seems unreasonably high")
# Categories validation
if "categories" in product:
if not isinstance(product["categories"], list):
errors.append("Categories must be a list")
elif len(product["categories"]) == 0:
errors.append("At least one category required")
# Date validation
if "created_date" in product:
try:
datetime.fromisoformat(product["created_date"].replace('Z', '+00:00'))
except (ValueError, AttributeError):
errors.append("Invalid date format")
return errors
def validate_product_batch(products, sample_size=10):
"""Validate a batch of products."""
sample = products[:sample_size]
all_errors = {}
for i, product in enumerate(sample):
errors = validate_product(product, i)
if errors:
all_errors[i] = errors
if all_errors:
print(f"Validation failed for {len(all_errors)} objects:")
for idx, errors in all_errors.items():
print(f"\nObject {idx}:")
for error in errors:
print(f" - {error}")
return False
else:
print(f"Validation passed for {len(sample)} sample objects")
return True
# Usage
products = [
# Your product data
]
if validate_product_batch(products, sample_size=20):
print("Safe to proceed with bulk import")
else:
print("Fix validation errors before importing")
Going through these checks before ingestion can help catch issues early and ensure data quality.
Search strategy and chunking considerations
While this course focuses on ingestion, two important topics affect how you prepare data:
Search strategy
How you plan to search affects data structure:
- Keyword search (BM25): Consider tokenization, parameters, and stop words
- Vector search: How will you embed the data, and how many vectors per object?
- Hybrid search: What balance of keyword/vector search results are needed?
For details, see the Search Strategies: In Depth course.
Chunking
Texts that are too long can lead to:
- Errors during vectorization
- Increased latency (may increase at a non-linear rate)
- Poor search relevance
- Wasted context window in generative AI (RAG or agentic) use cases
Chunking is one possible solution for this problem. For texts exceeding a particular length, consider breaking them into smaller pieces before ingestion.
How to chunk is a complex topic, and some considerations include:
- What type of data is it? (e.g., natural language, code, tabular)
- What embedding model will be used, and what are its token limits?
- What kinds of queries will be run against the data?
- How will the chunks be used downstream?
A good starting point is to use length-based chunking with some overlap, then iterate based on search relevance testing. This topic is covered in more detail in another course.
Now that your data is prepared and validated, let's learn about batch import operations and why they're essential for production.