Example solution: Ingest data into Weaviate
Here is an example solution for the data ingestion task, with explanations of key concepts and implementation decisions.
Collection creation
A sample solution is included in populate_complete.py.
Does the collection exist?
You can check this with client.collections.exists(). If it does exist, you can choose to skip the creation.
Some scripts are set up to delete the existing collection before creating a new one. This can be useful for development or evaluation, but this is dangerous, as it will delete all included objects with no way to recover them.
Instead, create a separate script for deleting collections, and only use it when you are sure you want to remove all data.
Collection deletion
A sample solution is shown in delete_collection_complete.py. Note two things:
- We use
client.collections.delete()to remove the collection. - We ask for explicit confirmation before proceeding, as this is a destructive action. This is best practice, as it helps prevent accidental data loss.
Collection creation
The creation of a collection in Weaviate is done using the client.collections.create() method.
- Each property is defined with a name (string) an data type (DataType enum).
- This particular collection has two vectors per object, where
Configure.Vectors.text2vec_weaviate()integration is used to manage vectorization of objects and inputs as needed.- Each vector is configured with different source properties, which allows for searches by similarity to the description (
titleandoverview), or by genre (genres).
- Each vector is configured with different source properties, which allows for searches by similarity to the description (
client.collections.create(
name=CollectionName.MOVIES,
properties=[
Property(name="movie_id", data_type=DataType.INT),
Property(name="title", data_type=DataType.TEXT),
Property(name="overview", data_type=DataType.TEXT),
Property(name="genres", data_type=DataType.TEXT_ARRAY),
Property(name="year", data_type=DataType.INT),
Property(name="popularity", data_type=DataType.NUMBER),
],
vector_config=[
Configure.Vectors.text2vec_weaviate(
name="default",
source_properties=["title", "overview"],
model="Snowflake/snowflake-arctic-embed-l-v2.0",
),
Configure.Vectors.text2vec_weaviate(
name="genres",
source_properties=["genres"],
model="Snowflake/snowflake-arctic-embed-l-v2.0",
),
],
)
Import data
The solution uses batch imports to efficiently ingest data into Weaviate.
Here, we enter a context manager with fixed size batching, and iterate over the data objects, which is parsed by the parse_data_object() helper function.
Each object is then added to the batch, which automatically handles the batching logic. Pre-generated vectors are included in the dataset for convenience and speed, so we add those as well. (If they were not included, the vector integration will generate them during import.)
We also add a UUID here which helps to uniquely identify each object in the batch, and is used to prevent duplicate entries.
with movies.batch.fixed_size(batch_size=200) as batch:
# Process each movie object
for data_row in tqdm(dataset):
obj = parse_data_object(data_row)
uuid = generate_uuid5(obj)
batch.add_object(
properties=obj["properties"],
uuid=uuid,
vector=obj["vectors"]
)
Error handling
Best practice batching includes inspecting any errors during ingestion. Generally, inspecting the number of failed objects at the end of imports is a good idea.
if len(movies.batch.failed_objects) > 0:
print(f"Failed to add {len(movies.batch.failed_objects)} objects")
for failed_obj in movies.batch.failed_objects[:3]:
print(failed_obj)
With your movie data successfully ingested into Weaviate, you're ready to start building the API endpoints that will showcase different search and retrieval capabilities.