Skip to main content

Example solution: Ingest data into Weaviate

Here is an example solution for the data ingestion task, with explanations of key concepts and implementation decisions.

Collection creation

A sample solution is included in populate_complete.py.

Does the collection exist?

You can check this with client.collections.exists(). If it does exist, you can choose to skip the creation.

Deleting collections if it exists

Some scripts are set up to delete the existing collection before creating a new one. This can be useful for development or evaluation, but this is dangerous, as it will delete all included objects with no way to recover them.

Instead, create a separate script for deleting collections, and only use it when you are sure you want to remove all data.

Collection deletion

A sample solution is shown in delete_collection_complete.py. Note two things:

  • We use client.collections.delete() to remove the collection.
  • We ask for explicit confirmation before proceeding, as this is a destructive action. This is best practice, as it helps prevent accidental data loss.

Collection creation

The creation of a collection in Weaviate is done using the client.collections.create() method.

  • Each property is defined with a name (string) an data type (DataType enum).
  • This particular collection has two vectors per object, where Configure.Vectors.text2vec_weaviate() integration is used to manage vectorization of objects and inputs as needed.
    • Each vector is configured with different source properties, which allows for searches by similarity to the description (title and overview), or by genre (genres).
client.collections.create(
name=CollectionName.MOVIES,
properties=[
Property(name="movie_id", data_type=DataType.INT),
Property(name="title", data_type=DataType.TEXT),
Property(name="overview", data_type=DataType.TEXT),
Property(name="genres", data_type=DataType.TEXT_ARRAY),
Property(name="year", data_type=DataType.INT),
Property(name="popularity", data_type=DataType.NUMBER),
],
vector_config=[
Configure.Vectors.text2vec_weaviate(
name="default",
source_properties=["title", "overview"],
model="Snowflake/snowflake-arctic-embed-l-v2.0",
),
Configure.Vectors.text2vec_weaviate(
name="genres",
source_properties=["genres"],
model="Snowflake/snowflake-arctic-embed-l-v2.0",
),
],
)

Import data

The solution uses batch imports to efficiently ingest data into Weaviate.

Here, we enter a context manager with fixed size batching, and iterate over the data objects, which is parsed by the parse_data_object() helper function.

Each object is then added to the batch, which automatically handles the batching logic. Pre-generated vectors are included in the dataset for convenience and speed, so we add those as well. (If they were not included, the vector integration will generate them during import.)

We also add a UUID here which helps to uniquely identify each object in the batch, and is used to prevent duplicate entries.

    with movies.batch.fixed_size(batch_size=200) as batch:
# Process each movie object
for data_row in tqdm(dataset):
obj = parse_data_object(data_row)

uuid = generate_uuid5(obj)
batch.add_object(
properties=obj["properties"],
uuid=uuid,
vector=obj["vectors"]
)

Error handling

Best practice batching includes inspecting any errors during ingestion. Generally, inspecting the number of failed objects at the end of imports is a good idea.

if len(movies.batch.failed_objects) > 0:
print(f"Failed to add {len(movies.batch.failed_objects)} objects")
for failed_obj in movies.batch.failed_objects[:3]:
print(failed_obj)
What's next?

With your movie data successfully ingested into Weaviate, you're ready to start building the API endpoints that will showcase different search and retrieval capabilities.

Login to track your progress