Weaviate Academy

Here we will import data into the collection which we just created.

Code

Run this code to import the movie data into our collection; it will:

Load the source data & get the collection
Enter a context manager with a batcher (batch) object
Loop through the data and add objects to the batcher
Print out any import errors

import weaviate
import pandas as pd
import requests
from datetime import datetime, timezone
import json
from weaviate.util import generate_uuid5
from tqdm import tqdm
import os

# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)

data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())

# Configure collection object
movies = client.collections.use("Movies")

# Enter context manager
with movies.batch.fixed_size(batch_size=200) as batch:
    # Loop through the data
    for i, movie in tqdm(df.iterrows()):
        # Convert data types
        # Convert a JSON date to `datetime` and add time zone information
        release_date = datetime.fromisoformat(movie["release_date"]).replace(tzinfo=timezone.utc)
        # Convert a JSON array to a list of integers
        genre_ids = json.loads(movie["genre_ids"])

        # Build the object payload
        movie_obj = {
            "title": movie["title"],
            "overview": movie["overview"],
            "vote_average": movie["vote_average"],
            "genre_ids": genre_ids,
            "release_date": release_date,
            "tmdb_id": movie["id"],
        }

        # Add object to batch queue
        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie["id"])
        )
        # Batcher automatically sends batches

# Check for failed objects
if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")

client.close()

API docs

Explain the code

Preparation

We use the requests library to load the data from the source, in this case a JSON file. The data is then converted to a Pandas DataFrame for easier manipulation.

Then, we create a collection object (with client.collections.use) so we can interact with the collection.

Batch context manager

The batch object is a context manager that allows you to add objects to the batcher. This is useful when you have a large amount of data to import, as it abstracts away the complexity of managing the batch size and when to send the batch.

with movies.batch.fixed_size(batch_size=200) as batch:

API docs

This example uses the .fixed_size() method to create a batcher which sets the number of objects per batch. There are also other batcher types, like .rate_limit() for specifying the number of objects per minute and .dynamic() to create a dynamic batcher, where the client determines and updates the batch size during the import process.

Add data to the batcher

Convert data types

Where appropriate, data is converted from a string to the correct data type for Weaviate. For example, the release_date is converted to a datetime object, and the genre_ids are converted to a list of integers.

        # Convert a JSON date to `datetime` and add time zone information
        release_date = datetime.fromisoformat(movie["release_date"]).replace(tzinfo=timezone.utc)
        # Convert a JSON array to a list of integers
        genre_ids = json.loads(movie["genre_ids"])

API docs

The correlation between the data types and their Weaviate counterparts is crucial for successful ingestion. Here are mappings of the common types:

str : DataType.TEXT
int : DataType.INT
float : DataType.NUMBER
bool : DataType.BOOL
datetime : DataType.DATE
list[str] : DataType.TEXT_ARRAY
list[int] : DataType.INT_ARRAY
list[float] : DataType.NUMBER_ARRAY
list[bool] : DataType.BOOL_ARRAY
list[datetime] : DataType.DATE_ARRAY

See the documentation for the full list of data types and their mappings.

Add objects to the batcher

Then we loop through the data and add each object to the batcher. The batch.add_object method is used to add the object to the batcher, and the batcher will periodically send the batch to Weaviate for ingestion.

        movie_obj = {
            "title": movie["title"],
            "overview": movie["overview"],
            "vote_average": movie["vote_average"],
            "genre_ids": genre_ids,
            "release_date": release_date,
            "tmdb_id": movie["id"],
        }

        # Add object to batch queue
        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie["id"])
        )

API docs

Error handling

In a batch import, it's possible for some objects to fail for various reasons, such as validation errors. The batcher saves these errors.

You can inspect these errors, or otherwise handle them as needed. This example simply prints out the errors.

if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")

client.close()

API docs

Note that the list of errors is cleared when a new context manager is entered, so you must handle the errors before initializing a new batcher.

Where do the vectors come from?

When the batcher sends the queue to Weaviate, the objects are added to the collection. In our case, the movie collection.

Recall that the collection is configured with a vector using an embedding model integration. As we do not specify vectors here, Weaviate uses the specified model integration to generate vector embeddings from the data.

What's next?

Now that you have imported data into the collection, go to the next lesson to see how to query the data.

← Back to Lesson Overview

Import data

Code

Explain the code

Preparation

Batch context manager

Add data to the batcher

Convert data types

Add objects to the batcher

Error handling

Where do the vectors come from?

← Back to Lesson Overview

Import data

Code​

Explain the code​

Preparation​

Batch context manager​

Add data to the batcher​

Convert data types​

Add objects to the batcher​

Error handling​

Where do the vectors come from?​

Code

Explain the code

Preparation

Batch context manager

Add data to the batcher

Convert data types

Add objects to the batcher

Error handling

Where do the vectors come from?