Import data
Here we will import data into the collection which we just created.
Code
Run this code to import the movie data into our collection; it will:
- Load the source data & get the collection
- Enter a context manager with a batcher (
batch) object - Loop through the data and add objects to the batcher
- Print out any import errors
import weaviate
import pandas as pd
import requests
from datetime import datetime, timezone
import json
from weaviate.util import generate_uuid5
from tqdm import tqdm
import os
# Instantiate your client (not shown). e.g.:
# client = weaviate.connect_to_weaviate_cloud(...) or
# client = weaviate.connect_to_local(...)
data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())
# Configure collection object
movies = client.collections.use("Movies")
# Enter context manager
with movies.batch.fixed_size(batch_size=200) as batch:
# Loop through the data
for i, movie in tqdm(df.iterrows()):
# Convert data types
# Convert a JSON date to `datetime` and add time zone information
release_date = datetime.fromisoformat(movie["release_date"]).replace(tzinfo=timezone.utc)
# Convert a JSON array to a list of integers
genre_ids = json.loads(movie["genre_ids"])
# Build the object payload
movie_obj = {
"title": movie["title"],
"overview": movie["overview"],
"vote_average": movie["vote_average"],
"genre_ids": genre_ids,
"release_date": release_date,
"tmdb_id": movie["id"],
}
# Add object to batch queue
batch.add_object(
properties=movie_obj,
uuid=generate_uuid5(movie["id"])
)
# Batcher automatically sends batches
# Check for failed objects
if len(movies.batch.failed_objects) > 0:
print(f"Failed to import {len(movies.batch.failed_objects)} objects")
client.close()
Explain the code
Preparation
We use the requests library to load the data from the source, in this case a JSON file. The data is then converted to a Pandas DataFrame for easier manipulation.
Then, we create a collection object (with client.collections.use) so we can interact with the collection.
Batch context manager
The batch object is a context manager that allows you to add objects to the batcher. This is useful when you have a large amount of data to import, as it abstracts away the complexity of managing the batch size and when to send the batch.
with movies.batch.fixed_size(batch_size=200) as batch:
This example uses the .fixed_size() method to create a batcher which sets the number of objects per batch. There are also other batcher types, like .rate_limit() for specifying the number of objects per minute and .dynamic() to create a dynamic batcher, where the client determines and updates the batch size during the import process.
Add data to the batcher
Convert data types
Where appropriate, data is converted from a string to the correct data type for Weaviate. For example, the release_date is converted to a datetime object, and the genre_ids are converted to a list of integers.
# Convert a JSON date to `datetime` and add time zone information
release_date = datetime.fromisoformat(movie["release_date"]).replace(tzinfo=timezone.utc)
# Convert a JSON array to a list of integers
genre_ids = json.loads(movie["genre_ids"])
The correlation between the data types and their Weaviate counterparts is crucial for successful ingestion. Here are mappings of the common types:
str:DataType.TEXTint:DataType.INTfloat:DataType.NUMBERbool:DataType.BOOLdatetime:DataType.DATElist[str]:DataType.TEXT_ARRAYlist[int]:DataType.INT_ARRAYlist[float]:DataType.NUMBER_ARRAYlist[bool]:DataType.BOOL_ARRAYlist[datetime]:DataType.DATE_ARRAY
See the documentation for the full list of data types and their mappings.
Add objects to the batcher
Then we loop through the data and add each object to the batcher. The batch.add_object method is used to add the object to the batcher, and the batcher will periodically send the batch to Weaviate for ingestion.
movie_obj = {
"title": movie["title"],
"overview": movie["overview"],
"vote_average": movie["vote_average"],
"genre_ids": genre_ids,
"release_date": release_date,
"tmdb_id": movie["id"],
}
# Add object to batch queue
batch.add_object(
properties=movie_obj,
uuid=generate_uuid5(movie["id"])
)
Error handling
In a batch import, it's possible for some objects to fail for various reasons, such as validation errors. The batcher saves these errors.
You can inspect these errors, or otherwise handle them as needed. This example simply prints out the errors.
if len(movies.batch.failed_objects) > 0:
print(f"Failed to import {len(movies.batch.failed_objects)} objects")
client.close()
Note that the list of errors is cleared when a new context manager is entered, so you must handle the errors before initializing a new batcher.
Where do the vectors come from?
When the batcher sends the queue to Weaviate, the objects are added to the collection. In our case, the movie collection.
Recall that the collection is configured with a vector using an embedding model integration. As we do not specify vectors here, Weaviate uses the specified model integration to generate vector embeddings from the data.
Now that you have imported data into the collection, go to the next lesson to see how to query the data.