Skip to main content

Setup

You'll need:

  • A Weaviate instance
    • Note: The example code uses Weaviate Cloud with Weaviate Embeddings. But any setup with a working vectorizer will work
  • Python environment with weaviate-client installed
  • Sample movie dataset (we'll set this up below)

The below steps should be relatively familiar for you. If not, refer to an earlier course such as Hands-on Weaviate with Python for a refresher.

Create a Movies collection

We'll use a dataset of movies for the rest of the lesson. Let's create a collection to store the movies.

import weaviate
from weaviate.classes.config import Configure, Property, DataType
import os

client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"), # Replace with your WCD URL
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)
client.collections.create(
name="Movies",
properties=[
Property(name="title", data_type=DataType.TEXT),
Property(name="overview", data_type=DataType.TEXT),
Property(name="vote_average", data_type=DataType.NUMBER),
Property(name="genre_ids", data_type=DataType.INT_ARRAY),
Property(name="release_date", data_type=DataType.DATE),
Property(name="tmdb_id", data_type=DataType.INT),
],
# Define the vectorizer module
vector_config=[
Configure.Vectors.text2vec_weaviate(
name="title",
source_properties=["title"]
),
Configure.Vectors.text2vec_weaviate(
name="title_overview",
source_properties=["title", "overview"]
),
]
)

client.close()

Ingest data

Ingest the data into the collection.

import weaviate
import pandas as pd
import requests
from datetime import datetime, timezone
import json
from weaviate.util import generate_uuid5
from tqdm import tqdm
import os

client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"), # Replace with your WCD URL
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())

# Configure collection object
movies = client.collections.use("Movies")

# Enter context manager
with movies.batch.fixed_size(batch_size=200) as batch:
for i, movie in tqdm(df.iterrows()):
release_date = datetime.fromisoformat(movie["release_date"]).replace(tzinfo=timezone.utc)
genre_ids = json.loads(movie["genre_ids"])
movie_obj = {
"title": movie["title"],
"overview": movie["overview"],
"vote_average": movie["vote_average"],
"genre_ids": genre_ids,
"release_date": release_date,
"tmdb_id": movie["id"],
}

batch.add_object(
properties=movie_obj,
uuid=generate_uuid5(movie["id"])
)

# Check for failed objects
if len(movies.batch.failed_objects) > 0:
print(f"Failed to import {len(movies.batch.failed_objects)} objects")

client.close()

Verify collection

Briefly verify the import by checking the collection size.

import weaviate
import os

client = weaviate.connect_to_weaviate_cloud(
cluster_url=os.getenv("WEAVIATE_URL"), # Replace with your WCD URL
auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

movies = client.collections.use("Movies")
print(len(movies))
client.close()

You should see 680 movies in the collection.

What's next?

Now that we have our data, let's compare how vector, keyword, and hybrid search handle the same queries to understand their different behaviors.

Login to track your progress