Weaviate Academy

You'll need:

A Weaviate instance
- Note: The example code uses Weaviate Cloud with Weaviate Embeddings. But any setup with a working vectorizer will work
Python environment with weaviate-client installed
Sample movie dataset (we'll set this up below)

The below steps should be relatively familiar for you. If not, refer to an earlier course such as Hands-on Weaviate with Python for a refresher.

Create a Movies collection

We'll use a dataset of movies for the rest of the lesson. Let's create a collection to store the movies.

import weaviate
from weaviate.classes.config import Configure, Property, DataType
import os

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),  # Replace with your WCD URL
    auth_credentials=os.getenv("WEAVIATE_API_KEY")
)
client.collections.create(
    name="Movies",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="overview", data_type=DataType.TEXT),
        Property(name="vote_average", data_type=DataType.NUMBER),
        Property(name="genre_ids", data_type=DataType.INT_ARRAY),
        Property(name="release_date", data_type=DataType.DATE),
        Property(name="tmdb_id", data_type=DataType.INT),
    ],
    # Define the vectorizer module
    vector_config=[
        Configure.Vectors.text2vec_weaviate(
            name="title",
            source_properties=["title"]
        ),
        Configure.Vectors.text2vec_weaviate(
            name="title_overview",
            source_properties=["title", "overview"]
        ),
    ]
)

client.close()

API docs

Ingest data

Ingest the data into the collection.

import weaviate
import pandas as pd
import requests
from datetime import datetime, timezone
import json
from weaviate.util import generate_uuid5
from tqdm import tqdm
import os

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),  # Replace with your WCD URL
    auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

data_url = "https://raw.githubusercontent.com/weaviate-tutorials/edu-datasets/main/movies_data_1990_2024.json"
resp = requests.get(data_url)
df = pd.DataFrame(resp.json())

# Configure collection object
movies = client.collections.use("Movies")

# Enter context manager
with movies.batch.fixed_size(batch_size=200) as batch:
    for i, movie in tqdm(df.iterrows()):
        release_date = datetime.fromisoformat(movie["release_date"]).replace(tzinfo=timezone.utc)
        genre_ids = json.loads(movie["genre_ids"])
        movie_obj = {
            "title": movie["title"],
            "overview": movie["overview"],
            "vote_average": movie["vote_average"],
            "genre_ids": genre_ids,
            "release_date": release_date,
            "tmdb_id": movie["id"],
        }

        batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie["id"])
        )

# Check for failed objects
if len(movies.batch.failed_objects) > 0:
    print(f"Failed to import {len(movies.batch.failed_objects)} objects")

client.close()

API docs

Verify collection

Briefly verify the import by checking the collection size.

import weaviate
import os

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WEAVIATE_URL"),  # Replace with your WCD URL
    auth_credentials=os.getenv("WEAVIATE_API_KEY")
)

movies = client.collections.use("Movies")
print(len(movies))
client.close()

API docs

You should see 680 movies in the collection.

What's next?

Now that we have our data, let's compare how vector, keyword, and hybrid search handle the same queries to understand their different behaviors.

← Back to Lesson Overview

Setup

Create a Movies collection

Ingest data

Verify collection

← Back to Lesson Overview

Setup

Create a Movies collection​

Ingest data​

Verify collection​

Create a Movies collection

Ingest data

Verify collection