Weaviate Academy

Local inference

In many cases, it may be desirable or even required to perform AI model inference using a local (or on-premise) model.

This may be brought on by a variety of reasons. You may wish (or need) to keep the data local (e.g. compliance or security), or have a custom-trained proprietary model. Or, the economics may be preferable for local inference over commercial inference APIs.

While there are arguably fewer options for local inference than those offered by inference providers, the range of choices is still quite wide. There are huge number of publicly available models as well as software libraries to make the process easier. As well as general deep learning libraries such as PyTorch or TensorFlow, libraries such as Hugging Face Transformers, Ollama and ONNX Runtime make it easier to perform local inference. In the case of Ollama and ONNX Runtime, at reasonable speeds without any hardware (GPU / TPU) acceleration.

Let's take a look at examples of performing inference through Ollama.

Model licenses

Just like any other product, AI models often come with a particular license that details what you can and cannot do.

When it comes to publicly available models, keep in mind that not all of them allow commercial usage. Consult each model's license to evaluate for yourself whether it is suitable for your use case.

Preparation

For this section, we will use Ollama. Ollama is an open-source framework for running and deploying AI models locally. It provides an easy way to download, set up, and interact with a variety of open-source models like Llama, Mistral, and Snowflake embedding models. Ollama offers both a command-line interface and a REST API, as well as programming language-specific SDKs.

System requirements

We will be performing local inference in this section. Even though we will use relatively small models, these AI models require somewhat significant system resources. We recommend using a modern computer, with at least 16 GB of RAM.

A GPU is not required.

To use Ollama, go to the site and follow the download and installation instructions.

Then, pull the required models. We will use the 1 billion parameter Gemma3 generative model, and the 110 million parameter Snowflake Arctic embedding model.

Once you have Ollama installed, pull the models with:

ollama pull gemma3:1b
ollama pull snowflake-arctic-embed:110m

Now, check that the models are loaded by running:

ollama list

The resulting output should include the gemma3:1b and snowflake-arctic-embed:110m models.

Install the Ollama Python library with your preferred environment with your preferred package manager. For example:

pip install ollama

Embedding model usage

The following snippet will convert a series of text snippets (source_texts) into embeddings:

import ollama

source_texts = [
    "You're a wizard, Harry.",
    "Space, the final frontier.",
    "I'm going to make him an offer he can't refuse.",
]

response = ollama.embed(model='snowflake-arctic-embed:110m', input=source_texts)

source_embeddings = []
for e in response.embeddings:
    print(len(e))                   # This will be the length of the embedding vector
    print(e[:5])                    # This will print the first 5 elements of the embedding vector
    source_embeddings.append(e)     # Save the embedding for later use

This should output something like this (note the exact numbers may vary):

768
[-0.030614788, 0.01759585, -0.001181114, 0.025152, 0.005875709]
768
[-0.039889574, 0.05197108, 0.036466435, 0.012909834, 0.012069418]
768
[-0.04942698, 0.05466185, -0.007884168, -0.00252788, -0.0025294009]

Printing for each vector its length (dimensionality) and the first few dimensions. (Note the number of dimensions here are different to the Cohere example, as the two models vary in their dimensionality.)

Let's follow the same steps. First, we find the piece of text that best matches a query (let's say: intergalactic voyage), we would first embed the query text:

# Get the query embedding:
query_text = "Intergalactic voyage"

response = ollama.embed(model='snowflake-arctic-embed:110m', input=query_text)

query_embedding = response.embeddings[0]

print(len(query_embedding))
print(query_embedding[:5])

Producing a result such as:

768
[-0.043455746, 0.05260946, 0.025877617, -0.017234074, 0.027434561]

Again, here our query vector is the same dimensionality as the document vector, and that each dimension has a similar format.

To perform a vector search:

# Find the most similar source text to the query:
import numpy as np

# Calculate the dot product between the query embedding and each source embedding
dot_products = [np.dot(query_embedding, e) for e in source_embeddings]

# Find the index of the maximum dot product
most_similar_index = np.argmax(dot_products)

# Get the most similar source text
most_similar_text = source_texts[most_similar_index]

print(f"The most similar text to '{query_text}' is:")
print(most_similar_text)

Note the snippet to compare embeddings is identical to that used in the Cohere example. This should produce the output:

The most similar text to 'Intergalactic voyage' is:
Space, the final frontier.

Happily for us, the Snowflake model also identified the same space-related passage as the closest one out of the candidates.

Generative model usage

Now, let's move to try using a large language model with Ollama, using the gemma3:1b model. We will once again ask to explain how a large language model works:

from ollama import chat
from ollama import ChatResponse

messages = [
    {
        "role": "user",
        "content": "Hi there. Please explain how language models work, in just a sentence or two.",
    }
]

response: ChatResponse = chat(model='gemma3:1b', messages=messages)

print(response.message.content)

The response may look something like this (note the exact output may vary):

Language models, like me, are trained on massive amounts of text data to predict the next word in a sequence, essentially learning patterns and relationships within language to generate text that seems coherent and relevant.

As before, we can perform a multi-turn conversation:

from ollama import chat
from ollama import ChatResponse

messages = [
    {
        "role": "user",
        "content": "Hi there. Please explain how language models work, in just a sentence or two.",
    }
]

# Initial response from the model
response: ChatResponse = chat(model='gemma3:1b', messages=messages)

# Append the initial response to the messages
messages.append(
    {
        "role": "assistant",
        "content": response.message.content,
    }
)

# Provide a follow-up prompt
messages.append(
    {
        "role": "user",
        "content": "Ah, I see. Now, can you write that in a Haiku?",
    }
)

response: ChatResponse = chat(model='gemma3:1b', messages=messages)

# This response will take both the initial and follow-up prompts into account
print(response.message.content)

The response looked like this:

Words flow, patterns bloom,
Digital mind learns to speak,
Meaning takes new form.

Although the specific response and syntax were different, the general workflow and principles were the same between using an inference provider and a local model.

What's next?

You now know how to run AI models both through providers and locally. Next, we'll explore how to choose between these approaches and evaluate different models to find the best fit for your specific use case.

← Back to Lesson Overview

Local Model Inference

Local inference

Preparation

Embedding model usage

Generative model usage

← Back to Lesson Overview

Local Model Inference

Local inference​

Preparation​

Embedding model usage​

Generative model usage​

Local inference

Preparation

Embedding model usage

Generative model usage