5.1 KiB
Ollama Distribution
The llamastack/distribution-ollama
distribution consists of the following provider configurations.
API | Provider(s) |
---|---|
agents | inline::meta-reference |
datasetio | remote::huggingface , inline::localfs |
eval | inline::meta-reference |
files | inline::localfs |
inference | remote::ollama |
post_training | inline::huggingface |
safety | inline::llama-guard |
scoring | inline::basic , inline::llm-as-judge , inline::braintrust |
telemetry | inline::meta-reference |
tool_runtime | remote::brave-search , remote::tavily-search , inline::rag-runtime , remote::model-context-protocol , remote::wolfram-alpha |
vector_io | inline::faiss , remote::chromadb , remote::pgvector |
Environment Variables
The following environment variables can be configured:
LLAMA_STACK_PORT
: Port for the Llama Stack distribution server (default:8321
)OLLAMA_URL
: URL of the Ollama server (default:http://127.0.0.1:11434
)INFERENCE_MODEL
: Inference model loaded into the Ollama server (default:meta-llama/Llama-3.2-3B-Instruct
)SAFETY_MODEL
: Safety model loaded into the Ollama server (default:meta-llama/Llama-Guard-3-1B
)
Prerequisites
Ollama Server
This distribution requires an external Ollama server to be running. You can install and run Ollama by following these steps:
-
Install Ollama: Download and install Ollama from https://ollama.ai/
-
Start the Ollama server:
ollama serve
By default, Ollama serves on
http://127.0.0.1:11434
-
Pull the required models:
# Pull the inference model ollama pull meta-llama/Llama-3.2-3B-Instruct # Pull the embedding model ollama pull all-minilm:latest # (Optional) Pull the safety model for run-with-safety.yaml ollama pull meta-llama/Llama-Guard-3-1B
Supported Services
Inference: Ollama
Uses an external Ollama server for running LLM inference. The server should be accessible at the URL specified in the OLLAMA_URL
environment variable.
Vector IO: FAISS
Provides vector storage capabilities using FAISS for embeddings and similarity search operations.
Safety: Llama Guard (Optional)
When using the run-with-safety.yaml
configuration, provides safety checks using Llama Guard models running on the Ollama server.
Agents: Meta Reference
Provides agent execution capabilities using the meta-reference implementation.
Post-Training: Hugging Face
Supports model fine-tuning using Hugging Face integration.
Tool Runtime
Supports various external tools including:
- Brave Search
- Tavily Search
- RAG Runtime
- Model Context Protocol
- Wolfram Alpha
Running Llama Stack with Ollama
You can do this via Conda or venv (build code), or Docker which has a pre-built image.
Via Docker
This method allows you to get started quickly without having to build the distribution code.
LLAMA_STACK_PORT=8321
docker run \
-it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ./run.yaml:/root/my-run.yaml \
llamastack/distribution-ollama \
--config /root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env OLLAMA_URL=$OLLAMA_URL \
--env INFERENCE_MODEL=$INFERENCE_MODEL
Via Conda
llama stack build --template ollama --image-type conda
llama stack run ./run.yaml \
--port 8321 \
--env OLLAMA_URL=$OLLAMA_URL \
--env INFERENCE_MODEL=$INFERENCE_MODEL
Via venv
If you've set up your local development environment, you can also build the image using your local virtual environment.
llama stack build --template ollama --image-type venv
llama stack run ./run.yaml \
--port 8321 \
--env OLLAMA_URL=$OLLAMA_URL \
--env INFERENCE_MODEL=$INFERENCE_MODEL
Running with Safety
To enable safety checks, use the run-with-safety.yaml
configuration:
llama stack run ./run-with-safety.yaml \
--port 8321 \
--env OLLAMA_URL=$OLLAMA_URL \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env SAFETY_MODEL=$SAFETY_MODEL
Example Usage
Once your Llama Stack server is running with Ollama, you can interact with it using the Llama Stack client:
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="http://localhost:8321")
# Run inference
response = client.inference.chat_completion(
model_id="meta-llama/Llama-3.2-3B-Instruct",
messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.completion_message.content)
Troubleshooting
Common Issues
-
Connection refused errors: Ensure your Ollama server is running and accessible at the configured URL.
-
Model not found errors: Make sure you've pulled the required models using
ollama pull <model-name>
. -
Performance issues: Consider using more powerful models or adjusting the Ollama server configuration for better performance.
Logs
Check the Ollama server logs for any issues:
# Ollama logs are typically available in:
# - macOS: ~/Library/Logs/Ollama/
# - Linux: ~/.ollama/logs/