mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-26 06:07:43 +00:00

Mustafa Elbehery 407c3e3bad feat: use XDG directory standards

Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>

2025-07-22 23:29:23 +02:00

5.1 KiB

Raw Blame History

Ollama Distribution

The llamastack/distribution-ollama distribution consists of the following provider configurations.

API	Provider(s)
agents	`inline::meta-reference`
datasetio	`remote::huggingface`, `inline::localfs`
eval	`inline::meta-reference`
files	`inline::localfs`
inference	`remote::ollama`
post_training	`inline::huggingface`
safety	`inline::llama-guard`
scoring	`inline::basic`, `inline::llm-as-judge`, `inline::braintrust`
telemetry	`inline::meta-reference`
tool_runtime	`remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`, `remote::wolfram-alpha`
vector_io	`inline::faiss`, `remote::chromadb`, `remote::pgvector`

Environment Variables

The following environment variables can be configured:

LLAMA_STACK_PORT: Port for the Llama Stack distribution server (default: 8321)
OLLAMA_URL: URL of the Ollama server (default: http://127.0.0.1:11434)
INFERENCE_MODEL: Inference model loaded into the Ollama server (default: meta-llama/Llama-3.2-3B-Instruct)
SAFETY_MODEL: Safety model loaded into the Ollama server (default: meta-llama/Llama-Guard-3-1B)

Prerequisites

Ollama Server

This distribution requires an external Ollama server to be running. You can install and run Ollama by following these steps:

Install Ollama: Download and install Ollama from https://ollama.ai/
Start the Ollama server:
```
ollama serve
```
By default, Ollama serves on http://127.0.0.1:11434

Pull the required models:

# Pull the inference model
ollama pull meta-llama/Llama-3.2-3B-Instruct

# Pull the embedding model
ollama pull all-minilm:latest

# (Optional) Pull the safety model for run-with-safety.yaml
ollama pull meta-llama/Llama-Guard-3-1B

Supported Services

Inference: Ollama

Uses an external Ollama server for running LLM inference. The server should be accessible at the URL specified in the OLLAMA_URL environment variable.

Vector IO: FAISS

Provides vector storage capabilities using FAISS for embeddings and similarity search operations.

Safety: Llama Guard (Optional)

When using the run-with-safety.yaml configuration, provides safety checks using Llama Guard models running on the Ollama server.

Agents: Meta Reference

Provides agent execution capabilities using the meta-reference implementation.

Post-Training: Hugging Face

Supports model fine-tuning using Hugging Face integration.

Tool Runtime

Supports various external tools including:

Brave Search
Tavily Search
RAG Runtime
Model Context Protocol
Wolfram Alpha

Running Llama Stack with Ollama

You can do this via Conda or venv (build code), or Docker which has a pre-built image.

Via Docker

This method allows you to get started quickly without having to build the distribution code.

LLAMA_STACK_PORT=8321
docker run \
  -it \
  --pull always \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ./run.yaml:/root/my-run.yaml \
  llamastack/distribution-ollama \
  --config /root/my-run.yaml \
  --port $LLAMA_STACK_PORT \
  --env OLLAMA_URL=$OLLAMA_URL \
  --env INFERENCE_MODEL=$INFERENCE_MODEL

Via Conda

llama stack build --template ollama --image-type conda
llama stack run ./run.yaml \
  --port 8321 \
  --env OLLAMA_URL=$OLLAMA_URL \
  --env INFERENCE_MODEL=$INFERENCE_MODEL

Via venv

If you've set up your local development environment, you can also build the image using your local virtual environment.

llama stack build --template ollama --image-type venv
llama stack run ./run.yaml \
  --port 8321 \
  --env OLLAMA_URL=$OLLAMA_URL \
  --env INFERENCE_MODEL=$INFERENCE_MODEL

Running with Safety

To enable safety checks, use the run-with-safety.yaml configuration:

llama stack run ./run-with-safety.yaml \
  --port 8321 \
  --env OLLAMA_URL=$OLLAMA_URL \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env SAFETY_MODEL=$SAFETY_MODEL

Example Usage

Once your Llama Stack server is running with Ollama, you can interact with it using the Llama Stack client:

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:8321")

# Run inference
response = client.inference.chat_completion(
    model_id="meta-llama/Llama-3.2-3B-Instruct",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
)
print(response.completion_message.content)

Troubleshooting

Common Issues

Connection refused errors: Ensure your Ollama server is running and accessible at the configured URL.
Model not found errors: Make sure you've pulled the required models using ollama pull <model-name>.
Performance issues: Consider using more powerful models or adjusting the Ollama server configuration for better performance.

Logs

Check the Ollama server logs for any issues:

# Ollama logs are typically available in:
# - macOS: ~/Library/Logs/Ollama/
# - Linux: ~/.ollama/logs/

5.1 KiB Raw Blame History