# Ollama Distribution The `llamastack/distribution-ollama` distribution consists of the following provider configurations. | API | Provider(s) | |-----|-------------| | agents | `inline::meta-reference` | | datasetio | `remote::huggingface`, `inline::localfs` | | eval | `inline::meta-reference` | | files | `inline::localfs` | | inference | `remote::ollama` | | post_training | `inline::huggingface` | | safety | `inline::llama-guard` | | scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` | | telemetry | `inline::meta-reference` | | tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`, `remote::wolfram-alpha` | | vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` | ### Environment Variables The following environment variables can be configured: - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) - `OLLAMA_URL`: URL of the Ollama server (default: `http://127.0.0.1:11434`) - `INFERENCE_MODEL`: Inference model loaded into the Ollama server (default: `meta-llama/Llama-3.2-3B-Instruct`) - `SAFETY_MODEL`: Safety model loaded into the Ollama server (default: `meta-llama/Llama-Guard-3-1B`) ## Prerequisites ### Ollama Server This distribution requires an external Ollama server to be running. You can install and run Ollama by following these steps: 1. **Install Ollama**: Download and install Ollama from [https://ollama.ai/](https://ollama.ai/) 2. **Start the Ollama server**: ```bash ollama serve ``` By default, Ollama serves on `http://127.0.0.1:11434` 3. **Pull the required models**: ```bash # Pull the inference model ollama pull meta-llama/Llama-3.2-3B-Instruct # Pull the embedding model ollama pull all-minilm:latest # (Optional) Pull the safety model for run-with-safety.yaml ollama pull meta-llama/Llama-Guard-3-1B ``` ## Supported Services ### Inference: Ollama Uses an external Ollama server for running LLM inference. The server should be accessible at the URL specified in the `OLLAMA_URL` environment variable. ### Vector IO: FAISS Provides vector storage capabilities using FAISS for embeddings and similarity search operations. ### Safety: Llama Guard (Optional) When using the `run-with-safety.yaml` configuration, provides safety checks using Llama Guard models running on the Ollama server. ### Agents: Meta Reference Provides agent execution capabilities using the meta-reference implementation. ### Post-Training: Hugging Face Supports model fine-tuning using Hugging Face integration. ### Tool Runtime Supports various external tools including: - Brave Search - Tavily Search - RAG Runtime - Model Context Protocol - Wolfram Alpha ## Running Llama Stack with Ollama You can do this via Conda or venv (build code), or Docker which has a pre-built image. ### Via Docker This method allows you to get started quickly without having to build the distribution code. ```bash LLAMA_STACK_PORT=8321 docker run \ -it \ --pull always \ -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ -v ./run.yaml:/root/my-run.yaml \ llamastack/distribution-ollama \ --config /root/my-run.yaml \ --port $LLAMA_STACK_PORT \ --env OLLAMA_URL=$OLLAMA_URL \ --env INFERENCE_MODEL=$INFERENCE_MODEL ``` ### Via Conda ```bash llama stack build --template ollama --image-type conda llama stack run ./run.yaml \ --port 8321 \ --env OLLAMA_URL=$OLLAMA_URL \ --env INFERENCE_MODEL=$INFERENCE_MODEL ``` ### Via venv If you've set up your local development environment, you can also build the image using your local virtual environment. ```bash llama stack build --template ollama --image-type venv llama stack run ./run.yaml \ --port 8321 \ --env OLLAMA_URL=$OLLAMA_URL \ --env INFERENCE_MODEL=$INFERENCE_MODEL ``` ### Running with Safety To enable safety checks, use the `run-with-safety.yaml` configuration: ```bash llama stack run ./run-with-safety.yaml \ --port 8321 \ --env OLLAMA_URL=$OLLAMA_URL \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env SAFETY_MODEL=$SAFETY_MODEL ``` ## Example Usage Once your Llama Stack server is running with Ollama, you can interact with it using the Llama Stack client: ```python from llama_stack_client import LlamaStackClient client = LlamaStackClient(base_url="http://localhost:8321") # Run inference response = client.inference.chat_completion( model_id="meta-llama/Llama-3.2-3B-Instruct", messages=[{"role": "user", "content": "Hello, how are you?"}], ) print(response.completion_message.content) ``` ## Troubleshooting ### Common Issues 1. **Connection refused errors**: Ensure your Ollama server is running and accessible at the configured URL. 2. **Model not found errors**: Make sure you've pulled the required models using `ollama pull `. 3. **Performance issues**: Consider using more powerful models or adjusting the Ollama server configuration for better performance. ### Logs Check the Ollama server logs for any issues: ```bash # Ollama logs are typically available in: # - macOS: ~/Library/Logs/Ollama/ # - Linux: ~/.ollama/logs/ ```