# Llama Stack with llama.cpp This template shows you how to run Llama Stack with [llama.cpp](https://github.com/ggerganov/llama.cpp) as the inference provider. ## Prerequisites 1. **Install llama.cpp**: Follow the installation instructions from the [llama.cpp repository](https://github.com/ggerganov/llama.cpp) 2. **Download a model**: Download a GGUF format model file (e.g., from Hugging Face) ## Starting llama.cpp Server Before running Llama Stack, you need to start the llama.cpp server: ```bash # Example: Start llama.cpp server with a model ./llama-server -m /path/to/your/model.gguf -c 4096 --host 0.0.0.0 --port 8080 ``` Common llama.cpp server options: - `-m`: Path to the GGUF model file - `-c`: Context size (default: 512) - `--host`: Host to bind to (default: 127.0.0.1) - `--port`: Port to bind to (default: 8080) - `-ngl`: Number of layers to offload to GPU - `--chat-template`: Chat template to use ## Environment Variables Set these environment variables before running Llama Stack: ```bash export LLAMACPP_URL=http://localhost:8080 # URL of your llama.cpp server (without /v1 suffix) export INFERENCE_MODEL=your-model-name # Name/identifier for your model export LLAMACPP_API_KEY="" # API key (leave empty for local servers) ``` ## Running Llama Stack ```bash llama stack run llamacpp ``` ## Configuration The template uses the following configuration: - **Inference Provider**: `remote::llamacpp` - Connects to your llama.cpp server via OpenAI-compatible API - **Default URL**: `http://localhost:8080` (configurable via `LLAMACPP_URL`) - **Vector Store**: FAISS for local vector storage - **Safety**: Llama Guard for content safety - **Other providers**: Standard Meta reference implementations ## Model Support This template works with any GGUF format model supported by llama.cpp, including: - Llama 2/3 models - Code Llama models - Other transformer-based models converted to GGUF format ## Troubleshooting 1. **Connection refused**: Make sure your llama.cpp server is running and accessible 2. **Model not found**: Verify the model path and that the GGUF file exists 3. **Out of memory**: Reduce context size (`-c`) or use GPU offloading (`-ngl`) 4. **Slow inference**: Consider using GPU acceleration or quantized models ## Advanced Configuration You can customize the llama.cpp server configuration by modifying the server startup command. For production use, consider: - Using GPU acceleration with `-ngl` parameter - Adjusting batch size with `-b` parameter - Setting appropriate context size with `-c` parameter - Using multiple threads with `-t` parameter For more llama.cpp server options, run: ```bash ./llama-server --help ```