3.4 KiB
Llama Stack with llama.cpp
This template demonstrates how to utilize Llama Stack with llama.cpp as the inference provider. \n Previously, the use of quantized models with Llama Stack was restricted, but now it is fully supported through llama.cpp. \n You can employ any .gguf models available on Hugging Face with this template.
Prerequisites
- Install llama.cpp: Follow the installation instructions from the llama.cpp repository
- Download a model: Download a GGUF format model file (e.g., from Hugging Face)
Starting llama.cpp Server
Before running Llama Stack, you need to start the llama.cpp server:
# Example: Start llama.cpp server with a model
./llama-server -m /path/to/your/YOUR_MODEL.gguf -c 4096 --host 0.0.0.0 --port 8080 --api-key YOUR_API_KEY --jinja -cb --alias llama-model
Common llama.cpp server options:
-m: Path to the GGUF model file-c: Context size (default: 512)--host: Host to bind to (default: 127.0.0.1)--port: Port to bind to (default: 8080)-ngl: Number of layers to offload to GPU--chat-template: Chat template to use--api-key: API key to use for authentication--alias: Alias name for the model--jinja: Enable jinja template support for tool calling in llama-stack--cb: Enable continuous batching to improve throughput
Environment Variables
Set these environment variables before running Llama Stack:
export LLAMACPP_URL=http://localhost:8080 # URL of your llama.cpp server (without /v1 suffix)
export INFERENCE_MODEL=llama-model # Use aliased Name/identifier
export LLAMACPP_API_KEY="YOUR_API_KEY" # API key (leave empty for local servers)
Running Llama Stack
The model name will be you gguf file name without the extension.
llama stack build --template llamacpp --image-type conda
llama stack run llamacpp --image-type conda
Configuration
The template uses the following configuration:
- Inference Provider:
remote::llamacpp- Connects to your llama.cpp server via OpenAI-compatible API - Default URL:
http://localhost:8080(configurable viaLLAMACPP_URL) - Vector Store: FAISS for local vector storage
- Safety: Llama Guard for content safety
- Other providers: Standard Meta reference implementations
Model Support
This template works with any GGUF format model supported by llama.cpp, including:
- Llama 2/3 models
- Code Llama models
- Other transformer-based models converted to GGUF format
Troubleshooting
- Connection refused: Make sure your llama.cpp server is running and accessible
- Model not found: Verify the model path and that the GGUF file exists
- Out of memory: Reduce context size (
-c) or use GPU offloading (-ngl) - Slow inference: Consider using GPU acceleration or quantized models
Advanced Configuration
You can customize the llama.cpp server configuration by modifying the server startup command. For production use, consider:
- Using GPU acceleration with
-nglparameter - Adjusting batch size with
-bparameter - Setting appropriate context size with
-cparameter - Using multiple threads with
-tparameter
For more llama.cpp server options, run:
./llama-server --help