mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-23 03:22:26 +00:00
2.9 KiB
2.9 KiB
Llama Stack with llama.cpp
This template shows you how to run Llama Stack with llama.cpp as the inference provider.
Prerequisites
- Install llama.cpp: Follow the installation instructions from the llama.cpp repository
- Download a model: Download a GGUF format model file (e.g., from Hugging Face)
Starting llama.cpp Server
Before running Llama Stack, you need to start the llama.cpp server:
# Example: Start llama.cpp server with a model
./llama-server -m /path/to/your/YOUR_MODEL.gguf -c 4096 --host 0.0.0.0 --port 8080 --api-key YOUR_API_KEY --jinja -cb
Common llama.cpp server options:
-m: Path to the GGUF model file-c: Context size (default: 512)--host: Host to bind to (default: 127.0.0.1)--port: Port to bind to (default: 8080)-ngl: Number of layers to offload to GPU--chat-template: Chat template to use
Environment Variables
Set these environment variables before running Llama Stack:
export LLAMACPP_URL=http://localhost:8080 # URL of your llama.cpp server (without /v1 suffix)
export INFERENCE_MODEL=your-model-name # Name/identifier without gguf extension
export LLAMACPP_API_KEY="YOUR_API_KEY" # API key (leave empty for local servers)
Running Llama Stack
The model name will be you gguf file name without the extension.
llama stack build --template llamacpp --image-type conda
llama stack run llamacpp --image-type conda
Configuration
The template uses the following configuration:
- Inference Provider:
remote::llamacpp- Connects to your llama.cpp server via OpenAI-compatible API - Default URL:
http://localhost:8080(configurable viaLLAMACPP_URL) - Vector Store: FAISS for local vector storage
- Safety: Llama Guard for content safety
- Other providers: Standard Meta reference implementations
Model Support
This template works with any GGUF format model supported by llama.cpp, including:
- Llama 2/3 models
- Code Llama models
- Other transformer-based models converted to GGUF format
Troubleshooting
- Connection refused: Make sure your llama.cpp server is running and accessible
- Model not found: Verify the model path and that the GGUF file exists
- Out of memory: Reduce context size (
-c) or use GPU offloading (-ngl) - Slow inference: Consider using GPU acceleration or quantized models
Advanced Configuration
You can customize the llama.cpp server configuration by modifying the server startup command. For production use, consider:
- Using GPU acceleration with
-nglparameter - Adjusting batch size with
-bparameter - Setting appropriate context size with
-cparameter - Using multiple threads with
-tparameter
For more llama.cpp server options, run:
./llama-server --help