llama-stack-mirror/docs/source/distributions/self_hosted_distro/llamacpp.md
2025-07-14 13:21:33 -07:00

2.9 KiB

Llama Stack with llama.cpp

This template shows you how to run Llama Stack with llama.cpp as the inference provider.

Prerequisites

  1. Install llama.cpp: Follow the installation instructions from the llama.cpp repository
  2. Download a model: Download a GGUF format model file (e.g., from Hugging Face)

Starting llama.cpp Server

Before running Llama Stack, you need to start the llama.cpp server:

# Example: Start llama.cpp server with a model
./llama-server -m /path/to/your/YOUR_MODEL.gguf -c 4096 --host 0.0.0.0 --port 8080 --api-key YOUR_API_KEY --jinja -cb

Common llama.cpp server options:

  • -m: Path to the GGUF model file
  • -c: Context size (default: 512)
  • --host: Host to bind to (default: 127.0.0.1)
  • --port: Port to bind to (default: 8080)
  • -ngl: Number of layers to offload to GPU
  • --chat-template: Chat template to use

Environment Variables

Set these environment variables before running Llama Stack:

export LLAMACPP_URL=http://localhost:8080  # URL of your llama.cpp server (without /v1 suffix)
export INFERENCE_MODEL=your-model-name     # Name/identifier without gguf extension
export LLAMACPP_API_KEY="YOUR_API_KEY"                 # API key (leave empty for local servers)

Running Llama Stack

The model name will be you gguf file name without the extension.

llama stack build --template llamacpp --image-type conda
llama stack run llamacpp --image-type conda

Configuration

The template uses the following configuration:

  • Inference Provider: remote::llamacpp - Connects to your llama.cpp server via OpenAI-compatible API
  • Default URL: http://localhost:8080 (configurable via LLAMACPP_URL)
  • Vector Store: FAISS for local vector storage
  • Safety: Llama Guard for content safety
  • Other providers: Standard Meta reference implementations

Model Support

This template works with any GGUF format model supported by llama.cpp, including:

  • Llama 2/3 models
  • Code Llama models
  • Other transformer-based models converted to GGUF format

Troubleshooting

  1. Connection refused: Make sure your llama.cpp server is running and accessible
  2. Model not found: Verify the model path and that the GGUF file exists
  3. Out of memory: Reduce context size (-c) or use GPU offloading (-ngl)
  4. Slow inference: Consider using GPU acceleration or quantized models

Advanced Configuration

You can customize the llama.cpp server configuration by modifying the server startup command. For production use, consider:

  • Using GPU acceleration with -ngl parameter
  • Adjusting batch size with -b parameter
  • Setting appropriate context size with -c parameter
  • Using multiple threads with -t parameter

For more llama.cpp server options, run:

./llama-server --help