Llama Stack with llama.cpp

This template demonstrates how to utilize Llama Stack with llama.cpp as the inference provider. \n Previously, the use of quantized models with Llama Stack was restricted, but now it is fully supported through llama.cpp. \n You can employ any .gguf models available on Hugging Face with this template.

Prerequisites

Install llama.cpp: Follow the installation instructions from the llama.cpp repository
Download a model: Download a GGUF format model file (e.g., from Hugging Face)

Starting llama.cpp Server

Before running Llama Stack, you need to start the llama.cpp server:

# Example: Start llama.cpp server with a model
./llama-server -m /path/to/your/YOUR_MODEL.gguf -c 4096 --host 0.0.0.0 --port 8080 --api-key YOUR_API_KEY --jinja -cb --alias llama-model

Common llama.cpp server options:

-m: Path to the GGUF model file
-c: Context size (default: 512)
--host: Host to bind to (default: 127.0.0.1)
--port: Port to bind to (default: 8080)
-ngl: Number of layers to offload to GPU
--chat-template: Chat template to use
--api-key: API key to use for authentication
--alias: Alias name for the model
--jinja: Enable jinja template support for tool calling in llama-stack
--cb: Enable continuous batching to improve throughput

Environment Variables

Set these environment variables before running Llama Stack:

export LLAMACPP_URL=http://localhost:8080  # URL of your llama.cpp server (without /v1 suffix)
export INFERENCE_MODEL=llama-model         # Use aliased Name/identifier
export LLAMACPP_API_KEY="YOUR_API_KEY"     # API key (leave empty for local servers)

Running Llama Stack

The model name will be you gguf file name without the extension.

llama stack build --template llamacpp --image-type conda
llama stack run llamacpp --image-type conda

Configuration

The template uses the following configuration:

Inference Provider: remote::llamacpp - Connects to your llama.cpp server via OpenAI-compatible API
Default URL: http://localhost:8080 (configurable via LLAMACPP_URL)
Vector Store: FAISS for local vector storage
Safety: Llama Guard for content safety
Other providers: Standard Meta reference implementations

Model Support

This template works with any GGUF format model supported by llama.cpp, including:

Llama 2/3 models
Code Llama models
Other transformer-based models converted to GGUF format

Troubleshooting

Connection refused: Make sure your llama.cpp server is running and accessible
Model not found: Verify the model path and that the GGUF file exists
Out of memory: Reduce context size (-c) or use GPU offloading (-ngl)
Slow inference: Consider using GPU acceleration or quantized models

Advanced Configuration

You can customize the llama.cpp server configuration by modifying the server startup command. For production use, consider:

Using GPU acceleration with -ngl parameter
Adjusting batch size with -b parameter
Setting appropriate context size with -c parameter
Using multiple threads with -t parameter

For more llama.cpp server options, run:

./llama-server --help

3.4 KiB Raw Blame History