llama-stack-mirror/llama_stack/templates/llamacpp/doc_template.md

# Llama Stack with llama.cpp

This template shows you how to run Llama Stack with [llama.cpp](https://github.com/ggerganov/llama.cpp) as the inference provider.

## Prerequisites

1. **Install llama.cpp**: Follow the installation instructions from the [llama.cpp repository](https://github.com/ggerganov/llama.cpp)
2. **Download a model**: Download a GGUF format model file (e.g., from Hugging Face)

## Starting llama.cpp Server

Before running Llama Stack, you need to start the llama.cpp server:

```bash
# Example: Start llama.cpp server with a model
./llama-server -m /path/to/your/model.gguf -c 4096 --host 0.0.0.0 --port 8080
```

Common llama.cpp server options:

- `-m`: Path to the GGUF model file
- `-c`: Context size (default: 512)
- `--host`: Host to bind to (default: 127.0.0.1)
- `--port`: Port to bind to (default: 8080)
- `-ngl`: Number of layers to offload to GPU
- `--chat-template`: Chat template to use

## Environment Variables

Set these environment variables before running Llama Stack:

```bash
export LLAMACPP_URL=http://localhost:8080  # URL of your llama.cpp server (without /v1 suffix)
export INFERENCE_MODEL=your-model-name     # Name/identifier for your model
export LLAMACPP_API_KEY=""                 # API key (leave empty for local servers)
```

## Running Llama Stack

```bash
llama stack run llamacpp
```

## Configuration

The template uses the following configuration:

- **Inference Provider**: `remote::llamacpp` - Connects to your llama.cpp server via OpenAI-compatible API
- **Default URL**: `http://localhost:8080` (configurable via `LLAMACPP_URL`)
- **Vector Store**: FAISS for local vector storage
- **Safety**: Llama Guard for content safety
- **Other providers**: Standard Meta reference implementations

## Model Support

This template works with any GGUF format model supported by llama.cpp, including:

- Llama 2/3 models
- Code Llama models
- Other transformer-based models converted to GGUF format

## Troubleshooting

1. **Connection refused**: Make sure your llama.cpp server is running and accessible
2. **Model not found**: Verify the model path and that the GGUF file exists
3. **Out of memory**: Reduce context size (`-c`) or use GPU offloading (`-ngl`)
4. **Slow inference**: Consider using GPU acceleration or quantized models

## Advanced Configuration

You can customize the llama.cpp server configuration by modifying the server startup command. For production use, consider:

- Using GPU acceleration with `-ngl` parameter
- Adjusting batch size with `-b` parameter
- Setting appropriate context size with `-c` parameter
- Using multiple threads with `-t` parameter

For more llama.cpp server options, run:

```bash
./llama-server --help
```