From f65a260cddc936d3d3dbb201635f83bdf4c641a2 Mon Sep 17 00:00:00 2001
From: Young Han <younghan@meta.com>
Date: Mon, 14 Jul 2025 13:21:33 -0700
Subject: [PATCH] feat: pre-commit results

---
 .../self_hosted_distro/llamacpp.md            | 86 +++++++++++++++++++
 docs/source/providers/inference/index.md      |  1 +
 .../providers/inference/remote_llamacpp.md    | 17 ++++
 3 files changed, 104 insertions(+)
 create mode 100644 docs/source/distributions/self_hosted_distro/llamacpp.md
 create mode 100644 docs/source/providers/inference/remote_llamacpp.md

diff --git a/docs/source/distributions/self_hosted_distro/llamacpp.md b/docs/source/distributions/self_hosted_distro/llamacpp.md
new file mode 100644
index 000000000..f3a5e3630
--- /dev/null
+++ b/docs/source/distributions/self_hosted_distro/llamacpp.md
@@ -0,0 +1,86 @@
+<!-- This file was auto-generated by distro_codegen.py, please edit source -->
+# Llama Stack with llama.cpp
+
+This template shows you how to run Llama Stack with [llama.cpp](https://github.com/ggerganov/llama.cpp) as the inference provider.
+
+## Prerequisites
+
+1. **Install llama.cpp**: Follow the installation instructions from the [llama.cpp repository](https://github.com/ggerganov/llama.cpp)
+2. **Download a model**: Download a GGUF format model file (e.g., from Hugging Face)
+
+## Starting llama.cpp Server
+
+Before running Llama Stack, you need to start the llama.cpp server:
+
+```bash
+# Example: Start llama.cpp server with a model
+./llama-server -m /path/to/your/YOUR_MODEL.gguf -c 4096 --host 0.0.0.0 --port 8080 --api-key YOUR_API_KEY --jinja -cb
+```
+
+Common llama.cpp server options:
+
+- `-m`: Path to the GGUF model file
+- `-c`: Context size (default: 512)
+- `--host`: Host to bind to (default: 127.0.0.1)
+- `--port`: Port to bind to (default: 8080)
+- `-ngl`: Number of layers to offload to GPU
+- `--chat-template`: Chat template to use
+
+## Environment Variables
+
+Set these environment variables before running Llama Stack:
+
+```bash
+export LLAMACPP_URL=http://localhost:8080  # URL of your llama.cpp server (without /v1 suffix)
+export INFERENCE_MODEL=your-model-name     # Name/identifier without gguf extension
+export LLAMACPP_API_KEY="YOUR_API_KEY"                 # API key (leave empty for local servers)
+```
+
+## Running Llama Stack
+
+The model name will be you gguf file name without the extension.
+
+```bash
+llama stack build --template llamacpp --image-type conda
+llama stack run llamacpp --image-type conda
+```
+
+## Configuration
+
+The template uses the following configuration:
+
+- **Inference Provider**: `remote::llamacpp` - Connects to your llama.cpp server via OpenAI-compatible API
+- **Default URL**: `http://localhost:8080` (configurable via `LLAMACPP_URL`)
+- **Vector Store**: FAISS for local vector storage
+- **Safety**: Llama Guard for content safety
+- **Other providers**: Standard Meta reference implementations
+
+## Model Support
+
+This template works with any GGUF format model supported by llama.cpp, including:
+
+- Llama 2/3 models
+- Code Llama models
+- Other transformer-based models converted to GGUF format
+
+## Troubleshooting
+
+1. **Connection refused**: Make sure your llama.cpp server is running and accessible
+2. **Model not found**: Verify the model path and that the GGUF file exists
+3. **Out of memory**: Reduce context size (`-c`) or use GPU offloading (`-ngl`)
+4. **Slow inference**: Consider using GPU acceleration or quantized models
+
+## Advanced Configuration
+
+You can customize the llama.cpp server configuration by modifying the server startup command. For production use, consider:
+
+- Using GPU acceleration with `-ngl` parameter
+- Adjusting batch size with `-b` parameter
+- Setting appropriate context size with `-c` parameter
+- Using multiple threads with `-t` parameter
+
+For more llama.cpp server options, run:
+
+```bash
+./llama-server --help
+```
diff --git a/docs/source/providers/inference/index.md b/docs/source/providers/inference/index.md
index 05773efce..a163a9324 100644
--- a/docs/source/providers/inference/index.md
+++ b/docs/source/providers/inference/index.md
@@ -18,6 +18,7 @@ This section contains documentation for all available providers for the **infere
 - [remote::hf::endpoint](remote_hf_endpoint.md)
 - [remote::hf::serverless](remote_hf_serverless.md)
 - [remote::llama-openai-compat](remote_llama-openai-compat.md)
+- [remote::llamacpp](remote_llamacpp.md)
 - [remote::nvidia](remote_nvidia.md)
 - [remote::ollama](remote_ollama.md)
 - [remote::openai](remote_openai.md)
diff --git a/docs/source/providers/inference/remote_llamacpp.md b/docs/source/providers/inference/remote_llamacpp.md
new file mode 100644
index 000000000..291c96614
--- /dev/null
+++ b/docs/source/providers/inference/remote_llamacpp.md
@@ -0,0 +1,17 @@
+# remote::llamacpp
+
+## Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `api_key` | `str \| None` | No |  | The llama.cpp server API key (optional for local servers) |
+| `openai_compat_api_base` | `<class 'str'>` | No | http://localhost:8080/v1 | The URL for the llama.cpp server with OpenAI-compatible API |
+
+## Sample Configuration
+
+```yaml
+openai_compat_api_base: ${env.LLAMACPP_URL:http://localhost:8080}/v1
+api_key: ${env.LLAMACPP_API_KEY:}
+
+```
+