From f65a260cddc936d3d3dbb201635f83bdf4c641a2 Mon Sep 17 00:00:00 2001 From: Young Han Date: Mon, 14 Jul 2025 13:21:33 -0700 Subject: [PATCH] feat: pre-commit results --- .../self_hosted_distro/llamacpp.md | 86 +++++++++++++++++++ docs/source/providers/inference/index.md | 1 + .../providers/inference/remote_llamacpp.md | 17 ++++ 3 files changed, 104 insertions(+) create mode 100644 docs/source/distributions/self_hosted_distro/llamacpp.md create mode 100644 docs/source/providers/inference/remote_llamacpp.md diff --git a/docs/source/distributions/self_hosted_distro/llamacpp.md b/docs/source/distributions/self_hosted_distro/llamacpp.md new file mode 100644 index 000000000..f3a5e3630 --- /dev/null +++ b/docs/source/distributions/self_hosted_distro/llamacpp.md @@ -0,0 +1,86 @@ + +# Llama Stack with llama.cpp + +This template shows you how to run Llama Stack with [llama.cpp](https://github.com/ggerganov/llama.cpp) as the inference provider. + +## Prerequisites + +1. **Install llama.cpp**: Follow the installation instructions from the [llama.cpp repository](https://github.com/ggerganov/llama.cpp) +2. **Download a model**: Download a GGUF format model file (e.g., from Hugging Face) + +## Starting llama.cpp Server + +Before running Llama Stack, you need to start the llama.cpp server: + +```bash +# Example: Start llama.cpp server with a model +./llama-server -m /path/to/your/YOUR_MODEL.gguf -c 4096 --host 0.0.0.0 --port 8080 --api-key YOUR_API_KEY --jinja -cb +``` + +Common llama.cpp server options: + +- `-m`: Path to the GGUF model file +- `-c`: Context size (default: 512) +- `--host`: Host to bind to (default: 127.0.0.1) +- `--port`: Port to bind to (default: 8080) +- `-ngl`: Number of layers to offload to GPU +- `--chat-template`: Chat template to use + +## Environment Variables + +Set these environment variables before running Llama Stack: + +```bash +export LLAMACPP_URL=http://localhost:8080 # URL of your llama.cpp server (without /v1 suffix) +export INFERENCE_MODEL=your-model-name # Name/identifier without gguf extension +export LLAMACPP_API_KEY="YOUR_API_KEY" # API key (leave empty for local servers) +``` + +## Running Llama Stack + +The model name will be you gguf file name without the extension. + +```bash +llama stack build --template llamacpp --image-type conda +llama stack run llamacpp --image-type conda +``` + +## Configuration + +The template uses the following configuration: + +- **Inference Provider**: `remote::llamacpp` - Connects to your llama.cpp server via OpenAI-compatible API +- **Default URL**: `http://localhost:8080` (configurable via `LLAMACPP_URL`) +- **Vector Store**: FAISS for local vector storage +- **Safety**: Llama Guard for content safety +- **Other providers**: Standard Meta reference implementations + +## Model Support + +This template works with any GGUF format model supported by llama.cpp, including: + +- Llama 2/3 models +- Code Llama models +- Other transformer-based models converted to GGUF format + +## Troubleshooting + +1. **Connection refused**: Make sure your llama.cpp server is running and accessible +2. **Model not found**: Verify the model path and that the GGUF file exists +3. **Out of memory**: Reduce context size (`-c`) or use GPU offloading (`-ngl`) +4. **Slow inference**: Consider using GPU acceleration or quantized models + +## Advanced Configuration + +You can customize the llama.cpp server configuration by modifying the server startup command. For production use, consider: + +- Using GPU acceleration with `-ngl` parameter +- Adjusting batch size with `-b` parameter +- Setting appropriate context size with `-c` parameter +- Using multiple threads with `-t` parameter + +For more llama.cpp server options, run: + +```bash +./llama-server --help +``` diff --git a/docs/source/providers/inference/index.md b/docs/source/providers/inference/index.md index 05773efce..a163a9324 100644 --- a/docs/source/providers/inference/index.md +++ b/docs/source/providers/inference/index.md @@ -18,6 +18,7 @@ This section contains documentation for all available providers for the **infere - [remote::hf::endpoint](remote_hf_endpoint.md) - [remote::hf::serverless](remote_hf_serverless.md) - [remote::llama-openai-compat](remote_llama-openai-compat.md) +- [remote::llamacpp](remote_llamacpp.md) - [remote::nvidia](remote_nvidia.md) - [remote::ollama](remote_ollama.md) - [remote::openai](remote_openai.md) diff --git a/docs/source/providers/inference/remote_llamacpp.md b/docs/source/providers/inference/remote_llamacpp.md new file mode 100644 index 000000000..291c96614 --- /dev/null +++ b/docs/source/providers/inference/remote_llamacpp.md @@ -0,0 +1,17 @@ +# remote::llamacpp + +## Configuration + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `api_key` | `str \| None` | No | | The llama.cpp server API key (optional for local servers) | +| `openai_compat_api_base` | `` | No | http://localhost:8080/v1 | The URL for the llama.cpp server with OpenAI-compatible API | + +## Sample Configuration + +```yaml +openai_compat_api_base: ${env.LLAMACPP_URL:http://localhost:8080}/v1 +api_key: ${env.LLAMACPP_API_KEY:} + +``` +