mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-23 03:12:25 +00:00
feat: pre-commit results
This commit is contained in:
parent
a562d81825
commit
f65a260cdd
3 changed files with 104 additions and 0 deletions
86
docs/source/distributions/self_hosted_distro/llamacpp.md
Normal file
86
docs/source/distributions/self_hosted_distro/llamacpp.md
Normal file
|
|
@ -0,0 +1,86 @@
|
||||||
|
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||||
|
# Llama Stack with llama.cpp
|
||||||
|
|
||||||
|
This template shows you how to run Llama Stack with [llama.cpp](https://github.com/ggerganov/llama.cpp) as the inference provider.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
1. **Install llama.cpp**: Follow the installation instructions from the [llama.cpp repository](https://github.com/ggerganov/llama.cpp)
|
||||||
|
2. **Download a model**: Download a GGUF format model file (e.g., from Hugging Face)
|
||||||
|
|
||||||
|
## Starting llama.cpp Server
|
||||||
|
|
||||||
|
Before running Llama Stack, you need to start the llama.cpp server:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Example: Start llama.cpp server with a model
|
||||||
|
./llama-server -m /path/to/your/YOUR_MODEL.gguf -c 4096 --host 0.0.0.0 --port 8080 --api-key YOUR_API_KEY --jinja -cb
|
||||||
|
```
|
||||||
|
|
||||||
|
Common llama.cpp server options:
|
||||||
|
|
||||||
|
- `-m`: Path to the GGUF model file
|
||||||
|
- `-c`: Context size (default: 512)
|
||||||
|
- `--host`: Host to bind to (default: 127.0.0.1)
|
||||||
|
- `--port`: Port to bind to (default: 8080)
|
||||||
|
- `-ngl`: Number of layers to offload to GPU
|
||||||
|
- `--chat-template`: Chat template to use
|
||||||
|
|
||||||
|
## Environment Variables
|
||||||
|
|
||||||
|
Set these environment variables before running Llama Stack:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export LLAMACPP_URL=http://localhost:8080 # URL of your llama.cpp server (without /v1 suffix)
|
||||||
|
export INFERENCE_MODEL=your-model-name # Name/identifier without gguf extension
|
||||||
|
export LLAMACPP_API_KEY="YOUR_API_KEY" # API key (leave empty for local servers)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Running Llama Stack
|
||||||
|
|
||||||
|
The model name will be you gguf file name without the extension.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
llama stack build --template llamacpp --image-type conda
|
||||||
|
llama stack run llamacpp --image-type conda
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
The template uses the following configuration:
|
||||||
|
|
||||||
|
- **Inference Provider**: `remote::llamacpp` - Connects to your llama.cpp server via OpenAI-compatible API
|
||||||
|
- **Default URL**: `http://localhost:8080` (configurable via `LLAMACPP_URL`)
|
||||||
|
- **Vector Store**: FAISS for local vector storage
|
||||||
|
- **Safety**: Llama Guard for content safety
|
||||||
|
- **Other providers**: Standard Meta reference implementations
|
||||||
|
|
||||||
|
## Model Support
|
||||||
|
|
||||||
|
This template works with any GGUF format model supported by llama.cpp, including:
|
||||||
|
|
||||||
|
- Llama 2/3 models
|
||||||
|
- Code Llama models
|
||||||
|
- Other transformer-based models converted to GGUF format
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
1. **Connection refused**: Make sure your llama.cpp server is running and accessible
|
||||||
|
2. **Model not found**: Verify the model path and that the GGUF file exists
|
||||||
|
3. **Out of memory**: Reduce context size (`-c`) or use GPU offloading (`-ngl`)
|
||||||
|
4. **Slow inference**: Consider using GPU acceleration or quantized models
|
||||||
|
|
||||||
|
## Advanced Configuration
|
||||||
|
|
||||||
|
You can customize the llama.cpp server configuration by modifying the server startup command. For production use, consider:
|
||||||
|
|
||||||
|
- Using GPU acceleration with `-ngl` parameter
|
||||||
|
- Adjusting batch size with `-b` parameter
|
||||||
|
- Setting appropriate context size with `-c` parameter
|
||||||
|
- Using multiple threads with `-t` parameter
|
||||||
|
|
||||||
|
For more llama.cpp server options, run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./llama-server --help
|
||||||
|
```
|
||||||
|
|
@ -18,6 +18,7 @@ This section contains documentation for all available providers for the **infere
|
||||||
- [remote::hf::endpoint](remote_hf_endpoint.md)
|
- [remote::hf::endpoint](remote_hf_endpoint.md)
|
||||||
- [remote::hf::serverless](remote_hf_serverless.md)
|
- [remote::hf::serverless](remote_hf_serverless.md)
|
||||||
- [remote::llama-openai-compat](remote_llama-openai-compat.md)
|
- [remote::llama-openai-compat](remote_llama-openai-compat.md)
|
||||||
|
- [remote::llamacpp](remote_llamacpp.md)
|
||||||
- [remote::nvidia](remote_nvidia.md)
|
- [remote::nvidia](remote_nvidia.md)
|
||||||
- [remote::ollama](remote_ollama.md)
|
- [remote::ollama](remote_ollama.md)
|
||||||
- [remote::openai](remote_openai.md)
|
- [remote::openai](remote_openai.md)
|
||||||
|
|
|
||||||
17
docs/source/providers/inference/remote_llamacpp.md
Normal file
17
docs/source/providers/inference/remote_llamacpp.md
Normal file
|
|
@ -0,0 +1,17 @@
|
||||||
|
# remote::llamacpp
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
| Field | Type | Required | Default | Description |
|
||||||
|
|-------|------|----------|---------|-------------|
|
||||||
|
| `api_key` | `str \| None` | No | | The llama.cpp server API key (optional for local servers) |
|
||||||
|
| `openai_compat_api_base` | `<class 'str'>` | No | http://localhost:8080/v1 | The URL for the llama.cpp server with OpenAI-compatible API |
|
||||||
|
|
||||||
|
## Sample Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
openai_compat_api_base: ${env.LLAMACPP_URL:http://localhost:8080}/v1
|
||||||
|
api_key: ${env.LLAMACPP_API_KEY:}
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue