mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-23 02:59:40 +00:00
docs: add detailed explanation
This commit is contained in:
parent
ccef7e325d
commit
723a870171
2 changed files with 20 additions and 6 deletions
|
|
@ -1,7 +1,10 @@
|
|||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# Llama Stack with llama.cpp
|
||||
|
||||
This template shows you how to run Llama Stack with [llama.cpp](https://github.com/ggerganov/llama.cpp) as the inference provider.
|
||||
This template demonstrates how to utilize Llama Stack with [llama.cpp](https://github.com/ggerganov/llama.cpp) as the inference provider. \n
|
||||
Previously, the use of quantized models with Llama Stack was restricted, but now it is fully supported through llama.cpp. \n
|
||||
You can employ any .gguf models available on [Hugging Face](https://huggingface.co/models) with this template.
|
||||
|
||||
|
||||
## Prerequisites
|
||||
|
||||
|
|
@ -14,7 +17,7 @@ Before running Llama Stack, you need to start the llama.cpp server:
|
|||
|
||||
```bash
|
||||
# Example: Start llama.cpp server with a model
|
||||
./llama-server -m /path/to/your/YOUR_MODEL.gguf -c 4096 --host 0.0.0.0 --port 8080 --api-key YOUR_API_KEY --jinja -cb
|
||||
./llama-server -m /path/to/your/YOUR_MODEL.gguf -c 4096 --host 0.0.0.0 --port 8080 --api-key YOUR_API_KEY --jinja -cb --alias llama-model
|
||||
```
|
||||
|
||||
Common llama.cpp server options:
|
||||
|
|
@ -25,6 +28,10 @@ Common llama.cpp server options:
|
|||
- `--port`: Port to bind to (default: 8080)
|
||||
- `-ngl`: Number of layers to offload to GPU
|
||||
- `--chat-template`: Chat template to use
|
||||
- `--api-key`: API key to use for authentication
|
||||
- `--alias`: Alias name for the model
|
||||
- `--jinja`: Enable jinja template support for tool calling in llama-stack
|
||||
- `--cb`: Enable continuous batching to improve throughput
|
||||
|
||||
## Environment Variables
|
||||
|
||||
|
|
@ -32,7 +39,7 @@ Set these environment variables before running Llama Stack:
|
|||
|
||||
```bash
|
||||
export LLAMACPP_URL=http://localhost:8080 # URL of your llama.cpp server (without /v1 suffix)
|
||||
export INFERENCE_MODEL=your-model-name # Name/identifier without gguf extension
|
||||
export INFERENCE_MODEL=llama-model # Use aliased Name/identifier
|
||||
export LLAMACPP_API_KEY="YOUR_API_KEY" # API key (leave empty for local servers)
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -1,6 +1,9 @@
|
|||
# Llama Stack with llama.cpp
|
||||
|
||||
This template shows you how to run Llama Stack with [llama.cpp](https://github.com/ggerganov/llama.cpp) as the inference provider.
|
||||
This template demonstrates how to utilize Llama Stack with [llama.cpp](https://github.com/ggerganov/llama.cpp) as the inference provider. \n
|
||||
Previously, the use of quantized models with Llama Stack was restricted, but now it is fully supported through llama.cpp. \n
|
||||
You can employ any .gguf models available on [Hugging Face](https://huggingface.co/models) with this template.
|
||||
|
||||
|
||||
## Prerequisites
|
||||
|
||||
|
|
@ -13,7 +16,7 @@ Before running Llama Stack, you need to start the llama.cpp server:
|
|||
|
||||
```bash
|
||||
# Example: Start llama.cpp server with a model
|
||||
./llama-server -m /path/to/your/YOUR_MODEL.gguf -c 4096 --host 0.0.0.0 --port 8080 --api-key YOUR_API_KEY --jinja -cb
|
||||
./llama-server -m /path/to/your/YOUR_MODEL.gguf -c 4096 --host 0.0.0.0 --port 8080 --api-key YOUR_API_KEY --jinja -cb --alias llama-model
|
||||
```
|
||||
|
||||
Common llama.cpp server options:
|
||||
|
|
@ -24,6 +27,10 @@ Common llama.cpp server options:
|
|||
- `--port`: Port to bind to (default: 8080)
|
||||
- `-ngl`: Number of layers to offload to GPU
|
||||
- `--chat-template`: Chat template to use
|
||||
- `--api-key`: API key to use for authentication
|
||||
- `--alias`: Alias name for the model
|
||||
- `--jinja`: Enable jinja template support for tool calling in llama-stack
|
||||
- `--cb`: Enable continuous batching to improve throughput
|
||||
|
||||
## Environment Variables
|
||||
|
||||
|
|
@ -31,7 +38,7 @@ Set these environment variables before running Llama Stack:
|
|||
|
||||
```bash
|
||||
export LLAMACPP_URL=http://localhost:8080 # URL of your llama.cpp server (without /v1 suffix)
|
||||
export INFERENCE_MODEL=your-model-name # Name/identifier without gguf extension
|
||||
export INFERENCE_MODEL=llama-model # Use aliased Name/identifier
|
||||
export LLAMACPP_API_KEY="YOUR_API_KEY" # API key (leave empty for local servers)
|
||||
```
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue