mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-28 01:41:59 +00:00
Merge branch 'meta-llama:main' into feat/litellm_sambanova_usage
This commit is contained in:
commit
13c660f5a5
57 changed files with 10986 additions and 93 deletions
9
docs/_static/js/detect_theme.js
vendored
Normal file
9
docs/_static/js/detect_theme.js
vendored
Normal file
|
|
@ -0,0 +1,9 @@
|
|||
document.addEventListener("DOMContentLoaded", function () {
|
||||
const prefersDark = window.matchMedia("(prefers-color-scheme: dark)").matches;
|
||||
const htmlElement = document.documentElement;
|
||||
if (prefersDark) {
|
||||
htmlElement.setAttribute("data-theme", "dark");
|
||||
} else {
|
||||
htmlElement.setAttribute("data-theme", "light");
|
||||
}
|
||||
});
|
||||
|
|
@ -112,6 +112,8 @@ html_theme_options = {
|
|||
# "style_nav_header_background": "#c3c9d4",
|
||||
}
|
||||
|
||||
default_dark_mode = False
|
||||
|
||||
html_static_path = ["../_static"]
|
||||
# html_logo = "../_static/llama-stack-logo.png"
|
||||
# html_style = "../_static/css/my_theme.css"
|
||||
|
|
@ -119,6 +121,7 @@ html_static_path = ["../_static"]
|
|||
|
||||
def setup(app):
|
||||
app.add_css_file("css/my_theme.css")
|
||||
app.add_js_file("js/detect_theme.js")
|
||||
|
||||
def dockerhub_role(name, rawtext, text, lineno, inliner, options={}, content=[]):
|
||||
url = f"https://hub.docker.com/r/llamastack/{text}"
|
||||
|
|
|
|||
|
|
@ -7,13 +7,13 @@ In this guide, we'll use a local [Kind](https://kind.sigs.k8s.io/) cluster and a
|
|||
|
||||
First, create a local Kubernetes cluster via Kind:
|
||||
|
||||
```bash
|
||||
```
|
||||
kind create cluster --image kindest/node:v1.32.0 --name llama-stack-test
|
||||
```
|
||||
|
||||
First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
|
||||
|
||||
```bash
|
||||
```
|
||||
cat <<EOF |kubectl apply -f -
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
|
|
@ -39,7 +39,7 @@ data:
|
|||
|
||||
Next, start the vLLM server as a Kubernetes Deployment and Service:
|
||||
|
||||
```bash
|
||||
```
|
||||
cat <<EOF |kubectl apply -f -
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
|
|
@ -95,7 +95,7 @@ EOF
|
|||
|
||||
We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):
|
||||
|
||||
```bash
|
||||
```
|
||||
$ kubectl logs -l app.kubernetes.io/name=vllm
|
||||
...
|
||||
INFO: Started server process [1]
|
||||
|
|
@ -119,7 +119,7 @@ providers:
|
|||
|
||||
Once we have defined the run configuration for Llama Stack, we can build an image with that configuration and the server source code:
|
||||
|
||||
```bash
|
||||
```
|
||||
cat >/tmp/test-vllm-llama-stack/Containerfile.llama-stack-run-k8s <<EOF
|
||||
FROM distribution-myenv:dev
|
||||
|
||||
|
|
@ -135,7 +135,7 @@ podman build -f /tmp/test-vllm-llama-stack/Containerfile.llama-stack-run-k8s -t
|
|||
|
||||
We can then start the Llama Stack server by deploying a Kubernetes Pod and Service:
|
||||
|
||||
```bash
|
||||
```
|
||||
cat <<EOF |kubectl apply -f -
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
|
|
@ -195,7 +195,7 @@ EOF
|
|||
### Verifying the Deployment
|
||||
We can check that the LlamaStack server has started:
|
||||
|
||||
```bash
|
||||
```
|
||||
$ kubectl logs -l app.kubernetes.io/name=llama-stack
|
||||
...
|
||||
INFO: Started server process [1]
|
||||
|
|
@ -207,7 +207,7 @@ INFO: Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit
|
|||
|
||||
Finally, we forward the Kubernetes service to a local port and test some inference requests against it via the Llama Stack Client:
|
||||
|
||||
```bash
|
||||
```
|
||||
kubectl port-forward service/llama-stack-service 5000:5000
|
||||
llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"
|
||||
```
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@ The `llamastack/distribution-remote-vllm` distribution consists of the following
|
|||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
You can use this distribution if you have GPUs and want to run an independent vLLM server container for running inference.
|
||||
You can use this distribution if you want to run an independent vLLM server for inference.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
|
|
@ -41,6 +41,83 @@ The following environment variables can be configured:
|
|||
|
||||
## Setting up vLLM server
|
||||
|
||||
In the following sections, we'll use either AMD and NVIDIA GPUs to serve as hardware accelerators for the vLLM
|
||||
server, which acts as both the LLM inference provider and the safety provider. Note that vLLM also
|
||||
[supports many other hardware accelerators](https://docs.vllm.ai/en/latest/getting_started/installation.html) and
|
||||
that we only use GPUs here for demonstration purposes.
|
||||
|
||||
### Setting up vLLM server on AMD GPU
|
||||
|
||||
AMD provides two main vLLM container options:
|
||||
- rocm/vllm: Production-ready container
|
||||
- rocm/vllm-dev: Development container with the latest vLLM features
|
||||
|
||||
Please check the [Blog about ROCm vLLM Usage](https://rocm.blogs.amd.com/software-tools-optimization/vllm-container/README.html) to get more details.
|
||||
|
||||
Here is a sample script to start a ROCm vLLM server locally via Docker:
|
||||
|
||||
```bash
|
||||
export INFERENCE_PORT=8000
|
||||
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
export VLLM_DIMG="rocm/vllm-dev:main"
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
--ipc=host \
|
||||
--privileged \
|
||||
--shm-size 16g \
|
||||
--device=/dev/kfd \
|
||||
--device=/dev/dri \
|
||||
--group-add video \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--cap-add=CAP_SYS_ADMIN \
|
||||
--security-opt seccomp=unconfined \
|
||||
--security-opt apparmor=unconfined \
|
||||
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
||||
--env "HIP_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" \
|
||||
-p $INFERENCE_PORT:$INFERENCE_PORT \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
$VLLM_DIMG \
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model $INFERENCE_MODEL \
|
||||
--port $INFERENCE_PORT
|
||||
```
|
||||
|
||||
Note that you'll also need to set `--enable-auto-tool-choice` and `--tool-call-parser` to [enable tool calling in vLLM](https://docs.vllm.ai/en/latest/features/tool_calling.html).
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
|
||||
|
||||
```bash
|
||||
export SAFETY_PORT=8081
|
||||
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
export CUDA_VISIBLE_DEVICES=1
|
||||
export VLLM_DIMG="rocm/vllm-dev:main"
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
--ipc=host \
|
||||
--privileged \
|
||||
--shm-size 16g \
|
||||
--device=/dev/kfd \
|
||||
--device=/dev/dri \
|
||||
--group-add video \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--cap-add=CAP_SYS_ADMIN \
|
||||
--security-opt seccomp=unconfined \
|
||||
--security-opt apparmor=unconfined \
|
||||
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
||||
--env "HIP_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" \
|
||||
-p $SAFETY_PORT:$SAFETY_PORT \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
$VLLM_DIMG \
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model $SAFETY_MODEL \
|
||||
--port $SAFETY_PORT
|
||||
```
|
||||
|
||||
### Setting up vLLM server on NVIDIA GPU
|
||||
|
||||
Please check the [vLLM Documentation](https://docs.vllm.ai/en/v0.5.5/serving/deploying_with_docker.html) to get a vLLM endpoint. Here is a sample script to start a vLLM server locally via Docker:
|
||||
|
||||
```bash
|
||||
|
|
|
|||
|
|
@ -6,13 +6,13 @@ Llama Stack is a stateful service with REST APIs to support seamless transition
|
|||
In this guide, we'll walk through how to build a RAG agent locally using Llama Stack with [Ollama](https://ollama.com/) to run inference on a Llama Model.
|
||||
|
||||
|
||||
### 1. Start Ollama
|
||||
### 1. Download a Llama model with Ollama
|
||||
|
||||
```bash
|
||||
ollama run llama3.2:3b --keepalive 60m
|
||||
ollama pull llama3.2:3b-instruct-fp16
|
||||
```
|
||||
|
||||
By default, Ollama keeps the model loaded in memory for 5 minutes which can be too short. We set the `--keepalive` flag to 60 minutes to ensure the model remains loaded for sometime.
|
||||
This will instruct the Ollama service to download the Llama 3.2 3B Instruct model, which we'll use in the rest of this guide.
|
||||
|
||||
```{admonition} Note
|
||||
:class: tip
|
||||
|
|
|
|||
|
|
@ -103,7 +103,5 @@ llama stack run together
|
|||
|
||||
2. Start Streamlit UI
|
||||
```bash
|
||||
cd llama_stack/distribution/ui
|
||||
pip install -r requirements.txt
|
||||
streamlit run app.py
|
||||
uv run --with ".[ui]" streamlit run llama_stack/distribution/ui/app.py
|
||||
```
|
||||
|
|
|
|||
234
docs/source/providers/external.md
Normal file
234
docs/source/providers/external.md
Normal file
|
|
@ -0,0 +1,234 @@
|
|||
# External Providers
|
||||
|
||||
Llama Stack supports external providers that live outside of the main codebase. This allows you to:
|
||||
- Create and maintain your own providers independently
|
||||
- Share providers with others without contributing to the main codebase
|
||||
- Keep provider-specific code separate from the core Llama Stack code
|
||||
|
||||
## Configuration
|
||||
|
||||
To enable external providers, you need to configure the `external_providers_dir` in your Llama Stack configuration. This directory should contain your external provider specifications:
|
||||
|
||||
```yaml
|
||||
external_providers_dir: /etc/llama-stack/providers.d/
|
||||
```
|
||||
|
||||
## Directory Structure
|
||||
|
||||
The external providers directory should follow this structure:
|
||||
|
||||
```
|
||||
providers.d/
|
||||
remote/
|
||||
inference/
|
||||
custom_ollama.yaml
|
||||
vllm.yaml
|
||||
vector_io/
|
||||
qdrant.yaml
|
||||
safety/
|
||||
llama-guard.yaml
|
||||
inline/
|
||||
inference/
|
||||
custom_ollama.yaml
|
||||
vllm.yaml
|
||||
vector_io/
|
||||
qdrant.yaml
|
||||
safety/
|
||||
llama-guard.yaml
|
||||
```
|
||||
|
||||
Each YAML file in these directories defines a provider specification for that particular API.
|
||||
|
||||
## Provider Types
|
||||
|
||||
Llama Stack supports two types of external providers:
|
||||
|
||||
1. **Remote Providers**: Providers that communicate with external services (e.g., cloud APIs)
|
||||
2. **Inline Providers**: Providers that run locally within the Llama Stack process
|
||||
|
||||
## Known External Providers
|
||||
|
||||
Here's a list of known external providers that you can use with Llama Stack:
|
||||
|
||||
| Type | Name | Description | Repository |
|
||||
|------|------|-------------|------------|
|
||||
| Remote | KubeFlow Training | Train models with KubeFlow | [llama-stack-provider-kft](https://github.com/opendatahub-io/llama-stack-provider-kft) |
|
||||
|
||||
### Remote Provider Specification
|
||||
|
||||
Remote providers are used when you need to communicate with external services. Here's an example for a custom Ollama provider:
|
||||
|
||||
```yaml
|
||||
adapter:
|
||||
adapter_type: custom_ollama
|
||||
pip_packages:
|
||||
- ollama
|
||||
- aiohttp
|
||||
config_class: llama_stack_ollama_provider.config.OllamaImplConfig
|
||||
module: llama_stack_ollama_provider
|
||||
api_dependencies: []
|
||||
optional_api_dependencies: []
|
||||
```
|
||||
|
||||
#### Adapter Configuration
|
||||
|
||||
The `adapter` section defines how to load and configure the provider:
|
||||
|
||||
- `adapter_type`: A unique identifier for this adapter
|
||||
- `pip_packages`: List of Python packages required by the provider
|
||||
- `config_class`: The full path to the configuration class
|
||||
- `module`: The Python module containing the provider implementation
|
||||
|
||||
### Inline Provider Specification
|
||||
|
||||
Inline providers run locally within the Llama Stack process. Here's an example for a custom vector store provider:
|
||||
|
||||
```yaml
|
||||
module: llama_stack_vector_provider
|
||||
config_class: llama_stack_vector_provider.config.VectorStoreConfig
|
||||
pip_packages:
|
||||
- faiss-cpu
|
||||
- numpy
|
||||
api_dependencies:
|
||||
- inference
|
||||
optional_api_dependencies:
|
||||
- vector_io
|
||||
provider_data_validator: llama_stack_vector_provider.validator.VectorStoreValidator
|
||||
container_image: custom-vector-store:latest # optional
|
||||
```
|
||||
|
||||
#### Inline Provider Fields
|
||||
|
||||
- `module`: The Python module containing the provider implementation
|
||||
- `config_class`: The full path to the configuration class
|
||||
- `pip_packages`: List of Python packages required by the provider
|
||||
- `api_dependencies`: List of Llama Stack APIs that this provider depends on
|
||||
- `optional_api_dependencies`: List of optional Llama Stack APIs that this provider can use
|
||||
- `provider_data_validator`: Optional validator for provider data
|
||||
- `container_image`: Optional container image to use instead of pip packages
|
||||
|
||||
## Required Implementation
|
||||
|
||||
### Remote Providers
|
||||
|
||||
Remote providers must expose a `get_adapter_impl()` function in their module that takes two arguments:
|
||||
1. `config`: An instance of the provider's config class
|
||||
2. `deps`: A dictionary of API dependencies
|
||||
|
||||
This function must return an instance of the provider's adapter class that implements the required protocol for the API.
|
||||
|
||||
Example:
|
||||
```python
|
||||
async def get_adapter_impl(
|
||||
config: OllamaImplConfig, deps: Dict[Api, Any]
|
||||
) -> OllamaInferenceAdapter:
|
||||
return OllamaInferenceAdapter(config)
|
||||
```
|
||||
|
||||
### Inline Providers
|
||||
|
||||
Inline providers must expose a `get_provider_impl()` function in their module that takes two arguments:
|
||||
1. `config`: An instance of the provider's config class
|
||||
2. `deps`: A dictionary of API dependencies
|
||||
|
||||
Example:
|
||||
```python
|
||||
async def get_provider_impl(
|
||||
config: VectorStoreConfig, deps: Dict[Api, Any]
|
||||
) -> VectorStoreImpl:
|
||||
impl = VectorStoreImpl(config, deps[Api.inference])
|
||||
await impl.initialize()
|
||||
return impl
|
||||
```
|
||||
|
||||
## Dependencies
|
||||
|
||||
The provider package must be installed on the system. For example:
|
||||
|
||||
```bash
|
||||
$ uv pip show llama-stack-ollama-provider
|
||||
Name: llama-stack-ollama-provider
|
||||
Version: 0.1.0
|
||||
Location: /path/to/venv/lib/python3.10/site-packages
|
||||
```
|
||||
|
||||
## Example: Custom Ollama Provider
|
||||
|
||||
Here's a complete example of creating and using a custom Ollama provider:
|
||||
|
||||
1. First, create the provider package:
|
||||
|
||||
```bash
|
||||
mkdir -p llama-stack-provider-ollama
|
||||
cd llama-stack-provider-ollama
|
||||
git init
|
||||
uv init
|
||||
```
|
||||
|
||||
2. Edit `pyproject.toml`:
|
||||
|
||||
```toml
|
||||
[project]
|
||||
name = "llama-stack-provider-ollama"
|
||||
version = "0.1.0"
|
||||
description = "Ollama provider for Llama Stack"
|
||||
requires-python = ">=3.10"
|
||||
dependencies = ["llama-stack", "pydantic", "ollama", "aiohttp"]
|
||||
```
|
||||
|
||||
3. Create the provider specification:
|
||||
|
||||
```yaml
|
||||
# /etc/llama-stack/providers.d/remote/inference/custom_ollama.yaml
|
||||
adapter:
|
||||
adapter_type: custom_ollama
|
||||
pip_packages: ["ollama", "aiohttp"]
|
||||
config_class: llama_stack_provider_ollama.config.OllamaImplConfig
|
||||
module: llama_stack_provider_ollama
|
||||
api_dependencies: []
|
||||
optional_api_dependencies: []
|
||||
```
|
||||
|
||||
4. Install the provider:
|
||||
|
||||
```bash
|
||||
uv pip install -e .
|
||||
```
|
||||
|
||||
5. Configure Llama Stack to use external providers:
|
||||
|
||||
```yaml
|
||||
external_providers_dir: /etc/llama-stack/providers.d/
|
||||
```
|
||||
|
||||
The provider will now be available in Llama Stack with the type `remote::custom_ollama`.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Package Naming**: Use the prefix `llama-stack-provider-` for your provider packages to make them easily identifiable.
|
||||
|
||||
2. **Version Management**: Keep your provider package versioned and compatible with the Llama Stack version you're using.
|
||||
|
||||
3. **Dependencies**: Only include the minimum required dependencies in your provider package.
|
||||
|
||||
4. **Documentation**: Include clear documentation in your provider package about:
|
||||
- Installation requirements
|
||||
- Configuration options
|
||||
- Usage examples
|
||||
- Any limitations or known issues
|
||||
|
||||
5. **Testing**: Include tests in your provider package to ensure it works correctly with Llama Stack.
|
||||
You can refer to the [integration tests
|
||||
guide](https://github.com/meta-llama/llama-stack/blob/main/tests/integration/README.md) for more
|
||||
information. Execute the test for the Provider type you are developing.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
If your external provider isn't being loaded:
|
||||
|
||||
1. Check that the `external_providers_dir` path is correct and accessible.
|
||||
2. Verify that the YAML files are properly formatted.
|
||||
3. Ensure all required Python packages are installed.
|
||||
4. Check the Llama Stack server logs for any error messages - turn on debug logging to get more
|
||||
information using `LLAMA_STACK_LOGGING=all=debug`.
|
||||
5. Verify that the provider package is installed in your Python environment.
|
||||
|
|
@ -11,6 +11,10 @@ Providers come in two flavors:
|
|||
|
||||
Importantly, Llama Stack always strives to provide at least one fully inline provider for each API so you can iterate on a fully featured environment locally.
|
||||
|
||||
## External Providers
|
||||
|
||||
Llama Stack supports external providers that live outside of the main codebase. This allows you to create and maintain your own providers independently. See the [External Providers Guide](external) for details.
|
||||
|
||||
## Agents
|
||||
Run multi-step agentic workflows with LLMs with tool usage, memory (RAG), etc.
|
||||
|
||||
|
|
@ -50,6 +54,7 @@ The following providers (i.e., databases) are available for Vector IO:
|
|||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
external
|
||||
vector_io/faiss
|
||||
vector_io/sqlite-vec
|
||||
vector_io/chromadb
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue