mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-10-04 12:07:34 +00:00
old md files deprecation
This commit is contained in:
parent
5b9bea02c3
commit
c8e0fc1a7d
173 changed files with 0 additions and 12955 deletions
|
@ -1,443 +0,0 @@
|
|||
# Build your own Distribution
|
||||
|
||||
|
||||
This guide will walk you through the steps to get started with building a Llama Stack distribution from scratch with your choice of API providers.
|
||||
|
||||
|
||||
### Setting your log level
|
||||
|
||||
In order to specify the proper logging level users can apply the following environment variable `LLAMA_STACK_LOGGING` with the following format:
|
||||
|
||||
`LLAMA_STACK_LOGGING=server=debug;core=info`
|
||||
|
||||
Where each category in the following list:
|
||||
|
||||
- all
|
||||
- core
|
||||
- server
|
||||
- router
|
||||
- inference
|
||||
- agents
|
||||
- safety
|
||||
- eval
|
||||
- tools
|
||||
- client
|
||||
|
||||
Can be set to any of the following log levels:
|
||||
|
||||
- debug
|
||||
- info
|
||||
- warning
|
||||
- error
|
||||
- critical
|
||||
|
||||
The default global log level is `info`. `all` sets the log level for all components.
|
||||
|
||||
A user can also set `LLAMA_STACK_LOG_FILE` which will pipe the logs to the specified path as well as to the terminal. An example would be: `export LLAMA_STACK_LOG_FILE=server.log`
|
||||
|
||||
### Llama Stack Build
|
||||
|
||||
In order to build your own distribution, we recommend you clone the `llama-stack` repository.
|
||||
|
||||
|
||||
```
|
||||
git clone git@github.com:meta-llama/llama-stack.git
|
||||
cd llama-stack
|
||||
pip install -e .
|
||||
```
|
||||
Use the CLI to build your distribution.
|
||||
The main points to consider are:
|
||||
1. **Image Type** - Do you want a venv environment or a Container (eg. Docker)
|
||||
2. **Template** - Do you want to use a template to build your distribution? or start from scratch ?
|
||||
3. **Config** - Do you want to use a pre-existing config file to build your distribution?
|
||||
|
||||
```
|
||||
llama stack build -h
|
||||
usage: llama stack build [-h] [--config CONFIG] [--template TEMPLATE] [--distro DISTRIBUTION] [--list-distros] [--image-type {container,venv}] [--image-name IMAGE_NAME] [--print-deps-only]
|
||||
[--run] [--providers PROVIDERS]
|
||||
|
||||
Build a Llama stack container
|
||||
|
||||
options:
|
||||
-h, --help show this help message and exit
|
||||
--config CONFIG Path to a config file to use for the build. You can find example configs in llama_stack.cores/**/build.yaml. If this argument is not provided, you will be prompted to
|
||||
enter information interactively (default: None)
|
||||
--template TEMPLATE (deprecated) Name of the example template config to use for build. You may use `llama stack build --list-distros` to check out the available distributions (default:
|
||||
None)
|
||||
--distro DISTRIBUTION, --distribution DISTRIBUTION
|
||||
Name of the distribution to use for build. You may use `llama stack build --list-distros` to check out the available distributions (default: None)
|
||||
--list-distros, --list-distributions
|
||||
Show the available distributions for building a Llama Stack distribution (default: False)
|
||||
--image-type {container,venv}
|
||||
Image Type to use for the build. If not specified, will use the image type from the template config. (default: None)
|
||||
--image-name IMAGE_NAME
|
||||
[for image-type=container|venv] Name of the virtual environment to use for the build. If not specified, currently active environment will be used if found. (default:
|
||||
None)
|
||||
--print-deps-only Print the dependencies for the stack only, without building the stack (default: False)
|
||||
--run Run the stack after building using the same image type, name, and other applicable arguments (default: False)
|
||||
--providers PROVIDERS
|
||||
Build a config for a list of providers and only those providers. This list is formatted like: api1=provider1,api2=provider2. Where there can be multiple providers per
|
||||
API. (default: None)
|
||||
```
|
||||
|
||||
After this step is complete, a file named `<name>-build.yaml` and template file `<name>-run.yaml` will be generated and saved at the output file path specified at the end of the command.
|
||||
|
||||
::::{tab-set}
|
||||
:::{tab-item} Building from a template
|
||||
To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers.
|
||||
|
||||
The following command will allow you to see the available templates and their corresponding providers.
|
||||
```
|
||||
llama stack build --list-templates
|
||||
```
|
||||
|
||||
```
|
||||
------------------------------+-----------------------------------------------------------------------------+
|
||||
| Template Name | Description |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| watsonx | Use watsonx for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| vllm-gpu | Use a built-in vLLM engine for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| together | Use Together.AI for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| tgi | Use (an external) TGI server for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| starter | Quick start template for running Llama Stack with several popular providers |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| sambanova | Use SambaNova for running LLM inference and safety |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| remote-vllm | Use (an external) vLLM server for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| postgres-demo | Quick start template for running Llama Stack with several popular providers |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| passthrough | Use Passthrough hosted llama-stack endpoint for LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| open-benchmark | Distribution for running open benchmarks |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| ollama | Use (an external) Ollama server for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| nvidia | Use NVIDIA NIM for running LLM inference, evaluation and safety |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| meta-reference-gpu | Use Meta Reference for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| llama_api | Distribution for running e2e tests in CI |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| hf-serverless | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| hf-endpoint | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| groq | Use Groq for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| fireworks | Use Fireworks.AI for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| experimental-post-training | Experimental template for post training |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| dell | Dell's distribution of Llama Stack. TGI inference via Dell's custom |
|
||||
| | container |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| ci-tests | Distribution for running e2e tests in CI |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| cerebras | Use Cerebras for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| bedrock | Use AWS Bedrock for running LLM inference and safety |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
You may then pick a template to build your distribution with providers fitted to your liking.
|
||||
|
||||
For example, to build a distribution with TGI as the inference provider, you can run:
|
||||
```
|
||||
$ llama stack build --distro starter
|
||||
...
|
||||
You can now edit ~/.llama/distributions/llamastack-starter/starter-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-starter/starter-run.yaml`
|
||||
```
|
||||
|
||||
```{tip}
|
||||
The generated `run.yaml` file is a starting point for your configuration. For comprehensive guidance on customizing it for your specific needs, infrastructure, and deployment scenarios, see [Customizing Your run.yaml Configuration](customizing_run_yaml.md).
|
||||
```
|
||||
:::
|
||||
:::{tab-item} Building from Scratch
|
||||
|
||||
If the provided templates do not fit your use case, you could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations.
|
||||
|
||||
It would be best to start with a template and understand the structure of the config file and the various concepts ( APIS, providers, resources, etc.) before starting from scratch.
|
||||
```
|
||||
llama stack build
|
||||
|
||||
> Enter a name for your Llama Stack (e.g. my-local-stack): my-stack
|
||||
> Enter the image type you want your Llama Stack to be built as (container or venv): venv
|
||||
|
||||
Llama Stack is composed of several APIs working together. Let's select
|
||||
the provider types (implementations) you want to use for these APIs.
|
||||
|
||||
Tip: use <TAB> to see options for the providers.
|
||||
|
||||
> Enter provider for API inference: inline::meta-reference
|
||||
> Enter provider for API safety: inline::llama-guard
|
||||
> Enter provider for API agents: inline::meta-reference
|
||||
> Enter provider for API memory: inline::faiss
|
||||
> Enter provider for API datasetio: inline::meta-reference
|
||||
> Enter provider for API scoring: inline::meta-reference
|
||||
> Enter provider for API eval: inline::meta-reference
|
||||
> Enter provider for API telemetry: inline::meta-reference
|
||||
|
||||
> (Optional) Enter a short description for your Llama Stack:
|
||||
|
||||
You can now edit ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml`
|
||||
```
|
||||
:::
|
||||
|
||||
:::{tab-item} Building from a pre-existing build config file
|
||||
- In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command.
|
||||
|
||||
- The config file will be of contents like the ones in `llama_stack/distributions/*build.yaml`.
|
||||
|
||||
```
|
||||
llama stack build --config llama_stack/distributions/starter/build.yaml
|
||||
```
|
||||
:::
|
||||
|
||||
:::{tab-item} Building with External Providers
|
||||
|
||||
Llama Stack supports external providers that live outside of the main codebase. This allows you to create and maintain your own providers independently or use community-provided providers.
|
||||
|
||||
To build a distribution with external providers, you need to:
|
||||
|
||||
1. Configure the `external_providers_dir` in your build configuration file:
|
||||
|
||||
```yaml
|
||||
# Example my-external-stack.yaml with external providers
|
||||
version: '2'
|
||||
distribution_spec:
|
||||
description: Custom distro for CI tests
|
||||
providers:
|
||||
inference:
|
||||
- remote::custom_ollama
|
||||
# Add more providers as needed
|
||||
image_type: container
|
||||
image_name: ci-test
|
||||
# Path to external provider implementations
|
||||
external_providers_dir: ~/.llama/providers.d
|
||||
```
|
||||
|
||||
Here's an example for a custom Ollama provider:
|
||||
|
||||
```yaml
|
||||
adapter:
|
||||
adapter_type: custom_ollama
|
||||
pip_packages:
|
||||
- ollama
|
||||
- aiohttp
|
||||
- llama-stack-provider-ollama # This is the provider package
|
||||
config_class: llama_stack_ollama_provider.config.OllamaImplConfig
|
||||
module: llama_stack_ollama_provider
|
||||
api_dependencies: []
|
||||
optional_api_dependencies: []
|
||||
```
|
||||
|
||||
The `pip_packages` section lists the Python packages required by the provider, as well as the
|
||||
provider package itself. The package must be available on PyPI or can be provided from a local
|
||||
directory or a git repository (git must be installed on the build environment).
|
||||
|
||||
2. Build your distribution using the config file:
|
||||
|
||||
```
|
||||
llama stack build --config my-external-stack.yaml
|
||||
```
|
||||
|
||||
For more information on external providers, including directory structure, provider types, and implementation requirements, see the [External Providers documentation](../providers/external.md).
|
||||
:::
|
||||
|
||||
:::{tab-item} Building Container
|
||||
|
||||
```{admonition} Podman Alternative
|
||||
:class: tip
|
||||
|
||||
Podman is supported as an alternative to Docker. Set `CONTAINER_BINARY` to `podman` in your environment to use Podman.
|
||||
```
|
||||
|
||||
To build a container image, you may start off from a template and use the `--image-type container` flag to specify `container` as the build image type.
|
||||
|
||||
```
|
||||
llama stack build --distro starter --image-type container
|
||||
```
|
||||
|
||||
```
|
||||
$ llama stack build --distro starter --image-type container
|
||||
...
|
||||
Containerfile created successfully in /tmp/tmp.viA3a3Rdsg/ContainerfileFROM python:3.10-slim
|
||||
...
|
||||
```
|
||||
|
||||
You can now edit ~/meta-llama/llama-stack/tmp/configs/ollama-run.yaml and run `llama stack run ~/meta-llama/llama-stack/tmp/configs/ollama-run.yaml`
|
||||
```
|
||||
|
||||
Now set some environment variables for the inference model ID and Llama Stack Port and create a local directory to mount into the container's file system.
|
||||
```
|
||||
export INFERENCE_MODEL="llama3.2:3b"
|
||||
export LLAMA_STACK_PORT=8321
|
||||
mkdir -p ~/.llama
|
||||
```
|
||||
|
||||
After this step is successful, you should be able to find the built container image and test it with the below Docker command:
|
||||
|
||||
```
|
||||
docker run -d \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
localhost/distribution-ollama:dev \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env OLLAMA_URL=http://host.docker.internal:11434
|
||||
```
|
||||
|
||||
Here are the docker flags and their uses:
|
||||
|
||||
* `-d`: Runs the container in the detached mode as a background process
|
||||
|
||||
* `-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT`: Maps the container port to the host port for accessing the server
|
||||
|
||||
* `-v ~/.llama:/root/.llama`: Mounts the local .llama directory to persist configurations and data
|
||||
|
||||
* `localhost/distribution-ollama:dev`: The name and tag of the container image to run
|
||||
|
||||
* `--port $LLAMA_STACK_PORT`: Port number for the server to listen on
|
||||
|
||||
* `--env INFERENCE_MODEL=$INFERENCE_MODEL`: Sets the model to use for inference
|
||||
|
||||
* `--env OLLAMA_URL=http://host.docker.internal:11434`: Configures the URL for the Ollama service
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
|
||||
### Running your Stack server
|
||||
Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack build` step.
|
||||
|
||||
```
|
||||
llama stack run -h
|
||||
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--env KEY=VALUE]
|
||||
[--image-type {venv}] [--enable-ui]
|
||||
[config | template]
|
||||
|
||||
Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.
|
||||
|
||||
positional arguments:
|
||||
config | template Path to config file to use for the run or name of known template (`llama stack list` for a list). (default: None)
|
||||
|
||||
options:
|
||||
-h, --help show this help message and exit
|
||||
--port PORT Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. (default: 8321)
|
||||
--image-name IMAGE_NAME
|
||||
Name of the image to run. Defaults to the current environment (default: None)
|
||||
--env KEY=VALUE Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times. (default: None)
|
||||
--image-type {venv}
|
||||
Image Type used during the build. This should be venv. (default: None)
|
||||
--enable-ui Start the UI server (default: False)
|
||||
```
|
||||
|
||||
**Note:** Container images built with `llama stack build --image-type container` cannot be run using `llama stack run`. Instead, they must be run directly using Docker or Podman commands as shown in the container building section above.
|
||||
|
||||
```
|
||||
# Start using template name
|
||||
llama stack run tgi
|
||||
|
||||
# Start using config file
|
||||
llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml
|
||||
|
||||
# Start using a venv
|
||||
llama stack run --image-type venv ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml
|
||||
```
|
||||
|
||||
```
|
||||
$ llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml
|
||||
|
||||
Serving API inspect
|
||||
GET /health
|
||||
GET /providers/list
|
||||
GET /routes/list
|
||||
Serving API inference
|
||||
POST /inference/chat_completion
|
||||
POST /inference/completion
|
||||
POST /inference/embeddings
|
||||
...
|
||||
Serving API agents
|
||||
POST /agents/create
|
||||
POST /agents/session/create
|
||||
POST /agents/turn/create
|
||||
POST /agents/delete
|
||||
POST /agents/session/delete
|
||||
POST /agents/session/get
|
||||
POST /agents/step/get
|
||||
POST /agents/turn/get
|
||||
|
||||
Listening on ['::', '0.0.0.0']:8321
|
||||
INFO: Started server process [2935911]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
|
||||
INFO: 2401:db00:35c:2d2b:face:0:c9:0:54678 - "GET /models/list HTTP/1.1" 200 OK
|
||||
```
|
||||
|
||||
### Listing Distributions
|
||||
Using the list command, you can view all existing Llama Stack distributions, including stacks built from templates, from scratch, or using custom configuration files.
|
||||
|
||||
```
|
||||
llama stack list -h
|
||||
usage: llama stack list [-h]
|
||||
|
||||
list the build stacks
|
||||
|
||||
options:
|
||||
-h, --help show this help message and exit
|
||||
```
|
||||
|
||||
Example Usage
|
||||
|
||||
```
|
||||
llama stack list
|
||||
```
|
||||
|
||||
```
|
||||
------------------------------+-----------------------------------------------------------------+--------------+------------+
|
||||
| Stack Name | Path | Build Config | Run Config |
|
||||
+------------------------------+-----------------------------------------------------------------------------+--------------+
|
||||
| together | ~/.llama/distributions/together | Yes | No |
|
||||
+------------------------------+-----------------------------------------------------------------------------+--------------+
|
||||
| bedrock | ~/.llama/distributions/bedrock | Yes | No |
|
||||
+------------------------------+-----------------------------------------------------------------------------+--------------+
|
||||
| starter | ~/.llama/distributions/starter | Yes | Yes |
|
||||
+------------------------------+-----------------------------------------------------------------------------+--------------+
|
||||
| remote-vllm | ~/.llama/distributions/remote-vllm | Yes | Yes |
|
||||
+------------------------------+-----------------------------------------------------------------------------+--------------+
|
||||
```
|
||||
|
||||
### Removing a Distribution
|
||||
Use the remove command to delete a distribution you've previously built.
|
||||
|
||||
```
|
||||
llama stack rm -h
|
||||
usage: llama stack rm [-h] [--all] [name]
|
||||
|
||||
Remove the build stack
|
||||
|
||||
positional arguments:
|
||||
name Name of the stack to delete (default: None)
|
||||
|
||||
options:
|
||||
-h, --help show this help message and exit
|
||||
--all, -a Delete all stacks (use with caution) (default: False)
|
||||
```
|
||||
|
||||
Example
|
||||
```
|
||||
llama stack rm llamastack-test
|
||||
```
|
||||
|
||||
To keep your environment organized and avoid clutter, consider using `llama stack list` to review old or unused distributions and `llama stack rm <name>` to delete them when they're no longer needed.
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
If you encounter any issues, ask questions in our discord or search through our [GitHub Issues](https://github.com/meta-llama/llama-stack/issues), or file an new issue.
|
|
@ -1,802 +0,0 @@
|
|||
# Configuring a "Stack"
|
||||
|
||||
The Llama Stack runtime configuration is specified as a YAML file. Here is a simplified version of an example configuration file for the Ollama distribution:
|
||||
|
||||
```{note}
|
||||
The default `run.yaml` files generated by templates are starting points for your configuration. For guidance on customizing these files for your specific needs, see [Customizing Your run.yaml Configuration](customizing_run_yaml.md).
|
||||
```
|
||||
|
||||
```{dropdown} 👋 Click here for a Sample Configuration File
|
||||
|
||||
```yaml
|
||||
version: 2
|
||||
apis:
|
||||
- agents
|
||||
- inference
|
||||
- vector_io
|
||||
- safety
|
||||
- telemetry
|
||||
providers:
|
||||
inference:
|
||||
- provider_id: ollama
|
||||
provider_type: remote::ollama
|
||||
config:
|
||||
url: ${env.OLLAMA_URL:=http://localhost:11434}
|
||||
vector_io:
|
||||
- provider_id: faiss
|
||||
provider_type: inline::faiss
|
||||
config:
|
||||
kvstore:
|
||||
type: sqlite
|
||||
namespace: null
|
||||
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/ollama}/faiss_store.db
|
||||
safety:
|
||||
- provider_id: llama-guard
|
||||
provider_type: inline::llama-guard
|
||||
config: {}
|
||||
agents:
|
||||
- provider_id: meta-reference
|
||||
provider_type: inline::meta-reference
|
||||
config:
|
||||
persistence_store:
|
||||
type: sqlite
|
||||
namespace: null
|
||||
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/ollama}/agents_store.db
|
||||
telemetry:
|
||||
- provider_id: meta-reference
|
||||
provider_type: inline::meta-reference
|
||||
config: {}
|
||||
metadata_store:
|
||||
namespace: null
|
||||
type: sqlite
|
||||
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/ollama}/registry.db
|
||||
models:
|
||||
- metadata: {}
|
||||
model_id: ${env.INFERENCE_MODEL}
|
||||
provider_id: ollama
|
||||
provider_model_id: null
|
||||
shields: []
|
||||
server:
|
||||
port: 8321
|
||||
auth:
|
||||
provider_config:
|
||||
type: "oauth2_token"
|
||||
jwks:
|
||||
uri: "https://my-token-issuing-svc.com/jwks"
|
||||
```
|
||||
|
||||
Let's break this down into the different sections. The first section specifies the set of APIs that the stack server will serve:
|
||||
```yaml
|
||||
apis:
|
||||
- agents
|
||||
- inference
|
||||
- vector_io
|
||||
- safety
|
||||
- telemetry
|
||||
```
|
||||
|
||||
## Providers
|
||||
Next up is the most critical part: the set of providers that the stack will use to serve the above APIs. Consider the `inference` API:
|
||||
```yaml
|
||||
providers:
|
||||
inference:
|
||||
# provider_id is a string you can choose freely
|
||||
- provider_id: ollama
|
||||
# provider_type is a string that specifies the type of provider.
|
||||
# in this case, the provider for inference is ollama and it runs remotely (outside of the distribution)
|
||||
provider_type: remote::ollama
|
||||
# config is a dictionary that contains the configuration for the provider.
|
||||
# in this case, the configuration is the url of the ollama server
|
||||
config:
|
||||
url: ${env.OLLAMA_URL:=http://localhost:11434}
|
||||
```
|
||||
A few things to note:
|
||||
- A _provider instance_ is identified with an (id, type, config) triplet.
|
||||
- The id is a string you can choose freely.
|
||||
- You can instantiate any number of provider instances of the same type.
|
||||
- The configuration dictionary is provider-specific.
|
||||
- Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value.
|
||||
|
||||
### Environment Variable Substitution
|
||||
|
||||
Llama Stack supports environment variable substitution in configuration values using the
|
||||
`${env.VARIABLE_NAME}` syntax. This allows you to externalize configuration values and provide
|
||||
different settings for different environments. The syntax is inspired by [bash parameter expansion](https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html)
|
||||
and follows similar patterns.
|
||||
|
||||
#### Basic Syntax
|
||||
|
||||
The basic syntax for environment variable substitution is:
|
||||
|
||||
```yaml
|
||||
config:
|
||||
api_key: ${env.API_KEY}
|
||||
url: ${env.SERVICE_URL}
|
||||
```
|
||||
|
||||
If the environment variable is not set, the server will raise an error during startup.
|
||||
|
||||
#### Default Values
|
||||
|
||||
You can provide default values using the `:=` operator:
|
||||
|
||||
```yaml
|
||||
config:
|
||||
url: ${env.OLLAMA_URL:=http://localhost:11434}
|
||||
port: ${env.PORT:=8321}
|
||||
timeout: ${env.TIMEOUT:=60}
|
||||
```
|
||||
|
||||
If the environment variable is not set, the default value `http://localhost:11434` will be used.
|
||||
Empty defaults are allowed so `url: ${env.OLLAMA_URL:=}` will be set to `None` if the environment variable is not set.
|
||||
|
||||
#### Conditional Values
|
||||
|
||||
You can use the `:+` operator to provide a value only when the environment variable is set:
|
||||
|
||||
```yaml
|
||||
config:
|
||||
# Only include this field if ENVIRONMENT is set
|
||||
environment: ${env.ENVIRONMENT:+production}
|
||||
```
|
||||
|
||||
If the environment variable is set, the value after `:+` will be used. If it's not set, the field
|
||||
will be omitted with a `None` value.
|
||||
|
||||
Do not use conditional values (`${env.OLLAMA_URL:+}`) for empty defaults (`${env.OLLAMA_URL:=}`).
|
||||
This will be set to `None` if the environment variable is not set.
|
||||
Conditional must only be used when the environment variable is set.
|
||||
|
||||
#### Examples
|
||||
|
||||
Here are some common patterns:
|
||||
|
||||
```yaml
|
||||
# Required environment variable (will error if not set)
|
||||
api_key: ${env.OPENAI_API_KEY}
|
||||
|
||||
# Optional with default
|
||||
base_url: ${env.API_BASE_URL:=https://api.openai.com/v1}
|
||||
|
||||
# Conditional field
|
||||
debug_mode: ${env.DEBUG:+true}
|
||||
|
||||
# Optional field that becomes None if not set
|
||||
optional_token: ${env.OPTIONAL_TOKEN:+}
|
||||
```
|
||||
|
||||
#### Runtime Override
|
||||
|
||||
You can override environment variables at runtime when starting the server:
|
||||
|
||||
```bash
|
||||
# Override specific environment variables
|
||||
llama stack run --config run.yaml --env API_KEY=sk-123 --env BASE_URL=https://custom-api.com
|
||||
|
||||
# Or set them in your shell
|
||||
export API_KEY=sk-123
|
||||
export BASE_URL=https://custom-api.com
|
||||
llama stack run --config run.yaml
|
||||
```
|
||||
|
||||
#### Type Safety
|
||||
|
||||
The environment variable substitution system is type-safe:
|
||||
|
||||
- String values remain strings
|
||||
- Empty defaults (`${env.VAR:+}`) are converted to `None` for fields that accept `str | None`
|
||||
- Numeric defaults are properly typed (e.g., `${env.PORT:=8321}` becomes an integer)
|
||||
- Boolean defaults work correctly (e.g., `${env.DEBUG:=false}` becomes a boolean)
|
||||
|
||||
## Resources
|
||||
|
||||
Let's look at the `models` section:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
- metadata: {}
|
||||
model_id: ${env.INFERENCE_MODEL}
|
||||
provider_id: ollama
|
||||
provider_model_id: null
|
||||
model_type: llm
|
||||
```
|
||||
A Model is an instance of a "Resource" (see [Concepts](../concepts/index)) and is associated with a specific inference provider (in this case, the provider with identifier `ollama`). This is an instance of a "pre-registered" model. While we always encourage the clients to register models before using them, some Stack servers may come up a list of "already known and available" models.
|
||||
|
||||
What's with the `provider_model_id` field? This is an identifier for the model inside the provider's model catalog. Contrast it with `model_id` which is the identifier for the same model for Llama Stack's purposes. For example, you may want to name "llama3.2:vision-11b" as "image_captioning_model" when you use it in your Stack interactions. When omitted, the server will set `provider_model_id` to be the same as `model_id`.
|
||||
|
||||
If you need to conditionally register a model in the configuration, such as only when specific environment variable(s) are set, this can be accomplished by utilizing a special `__disabled__` string as the default value of an environment variable substitution, as shown below:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
- metadata: {}
|
||||
model_id: ${env.INFERENCE_MODEL:__disabled__}
|
||||
provider_id: ollama
|
||||
provider_model_id: ${env.INFERENCE_MODEL:__disabled__}
|
||||
```
|
||||
|
||||
The snippet above will only register this model if the environment variable `INFERENCE_MODEL` is set and non-empty. If the environment variable is not set, the model will not get registered at all.
|
||||
|
||||
## Server Configuration
|
||||
|
||||
The `server` section configures the HTTP server that serves the Llama Stack APIs:
|
||||
|
||||
```yaml
|
||||
server:
|
||||
port: 8321 # Port to listen on (default: 8321)
|
||||
tls_certfile: "/path/to/cert.pem" # Optional: Path to TLS certificate for HTTPS
|
||||
tls_keyfile: "/path/to/key.pem" # Optional: Path to TLS key for HTTPS
|
||||
cors: true # Optional: Enable CORS (dev mode) or full config object
|
||||
```
|
||||
|
||||
### CORS Configuration
|
||||
|
||||
CORS (Cross-Origin Resource Sharing) can be configured in two ways:
|
||||
|
||||
**Local development** (allows localhost origins only):
|
||||
```yaml
|
||||
server:
|
||||
cors: true
|
||||
```
|
||||
|
||||
**Explicit configuration** (custom origins and settings):
|
||||
```yaml
|
||||
server:
|
||||
cors:
|
||||
allow_origins: ["https://myapp.com", "https://app.example.com"]
|
||||
allow_methods: ["GET", "POST", "PUT", "DELETE"]
|
||||
allow_headers: ["Content-Type", "Authorization"]
|
||||
allow_credentials: true
|
||||
max_age: 3600
|
||||
```
|
||||
|
||||
When `cors: true`, the server enables secure localhost-only access for local development. For production, specify exact origins to maintain security.
|
||||
|
||||
### Authentication Configuration
|
||||
|
||||
> **Breaking Change (v0.2.14)**: The authentication configuration structure has changed. The previous format with `provider_type` and `config` fields has been replaced with a unified `provider_config` field that includes the `type` field. Update your configuration files accordingly.
|
||||
|
||||
The `auth` section configures authentication for the server. When configured, all API requests must include a valid Bearer token in the Authorization header:
|
||||
|
||||
```
|
||||
Authorization: Bearer <token>
|
||||
```
|
||||
|
||||
The server supports multiple authentication providers:
|
||||
|
||||
#### OAuth 2.0/OpenID Connect Provider with Kubernetes
|
||||
|
||||
The server can be configured to use service account tokens for authorization, validating these against the Kubernetes API server, e.g.:
|
||||
```yaml
|
||||
server:
|
||||
auth:
|
||||
provider_config:
|
||||
type: "oauth2_token"
|
||||
jwks:
|
||||
uri: "https://kubernetes.default.svc:8443/openid/v1/jwks"
|
||||
token: "${env.TOKEN:+}"
|
||||
key_recheck_period: 3600
|
||||
tls_cafile: "/path/to/ca.crt"
|
||||
issuer: "https://kubernetes.default.svc"
|
||||
audience: "https://kubernetes.default.svc"
|
||||
```
|
||||
|
||||
To find your cluster's jwks uri (from which the public key(s) to verify the token signature are obtained), run:
|
||||
```
|
||||
kubectl get --raw /.well-known/openid-configuration| jq -r .jwks_uri
|
||||
```
|
||||
|
||||
For the tls_cafile, you can use the CA certificate of the OIDC provider:
|
||||
```bash
|
||||
kubectl config view --minify -o jsonpath='{.clusters[0].cluster.certificate-authority}'
|
||||
```
|
||||
|
||||
For the issuer, you can use the OIDC provider's URL:
|
||||
```bash
|
||||
kubectl get --raw /.well-known/openid-configuration| jq .issuer
|
||||
```
|
||||
|
||||
The audience can be obtained from a token, e.g. run:
|
||||
```bash
|
||||
kubectl create token default --duration=1h | cut -d. -f2 | base64 -d | jq .aud
|
||||
```
|
||||
|
||||
The jwks token is used to authorize access to the jwks endpoint. You can obtain a token by running:
|
||||
|
||||
```bash
|
||||
kubectl create namespace llama-stack
|
||||
kubectl create serviceaccount llama-stack-auth -n llama-stack
|
||||
kubectl create token llama-stack-auth -n llama-stack > llama-stack-auth-token
|
||||
export TOKEN=$(cat llama-stack-auth-token)
|
||||
```
|
||||
|
||||
Alternatively, you can configure the jwks endpoint to allow anonymous access. To do this, make sure
|
||||
the `kube-apiserver` runs with `--anonymous-auth=true` to allow unauthenticated requests
|
||||
and that the correct RoleBinding is created to allow the service account to access the necessary
|
||||
resources. If that is not the case, you can create a RoleBinding for the service account to access
|
||||
the necessary resources:
|
||||
|
||||
```yaml
|
||||
# allow-anonymous-openid.yaml
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRole
|
||||
metadata:
|
||||
name: allow-anonymous-openid
|
||||
rules:
|
||||
- nonResourceURLs: ["/openid/v1/jwks"]
|
||||
verbs: ["get"]
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRoleBinding
|
||||
metadata:
|
||||
name: allow-anonymous-openid
|
||||
roleRef:
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
kind: ClusterRole
|
||||
name: allow-anonymous-openid
|
||||
subjects:
|
||||
- kind: User
|
||||
name: system:anonymous
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
```
|
||||
|
||||
And then apply the configuration:
|
||||
```bash
|
||||
kubectl apply -f allow-anonymous-openid.yaml
|
||||
```
|
||||
|
||||
The provider extracts user information from the JWT token:
|
||||
- Username from the `sub` claim becomes a role
|
||||
- Kubernetes groups become teams
|
||||
|
||||
You can easily validate a request by running:
|
||||
|
||||
```bash
|
||||
curl -s -L -H "Authorization: Bearer $(cat llama-stack-auth-token)" http://127.0.0.1:8321/v1/providers
|
||||
```
|
||||
|
||||
#### Kubernetes Authentication Provider
|
||||
|
||||
The server can be configured to use Kubernetes SelfSubjectReview API to validate tokens directly against the Kubernetes API server:
|
||||
|
||||
```yaml
|
||||
server:
|
||||
auth:
|
||||
provider_config:
|
||||
type: "kubernetes"
|
||||
api_server_url: "https://kubernetes.default.svc"
|
||||
claims_mapping:
|
||||
username: "roles"
|
||||
groups: "roles"
|
||||
uid: "uid_attr"
|
||||
verify_tls: true
|
||||
tls_cafile: "/path/to/ca.crt"
|
||||
```
|
||||
|
||||
Configuration options:
|
||||
- `api_server_url`: The Kubernetes API server URL (e.g., https://kubernetes.default.svc:6443)
|
||||
- `verify_tls`: Whether to verify TLS certificates (default: true)
|
||||
- `tls_cafile`: Path to CA certificate file for TLS verification
|
||||
- `claims_mapping`: Mapping of Kubernetes user claims to access attributes
|
||||
|
||||
The provider validates tokens by sending a SelfSubjectReview request to the Kubernetes API server at `/apis/authentication.k8s.io/v1/selfsubjectreviews`. The provider extracts user information from the response:
|
||||
- Username from the `userInfo.username` field
|
||||
- Groups from the `userInfo.groups` field
|
||||
- UID from the `userInfo.uid` field
|
||||
|
||||
To obtain a token for testing:
|
||||
```bash
|
||||
kubectl create namespace llama-stack
|
||||
kubectl create serviceaccount llama-stack-auth -n llama-stack
|
||||
kubectl create token llama-stack-auth -n llama-stack > llama-stack-auth-token
|
||||
```
|
||||
|
||||
You can validate a request by running:
|
||||
```bash
|
||||
curl -s -L -H "Authorization: Bearer $(cat llama-stack-auth-token)" http://127.0.0.1:8321/v1/providers
|
||||
```
|
||||
|
||||
#### GitHub Token Provider
|
||||
Validates GitHub personal access tokens or OAuth tokens directly:
|
||||
```yaml
|
||||
server:
|
||||
auth:
|
||||
provider_config:
|
||||
type: "github_token"
|
||||
github_api_base_url: "https://api.github.com" # Or GitHub Enterprise URL
|
||||
```
|
||||
|
||||
The provider fetches user information from GitHub and maps it to access attributes based on the `claims_mapping` configuration.
|
||||
|
||||
#### Custom Provider
|
||||
Validates tokens against a custom authentication endpoint:
|
||||
```yaml
|
||||
server:
|
||||
auth:
|
||||
provider_config:
|
||||
type: "custom"
|
||||
endpoint: "https://auth.example.com/validate" # URL of the auth endpoint
|
||||
```
|
||||
|
||||
The custom endpoint receives a POST request with:
|
||||
```json
|
||||
{
|
||||
"api_key": "<token>",
|
||||
"request": {
|
||||
"path": "/api/v1/endpoint",
|
||||
"headers": {
|
||||
"content-type": "application/json",
|
||||
"user-agent": "curl/7.64.1"
|
||||
},
|
||||
"params": {
|
||||
"key": ["value"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
And must respond with:
|
||||
```json
|
||||
{
|
||||
"access_attributes": {
|
||||
"roles": ["admin", "user"],
|
||||
"teams": ["ml-team", "nlp-team"],
|
||||
"projects": ["llama-3", "project-x"],
|
||||
"namespaces": ["research"]
|
||||
},
|
||||
"message": "Authentication successful"
|
||||
}
|
||||
```
|
||||
|
||||
If no access attributes are returned, the token is used as a namespace.
|
||||
|
||||
### Access control
|
||||
|
||||
When authentication is enabled, access to resources is controlled
|
||||
through the `access_policy` attribute of the auth config section under
|
||||
server. The value for this is a list of access rules.
|
||||
|
||||
Each access rule defines a list of actions either to permit or to
|
||||
forbid. It may specify a principal or a resource that must match for
|
||||
the rule to take effect.
|
||||
|
||||
Valid actions are create, read, update, and delete. The resource to
|
||||
match should be specified in the form of a type qualified identifier,
|
||||
e.g. model::my-model or vector_db::some-db, or a wildcard for all
|
||||
resources of a type, e.g. model::*. If the principal or resource are
|
||||
not specified, they will match all requests.
|
||||
|
||||
The valid resource types are model, shield, vector_db, dataset,
|
||||
scoring_function, benchmark, tool, tool_group and session.
|
||||
|
||||
A rule may also specify a condition, either a 'when' or an 'unless',
|
||||
with additional constraints as to where the rule applies. The
|
||||
constraints supported at present are:
|
||||
|
||||
- 'user with <attr-value> in <attr-name>'
|
||||
- 'user with <attr-value> not in <attr-name>'
|
||||
- 'user is owner'
|
||||
- 'user is not owner'
|
||||
- 'user in owners <attr-name>'
|
||||
- 'user not in owners <attr-name>'
|
||||
|
||||
The attributes defined for a user will depend on how the auth
|
||||
configuration is defined.
|
||||
|
||||
When checking whether a particular action is allowed by the current
|
||||
user for a resource, all the defined rules are tested in order to find
|
||||
a match. If a match is found, the request is permitted or forbidden
|
||||
depending on the type of rule. If no match is found, the request is
|
||||
denied.
|
||||
|
||||
If no explicit rules are specified, a default policy is defined with
|
||||
which all users can access all resources defined in config but
|
||||
resources created dynamically can only be accessed by the user that
|
||||
created them.
|
||||
|
||||
Examples:
|
||||
|
||||
The following restricts access to particular github users:
|
||||
|
||||
```yaml
|
||||
server:
|
||||
auth:
|
||||
provider_config:
|
||||
type: "github_token"
|
||||
github_api_base_url: "https://api.github.com"
|
||||
access_policy:
|
||||
- permit:
|
||||
principal: user-1
|
||||
actions: [create, read, delete]
|
||||
description: user-1 has full access to all resources
|
||||
- permit:
|
||||
principal: user-2
|
||||
actions: [read]
|
||||
resource: model::model-1
|
||||
description: user-2 has read access to model-1 only
|
||||
```
|
||||
|
||||
Similarly, the following restricts access to particular kubernetes
|
||||
service accounts:
|
||||
|
||||
```yaml
|
||||
server:
|
||||
auth:
|
||||
provider_config:
|
||||
type: "oauth2_token"
|
||||
audience: https://kubernetes.default.svc.cluster.local
|
||||
issuer: https://kubernetes.default.svc.cluster.local
|
||||
tls_cafile: /home/gsim/.minikube/ca.crt
|
||||
jwks:
|
||||
uri: https://kubernetes.default.svc.cluster.local:8443/openid/v1/jwks
|
||||
token: ${env.TOKEN}
|
||||
access_policy:
|
||||
- permit:
|
||||
principal: system:serviceaccount:my-namespace:my-serviceaccount
|
||||
actions: [create, read, delete]
|
||||
description: specific serviceaccount has full access to all resources
|
||||
- permit:
|
||||
principal: system:serviceaccount:default:default
|
||||
actions: [read]
|
||||
resource: model::model-1
|
||||
description: default account has read access to model-1 only
|
||||
```
|
||||
|
||||
The following policy, which assumes that users are defined with roles
|
||||
and teams by whichever authentication system is in use, allows any
|
||||
user with a valid token to use models, create resources other than
|
||||
models, read and delete resources they created and read resources
|
||||
created by users sharing a team with them:
|
||||
|
||||
```
|
||||
access_policy:
|
||||
- permit:
|
||||
actions: [read]
|
||||
resource: model::*
|
||||
description: all users have read access to models
|
||||
- forbid:
|
||||
actions: [create, delete]
|
||||
resource: model::*
|
||||
unless: user with admin in roles
|
||||
description: only user with admin role can create or delete models
|
||||
- permit:
|
||||
actions: [create, read, delete]
|
||||
when: user is owner
|
||||
description: users can create resources other than models and read and delete those they own
|
||||
- permit:
|
||||
actions: [read]
|
||||
when: user in owner teams
|
||||
description: any user has read access to any resource created by a user with the same team
|
||||
```
|
||||
|
||||
#### API Endpoint Authorization with Scopes
|
||||
|
||||
In addition to resource-based access control, Llama Stack supports endpoint-level authorization using OAuth 2.0 style scopes. When authentication is enabled, specific API endpoints require users to have particular scopes in their authentication token.
|
||||
|
||||
**Scope-Gated APIs:**
|
||||
The following APIs are currently gated by scopes:
|
||||
|
||||
- **Telemetry API** (scope: `telemetry.read`):
|
||||
- `POST /telemetry/traces` - Query traces
|
||||
- `GET /telemetry/traces/{trace_id}` - Get trace by ID
|
||||
- `GET /telemetry/traces/{trace_id}/spans/{span_id}` - Get span by ID
|
||||
- `POST /telemetry/spans/{span_id}/tree` - Get span tree
|
||||
- `POST /telemetry/spans` - Query spans
|
||||
- `POST /telemetry/metrics/{metric_name}` - Query metrics
|
||||
|
||||
**Authentication Configuration:**
|
||||
|
||||
For **JWT/OAuth2 providers**, scopes should be included in the JWT's claims:
|
||||
```json
|
||||
{
|
||||
"sub": "user123",
|
||||
"scope": "telemetry.read",
|
||||
"aud": "llama-stack"
|
||||
}
|
||||
```
|
||||
|
||||
For **custom authentication providers**, the endpoint must return user attributes including the `scopes` array:
|
||||
```json
|
||||
{
|
||||
"principal": "user123",
|
||||
"attributes": {
|
||||
"scopes": ["telemetry.read"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Behavior:**
|
||||
- Users without the required scope receive a 403 Forbidden response
|
||||
- When authentication is disabled, scope checks are bypassed
|
||||
- Endpoints without `required_scope` work normally for all authenticated users
|
||||
|
||||
### Quota Configuration
|
||||
|
||||
The `quota` section allows you to enable server-side request throttling for both
|
||||
authenticated and anonymous clients. This is useful for preventing abuse, enforcing
|
||||
fairness across tenants, and controlling infrastructure costs without requiring
|
||||
client-side rate limiting or external proxies.
|
||||
|
||||
Quotas are disabled by default. When enabled, each client is tracked using either:
|
||||
|
||||
* Their authenticated `client_id` (derived from the Bearer token), or
|
||||
* Their IP address (fallback for anonymous requests)
|
||||
|
||||
Quota state is stored in a SQLite-backed key-value store, and rate limits are applied
|
||||
within a configurable time window (currently only `day` is supported).
|
||||
|
||||
#### Example
|
||||
|
||||
```yaml
|
||||
server:
|
||||
quota:
|
||||
kvstore:
|
||||
type: sqlite
|
||||
db_path: ./quotas.db
|
||||
anonymous_max_requests: 100
|
||||
authenticated_max_requests: 1000
|
||||
period: day
|
||||
```
|
||||
|
||||
#### Configuration Options
|
||||
|
||||
| Field | Description |
|
||||
| ---------------------------- | -------------------------------------------------------------------------- |
|
||||
| `kvstore` | Required. Backend storage config for tracking request counts. |
|
||||
| `kvstore.type` | Must be `"sqlite"` for now. Other backends may be supported in the future. |
|
||||
| `kvstore.db_path` | File path to the SQLite database. |
|
||||
| `anonymous_max_requests` | Max requests per period for unauthenticated clients. |
|
||||
| `authenticated_max_requests` | Max requests per period for authenticated clients. |
|
||||
| `period` | Time window for quota enforcement. Only `"day"` is supported. |
|
||||
|
||||
> Note: if `authenticated_max_requests` is set but no authentication provider is
|
||||
configured, the server will fall back to applying `anonymous_max_requests` to all
|
||||
clients.
|
||||
|
||||
#### Example with Authentication Enabled
|
||||
|
||||
```yaml
|
||||
server:
|
||||
port: 8321
|
||||
auth:
|
||||
provider_config:
|
||||
type: custom
|
||||
endpoint: https://auth.example.com/validate
|
||||
quota:
|
||||
kvstore:
|
||||
type: sqlite
|
||||
db_path: ./quotas.db
|
||||
anonymous_max_requests: 100
|
||||
authenticated_max_requests: 1000
|
||||
period: day
|
||||
```
|
||||
|
||||
If a client exceeds their limit, the server responds with:
|
||||
|
||||
```http
|
||||
HTTP/1.1 429 Too Many Requests
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"error": {
|
||||
"message": "Quota exceeded"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### CORS Configuration
|
||||
|
||||
Configure CORS to allow web browsers to make requests from different domains. Disabled by default.
|
||||
|
||||
#### Quick Setup
|
||||
|
||||
For development, use the simple boolean flag:
|
||||
|
||||
```yaml
|
||||
server:
|
||||
cors: true # Auto-enables localhost with any port
|
||||
```
|
||||
|
||||
This automatically allows `http://localhost:*` and `https://localhost:*` with secure defaults.
|
||||
|
||||
#### Custom Configuration
|
||||
|
||||
For specific origins and full control:
|
||||
|
||||
```yaml
|
||||
server:
|
||||
cors:
|
||||
allow_origins: ["https://myapp.com", "https://staging.myapp.com"]
|
||||
allow_credentials: true
|
||||
allow_methods: ["GET", "POST", "PUT", "DELETE"]
|
||||
allow_headers: ["Content-Type", "Authorization"]
|
||||
allow_origin_regex: "https://.*\\.example\\.com" # Optional regex pattern
|
||||
expose_headers: ["X-Total-Count"]
|
||||
max_age: 86400
|
||||
```
|
||||
|
||||
#### Configuration Options
|
||||
|
||||
| Field | Description | Default |
|
||||
| -------------------- | ---------------------------------------------- | ------- |
|
||||
| `allow_origins` | List of allowed origins. Use `["*"]` for any. | `["*"]` |
|
||||
| `allow_origin_regex` | Regex pattern for allowed origins (optional). | `None` |
|
||||
| `allow_methods` | Allowed HTTP methods. | `["*"]` |
|
||||
| `allow_headers` | Allowed headers. | `["*"]` |
|
||||
| `allow_credentials` | Allow credentials (cookies, auth headers). | `false` |
|
||||
| `expose_headers` | Headers exposed to browser. | `[]` |
|
||||
| `max_age` | Preflight cache time (seconds). | `600` |
|
||||
|
||||
**Security Notes**:
|
||||
- `allow_credentials: true` requires explicit origins (no wildcards)
|
||||
- `cors: true` enables localhost access only (secure for development)
|
||||
- For public APIs, always specify exact allowed origins
|
||||
|
||||
## Extending to handle Safety
|
||||
|
||||
Configuring Safety can be a little involved so it is instructive to go through an example.
|
||||
|
||||
The Safety API works with the associated Resource called a `Shield`. Providers can support various kinds of Shields. Good examples include the [Llama Guard](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/) system-safety models, or [Bedrock Guardrails](https://aws.amazon.com/bedrock/guardrails/).
|
||||
|
||||
To configure a Bedrock Shield, you would need to add:
|
||||
- A Safety API provider instance with type `remote::bedrock`
|
||||
- A Shield resource served by this provider.
|
||||
|
||||
```yaml
|
||||
...
|
||||
providers:
|
||||
safety:
|
||||
- provider_id: bedrock
|
||||
provider_type: remote::bedrock
|
||||
config:
|
||||
aws_access_key_id: ${env.AWS_ACCESS_KEY_ID}
|
||||
aws_secret_access_key: ${env.AWS_SECRET_ACCESS_KEY}
|
||||
...
|
||||
shields:
|
||||
- provider_id: bedrock
|
||||
params:
|
||||
guardrailVersion: ${env.GUARDRAIL_VERSION}
|
||||
provider_shield_id: ${env.GUARDRAIL_ID}
|
||||
...
|
||||
```
|
||||
|
||||
The situation is more involved if the Shield needs _Inference_ of an associated model. This is the case with Llama Guard. In that case, you would need to add:
|
||||
- A Safety API provider instance with type `inline::llama-guard`
|
||||
- An Inference API provider instance for serving the model.
|
||||
- A Model resource associated with this provider.
|
||||
- A Shield resource served by the Safety provider.
|
||||
|
||||
The yaml configuration for this setup, assuming you were using vLLM as your inference server, would look like:
|
||||
```yaml
|
||||
...
|
||||
providers:
|
||||
safety:
|
||||
- provider_id: llama-guard
|
||||
provider_type: inline::llama-guard
|
||||
config: {}
|
||||
inference:
|
||||
# this vLLM server serves the "normal" inference model (e.g., llama3.2:3b)
|
||||
- provider_id: vllm-0
|
||||
provider_type: remote::vllm
|
||||
config:
|
||||
url: ${env.VLLM_URL:=http://localhost:8000}
|
||||
# this vLLM server serves the llama-guard model (e.g., llama-guard:3b)
|
||||
- provider_id: vllm-1
|
||||
provider_type: remote::vllm
|
||||
config:
|
||||
url: ${env.SAFETY_VLLM_URL:=http://localhost:8001}
|
||||
...
|
||||
models:
|
||||
- metadata: {}
|
||||
model_id: ${env.INFERENCE_MODEL}
|
||||
provider_id: vllm-0
|
||||
provider_model_id: null
|
||||
- metadata: {}
|
||||
model_id: ${env.SAFETY_MODEL}
|
||||
provider_id: vllm-1
|
||||
provider_model_id: null
|
||||
shields:
|
||||
- provider_id: llama-guard
|
||||
shield_id: ${env.SAFETY_MODEL} # Llama Guard shields are identified by the corresponding LlamaGuard model
|
||||
provider_shield_id: null
|
||||
...
|
||||
```
|
|
@ -1,40 +0,0 @@
|
|||
# Customizing run.yaml Files
|
||||
|
||||
The `run.yaml` files generated by Llama Stack templates are **starting points** designed to be customized for your specific needs. They are not meant to be used as-is in production environments.
|
||||
|
||||
## Key Points
|
||||
|
||||
- **Templates are starting points**: Generated `run.yaml` files contain defaults for development/testing
|
||||
- **Customization expected**: Update URLs, credentials, models, and settings for your environment
|
||||
- **Version control separately**: Keep customized configs in your own repository
|
||||
- **Environment-specific**: Create different configurations for dev, staging, production
|
||||
|
||||
## What You Can Customize
|
||||
|
||||
You can customize:
|
||||
- **Provider endpoints**: Change `http://localhost:8000` to your actual servers
|
||||
- **Swap providers**: Replace default providers (e.g., swap Tavily with Brave for search)
|
||||
- **Storage paths**: Move from `/tmp/` to production directories
|
||||
- **Authentication**: Add API keys, SSL, timeouts
|
||||
- **Models**: Different model sizes for dev vs prod
|
||||
- **Database settings**: Switch from SQLite to PostgreSQL
|
||||
- **Tool configurations**: Add custom tools and integrations
|
||||
|
||||
## Best Practices
|
||||
|
||||
- Use environment variables for secrets and environment-specific values
|
||||
- Create separate `run.yaml` files for different environments (dev, staging, prod)
|
||||
- Document your changes with comments
|
||||
- Test configurations before deployment
|
||||
- Keep your customized configs in version control
|
||||
|
||||
Example structure:
|
||||
```
|
||||
your-project/
|
||||
├── configs/
|
||||
│ ├── dev-run.yaml
|
||||
│ ├── prod-run.yaml
|
||||
└── README.md
|
||||
```
|
||||
|
||||
The goal is to take the generated template and adapt it to your specific infrastructure and operational needs.
|
|
@ -1,19 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
# Copyright (c) Meta Platforms, Inc. and affiliates.
|
||||
# All rights reserved.
|
||||
#
|
||||
# This source code is licensed under the terms described in the LICENSE file in
|
||||
# the root directory of this source tree.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
|
||||
K8S_DIR="${SCRIPT_DIR}/../k8s"
|
||||
|
||||
echo "Setting up AWS EKS-specific storage class..."
|
||||
kubectl apply -f gp3-topology-aware.yaml
|
||||
|
||||
echo "Running main Kubernetes deployment..."
|
||||
cd "${K8S_DIR}"
|
||||
./apply.sh "$@"
|
|
@ -1,15 +0,0 @@
|
|||
# Set up default storage class on AWS EKS
|
||||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: gp3-topology-aware
|
||||
annotations:
|
||||
storageclass.kubernetes.io/is-default-class: "true"
|
||||
parameters:
|
||||
type: gp3
|
||||
iops: "3000"
|
||||
throughput: "125"
|
||||
provisioner: ebs.csi.aws.com
|
||||
reclaimPolicy: Delete
|
||||
volumeBindingMode: WaitForFirstConsumer
|
||||
allowVolumeExpansion: true
|
|
@ -1,34 +0,0 @@
|
|||
# Using Llama Stack as a Library
|
||||
|
||||
## Setup Llama Stack without a Server
|
||||
If you are planning to use an external service for Inference (even Ollama or TGI counts as external), it is often easier to use Llama Stack as a library.
|
||||
This avoids the overhead of setting up a server.
|
||||
```bash
|
||||
# setup
|
||||
uv pip install llama-stack
|
||||
llama stack build --distro starter --image-type venv
|
||||
```
|
||||
|
||||
```python
|
||||
from llama_stack.core.library_client import LlamaStackAsLibraryClient
|
||||
|
||||
client = LlamaStackAsLibraryClient(
|
||||
"starter",
|
||||
# provider_data is optional, but if you need to pass in any provider specific data, you can do so here.
|
||||
provider_data={"tavily_search_api_key": os.environ["TAVILY_SEARCH_API_KEY"]},
|
||||
)
|
||||
```
|
||||
|
||||
This will parse your config and set up any inline implementations and remote clients needed for your implementation.
|
||||
|
||||
Then, you can access the APIs like `models` and `inference` on the client and call their methods directly:
|
||||
|
||||
```python
|
||||
response = client.models.list()
|
||||
```
|
||||
|
||||
If you've created a [custom distribution](building_distro.md), you can also use the run.yaml configuration file directly:
|
||||
|
||||
```python
|
||||
client = LlamaStackAsLibraryClient(config_path)
|
||||
```
|
|
@ -1,15 +0,0 @@
|
|||
# Distributions Overview
|
||||
|
||||
A distribution is a pre-packaged set of Llama Stack components that can be deployed together.
|
||||
|
||||
This section provides an overview of the distributions available in Llama Stack.
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 3
|
||||
list_of_distributions
|
||||
building_distro
|
||||
customizing_run_yaml
|
||||
starting_llama_stack_server
|
||||
importing_as_library
|
||||
configuration
|
||||
```
|
|
@ -1,63 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
# Copyright (c) Meta Platforms, Inc. and affiliates.
|
||||
# All rights reserved.
|
||||
#
|
||||
# This source code is licensed under the terms described in the LICENSE file in
|
||||
# the root directory of this source tree.
|
||||
|
||||
export POSTGRES_USER=llamastack
|
||||
export POSTGRES_DB=llamastack
|
||||
export POSTGRES_PASSWORD=llamastack
|
||||
|
||||
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
|
||||
# HF_TOKEN should be set by the user; base64 encode it for the secret
|
||||
if [ -n "${HF_TOKEN:-}" ]; then
|
||||
export HF_TOKEN_BASE64=$(echo -n "$HF_TOKEN" | base64)
|
||||
else
|
||||
echo "ERROR: HF_TOKEN not set. You need it for vLLM to download models from Hugging Face."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ -z "${GITHUB_CLIENT_ID:-}" ]; then
|
||||
echo "ERROR: GITHUB_CLIENT_ID not set. You need it for Github login to work. See the Kubernetes Deployment Guide in the Llama Stack documentation."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ -z "${GITHUB_CLIENT_SECRET:-}" ]; then
|
||||
echo "ERROR: GITHUB_CLIENT_SECRET not set. You need it for Github login to work. See the Kubernetes Deployment Guide in the Llama Stack documentation."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [ -z "${LLAMA_STACK_UI_URL:-}" ]; then
|
||||
echo "ERROR: LLAMA_STACK_UI_URL not set. Should be set to the external URL of the UI (excluding port). You need it for Github login to work. See the Kubernetes Deployment Guide in the Llama Stack documentation."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
|
||||
|
||||
|
||||
set -euo pipefail
|
||||
set -x
|
||||
|
||||
# Apply the HF token secret if HF_TOKEN is provided
|
||||
if [ -n "${HF_TOKEN:-}" ]; then
|
||||
envsubst < ./hf-token-secret.yaml.template | kubectl apply -f -
|
||||
fi
|
||||
|
||||
envsubst < ./vllm-k8s.yaml.template | kubectl apply -f -
|
||||
envsubst < ./vllm-safety-k8s.yaml.template | kubectl apply -f -
|
||||
envsubst < ./postgres-k8s.yaml.template | kubectl apply -f -
|
||||
envsubst < ./chroma-k8s.yaml.template | kubectl apply -f -
|
||||
|
||||
kubectl create configmap llama-stack-config --from-file=stack_run_config.yaml \
|
||||
--dry-run=client -o yaml > stack-configmap.yaml
|
||||
|
||||
kubectl apply -f stack-configmap.yaml
|
||||
|
||||
envsubst < ./stack-k8s.yaml.template | kubectl apply -f -
|
||||
envsubst < ./ingress-k8s.yaml.template | kubectl apply -f -
|
||||
|
||||
envsubst < ./ui-k8s.yaml.template | kubectl apply -f -
|
|
@ -1,66 +0,0 @@
|
|||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: chromadb-pvc
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 20Gi
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: chromadb
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: chromadb
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: chromadb
|
||||
spec:
|
||||
containers:
|
||||
- name: chromadb
|
||||
image: chromadb/chroma:latest
|
||||
ports:
|
||||
- containerPort: 6000
|
||||
env:
|
||||
- name: CHROMA_HOST
|
||||
value: "0.0.0.0"
|
||||
- name: CHROMA_PORT
|
||||
value: "6000"
|
||||
- name: PERSIST_DIRECTORY
|
||||
value: "/chroma/chroma"
|
||||
- name: CHROMA_DB_IMPL
|
||||
value: "duckdb+parquet"
|
||||
resources:
|
||||
requests:
|
||||
memory: "512Mi"
|
||||
cpu: "250m"
|
||||
limits:
|
||||
memory: "2Gi"
|
||||
cpu: "1000m"
|
||||
volumeMounts:
|
||||
- name: chromadb-storage
|
||||
mountPath: /chroma/chroma
|
||||
volumes:
|
||||
- name: chromadb-storage
|
||||
persistentVolumeClaim:
|
||||
claimName: chromadb-pvc
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: chromadb
|
||||
spec:
|
||||
selector:
|
||||
app: chromadb
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 6000
|
||||
targetPort: 6000
|
||||
type: ClusterIP
|
|
@ -1,7 +0,0 @@
|
|||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: hf-token-secret
|
||||
type: Opaque
|
||||
data:
|
||||
token: ${HF_TOKEN_BASE64}
|
|
@ -1,17 +0,0 @@
|
|||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: llama-stack-service
|
||||
spec:
|
||||
type: LoadBalancer
|
||||
selector:
|
||||
app.kubernetes.io/name: llama-stack
|
||||
ports:
|
||||
- name: llama-stack-api
|
||||
port: 8321
|
||||
targetPort: 8321
|
||||
protocol: TCP
|
||||
- name: llama-stack-ui
|
||||
port: 8322
|
||||
targetPort: 8322
|
||||
protocol: TCP
|
|
@ -1,66 +0,0 @@
|
|||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: postgres-pvc
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 10Gi
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: postgres
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/name: postgres
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app.kubernetes.io/name: postgres
|
||||
spec:
|
||||
containers:
|
||||
- name: postgres
|
||||
image: postgres:15
|
||||
env:
|
||||
- name: POSTGRES_DB
|
||||
value: "${POSTGRES_DB}"
|
||||
- name: POSTGRES_USER
|
||||
value: "${POSTGRES_USER}"
|
||||
- name: POSTGRES_PASSWORD
|
||||
value: "${POSTGRES_PASSWORD}"
|
||||
- name: PGDATA
|
||||
value: "/var/lib/postgresql/data/pgdata"
|
||||
ports:
|
||||
- containerPort: 5432
|
||||
resources:
|
||||
requests:
|
||||
memory: "512Mi"
|
||||
cpu: "250m"
|
||||
limits:
|
||||
memory: "1Gi"
|
||||
cpu: "500m"
|
||||
volumeMounts:
|
||||
- name: postgres-storage
|
||||
mountPath: /var/lib/postgresql/data
|
||||
volumes:
|
||||
- name: postgres-storage
|
||||
persistentVolumeClaim:
|
||||
claimName: postgres-pvc
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: postgres-server
|
||||
spec:
|
||||
selector:
|
||||
app.kubernetes.io/name: postgres
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 5432
|
||||
targetPort: 5432
|
||||
type: ClusterIP
|
|
@ -1,56 +0,0 @@
|
|||
apiVersion: v1
|
||||
data:
|
||||
stack_run_config.yaml: "version: '2'\nimage_name: kubernetes-demo\napis:\n- agents\n-
|
||||
inference\n- files\n- safety\n- telemetry\n- tool_runtime\n- vector_io\nproviders:\n
|
||||
\ inference:\n - provider_id: vllm-inference\n provider_type: remote::vllm\n
|
||||
\ config:\n url: ${env.VLLM_URL:=http://localhost:8000/v1}\n max_tokens:
|
||||
${env.VLLM_MAX_TOKENS:=4096}\n api_token: ${env.VLLM_API_TOKEN:=fake}\n tls_verify:
|
||||
${env.VLLM_TLS_VERIFY:=true}\n - provider_id: vllm-safety\n provider_type:
|
||||
remote::vllm\n config:\n url: ${env.VLLM_SAFETY_URL:=http://localhost:8000/v1}\n
|
||||
\ max_tokens: ${env.VLLM_MAX_TOKENS:=4096}\n api_token: ${env.VLLM_API_TOKEN:=fake}\n
|
||||
\ tls_verify: ${env.VLLM_TLS_VERIFY:=true}\n - provider_id: sentence-transformers\n
|
||||
\ provider_type: inline::sentence-transformers\n config: {}\n vector_io:\n
|
||||
\ - provider_id: ${env.ENABLE_CHROMADB:+chromadb}\n provider_type: remote::chromadb\n
|
||||
\ config:\n url: ${env.CHROMADB_URL:=}\n kvstore:\n type: postgres\n
|
||||
\ host: ${env.POSTGRES_HOST:=localhost}\n port: ${env.POSTGRES_PORT:=5432}\n
|
||||
\ db: ${env.POSTGRES_DB:=llamastack}\n user: ${env.POSTGRES_USER:=llamastack}\n
|
||||
\ password: ${env.POSTGRES_PASSWORD:=llamastack}\n files:\n - provider_id:
|
||||
meta-reference-files\n provider_type: inline::localfs\n config:\n storage_dir:
|
||||
${env.FILES_STORAGE_DIR:=~/.llama/distributions/starter/files}\n metadata_store:\n
|
||||
\ type: sqlite\n db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/starter}/files_metadata.db
|
||||
\ \n safety:\n - provider_id: llama-guard\n provider_type: inline::llama-guard\n
|
||||
\ config:\n excluded_categories: []\n agents:\n - provider_id: meta-reference\n
|
||||
\ provider_type: inline::meta-reference\n config:\n persistence_store:\n
|
||||
\ type: postgres\n host: ${env.POSTGRES_HOST:=localhost}\n port:
|
||||
${env.POSTGRES_PORT:=5432}\n db: ${env.POSTGRES_DB:=llamastack}\n user:
|
||||
${env.POSTGRES_USER:=llamastack}\n password: ${env.POSTGRES_PASSWORD:=llamastack}\n
|
||||
\ responses_store:\n type: postgres\n host: ${env.POSTGRES_HOST:=localhost}\n
|
||||
\ port: ${env.POSTGRES_PORT:=5432}\n db: ${env.POSTGRES_DB:=llamastack}\n
|
||||
\ user: ${env.POSTGRES_USER:=llamastack}\n password: ${env.POSTGRES_PASSWORD:=llamastack}\n
|
||||
\ telemetry:\n - provider_id: meta-reference\n provider_type: inline::meta-reference\n
|
||||
\ config:\n service_name: \"${env.OTEL_SERVICE_NAME:=\\u200B}\"\n sinks:
|
||||
${env.TELEMETRY_SINKS:=console}\n tool_runtime:\n - provider_id: brave-search\n
|
||||
\ provider_type: remote::brave-search\n config:\n api_key: ${env.BRAVE_SEARCH_API_KEY:+}\n
|
||||
\ max_results: 3\n - provider_id: tavily-search\n provider_type: remote::tavily-search\n
|
||||
\ config:\n api_key: ${env.TAVILY_SEARCH_API_KEY:+}\n max_results:
|
||||
3\n - provider_id: rag-runtime\n provider_type: inline::rag-runtime\n config:
|
||||
{}\n - provider_id: model-context-protocol\n provider_type: remote::model-context-protocol\n
|
||||
\ config: {}\nmetadata_store:\n type: postgres\n host: ${env.POSTGRES_HOST:=localhost}\n
|
||||
\ port: ${env.POSTGRES_PORT:=5432}\n db: ${env.POSTGRES_DB:=llamastack}\n user:
|
||||
${env.POSTGRES_USER:=llamastack}\n password: ${env.POSTGRES_PASSWORD:=llamastack}\n
|
||||
\ table_name: llamastack_kvstore\ninference_store:\n type: postgres\n host:
|
||||
${env.POSTGRES_HOST:=localhost}\n port: ${env.POSTGRES_PORT:=5432}\n db: ${env.POSTGRES_DB:=llamastack}\n
|
||||
\ user: ${env.POSTGRES_USER:=llamastack}\n password: ${env.POSTGRES_PASSWORD:=llamastack}\nmodels:\n-
|
||||
metadata:\n embedding_dimension: 384\n model_id: all-MiniLM-L6-v2\n provider_id:
|
||||
sentence-transformers\n model_type: embedding\n- metadata: {}\n model_id: ${env.INFERENCE_MODEL}\n
|
||||
\ provider_id: vllm-inference\n model_type: llm\n- metadata: {}\n model_id:
|
||||
${env.SAFETY_MODEL:=meta-llama/Llama-Guard-3-1B}\n provider_id: vllm-safety\n
|
||||
\ model_type: llm\nshields:\n- shield_id: ${env.SAFETY_MODEL:=meta-llama/Llama-Guard-3-1B}\nvector_dbs:
|
||||
[]\ndatasets: []\nscoring_fns: []\nbenchmarks: []\ntool_groups:\n- toolgroup_id:
|
||||
builtin::websearch\n provider_id: tavily-search\n- toolgroup_id: builtin::rag\n
|
||||
\ provider_id: rag-runtime\nserver:\n port: 8321\n auth:\n provider_config:\n
|
||||
\ type: github_token\n"
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
creationTimestamp: null
|
||||
name: llama-stack-config
|
|
@ -1,69 +0,0 @@
|
|||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: llama-pvc
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 1Gi
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: llama-stack-server
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/name: llama-stack
|
||||
app.kubernetes.io/component: server
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app.kubernetes.io/name: llama-stack
|
||||
app.kubernetes.io/component: server
|
||||
spec:
|
||||
containers:
|
||||
- name: llama-stack
|
||||
image: llamastack/distribution-starter:latest
|
||||
imagePullPolicy: Always # since we have specified latest instead of a version
|
||||
env:
|
||||
- name: ENABLE_CHROMADB
|
||||
value: "true"
|
||||
- name: CHROMADB_URL
|
||||
value: http://chromadb.default.svc.cluster.local:6000
|
||||
- name: VLLM_URL
|
||||
value: http://vllm-server.default.svc.cluster.local:8000/v1
|
||||
- name: VLLM_MAX_TOKENS
|
||||
value: "3072"
|
||||
- name: VLLM_SAFETY_URL
|
||||
value: http://vllm-server-safety.default.svc.cluster.local:8001/v1
|
||||
- name: VLLM_TLS_VERIFY
|
||||
value: "false"
|
||||
- name: POSTGRES_HOST
|
||||
value: postgres-server.default.svc.cluster.local
|
||||
- name: POSTGRES_PORT
|
||||
value: "5432"
|
||||
- name: INFERENCE_MODEL
|
||||
value: "${INFERENCE_MODEL}"
|
||||
- name: SAFETY_MODEL
|
||||
value: "${SAFETY_MODEL}"
|
||||
- name: TAVILY_SEARCH_API_KEY
|
||||
value: "${TAVILY_SEARCH_API_KEY}"
|
||||
command: ["python", "-m", "llama_stack.core.server.server", "/etc/config/stack_run_config.yaml", "--port", "8321"]
|
||||
ports:
|
||||
- containerPort: 8321
|
||||
volumeMounts:
|
||||
- name: llama-storage
|
||||
mountPath: /root/.llama
|
||||
- name: llama-config
|
||||
mountPath: /etc/config
|
||||
volumes:
|
||||
- name: llama-storage
|
||||
persistentVolumeClaim:
|
||||
claimName: llama-pvc
|
||||
- name: llama-config
|
||||
configMap:
|
||||
name: llama-stack-config
|
|
@ -1,140 +0,0 @@
|
|||
version: '2'
|
||||
image_name: kubernetes-demo
|
||||
apis:
|
||||
- agents
|
||||
- inference
|
||||
- files
|
||||
- safety
|
||||
- telemetry
|
||||
- tool_runtime
|
||||
- vector_io
|
||||
providers:
|
||||
inference:
|
||||
- provider_id: vllm-inference
|
||||
provider_type: remote::vllm
|
||||
config:
|
||||
url: ${env.VLLM_URL:=http://localhost:8000/v1}
|
||||
max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
|
||||
api_token: ${env.VLLM_API_TOKEN:=fake}
|
||||
tls_verify: ${env.VLLM_TLS_VERIFY:=true}
|
||||
- provider_id: vllm-safety
|
||||
provider_type: remote::vllm
|
||||
config:
|
||||
url: ${env.VLLM_SAFETY_URL:=http://localhost:8000/v1}
|
||||
max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
|
||||
api_token: ${env.VLLM_API_TOKEN:=fake}
|
||||
tls_verify: ${env.VLLM_TLS_VERIFY:=true}
|
||||
- provider_id: sentence-transformers
|
||||
provider_type: inline::sentence-transformers
|
||||
config: {}
|
||||
vector_io:
|
||||
- provider_id: ${env.ENABLE_CHROMADB:+chromadb}
|
||||
provider_type: remote::chromadb
|
||||
config:
|
||||
url: ${env.CHROMADB_URL:=}
|
||||
kvstore:
|
||||
type: postgres
|
||||
host: ${env.POSTGRES_HOST:=localhost}
|
||||
port: ${env.POSTGRES_PORT:=5432}
|
||||
db: ${env.POSTGRES_DB:=llamastack}
|
||||
user: ${env.POSTGRES_USER:=llamastack}
|
||||
password: ${env.POSTGRES_PASSWORD:=llamastack}
|
||||
files:
|
||||
- provider_id: meta-reference-files
|
||||
provider_type: inline::localfs
|
||||
config:
|
||||
storage_dir: ${env.FILES_STORAGE_DIR:=~/.llama/distributions/starter/files}
|
||||
metadata_store:
|
||||
type: sqlite
|
||||
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/starter}/files_metadata.db
|
||||
safety:
|
||||
- provider_id: llama-guard
|
||||
provider_type: inline::llama-guard
|
||||
config:
|
||||
excluded_categories: []
|
||||
agents:
|
||||
- provider_id: meta-reference
|
||||
provider_type: inline::meta-reference
|
||||
config:
|
||||
persistence_store:
|
||||
type: postgres
|
||||
host: ${env.POSTGRES_HOST:=localhost}
|
||||
port: ${env.POSTGRES_PORT:=5432}
|
||||
db: ${env.POSTGRES_DB:=llamastack}
|
||||
user: ${env.POSTGRES_USER:=llamastack}
|
||||
password: ${env.POSTGRES_PASSWORD:=llamastack}
|
||||
responses_store:
|
||||
type: postgres
|
||||
host: ${env.POSTGRES_HOST:=localhost}
|
||||
port: ${env.POSTGRES_PORT:=5432}
|
||||
db: ${env.POSTGRES_DB:=llamastack}
|
||||
user: ${env.POSTGRES_USER:=llamastack}
|
||||
password: ${env.POSTGRES_PASSWORD:=llamastack}
|
||||
telemetry:
|
||||
- provider_id: meta-reference
|
||||
provider_type: inline::meta-reference
|
||||
config:
|
||||
service_name: "${env.OTEL_SERVICE_NAME:=\u200B}"
|
||||
sinks: ${env.TELEMETRY_SINKS:=console}
|
||||
tool_runtime:
|
||||
- provider_id: brave-search
|
||||
provider_type: remote::brave-search
|
||||
config:
|
||||
api_key: ${env.BRAVE_SEARCH_API_KEY:+}
|
||||
max_results: 3
|
||||
- provider_id: tavily-search
|
||||
provider_type: remote::tavily-search
|
||||
config:
|
||||
api_key: ${env.TAVILY_SEARCH_API_KEY:+}
|
||||
max_results: 3
|
||||
- provider_id: rag-runtime
|
||||
provider_type: inline::rag-runtime
|
||||
config: {}
|
||||
- provider_id: model-context-protocol
|
||||
provider_type: remote::model-context-protocol
|
||||
config: {}
|
||||
metadata_store:
|
||||
type: postgres
|
||||
host: ${env.POSTGRES_HOST:=localhost}
|
||||
port: ${env.POSTGRES_PORT:=5432}
|
||||
db: ${env.POSTGRES_DB:=llamastack}
|
||||
user: ${env.POSTGRES_USER:=llamastack}
|
||||
password: ${env.POSTGRES_PASSWORD:=llamastack}
|
||||
table_name: llamastack_kvstore
|
||||
inference_store:
|
||||
type: postgres
|
||||
host: ${env.POSTGRES_HOST:=localhost}
|
||||
port: ${env.POSTGRES_PORT:=5432}
|
||||
db: ${env.POSTGRES_DB:=llamastack}
|
||||
user: ${env.POSTGRES_USER:=llamastack}
|
||||
password: ${env.POSTGRES_PASSWORD:=llamastack}
|
||||
models:
|
||||
- metadata:
|
||||
embedding_dimension: 384
|
||||
model_id: all-MiniLM-L6-v2
|
||||
provider_id: sentence-transformers
|
||||
model_type: embedding
|
||||
- metadata: {}
|
||||
model_id: ${env.INFERENCE_MODEL}
|
||||
provider_id: vllm-inference
|
||||
model_type: llm
|
||||
- metadata: {}
|
||||
model_id: ${env.SAFETY_MODEL:=meta-llama/Llama-Guard-3-1B}
|
||||
provider_id: vllm-safety
|
||||
model_type: llm
|
||||
shields:
|
||||
- shield_id: ${env.SAFETY_MODEL:=meta-llama/Llama-Guard-3-1B}
|
||||
vector_dbs: []
|
||||
datasets: []
|
||||
scoring_fns: []
|
||||
benchmarks: []
|
||||
tool_groups:
|
||||
- toolgroup_id: builtin::websearch
|
||||
provider_id: tavily-search
|
||||
- toolgroup_id: builtin::rag
|
||||
provider_id: rag-runtime
|
||||
server:
|
||||
port: 8321
|
||||
auth:
|
||||
provider_config:
|
||||
type: github_token
|
|
@ -1,68 +0,0 @@
|
|||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: llama-stack-ui
|
||||
labels:
|
||||
app.kubernetes.io/name: llama-stack
|
||||
app.kubernetes.io/component: ui
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/name: llama-stack
|
||||
app.kubernetes.io/component: ui
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app.kubernetes.io/name: llama-stack
|
||||
app.kubernetes.io/component: ui
|
||||
spec:
|
||||
containers:
|
||||
- name: llama-stack-ui
|
||||
image: node:18-alpine
|
||||
command: ["/bin/sh"]
|
||||
env:
|
||||
- name: LLAMA_STACK_BACKEND_URL
|
||||
value: "http://llama-stack-service:8321"
|
||||
- name: LLAMA_STACK_UI_PORT
|
||||
value: "8322"
|
||||
- name: GITHUB_CLIENT_ID
|
||||
value: "${GITHUB_CLIENT_ID}"
|
||||
- name: GITHUB_CLIENT_SECRET
|
||||
value: "${GITHUB_CLIENT_SECRET}"
|
||||
- name: NEXTAUTH_URL
|
||||
value: "${LLAMA_STACK_UI_URL}:8322"
|
||||
args:
|
||||
- -c
|
||||
- |
|
||||
# Install git (not included in alpine by default)
|
||||
apk add --no-cache git
|
||||
|
||||
# Clone the repository
|
||||
echo "Cloning repository..."
|
||||
git clone https://github.com/meta-llama/llama-stack.git /app
|
||||
|
||||
# Navigate to the UI directory
|
||||
echo "Navigating to UI directory..."
|
||||
cd /app/llama_stack/ui
|
||||
|
||||
# Check if package.json exists
|
||||
if [ ! -f "package.json" ]; then
|
||||
echo "ERROR: package.json not found in $(pwd)"
|
||||
ls -la
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Install dependencies with verbose output
|
||||
echo "Installing dependencies..."
|
||||
npm install --verbose
|
||||
|
||||
# Verify next is installed
|
||||
echo "Checking if next is installed..."
|
||||
npx next --version || echo "Next.js not found, checking node_modules..."
|
||||
ls -la node_modules/.bin/ | grep next || echo "No next binary found"
|
||||
|
||||
npm run dev
|
||||
ports:
|
||||
- containerPort: 8322
|
||||
workingDir: /app
|
|
@ -1,70 +0,0 @@
|
|||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: vllm-models
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
volumeMode: Filesystem
|
||||
resources:
|
||||
requests:
|
||||
storage: 50Gi
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: vllm-server
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/name: vllm
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app.kubernetes.io/name: vllm
|
||||
workload-type: inference
|
||||
spec:
|
||||
nodeSelector:
|
||||
eks.amazonaws.com/nodegroup: gpu
|
||||
containers:
|
||||
- name: vllm
|
||||
image: vllm/vllm-openai:latest
|
||||
command: ["/bin/sh", "-c"]
|
||||
args:
|
||||
- "vllm serve ${INFERENCE_MODEL} --dtype float16 --enforce-eager --max-model-len 4096 --gpu-memory-utilization 0.6 --enable-auto-tool-choice --tool-call-parser llama4_pythonic"
|
||||
env:
|
||||
- name: INFERENCE_MODEL
|
||||
value: "${INFERENCE_MODEL}"
|
||||
- name: HUGGING_FACE_HUB_TOKEN
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: hf-token-secret
|
||||
key: token
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
requests:
|
||||
nvidia.com/gpu: 1
|
||||
volumeMounts:
|
||||
- name: llama-storage
|
||||
mountPath: /root/.cache/huggingface
|
||||
volumes:
|
||||
- name: llama-storage
|
||||
persistentVolumeClaim:
|
||||
claimName: vllm-models
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: vllm-server
|
||||
spec:
|
||||
selector:
|
||||
app.kubernetes.io/name: vllm
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 8000
|
||||
targetPort: 8000
|
||||
type: ClusterIP
|
|
@ -1,71 +0,0 @@
|
|||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: vllm-models-safety
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
volumeMode: Filesystem
|
||||
resources:
|
||||
requests:
|
||||
storage: 30Gi
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: vllm-server-safety
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app.kubernetes.io/name: vllm-safety
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app.kubernetes.io/name: vllm-safety
|
||||
workload-type: inference
|
||||
spec:
|
||||
nodeSelector:
|
||||
eks.amazonaws.com/nodegroup: gpu
|
||||
containers:
|
||||
- name: vllm-safety
|
||||
image: vllm/vllm-openai:latest
|
||||
command: ["/bin/sh", "-c"]
|
||||
args: [
|
||||
"vllm serve ${SAFETY_MODEL} --dtype float16 --enforce-eager --max-model-len 4096 --port 8001 --gpu-memory-utilization 0.3"
|
||||
]
|
||||
env:
|
||||
- name: SAFETY_MODEL
|
||||
value: "${SAFETY_MODEL}"
|
||||
- name: HUGGING_FACE_HUB_TOKEN
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: hf-token-secret
|
||||
key: token
|
||||
ports:
|
||||
- containerPort: 8001
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
requests:
|
||||
nvidia.com/gpu: 1
|
||||
volumeMounts:
|
||||
- name: llama-storage
|
||||
mountPath: /root/.cache/huggingface
|
||||
volumes:
|
||||
- name: llama-storage
|
||||
persistentVolumeClaim:
|
||||
claimName: vllm-models-safety
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: vllm-server-safety
|
||||
spec:
|
||||
selector:
|
||||
app.kubernetes.io/name: vllm-safety
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 8001
|
||||
targetPort: 8001
|
||||
type: ClusterIP
|
|
@ -1,127 +0,0 @@
|
|||
# Available Distributions
|
||||
|
||||
Llama Stack provides several pre-configured distributions to help you get started quickly. Choose the distribution that best fits your hardware and use case.
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Distribution | Use Case | Hardware Requirements | Provider |
|
||||
|--------------|----------|----------------------|----------|
|
||||
| `distribution-starter` | General purpose, prototyping | Any (CPU/GPU) | Ollama, Remote APIs |
|
||||
| `distribution-meta-reference-gpu` | High-performance inference | GPU required | Local GPU inference |
|
||||
| Remote-hosted | Production, managed service | None | Partner providers |
|
||||
| iOS/Android SDK | Mobile applications | Mobile device | On-device inference |
|
||||
|
||||
## Choose Your Distribution
|
||||
|
||||
### 🚀 Getting Started (Recommended for Beginners)
|
||||
|
||||
**Use `distribution-starter` if you want to:**
|
||||
- Prototype quickly without GPU requirements
|
||||
- Use remote inference providers (Fireworks, Together, vLLM etc.)
|
||||
- Run locally with Ollama for development
|
||||
|
||||
```bash
|
||||
docker pull llama-stack/distribution-starter
|
||||
```
|
||||
|
||||
**Guides:** [Starter Distribution Guide](self_hosted_distro/starter)
|
||||
|
||||
### 🖥️ Self-Hosted with GPU
|
||||
|
||||
**Use `distribution-meta-reference-gpu` if you:**
|
||||
- Have access to GPU hardware
|
||||
- Want maximum performance and control
|
||||
- Need to run inference locally
|
||||
|
||||
```bash
|
||||
docker pull llama-stack/distribution-meta-reference-gpu
|
||||
```
|
||||
|
||||
**Guides:** [Meta Reference GPU Guide](self_hosted_distro/meta-reference-gpu)
|
||||
|
||||
### 🖥️ Self-Hosted with NVIDA NeMo Microservices
|
||||
|
||||
**Use `nvidia` if you:**
|
||||
- Want to use Llama Stack with NVIDIA NeMo Microservices
|
||||
|
||||
**Guides:** [NVIDIA Distribution Guide](self_hosted_distro/nvidia)
|
||||
|
||||
### ☁️ Managed Hosting
|
||||
|
||||
**Use remote-hosted endpoints if you:**
|
||||
- Don't want to manage infrastructure
|
||||
- Need production-ready reliability
|
||||
- Prefer managed services
|
||||
|
||||
**Partners:** [Fireworks.ai](https://fireworks.ai) and [Together.xyz](https://together.xyz)
|
||||
|
||||
**Guides:** [Remote-Hosted Endpoints](remote_hosted_distro/index)
|
||||
|
||||
### 📱 Mobile Development
|
||||
|
||||
**Use mobile SDKs if you:**
|
||||
- Are building iOS or Android applications
|
||||
- Need on-device inference capabilities
|
||||
- Want offline functionality
|
||||
|
||||
- [iOS SDK](ondevice_distro/ios_sdk)
|
||||
- [Android SDK](ondevice_distro/android_sdk)
|
||||
|
||||
### 🔧 Custom Solutions
|
||||
|
||||
**Build your own distribution if:**
|
||||
- None of the above fit your specific needs
|
||||
- You need custom configurations
|
||||
- You want to optimize for your specific use case
|
||||
|
||||
**Guides:** [Building Custom Distributions](building_distro.md)
|
||||
|
||||
## Detailed Documentation
|
||||
|
||||
### Self-Hosted Distributions
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
self_hosted_distro/starter
|
||||
self_hosted_distro/meta-reference-gpu
|
||||
```
|
||||
|
||||
### Remote-Hosted Solutions
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
remote_hosted_distro/index
|
||||
```
|
||||
|
||||
### Mobile SDKs
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
ondevice_distro/ios_sdk
|
||||
ondevice_distro/android_sdk
|
||||
```
|
||||
|
||||
## Decision Flow
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
A[What's your use case?] --> B{Need mobile app?}
|
||||
B -->|Yes| C[Use Mobile SDKs]
|
||||
B -->|No| D{Have GPU hardware?}
|
||||
D -->|Yes| E[Use Meta Reference GPU]
|
||||
D -->|No| F{Want managed hosting?}
|
||||
F -->|Yes| G[Use Remote-Hosted]
|
||||
F -->|No| H[Use Starter Distribution]
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Choose your distribution** from the options above
|
||||
2. **Follow the setup guide** for your selected distribution
|
||||
3. **Configure your providers** with API keys or local models
|
||||
4. **Start building** with Llama Stack!
|
||||
|
||||
For help choosing or troubleshooting, check our [Getting Started Guide](../getting_started/index.md) or [Community Support](https://github.com/llama-stack/llama-stack/discussions).
|
|
@ -1,262 +0,0 @@
|
|||
# Llama Stack Client Kotlin API Library
|
||||
|
||||
We are excited to share a guide for a Kotlin Library that brings front the benefits of Llama Stack to your Android device. This library is a set of SDKs that provide a simple and effective way to integrate AI capabilities into your Android app whether it is local (on-device) or remote inference.
|
||||
|
||||
Features:
|
||||
- Local Inferencing: Run Llama models purely on-device with real-time processing. We currently utilize ExecuTorch as the local inference distributor and may support others in the future.
|
||||
- [ExecuTorch](https://github.com/pytorch/executorch/tree/main) is a complete end-to-end solution within the PyTorch framework for inferencing capabilities on-device with high portability and seamless performance.
|
||||
- Remote Inferencing: Perform inferencing tasks remotely with Llama models hosted on a remote connection (or serverless localhost).
|
||||
- Simple Integration: With easy-to-use APIs, a developer can quickly integrate Llama Stack in their Android app. The difference with local vs remote inferencing is also minimal.
|
||||
|
||||
Latest Release Notes: [link](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release)
|
||||
|
||||
*Tagged releases are stable versions of the project. While we strive to maintain a stable main branch, it's not guaranteed to be free of bugs or issues.*
|
||||
|
||||
## Android Demo App
|
||||
Check out our demo app to see how to integrate Llama Stack into your Android app: [Android Demo App](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/examples/android_app)
|
||||
|
||||
The key files in the app are `ExampleLlamaStackLocalInference.kt`, `ExampleLlamaStackRemoteInference.kts`, and `MainActivity.java`. With encompassed business logic, the app shows how to use Llama Stack for both the environments.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Add Dependencies
|
||||
#### Kotlin Library
|
||||
Add the following dependency in your `build.gradle.kts` file:
|
||||
```
|
||||
dependencies {
|
||||
implementation("com.llama.llamastack:llama-stack-client-kotlin:0.2.2")
|
||||
}
|
||||
```
|
||||
This will download jar files in your gradle cache in a directory like `~/.gradle/caches/modules-2/files-2.1/com.llama.llamastack/`
|
||||
|
||||
If you plan on doing remote inferencing this is sufficient to get started.
|
||||
|
||||
#### Dependency for Local
|
||||
|
||||
For local inferencing, it is required to include the ExecuTorch library into your app.
|
||||
|
||||
Include the ExecuTorch library by:
|
||||
1. Download the `download-prebuilt-et-lib.sh` script file from the [llama-stack-client-kotlin-client-local](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/llama-stack-client-kotlin-client-local/download-prebuilt-et-lib.sh) directory to your local machine.
|
||||
2. Move the script to the top level of your Android app where the `app` directory resides.
|
||||
3. Run `sh download-prebuilt-et-lib.sh` to create an `app/libs` directory and download the `executorch.aar` in that path. This generates an ExecuTorch library for the XNNPACK delegate.
|
||||
4. Add the `executorch.aar` dependency in your `build.gradle.kts` file:
|
||||
```
|
||||
dependencies {
|
||||
...
|
||||
implementation(files("libs/executorch.aar"))
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
See other dependencies for the local RAG in Android app [README](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/examples/android_app#quick-start).
|
||||
|
||||
## Llama Stack APIs in Your Android App
|
||||
Breaking down the demo app, this section will show the core pieces that are used to initialize and run inference with Llama Stack using the Kotlin library.
|
||||
|
||||
### Setup Remote Inferencing
|
||||
Start a Llama Stack server on localhost. Here is an example of how you can do this using the firework.ai distribution:
|
||||
```
|
||||
uv venv starter --python 3.12
|
||||
source starter/bin/activate # On Windows: starter\Scripts\activate
|
||||
pip install --no-cache llama-stack==0.2.2
|
||||
llama stack build --distro starter --image-type venv
|
||||
export FIREWORKS_API_KEY=<SOME_KEY>
|
||||
llama stack run starter --port 5050
|
||||
```
|
||||
|
||||
Ensure the Llama Stack server version is the same as the Kotlin SDK Library for maximum compatibility.
|
||||
|
||||
Other inference providers: [Table](../../index.md#supported-llama-stack-implementations)
|
||||
|
||||
How to set remote localhost in Demo App: [Settings](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/examples/android_app#settings)
|
||||
|
||||
### Initialize the Client
|
||||
A client serves as the primary interface for interacting with a specific inference type and its associated parameters. Only after client is initialized then you can configure and start inferences.
|
||||
|
||||
<table>
|
||||
<tr>
|
||||
<th>Local Inference</th>
|
||||
<th>Remote Inference</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>
|
||||
|
||||
```
|
||||
client = LlamaStackClientLocalClient
|
||||
.builder()
|
||||
.modelPath(modelPath)
|
||||
.tokenizerPath(tokenizerPath)
|
||||
.temperature(temperature)
|
||||
.build()
|
||||
```
|
||||
</td>
|
||||
<td>
|
||||
|
||||
```
|
||||
// remoteURL is a string like "http://localhost:5050"
|
||||
client = LlamaStackClientOkHttpClient
|
||||
.builder()
|
||||
.baseUrl(remoteURL)
|
||||
.build()
|
||||
```
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
|
||||
### Run Inference
|
||||
With the Kotlin Library managing all the major operational logic, there are minimal to no changes when running simple chat inference for local or remote:
|
||||
|
||||
```
|
||||
val result = client!!.inference().chatCompletion(
|
||||
InferenceChatCompletionParams.builder()
|
||||
.modelId(modelName)
|
||||
.messages(listOfMessages)
|
||||
.build()
|
||||
)
|
||||
|
||||
// response contains string with response from model
|
||||
var response = result.asChatCompletionResponse().completionMessage().content().string();
|
||||
```
|
||||
|
||||
[Remote only] For inference with a streaming response:
|
||||
|
||||
```
|
||||
val result = client!!.inference().chatCompletionStreaming(
|
||||
InferenceChatCompletionParams.builder()
|
||||
.modelId(modelName)
|
||||
.messages(listOfMessages)
|
||||
.build()
|
||||
)
|
||||
|
||||
// Response can be received as a asChatCompletionResponseStreamChunk as part of a callback.
|
||||
// See Android demo app for a detailed implementation example.
|
||||
```
|
||||
|
||||
### Setup Custom Tool Calling
|
||||
|
||||
Android demo app for more details: [Custom Tool Calling](https://github.com/meta-llama/llama-stack-client-kotlin/tree/latest-release/examples/android_app#tool-calling)
|
||||
|
||||
## Advanced Users
|
||||
|
||||
The purpose of this section is to share more details with users that would like to dive deeper into the Llama Stack Kotlin Library. Whether you’re interested in contributing to the open source library, debugging or just want to learn more, this section is for you!
|
||||
|
||||
### Prerequisite
|
||||
|
||||
You must complete the following steps:
|
||||
1. Clone the repo (`git clone https://github.com/meta-llama/llama-stack-client-kotlin.git -b latest-release`)
|
||||
2. Port the appropriate ExecuTorch libraries over into your Llama Stack Kotlin library environment.
|
||||
```
|
||||
cd llama-stack-client-kotlin-client-local
|
||||
sh download-prebuilt-et-lib.sh --unzip
|
||||
```
|
||||
|
||||
Now you will notice that the `jni/` , `libs/`, and `AndroidManifest.xml` files from the `executorch.aar` file are present in the local module. This way the local client module will be able to realize the ExecuTorch SDK.
|
||||
|
||||
### Building for Development/Debugging
|
||||
If you’d like to contribute to the Kotlin library via development, debug, or add play around with the library with various print statements, run the following command in your terminal under the llama-stack-client-kotlin directory.
|
||||
|
||||
```
|
||||
sh build-libs.sh
|
||||
```
|
||||
|
||||
Output: .jar files located in the build-jars directory
|
||||
|
||||
Copy the .jar files over to the lib directory in your Android app. At the same time make sure to remove the llama-stack-client-kotlin dependency within your build.gradle.kts file in your app (or if you are using the demo app) to avoid having multiple llama stack client dependencies.
|
||||
|
||||
### Additional Options for Local Inferencing
|
||||
Currently we provide additional properties support with local inferencing. In order to get the tokens/sec metric for each inference call, add the following code in your Android app after you run your chatCompletion inference function. The Reference app has this implementation as well:
|
||||
```
|
||||
var tps = (result.asChatCompletionResponse()._additionalProperties()["tps"] as JsonNumber).value as Float
|
||||
```
|
||||
We will be adding more properties in the future.
|
||||
|
||||
### Additional Options for Remote Inferencing
|
||||
|
||||
#### Network options
|
||||
|
||||
##### Retries
|
||||
|
||||
Requests that experience certain errors are automatically retried 2 times by default, with a short exponential backoff. Connection errors (for example, due to a network connectivity problem), 408 Request Timeout, 409 Conflict, 429 Rate Limit, and >=500 Internal errors will all be retried by default.
|
||||
You can provide a `maxRetries` on the client builder to configure this:
|
||||
|
||||
```kotlin
|
||||
val client = LlamaStackClientOkHttpClient.builder()
|
||||
.fromEnv()
|
||||
.maxRetries(4)
|
||||
.build()
|
||||
```
|
||||
|
||||
##### Timeouts
|
||||
|
||||
Requests time out after 1 minute by default. You can configure this on the client builder:
|
||||
|
||||
```kotlin
|
||||
val client = LlamaStackClientOkHttpClient.builder()
|
||||
.fromEnv()
|
||||
.timeout(Duration.ofSeconds(30))
|
||||
.build()
|
||||
```
|
||||
|
||||
##### Proxies
|
||||
|
||||
Requests can be routed through a proxy. You can configure this on the client builder:
|
||||
|
||||
```kotlin
|
||||
val client = LlamaStackClientOkHttpClient.builder()
|
||||
.fromEnv()
|
||||
.proxy(new Proxy(
|
||||
Type.HTTP,
|
||||
new InetSocketAddress("proxy.com", 8080)
|
||||
))
|
||||
.build()
|
||||
```
|
||||
|
||||
##### Environments
|
||||
|
||||
Requests are made to the production environment by default. You can connect to other environments, like `sandbox`, via the client builder:
|
||||
|
||||
```kotlin
|
||||
val client = LlamaStackClientOkHttpClient.builder()
|
||||
.fromEnv()
|
||||
.sandbox()
|
||||
.build()
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
This library throws exceptions in a single hierarchy for easy handling:
|
||||
|
||||
- **`LlamaStackClientException`** - Base exception for all exceptions
|
||||
|
||||
- **`LlamaStackClientServiceException`** - HTTP errors with a well-formed response body we were able to parse. The exception message and the `.debuggingRequestId()` will be set by the server.
|
||||
|
||||
| 400 | BadRequestException |
|
||||
| ------ | ----------------------------- |
|
||||
| 401 | AuthenticationException |
|
||||
| 403 | PermissionDeniedException |
|
||||
| 404 | NotFoundException |
|
||||
| 422 | UnprocessableEntityException |
|
||||
| 429 | RateLimitException |
|
||||
| 5xx | InternalServerException |
|
||||
| others | UnexpectedStatusCodeException |
|
||||
|
||||
- **`LlamaStackClientIoException`** - I/O networking errors
|
||||
- **`LlamaStackClientInvalidDataException`** - any other exceptions on the client side, e.g.:
|
||||
- We failed to serialize the request body
|
||||
- We failed to parse the response body (has access to response code and body)
|
||||
|
||||
## Reporting Issues
|
||||
If you encountered any bugs or issues following this guide please file a bug/issue on our [Github issue tracker](https://github.com/meta-llama/llama-stack-client-kotlin/issues).
|
||||
|
||||
## Known Issues
|
||||
We're aware of the following issues and are working to resolve them:
|
||||
1. Streaming response is a work-in-progress for local and remote inference
|
||||
2. Due to #1, agents are not supported at the time. LS agents only work in streaming mode
|
||||
3. Changing to another model is a work in progress for local and remote platforms
|
||||
|
||||
## Thanks
|
||||
We'd like to extend our thanks to the ExecuTorch team for providing their support as we integrated ExecuTorch as one of the local inference distributors for Llama Stack. Checkout [ExecuTorch Github repo](https://github.com/pytorch/executorch/tree/main) for more information.
|
||||
|
||||
---
|
||||
|
||||
The API interface is generated using the OpenAPI standard with [Stainless](https://www.stainlessapi.com/).
|
|
@ -1,134 +0,0 @@
|
|||
# iOS SDK
|
||||
|
||||
We offer both remote and on-device use of Llama Stack in Swift via a single SDK [llama-stack-client-swift](https://github.com/meta-llama/llama-stack-client-swift/) that contains two components:
|
||||
1. LlamaStackClient for remote
|
||||
2. Local Inference for on-device
|
||||
|
||||
```{image} ../../../_static/remote_or_local.gif
|
||||
:alt: Seamlessly switching between local, on-device inference and remote hosted inference
|
||||
:width: 412px
|
||||
:align: center
|
||||
```
|
||||
|
||||
## Remote Only
|
||||
|
||||
If you don't want to run inference on-device, then you can connect to any hosted Llama Stack distribution with #1.
|
||||
|
||||
1. Add `https://github.com/meta-llama/llama-stack-client-swift/` as a Package Dependency in Xcode
|
||||
|
||||
2. Add `LlamaStackClient` as a framework to your app target
|
||||
|
||||
3. Call an API:
|
||||
|
||||
```swift
|
||||
import LlamaStackClient
|
||||
|
||||
let agents = RemoteAgents(url: URL(string: "http://localhost:8321")!)
|
||||
let request = Components.Schemas.CreateAgentTurnRequest(
|
||||
agent_id: agentId,
|
||||
messages: [
|
||||
.UserMessage(Components.Schemas.UserMessage(
|
||||
content: .case1("Hello Llama!"),
|
||||
role: .user
|
||||
))
|
||||
],
|
||||
session_id: self.agenticSystemSessionId,
|
||||
stream: true
|
||||
)
|
||||
|
||||
for try await chunk in try await agents.createTurn(request: request) {
|
||||
let payload = chunk.event.payload
|
||||
// ...
|
||||
```
|
||||
|
||||
Check out [iOSCalendarAssistant](https://github.com/meta-llama/llama-stack-client-swift/tree/main/examples/ios_calendar_assistant) for a complete app demo.
|
||||
|
||||
## LocalInference
|
||||
|
||||
LocalInference provides a local inference implementation powered by [executorch](https://github.com/pytorch/executorch/).
|
||||
|
||||
Llama Stack currently supports on-device inference for iOS with Android coming soon. You can run on-device inference on Android today using [executorch](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/android/LlamaDemo), PyTorch’s on-device inference library.
|
||||
|
||||
The APIs *work the same as remote* – the only difference is you'll instead use the `LocalAgents` / `LocalInference` classes and pass in a `DispatchQueue`:
|
||||
|
||||
```swift
|
||||
private let runnerQueue = DispatchQueue(label: "org.llamastack.stacksummary")
|
||||
let inference = LocalInference(queue: runnerQueue)
|
||||
let agents = LocalAgents(inference: self.inference)
|
||||
```
|
||||
|
||||
Check out [iOSCalendarAssistantWithLocalInf](https://github.com/meta-llama/llama-stack-client-swift/tree/main/examples/ios_calendar_assistant) for a complete app demo.
|
||||
|
||||
### Installation
|
||||
|
||||
We're working on making LocalInference easier to set up. For now, you'll need to import it via `.xcframework`:
|
||||
|
||||
1. Clone the executorch submodule in this repo and its dependencies: `git submodule update --init --recursive`
|
||||
1. Install [Cmake](https://cmake.org/) for the executorch build`
|
||||
1. Drag `LocalInference.xcodeproj` into your project
|
||||
1. Add `LocalInference` as a framework in your app target
|
||||
|
||||
### Preparing a model
|
||||
|
||||
1. Prepare a `.pte` file [following the executorch docs](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#step-2-prepare-model)
|
||||
2. Bundle the `.pte` and `tokenizer.model` file into your app
|
||||
|
||||
We now support models quantized using SpinQuant and QAT-LoRA which offer a significant performance boost (demo app on iPhone 13 Pro):
|
||||
|
||||
|
||||
| Llama 3.2 1B | Tokens / Second (total) | | Time-to-First-Token (sec) | |
|
||||
| :---- | :---- | :---- | :---- | :---- |
|
||||
| | Haiku | Paragraph | Haiku | Paragraph |
|
||||
| BF16 | 2.2 | 2.5 | 2.3 | 1.9 |
|
||||
| QAT+LoRA | 7.1 | 3.3 | 0.37 | 0.24 |
|
||||
| SpinQuant | 10.1 | 5.2 | 0.2 | 0.2 |
|
||||
|
||||
|
||||
### Using LocalInference
|
||||
|
||||
1. Instantiate LocalInference with a DispatchQueue. Optionally, pass it into your agents service:
|
||||
|
||||
```swift
|
||||
init () {
|
||||
runnerQueue = DispatchQueue(label: "org.meta.llamastack")
|
||||
inferenceService = LocalInferenceService(queue: runnerQueue)
|
||||
agentsService = LocalAgentsService(inference: inferenceService)
|
||||
}
|
||||
```
|
||||
|
||||
2. Before making any inference calls, load your model from your bundle:
|
||||
|
||||
```swift
|
||||
let mainBundle = Bundle.main
|
||||
inferenceService.loadModel(
|
||||
modelPath: mainBundle.url(forResource: "llama32_1b_spinquant", withExtension: "pte"),
|
||||
tokenizerPath: mainBundle.url(forResource: "tokenizer", withExtension: "model"),
|
||||
completion: {_ in } // use to handle load failures
|
||||
)
|
||||
```
|
||||
|
||||
3. Make inference calls (or agents calls) as you normally would with LlamaStack:
|
||||
|
||||
```
|
||||
for await chunk in try await agentsService.initAndCreateTurn(
|
||||
messages: [
|
||||
.UserMessage(Components.Schemas.UserMessage(
|
||||
content: .case1("Call functions as needed to handle any actions in the following text:\n\n" + text),
|
||||
role: .user))
|
||||
]
|
||||
) {
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
If you receive errors like "missing package product" or "invalid checksum", try cleaning the build folder and resetting the Swift package cache:
|
||||
|
||||
(Opt+Click) Product > Clean Build Folder Immediately
|
||||
|
||||
```
|
||||
rm -rf \
|
||||
~/Library/org.swift.swiftpm \
|
||||
~/Library/Caches/org.swift.swiftpm \
|
||||
~/Library/Caches/com.apple.dt.Xcode \
|
||||
~/Library/Developer/Xcode/DerivedData
|
||||
```
|
|
@ -1,20 +0,0 @@
|
|||
# Remote-Hosted Distributions
|
||||
|
||||
Remote-Hosted distributions are available endpoints serving Llama Stack API that you can directly connect to.
|
||||
|
||||
| Distribution | Endpoint | Inference | Agents | Memory | Safety | Telemetry |
|
||||
|-------------|----------|-----------|---------|---------|---------|------------|
|
||||
| Together | [https://llama-stack.together.ai](https://llama-stack.together.ai) | remote::together | meta-reference | remote::weaviate | meta-reference | meta-reference |
|
||||
| Fireworks | [https://llamastack-preview.fireworks.ai](https://llamastack-preview.fireworks.ai) | remote::fireworks | meta-reference | remote::weaviate | meta-reference | meta-reference |
|
||||
|
||||
## Connecting to Remote-Hosted Distributions
|
||||
|
||||
You can use `llama-stack-client` to interact with these endpoints. For example, to list the available models served by the Fireworks endpoint:
|
||||
|
||||
```bash
|
||||
$ pip install llama-stack-client
|
||||
$ llama-stack-client configure --endpoint https://llamastack-preview.fireworks.ai
|
||||
$ llama-stack-client models list
|
||||
```
|
||||
|
||||
Checkout the [llama-stack-client-python](https://github.com/meta-llama/llama-stack-client-python/blob/main/docs/cli_reference.md) repo for more details on how to use the `llama-stack-client` CLI. Checkout [llama-stack-app](https://github.com/meta-llama/llama-stack-apps/tree/main) for examples applications built on top of Llama Stack.
|
|
@ -1,78 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# watsonx Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-watsonx` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| inference | `remote::watsonx`, `inline::sentence-transformers` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
|
||||
| vector_io | `inline::faiss` |
|
||||
|
||||
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
|
||||
- `WATSONX_API_KEY`: watsonx API Key (default: ``)
|
||||
- `WATSONX_PROJECT_ID`: watsonx Project ID (default: ``)
|
||||
|
||||
### Models
|
||||
|
||||
The following models are available by default:
|
||||
|
||||
- `meta-llama/llama-3-3-70b-instruct (aliases: meta-llama/Llama-3.3-70B-Instruct)`
|
||||
- `meta-llama/llama-2-13b-chat (aliases: meta-llama/Llama-2-13b)`
|
||||
- `meta-llama/llama-3-1-70b-instruct (aliases: meta-llama/Llama-3.1-70B-Instruct)`
|
||||
- `meta-llama/llama-3-1-8b-instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)`
|
||||
- `meta-llama/llama-3-2-11b-vision-instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)`
|
||||
- `meta-llama/llama-3-2-1b-instruct (aliases: meta-llama/Llama-3.2-1B-Instruct)`
|
||||
- `meta-llama/llama-3-2-3b-instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)`
|
||||
- `meta-llama/llama-3-2-90b-vision-instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)`
|
||||
- `meta-llama/llama-guard-3-11b-vision (aliases: meta-llama/Llama-Guard-3-11B-Vision)`
|
||||
|
||||
|
||||
### Prerequisite: API Keys
|
||||
|
||||
Make sure you have access to a watsonx API Key. You can get one by referring [watsonx.ai](https://www.ibm.com/docs/en/masv-and-l/maximo-manage/continuous-delivery?topic=setup-create-watsonx-api-key).
|
||||
|
||||
|
||||
## Running Llama Stack with watsonx
|
||||
|
||||
You can do this via venv or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=5001
|
||||
docker run \
|
||||
-it \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ./run.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-watsonx \
|
||||
--config /root/my-run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env WATSONX_API_KEY=$WATSONX_API_KEY \
|
||||
--env WATSONX_PROJECT_ID=$WATSONX_PROJECT_ID \
|
||||
--env WATSONX_BASE_URL=$WATSONX_BASE_URL
|
||||
```
|
|
@ -1,78 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
# Dell-TGI Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
|
||||
|
||||
|
||||
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|
||||
|----------------- |--------------- |---------------- |-------------------------------------------------- |---------------- |---------------- |
|
||||
| **Provider(s)** | remote::tgi | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
|
||||
|
||||
|
||||
The only difference vs. the `tgi` distribution is that it runs the Dell-TGI server for inference.
|
||||
|
||||
|
||||
### Start the Distribution (Single Node GPU)
|
||||
|
||||
> [!NOTE]
|
||||
> This assumes you have access to GPU to start a TGI server with access to your GPU.
|
||||
|
||||
```
|
||||
$ cd distributions/dell-tgi/
|
||||
$ ls
|
||||
compose.yaml README.md run.yaml
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should be able to see the following outputs --
|
||||
```
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.810397Z INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.810448Z WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
|
||||
[text-generation-inference] | 2024-10-15T18:56:33.864143Z INFO text_generation_router::server: router/src/server.rs:2353: Connected
|
||||
INFO: Started server process [1]
|
||||
INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://[::]:8321 (Press CTRL+C to quit)
|
||||
```
|
||||
|
||||
To kill the server
|
||||
```
|
||||
docker compose down
|
||||
```
|
||||
|
||||
### (Alternative) Dell-TGI server + llama stack run (Single Node GPU)
|
||||
|
||||
#### Start Dell-TGI server locally
|
||||
```
|
||||
docker run -it --pull always --shm-size 1g -p 80:80 --gpus 4 \
|
||||
-e NUM_SHARD=4
|
||||
-e MAX_BATCH_PREFILL_TOKENS=32768 \
|
||||
-e MAX_INPUT_TOKENS=8000 \
|
||||
-e MAX_TOTAL_TOKENS=8192 \
|
||||
registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct
|
||||
```
|
||||
|
||||
|
||||
#### Start Llama Stack server pointing to TGI server
|
||||
|
||||
```
|
||||
docker run --pull always --network host -it -p 8321:8321 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
|
||||
```
|
||||
|
||||
Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
|
||||
```
|
||||
inference:
|
||||
- provider_id: tgi0
|
||||
provider_type: remote::tgi
|
||||
config:
|
||||
url: http://127.0.0.1:5009
|
||||
```
|
|
@ -1,190 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
|
||||
# Dell Distribution of Llama Stack
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-dell` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| inference | `remote::tgi`, `inline::sentence-transformers` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime` |
|
||||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
You can use this distribution if you have GPUs and want to run an independent TGI or Dell Enterprise Hub container for running inference.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `DEH_URL`: URL for the Dell inference server (default: `http://0.0.0.0:8181`)
|
||||
- `DEH_SAFETY_URL`: URL for the Dell safety inference server (default: `http://0.0.0.0:8282`)
|
||||
- `CHROMA_URL`: URL for the Chroma server (default: `http://localhost:6601`)
|
||||
- `INFERENCE_MODEL`: Inference model loaded into the TGI server (default: `meta-llama/Llama-3.2-3B-Instruct`)
|
||||
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
|
||||
|
||||
|
||||
## Setting up Inference server using Dell Enterprise Hub's custom TGI container.
|
||||
|
||||
NOTE: This is a placeholder to run inference with TGI. This will be updated to use [Dell Enterprise Hub's containers](https://dell.huggingface.co/authenticated/models) once verified.
|
||||
|
||||
```bash
|
||||
export INFERENCE_PORT=8181
|
||||
export DEH_URL=http://0.0.0.0:$INFERENCE_PORT
|
||||
export INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
|
||||
export CHROMADB_HOST=localhost
|
||||
export CHROMADB_PORT=6601
|
||||
export CHROMA_URL=http://$CHROMADB_HOST:$CHROMADB_PORT
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
export LLAMA_STACK_PORT=8321
|
||||
|
||||
docker run --rm -it \
|
||||
--pull always \
|
||||
--network host \
|
||||
-v $HOME/.cache/huggingface:/data \
|
||||
-e HF_TOKEN=$HF_TOKEN \
|
||||
-p $INFERENCE_PORT:$INFERENCE_PORT \
|
||||
--gpus $CUDA_VISIBLE_DEVICES \
|
||||
ghcr.io/huggingface/text-generation-inference \
|
||||
--dtype bfloat16 \
|
||||
--usage-stats off \
|
||||
--sharded false \
|
||||
--cuda-memory-fraction 0.7 \
|
||||
--model-id $INFERENCE_MODEL \
|
||||
--port $INFERENCE_PORT --hostname 0.0.0.0
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
|
||||
|
||||
```bash
|
||||
export SAFETY_INFERENCE_PORT=8282
|
||||
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
|
||||
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
export CUDA_VISIBLE_DEVICES=1
|
||||
|
||||
docker run --rm -it \
|
||||
--pull always \
|
||||
--network host \
|
||||
-v $HOME/.cache/huggingface:/data \
|
||||
-e HF_TOKEN=$HF_TOKEN \
|
||||
-p $SAFETY_INFERENCE_PORT:$SAFETY_INFERENCE_PORT \
|
||||
--gpus $CUDA_VISIBLE_DEVICES \
|
||||
ghcr.io/huggingface/text-generation-inference \
|
||||
--dtype bfloat16 \
|
||||
--usage-stats off \
|
||||
--sharded false \
|
||||
--cuda-memory-fraction 0.7 \
|
||||
--model-id $SAFETY_MODEL \
|
||||
--hostname 0.0.0.0 \
|
||||
--port $SAFETY_INFERENCE_PORT
|
||||
```
|
||||
|
||||
## Dell distribution relies on ChromaDB for vector database usage
|
||||
|
||||
You can start a chroma-db easily using docker.
|
||||
```bash
|
||||
# This is where the indices are persisted
|
||||
mkdir -p $HOME/chromadb
|
||||
|
||||
podman run --rm -it \
|
||||
--network host \
|
||||
--name chromadb \
|
||||
-v $HOME/chromadb:/chroma/chroma \
|
||||
-e IS_PERSISTENT=TRUE \
|
||||
chromadb/chroma:latest \
|
||||
--port $CHROMADB_PORT \
|
||||
--host $CHROMADB_HOST
|
||||
```
|
||||
|
||||
## Running Llama Stack
|
||||
|
||||
Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via venv or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
docker run -it \
|
||||
--pull always \
|
||||
--network host \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v $HOME/.llama:/root/.llama \
|
||||
# NOTE: mount the llama-stack / llama-model directories if testing local changes else not needed
|
||||
-v /home/hjshah/git/llama-stack:/app/llama-stack-source -v /home/hjshah/git/llama-models:/app/llama-models-source \
|
||||
# localhost/distribution-dell:dev if building / testing locally
|
||||
llamastack/distribution-dell\
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env DEH_URL=$DEH_URL \
|
||||
--env CHROMA_URL=$CHROMA_URL
|
||||
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, use:
|
||||
|
||||
```bash
|
||||
# You need a local checkout of llama-stack to run this, get it using
|
||||
# git clone https://github.com/meta-llama/llama-stack.git
|
||||
cd /path/to/llama-stack
|
||||
|
||||
export SAFETY_INFERENCE_PORT=8282
|
||||
export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
|
||||
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v $HOME/.llama:/root/.llama \
|
||||
-v ./llama_stack/distributions/tgi/run-with-safety.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-dell \
|
||||
--config /root/my-run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env DEH_URL=$DEH_URL \
|
||||
--env SAFETY_MODEL=$SAFETY_MODEL \
|
||||
--env DEH_SAFETY_URL=$DEH_SAFETY_URL \
|
||||
--env CHROMA_URL=$CHROMA_URL
|
||||
```
|
||||
|
||||
### Via venv
|
||||
|
||||
Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.
|
||||
|
||||
```bash
|
||||
llama stack build --distro dell --image-type venv
|
||||
llama stack run dell
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env DEH_URL=$DEH_URL \
|
||||
--env CHROMA_URL=$CHROMA_URL
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, use:
|
||||
|
||||
```bash
|
||||
llama stack run ./run-with-safety.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env DEH_URL=$DEH_URL \
|
||||
--env SAFETY_MODEL=$SAFETY_MODEL \
|
||||
--env DEH_SAFETY_URL=$DEH_SAFETY_URL \
|
||||
--env CHROMA_URL=$CHROMA_URL
|
||||
```
|
|
@ -1,125 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# Meta Reference GPU Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations:
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| inference | `inline::meta-reference` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
|
||||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
Note that you need access to nvidia GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`)
|
||||
- `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`)
|
||||
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
|
||||
- `SAFETY_CHECKPOINT_DIR`: Directory containing the Llama-Guard model checkpoint (default: `null`)
|
||||
|
||||
|
||||
## Prerequisite: Downloading Models
|
||||
|
||||
Please use `llama model list --downloaded` to check that you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](../../references/llama_cli_reference/download_models.md) here to download the models. Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
|
||||
|
||||
```
|
||||
$ llama model list --downloaded
|
||||
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
|
||||
┃ Model ┃ Size ┃ Modified Time ┃
|
||||
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
|
||||
│ Llama3.2-1B-Instruct:int4-qlora-eo8 │ 1.53 GB │ 2025-02-26 11:22:28 │
|
||||
├─────────────────────────────────────────┼──────────┼─────────────────────┤
|
||||
│ Llama3.2-1B │ 2.31 GB │ 2025-02-18 21:48:52 │
|
||||
├─────────────────────────────────────────┼──────────┼─────────────────────┤
|
||||
│ Prompt-Guard-86M │ 0.02 GB │ 2025-02-26 11:29:28 │
|
||||
├─────────────────────────────────────────┼──────────┼─────────────────────┤
|
||||
│ Llama3.2-3B-Instruct:int4-spinquant-eo8 │ 3.69 GB │ 2025-02-26 11:37:41 │
|
||||
├─────────────────────────────────────────┼──────────┼─────────────────────┤
|
||||
│ Llama3.2-3B │ 5.99 GB │ 2025-02-18 21:51:26 │
|
||||
├─────────────────────────────────────────┼──────────┼─────────────────────┤
|
||||
│ Llama3.1-8B │ 14.97 GB │ 2025-02-16 10:36:37 │
|
||||
├─────────────────────────────────────────┼──────────┼─────────────────────┤
|
||||
│ Llama3.2-1B-Instruct:int4-spinquant-eo8 │ 1.51 GB │ 2025-02-26 11:35:02 │
|
||||
├─────────────────────────────────────────┼──────────┼─────────────────────┤
|
||||
│ Llama-Guard-3-1B │ 2.80 GB │ 2025-02-26 11:20:46 │
|
||||
├─────────────────────────────────────────┼──────────┼─────────────────────┤
|
||||
│ Llama-Guard-3-1B:int4 │ 0.43 GB │ 2025-02-26 11:33:33 │
|
||||
└─────────────────────────────────────────┴──────────┴─────────────────────┘
|
||||
```
|
||||
|
||||
## Running the Distribution
|
||||
|
||||
You can do this via venv or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
--gpu all \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
llamastack/distribution-meta-reference-gpu \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, use:
|
||||
|
||||
```bash
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
--gpu all \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
llamastack/distribution-meta-reference-gpu \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
|
||||
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
```
|
||||
|
||||
### Via venv
|
||||
|
||||
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
|
||||
|
||||
```bash
|
||||
llama stack build --distro meta-reference-gpu --image-type venv
|
||||
llama stack run distributions/meta-reference-gpu/run.yaml \
|
||||
--port 8321 \
|
||||
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, use:
|
||||
|
||||
```bash
|
||||
llama stack run distributions/meta-reference-gpu/run-with-safety.yaml \
|
||||
--port 8321 \
|
||||
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
|
||||
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
```
|
|
@ -1,171 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# NVIDIA Distribution
|
||||
|
||||
The `llamastack/distribution-nvidia` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `inline::localfs`, `remote::nvidia` |
|
||||
| eval | `remote::nvidia` |
|
||||
| files | `inline::localfs` |
|
||||
| inference | `remote::nvidia` |
|
||||
| post_training | `remote::nvidia` |
|
||||
| safety | `remote::nvidia` |
|
||||
| scoring | `inline::basic` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `inline::rag-runtime` |
|
||||
| vector_io | `inline::faiss` |
|
||||
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``)
|
||||
- `NVIDIA_APPEND_API_VERSION`: Whether to append the API version to the base_url (default: `True`)
|
||||
- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`)
|
||||
- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`)
|
||||
- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`)
|
||||
- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`)
|
||||
- `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`)
|
||||
- `NVIDIA_GUARDRAILS_CONFIG_ID`: NVIDIA Guardrail Configuration ID (default: `self-check`)
|
||||
- `NVIDIA_EVALUATOR_URL`: URL for the NeMo Evaluator Service (default: `http://0.0.0.0:7331`)
|
||||
- `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`)
|
||||
- `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`)
|
||||
|
||||
### Models
|
||||
|
||||
The following models are available by default:
|
||||
|
||||
- `meta/llama3-8b-instruct `
|
||||
- `meta/llama3-70b-instruct `
|
||||
- `meta/llama-3.1-8b-instruct `
|
||||
- `meta/llama-3.1-70b-instruct `
|
||||
- `meta/llama-3.1-405b-instruct `
|
||||
- `meta/llama-3.2-1b-instruct `
|
||||
- `meta/llama-3.2-3b-instruct `
|
||||
- `meta/llama-3.2-11b-vision-instruct `
|
||||
- `meta/llama-3.2-90b-vision-instruct `
|
||||
- `meta/llama-3.3-70b-instruct `
|
||||
- `nvidia/vila `
|
||||
- `nvidia/llama-3.2-nv-embedqa-1b-v2 `
|
||||
- `nvidia/nv-embedqa-e5-v5 `
|
||||
- `nvidia/nv-embedqa-mistral-7b-v2 `
|
||||
- `snowflake/arctic-embed-l `
|
||||
|
||||
|
||||
## Prerequisites
|
||||
### NVIDIA API Keys
|
||||
|
||||
Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable.
|
||||
|
||||
### Deploy NeMo Microservices Platform
|
||||
The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform.
|
||||
|
||||
## Supported Services
|
||||
Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints.
|
||||
|
||||
### Inference: NVIDIA NIM
|
||||
NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs:
|
||||
1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key)
|
||||
2. Self-hosted: NVIDIA NIMs that run on your own infrastructure.
|
||||
|
||||
The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment.
|
||||
|
||||
### Datasetio API: NeMo Data Store
|
||||
The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint.
|
||||
|
||||
See the {repopath}`NVIDIA Datasetio docs::llama_stack/providers/remote/datasetio/nvidia/README.md` for supported features and example usage.
|
||||
|
||||
### Eval API: NeMo Evaluator
|
||||
The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint.
|
||||
|
||||
See the {repopath}`NVIDIA Eval docs::llama_stack/providers/remote/eval/nvidia/README.md` for supported features and example usage.
|
||||
|
||||
### Post-Training API: NeMo Customizer
|
||||
The NeMo Customizer microservice supports fine-tuning models. You can reference {repopath}`this list of supported models::llama_stack/providers/remote/post_training/nvidia/models.py` that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.
|
||||
|
||||
See the {repopath}`NVIDIA Post-Training docs::llama_stack/providers/remote/post_training/nvidia/README.md` for supported features and example usage.
|
||||
|
||||
### Safety API: NeMo Guardrails
|
||||
The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint.
|
||||
|
||||
See the {repopath}`NVIDIA Safety docs::llama_stack/providers/remote/safety/nvidia/README.md` for supported features and example usage.
|
||||
|
||||
## Deploying models
|
||||
In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`.
|
||||
|
||||
Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart.
|
||||
```sh
|
||||
# URL to NeMo NIM Proxy service
|
||||
export NEMO_URL="http://nemo.test"
|
||||
|
||||
curl --location "$NEMO_URL/v1/deployment/model-deployments" \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"name": "llama-3.2-1b-instruct",
|
||||
"namespace": "meta",
|
||||
"config": {
|
||||
"model": "meta/llama-3.2-1b-instruct",
|
||||
"nim_deployment": {
|
||||
"image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct",
|
||||
"image_tag": "1.8.3",
|
||||
"pvc_size": "25Gi",
|
||||
"gpu": 1,
|
||||
"additional_envs": {
|
||||
"NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
|
||||
}
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference.
|
||||
|
||||
You can also remove a deployed NIM to free up GPU resources, if needed.
|
||||
```sh
|
||||
export NEMO_URL="http://nemo.test"
|
||||
|
||||
curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct"
|
||||
```
|
||||
|
||||
## Running Llama Stack with NVIDIA
|
||||
|
||||
You can do this via venv (build code), or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ./run.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-nvidia \
|
||||
--config /root/my-run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env NVIDIA_API_KEY=$NVIDIA_API_KEY
|
||||
```
|
||||
|
||||
### Via venv
|
||||
|
||||
If you've set up your local development environment, you can also build the image using your local virtual environment.
|
||||
|
||||
```bash
|
||||
INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
|
||||
llama stack build --distro nvidia --image-type venv
|
||||
llama stack run ./run.yaml \
|
||||
--port 8321 \
|
||||
--env NVIDIA_API_KEY=$NVIDIA_API_KEY \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL
|
||||
```
|
||||
|
||||
## Example Notebooks
|
||||
For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in {repopath}`docs/notebooks/nvidia`.
|
|
@ -1,42 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# Passthrough Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-passthrough` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| inference | `remote::passthrough`, `inline::sentence-transformers` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `remote::wolfram-alpha`, `inline::rag-runtime`, `remote::model-context-protocol` |
|
||||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `PASSTHROUGH_API_KEY`: Passthrough API Key (default: ``)
|
||||
- `PASSTHROUGH_URL`: Passthrough URL (default: ``)
|
||||
|
||||
### Models
|
||||
|
||||
The following models are available by default:
|
||||
|
||||
- `llama3.1-8b-instruct `
|
||||
- `llama3.2-11b-vision-instruct `
|
|
@ -1,232 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# Starter Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-starter` distribution is a comprehensive, multi-provider distribution that includes most of the available inference providers in Llama Stack. It's designed to be a one-stop solution for developers who want to experiment with different AI providers without having to configure each one individually.
|
||||
|
||||
## Provider Composition
|
||||
|
||||
The starter distribution consists of the following provider configurations:
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| files | `inline::localfs` |
|
||||
| inference | `remote::openai`, `remote::fireworks`, `remote::together`, `remote::ollama`, `remote::anthropic`, `remote::gemini`, `remote::groq`, `remote::sambanova`, `remote::vllm`, `remote::tgi`, `remote::cerebras`, `remote::llama-openai-compat`, `remote::nvidia`, `remote::hf::serverless`, `remote::hf::endpoint`, `inline::sentence-transformers` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
|
||||
| vector_io | `inline::faiss`, `inline::sqlite-vec`, `inline::milvus`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
## Inference Providers
|
||||
|
||||
The starter distribution includes a comprehensive set of inference providers:
|
||||
|
||||
### Hosted Providers
|
||||
- **[OpenAI](https://openai.com/api/)**: GPT-4, GPT-3.5, O1, O3, O4 models and text embeddings -
|
||||
provider ID: `openai` - reference documentation: [openai](../../providers/inference/remote_openai.md)
|
||||
- **[Fireworks](https://fireworks.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and
|
||||
embeddings - provider ID: `fireworks` - reference documentation: [fireworks](../../providers/inference/remote_fireworks.md)
|
||||
- **[Together](https://together.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and
|
||||
embeddings - provider ID: `together` - reference documentation: [together](../../providers/inference/remote_together.md)
|
||||
- **[Anthropic](https://www.anthropic.com/)**: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Haiku, and Voyage embeddings - provider ID: `anthropic` - reference documentation: [anthropic](../../providers/inference/remote_anthropic.md)
|
||||
- **[Gemini](https://gemini.google.com/)**: Gemini 1.5, 2.0, 2.5 models and text embeddings - provider ID: `gemini` - reference documentation: [gemini](../../providers/inference/remote_gemini.md)
|
||||
- **[Groq](https://groq.com/)**: Fast Llama models (3.1, 3.2, 3.3, 4 Scout, 4 Maverick) - provider ID: `groq` - reference documentation: [groq](../../providers/inference/remote_groq.md)
|
||||
- **[SambaNova](https://www.sambanova.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models - provider ID: `sambanova` - reference documentation: [sambanova](../../providers/inference/remote_sambanova.md)
|
||||
- **[Cerebras](https://www.cerebras.ai/)**: Cerebras AI models - provider ID: `cerebras` - reference documentation: [cerebras](../../providers/inference/remote_cerebras.md)
|
||||
- **[NVIDIA](https://www.nvidia.com/)**: NVIDIA NIM - provider ID: `nvidia` - reference documentation: [nvidia](../../providers/inference/remote_nvidia.md)
|
||||
- **[HuggingFace](https://huggingface.co/)**: Serverless and endpoint models - provider ID: `hf::serverless` and `hf::endpoint` - reference documentation: [huggingface-serverless](../../providers/inference/remote_hf_serverless.md) and [huggingface-endpoint](../../providers/inference/remote_hf_endpoint.md)
|
||||
- **[Bedrock](https://aws.amazon.com/bedrock/)**: AWS Bedrock models - provider ID: `bedrock` - reference documentation: [bedrock](../../providers/inference/remote_bedrock.md)
|
||||
|
||||
### Local/Remote Providers
|
||||
- **[Ollama](https://ollama.ai/)**: Local Ollama models - provider ID: `ollama` - reference documentation: [ollama](../../providers/inference/remote_ollama.md)
|
||||
- **[vLLM](https://docs.vllm.ai/en/latest/)**: Local or remote vLLM server - provider ID: `vllm` - reference documentation: [vllm](../../providers/inference/remote_vllm.md)
|
||||
- **[TGI](https://github.com/huggingface/text-generation-inference)**: Text Generation Inference server - Dell Enterprise Hub's custom TGI container too (use `DEH_URL`) - provider ID: `tgi` - reference documentation: [tgi](../../providers/inference/remote_tgi.md)
|
||||
- **[Sentence Transformers](https://www.sbert.net/)**: Local embedding models - provider ID: `sentence-transformers` - reference documentation: [sentence-transformers](../../providers/inference/inline_sentence-transformers.md)
|
||||
|
||||
All providers are disabled by default. So you need to enable them by setting the environment variables.
|
||||
|
||||
## Vector IO
|
||||
|
||||
The starter distribution includes a comprehensive set of vector IO providers:
|
||||
|
||||
- **[FAISS](https://github.com/facebookresearch/faiss)**: Local FAISS vector store - enabled by
|
||||
default - provider ID: `faiss`
|
||||
- **[SQLite](https://www.sqlite.org/index.html)**: Local SQLite vector store - disabled by default - provider ID: `sqlite-vec`
|
||||
- **[ChromaDB](https://www.trychroma.com/)**: Remote ChromaDB vector store - disabled by default - provider ID: `chromadb`
|
||||
- **[PGVector](https://github.com/pgvector/pgvector)**: PostgreSQL vector store - disabled by default - provider ID: `pgvector`
|
||||
- **[Milvus](https://milvus.io/)**: Milvus vector store - disabled by default - provider ID: `milvus`
|
||||
|
||||
## Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
### Server Configuration
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
|
||||
### API Keys for Hosted Providers
|
||||
- `OPENAI_API_KEY`: OpenAI API key
|
||||
- `FIREWORKS_API_KEY`: Fireworks API key
|
||||
- `TOGETHER_API_KEY`: Together API key
|
||||
- `ANTHROPIC_API_KEY`: Anthropic API key
|
||||
- `GEMINI_API_KEY`: Google Gemini API key
|
||||
- `GROQ_API_KEY`: Groq API key
|
||||
- `SAMBANOVA_API_KEY`: SambaNova API key
|
||||
- `CEREBRAS_API_KEY`: Cerebras API key
|
||||
- `LLAMA_API_KEY`: Llama API key
|
||||
- `NVIDIA_API_KEY`: NVIDIA API key
|
||||
- `HF_API_TOKEN`: HuggingFace API token
|
||||
|
||||
### Local Provider Configuration
|
||||
- `OLLAMA_URL`: Ollama server URL (default: `http://localhost:11434`)
|
||||
- `VLLM_URL`: vLLM server URL (default: `http://localhost:8000/v1`)
|
||||
- `VLLM_MAX_TOKENS`: vLLM max tokens (default: `4096`)
|
||||
- `VLLM_API_TOKEN`: vLLM API token (default: `fake`)
|
||||
- `VLLM_TLS_VERIFY`: vLLM TLS verification (default: `true`)
|
||||
- `TGI_URL`: TGI server URL
|
||||
|
||||
### Model Configuration
|
||||
- `INFERENCE_MODEL`: HuggingFace model for serverless inference
|
||||
- `INFERENCE_ENDPOINT_NAME`: HuggingFace endpoint name
|
||||
|
||||
### Vector Database Configuration
|
||||
- `SQLITE_STORE_DIR`: SQLite store directory (default: `~/.llama/distributions/starter`)
|
||||
- `ENABLE_SQLITE_VEC`: Enable SQLite vector provider
|
||||
- `ENABLE_CHROMADB`: Enable ChromaDB provider
|
||||
- `ENABLE_PGVECTOR`: Enable PGVector provider
|
||||
- `CHROMADB_URL`: ChromaDB server URL
|
||||
- `PGVECTOR_HOST`: PGVector host (default: `localhost`)
|
||||
- `PGVECTOR_PORT`: PGVector port (default: `5432`)
|
||||
- `PGVECTOR_DB`: PGVector database name
|
||||
- `PGVECTOR_USER`: PGVector username
|
||||
- `PGVECTOR_PASSWORD`: PGVector password
|
||||
|
||||
### Tool Configuration
|
||||
- `BRAVE_SEARCH_API_KEY`: Brave Search API key
|
||||
- `TAVILY_SEARCH_API_KEY`: Tavily Search API key
|
||||
|
||||
### Telemetry Configuration
|
||||
- `OTEL_SERVICE_NAME`: OpenTelemetry service name
|
||||
- `TELEMETRY_SINKS`: Telemetry sinks (default: `console,sqlite`)
|
||||
|
||||
## Enabling Providers
|
||||
|
||||
You can enable specific providers by setting appropriate environment variables. For example,
|
||||
|
||||
```bash
|
||||
# self-hosted
|
||||
export OLLAMA_URL=http://localhost:11434 # enables the Ollama inference provider
|
||||
export VLLM_URL=http://localhost:8000/v1 # enables the vLLM inference provider
|
||||
export TGI_URL=http://localhost:8000/v1 # enables the TGI inference provider
|
||||
|
||||
# cloud-hosted requiring API key configuration on the server
|
||||
export CEREBRAS_API_KEY=your_cerebras_api_key # enables the Cerebras inference provider
|
||||
export NVIDIA_API_KEY=your_nvidia_api_key # enables the NVIDIA inference provider
|
||||
|
||||
# vector providers
|
||||
export MILVUS_URL=http://localhost:19530 # enables the Milvus vector provider
|
||||
export CHROMADB_URL=http://localhost:8000/v1 # enables the ChromaDB vector provider
|
||||
export PGVECTOR_DB=llama_stack_db # enables the PGVector vector provider
|
||||
```
|
||||
|
||||
This distribution comes with a default "llama-guard" shield that can be enabled by setting the `SAFETY_MODEL` environment variable to point to an appropriate Llama Guard model id. Use `llama-stack-client models list` to see the list of available models.
|
||||
|
||||
## Running the Distribution
|
||||
|
||||
You can run the starter distribution via Docker or venv.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-e OPENAI_API_KEY=your_openai_key \
|
||||
-e FIREWORKS_API_KEY=your_fireworks_key \
|
||||
-e TOGETHER_API_KEY=your_together_key \
|
||||
llamastack/distribution-starter \
|
||||
--port $LLAMA_STACK_PORT
|
||||
```
|
||||
|
||||
### Via venv
|
||||
|
||||
Ensure you have configured the starter distribution using the environment variables explained above.
|
||||
|
||||
```bash
|
||||
uv run --with llama-stack llama stack build --distro starter --image-type venv --run
|
||||
```
|
||||
|
||||
## Example Usage
|
||||
|
||||
Once the distribution is running, you can use any of the available models. Here are some examples:
|
||||
|
||||
### Using OpenAI Models
|
||||
```bash
|
||||
llama-stack-client --endpoint http://localhost:8321 \
|
||||
inference chat-completion \
|
||||
--model-id openai/gpt-4o \
|
||||
--message "Hello, how are you?"
|
||||
```
|
||||
|
||||
### Using Fireworks Models
|
||||
```bash
|
||||
llama-stack-client --endpoint http://localhost:8321 \
|
||||
inference chat-completion \
|
||||
--model-id fireworks/meta-llama/Llama-3.2-3B-Instruct \
|
||||
--message "Write a short story about a robot."
|
||||
```
|
||||
|
||||
### Using Local Ollama Models
|
||||
```bash
|
||||
# First, make sure Ollama is running and you have a model
|
||||
ollama run llama3.2:3b
|
||||
|
||||
# Then use it through Llama Stack
|
||||
export OLLAMA_INFERENCE_MODEL=llama3.2:3b
|
||||
llama-stack-client --endpoint http://localhost:8321 \
|
||||
inference chat-completion \
|
||||
--model-id ollama/llama3.2:3b \
|
||||
--message "Explain quantum computing in simple terms."
|
||||
```
|
||||
|
||||
## Storage
|
||||
|
||||
The starter distribution uses SQLite for local storage of various components:
|
||||
|
||||
- **Metadata store**: `~/.llama/distributions/starter/registry.db`
|
||||
- **Inference store**: `~/.llama/distributions/starter/inference_store.db`
|
||||
- **FAISS store**: `~/.llama/distributions/starter/faiss_store.db`
|
||||
- **SQLite vector store**: `~/.llama/distributions/starter/sqlite_vec.db`
|
||||
- **Files metadata**: `~/.llama/distributions/starter/files_metadata.db`
|
||||
- **Agents store**: `~/.llama/distributions/starter/agents_store.db`
|
||||
- **Responses store**: `~/.llama/distributions/starter/responses_store.db`
|
||||
- **Trace store**: `~/.llama/distributions/starter/trace_store.db`
|
||||
- **Evaluation store**: `~/.llama/distributions/starter/meta_reference_eval.db`
|
||||
- **Dataset I/O stores**: Various HuggingFace and local filesystem stores
|
||||
|
||||
## Benefits of the Starter Distribution
|
||||
|
||||
1. **Comprehensive Coverage**: Includes most popular AI providers in one distribution
|
||||
2. **Flexible Configuration**: Easy to enable/disable providers based on your needs
|
||||
3. **No Local GPU Required**: Most providers are cloud-based, making it accessible to developers without high-end hardware
|
||||
4. **Easy Migration**: Start with hosted providers and gradually move to local ones as needed
|
||||
5. **Production Ready**: Includes safety, evaluation, and telemetry components
|
||||
6. **Tool Integration**: Comes with web search, RAG, and model context protocol tools
|
||||
|
||||
The starter distribution is ideal for developers who want to experiment with different AI providers, build prototypes quickly, or create applications that can work with multiple AI backends.
|
|
@ -1,25 +0,0 @@
|
|||
# Starting a Llama Stack Server
|
||||
|
||||
You can run a Llama Stack server in one of the following ways:
|
||||
|
||||
## As a Library:
|
||||
|
||||
This is the simplest way to get started. Using Llama Stack as a library means you do not need to start a server. This is especially useful when you are not running inference locally and relying on an external inference service (eg. fireworks, together, groq, etc.) See [Using Llama Stack as a Library](importing_as_library)
|
||||
|
||||
|
||||
## Container:
|
||||
|
||||
Another simple way to start interacting with Llama Stack is to just spin up a container (via Docker or Podman) which is pre-built with all the providers you need. We provide a number of pre-built images so you can start a Llama Stack server instantly. You can also build your own custom container. Which distribution to choose depends on the hardware you have. See [Selection of a Distribution](selection) for more details.
|
||||
|
||||
## Kubernetes:
|
||||
|
||||
If you have built a container image and want to deploy it in a Kubernetes cluster instead of starting the Llama Stack server locally. See [Kubernetes Deployment Guide](kubernetes_deployment) for more details.
|
||||
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
:hidden:
|
||||
|
||||
importing_as_library
|
||||
configuration
|
||||
```
|
Loading…
Add table
Add a link
Reference in a new issue