# What does this PR do?
Rename environment var for consistency
## Test Plan
No regressions
## Sources
## Before submitting
- [X] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Ran pre-commit to handle lint / formatting issues.
- [X] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [X] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
---------
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
Automatically generates
- build.yaml
- run.yaml
- run-with-safety.yaml
- parts of markdown docs
for the distributions.
## Test Plan
At this point, this only updates the YAMLs and the docs. Some testing
(especially with ollama and vllm) has been performed but needs to be
much more tested.
# What does this PR do?
We'd like our docker steps to require _ZERO EDITS_ to a YAML file in
order to get going. This is often not possible because depending on the
provider, we do need some configuration input from the user. Environment
variables are the best way to obtain this information.
This PR allows our run.yaml to contain `${env.FOO_BAR}` placeholders
which can be replaced using `docker run -e FOO_BAR=baz` (and similar
`docker compose` equivalent).
## Test Plan
For remote-vllm, example `run.yaml` snippet looks like this:
```yaml
providers:
inference:
# serves main inference model
- provider_id: vllm-0
provider_type: remote::vllm
config:
# NOTE: replace with "localhost" if you are running in "host" network mode
url: ${env.LLAMA_INFERENCE_VLLM_URL:http://host.docker.internal:5100/v1}
max_tokens: ${env.MAX_TOKENS:4096}
api_token: fake
# serves safety llama_guard model
- provider_id: vllm-1
provider_type: remote::vllm
config:
# NOTE: replace with "localhost" if you are running in "host" network mode
url: ${env.LLAMA_SAFETY_VLLM_URL:http://host.docker.internal:5101/v1}
max_tokens: ${env.MAX_TOKENS:4096}
api_token: fake
```
`compose.yaml` snippet looks like this:
```yaml
llamastack:
depends_on:
- vllm-0
- vllm-1
# image: llamastack/distribution-remote-vllm
image: llamastack/distribution-remote-vllm:test-0.0.52rc3
volumes:
- ~/.llama:/root/.llama
- ~/local/llama-stack/distributions/remote-vllm/run.yaml:/root/llamastack-run-remote-vllm.yaml
# network_mode: "host"
environment:
- LLAMA_INFERENCE_VLLM_URL=${LLAMA_INFERENCE_VLLM_URL:-http://host.docker.internal:5100/v1}
- LLAMA_INFERENCE_MODEL=${LLAMA_INFERENCE_MODEL:-Llama3.1-8B-Instruct}
- MAX_TOKENS=${MAX_TOKENS:-4096}
- SQLITE_STORE_DIR=${SQLITE_STORE_DIR:-$HOME/.llama/distributions/remote-vllm}
- LLAMA_SAFETY_VLLM_URL=${LLAMA_SAFETY_VLLM_URL:-http://host.docker.internal:5101/v1}
- LLAMA_SAFETY_MODEL=${LLAMA_SAFETY_MODEL:-Llama-Guard-3-1B}
```
Splits the meta-reference safety implementation into three distinct providers:
- inline::llama-guard
- inline::prompt-guard
- inline::code-scanner
Note that this PR is a backward incompatible change to the llama stack server. I have added deprecation_error field to ProviderSpec -- the server reads it and immediately barfs. This is used to direct the user with a specific message on what action to perform. An automagical "config upgrade" is a bit too much work to implement right now :/
(Note that we will be gradually prefixing all inline providers with inline:: -- I am only doing this for this set of new providers because otherwise existing configuration files will break even more badly.)