mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-26 22:19:49 +00:00

Ashwin Bharambe 1463b79218

feat(registry): make the Stack query providers for model listing (#2862 )

This flips #2823 and #2805 by making the Stack periodically query the
providers for models rather than the providers going behind the back and
calling "register" on to the registry themselves. This also adds support
for model listing for all other providers via `ModelRegistryHelper`.
Once this is done, we do not need to manually list or register models
via `run.yaml` and it will remove both noise and annoyance (setting
`INFERENCE_MODEL` environment variables, for example) from the new user
experience.

In addition, it adds a configuration variable `allowed_models` which can
be used to optionally restrict the set of models exposed from a
provider.

2025-07-24 10:39:53 -07:00

859 B

Raw Blame History

remote::vllm

Description

Remote vLLM inference provider for connecting to vLLM servers.

Configuration

Field	Type	Required	Default	Description
`url`	`str \| None`	No		The URL for the vLLM model serving endpoint
`max_tokens`	`<class 'int'>`	No	4096	Maximum number of tokens to generate.
`api_token`	`str \| None`	No	fake	The API token
`tls_verify`	`bool \| str`	No	True	Whether to verify TLS certificates. Can be a boolean or a path to a CA certificate file.
`refresh_models`	`<class 'bool'>`	No	False	Whether to refresh models periodically

Sample Configuration

url: ${env.VLLM_URL:=}
max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
api_token: ${env.VLLM_API_TOKEN:=fake}
tls_verify: ${env.VLLM_TLS_VERIFY:=true}

859 B Raw Blame History

remote::vllm

Description

Configuration

Sample Configuration

859 B

Raw Blame History