- Add new Vertex AI remote inference provider with litellm integration
- Support for Gemini models through Google Cloud Vertex AI platform
- Uses Google Cloud Application Default Credentials (ADC) for authentication
- Added VertexAI models: gemini-2.5-flash, gemini-2.5-pro, gemini-2.0-flash.
- Updated provider registry to include vertexai provider
- Updated starter template to support Vertex AI configuration
- Added comprehensive documentation and sample configuration
Signed-off-by: Eran Cohen <eranco@redhat.com>
# What does this PR do?
https://github.com/meta-llama/llama-stack/pull/2716/ broke commands
like:
```
python -m llama_stack.distribution.server.server --config
llama_stack/templates/starter/run.yaml
```
And will fail with:
```
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py", line 626, in <module>
main()
File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py", line 402, in main
config_file = resolve_config_or_template(args.config, Mode.RUN)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/utils/config_resolution.py", line 43, in resolve_config_or_template
config_path = Path(config_or_template)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.8/Frameworks/Python.framework/Versions/3.12/lib/python3.12/pathlib.py", line 1162, in __init__
super().__init__(*args)
File "/opt/homebrew/Cellar/python@3.12/3.12.8/Frameworks/Python.framework/Versions/3.12/lib/python3.12/pathlib.py", line 373, in __init__
raise TypeError(
TypeError: argument should be a str or an os.PathLike object where __fspath__ returns a str, not 'NoneType'
```
Complaining that no positional arguments are present. We now honour the
deprecation until --config and --template are removed completely.
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Both ` python -m llama_stack.distribution.server.server --config
llama_stack/templates/starter/run.yaml` and ` python -m
llama_stack.distribution.server.server
llama_stack/templates/starter/run.yaml` should run the server. Same for
`--template starter`.
Signed-off-by: Sébastien Han <seb@redhat.com>
- Remove --no-cache flags from uv pip install commands to enable caching
- Mount host uv cache directory to container for persistent caching
- Set UV_LINK_MODE=copy to prevent uv using hardlinks
- When building the starter image
o Build time reduced from ~4:45 to ~3:05 on subsequent builds
(environment specific)
o Eliminates re-downloading of 3G+ of data on each build
o Cache size: ~6.2G (when building starter image)
Fixes excessive data downloads during distro container builds.
Signed-off-by: Derek Higgins <derekh@redhat.com>
This PR updates model registration and lookup behavior to be slightly
more general / flexible. See
https://github.com/meta-llama/llama-stack/issues/2843 for more details.
Note that this change is backwards compatible given the design of the
`lookup_model()` method.
## Test Plan
Added unit tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR fixes flaky telemetry tests
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
See https://github.com/meta-llama/llama-stack/pull/2814
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
When podman is used and the registry is omitted, podman will prompt the
user. However, we're piping the output of podman to /dev/null and the
user will not see the prompt, the script will end abruptly and this is
confusing.
This commit explicitly uses the docker.io registry for the ollama image
and the llama-stack image so that the prompt is avoided.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
I ran the script on a machine with podman and the issue was resolved
## Image
Before the fix, this is what would happen:
<img width="748" height="95" alt="image"
src="https://github.com/user-attachments/assets/9c609f88-c0a8-45e7-a789-834f64f601e5"
/>
Signed-off-by: Omer Tuchfeld <omer@tuchfeld.dev>
# What does this PR do?
chore: Making name optional in openai_create_vector_store
# Closes https://github.com/meta-llama/llama-stack/issues/2706
## Test Plan
CI and unit tests
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Ensures that session turns retrieved from the agent persistence layer
are sorted by their `started_at` timestamp, as the key-value store does
not guarantee order.
Closes#2852
## Test Plan
- [ ] Add unit tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
minor update of the pgvector doc, changing 'faiss' to 'pgvector'
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
This PR adds the quickstart as a file to the docs so that it can be more
easily maintained and run, as mentioned in
https://github.com/meta-llama/llama-stack/pull/2800.
## Test Plan
I could add this as a test in the CI but I wasn't sure if we wanted to
add additional jobs there. 😅
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Refactors the vector store routing logic by moving OpenAI-compatible
vector store operations from the `VectorIORouter` to the
`VectorDBsRoutingTable`.
Closes https://github.com/meta-llama/llama-stack/issues/2761
## Test Plan
Added unit tests to cover new routing logic and ACL checks.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Part of #2696
## Test Plan
Run `llama stack run starter`
Error:
```
myenv ❯ llama stack run starters
WARNING 2025-07-10 12:12:43,052 llama_stack.cli.stack.run:82 server: Conda detected. Using conda environment myenv for the run.
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--env KEY=VALUE]
[--image-type {conda,venv}] [--enable-ui]
[config | template]
llama stack run: error: Could not resolve config or template 'starters'.
Tried the following locations:
1. As file path: /Users/erichuang/projects/llama-stack-git/starters
2. As template: /Users/erichuang/projects/llama-stack-git/llama_stack/templates/starters/run.yaml
3. As built distribution: (/Users/erichuang/.llama/distributions/llamastack-starters/starters-run.yaml, /Users/erichuang/.llama/distributions/starters/starters-run.yaml)
Available templates: dell, test-env, vllm-gpu, test-template, cerebras, openai-api-verification, sambanova, passthrough, direct-config, together, openai, fireworks, meta-reference-gpu, __pycache__, dev, ollama, watsonx, remote-vllm, llama_api, groq, dummy, oracle, nvidia, ci-tests, postgres-demo, test-stack, bedrock, starter, hf-serverless, hf-endpoint, tgi, open-benchmark, verification
Did you mean one of these templates?
- starter
- together
- postgres-demo
```
# What does this PR do?
After https://github.com/meta-llama/llama-stack/pull/2818, SIGINT will
print a stack trace. This is because uvicorn re-raises SIGINT and it
gets converted by Python internal signal handler (default handles
SIGINT) to KeyboardInterrupt exception. We know simply catch the
exception to get a clean exit, this is not changing the behavior on
SIGINT.
## Test Plan
Run the server, hit Ctrl+C or `kill -2 <server pid>` and expect a clean
exit with no stack trace.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The pre-commit workflow was failing in the main branch and removing
`@pytest.mark.asyncio `from `test_get_raw_document_text.py` fixed that.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR add `provider_id` field to `VectorDBInput` class.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
fixes https://github.com/meta-llama/llama-stack/issues/2819
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
The workflow that automatically creates a PR to update the Coverage
Badge fails as the `GITHUB_TOKEN` doesn't have write permissions.
As opposed to providing write permissions to the token, we can provide
the permissions for just this workflow with this PR.
Just like #2805 but for vLLM.
We also make VLLM_URL env variable optional (not required) -- if not
specified, the provider silently sits idle and yells eventually if
someone tries to call a completion on it. This is done so as to allow
this provider to be present in the `starter` distribution.
## Test Plan
Set up vLLM, copy the starter template and set `{ refresh_models: true,
refresh_models_interval: 10 }` for the vllm provider and then run:
```
ENABLE_VLLM=vllm VLLM_URL=http://localhost:8000/v1 \
uv run llama stack run --image-type venv /tmp/starter.yaml
```
Verify that `llama-stack-client models list` brings up the model
correctly from vLLM.
Inline _inference_ providers haven't proved to be very useful -- they
are rarely used. And for good reason -- it is almost never a good idea
to include a complex (distributed) inference engine bundled into a
distributed stateful front-end server serving many other things.
Responsibility should be split properly.
See Discord discussion:
1395849853
For self-hosted providers like Ollama (or vLLM), the backing server is
running a set of models. That server should be treated as the source of
truth and the Stack registry should just be a cache for those models. Of
course, in production environments, you may not want this (because you
know what model you are running statically) hence there's a config
boolean to control this behavior.
_This is part of a series of PRs aimed at removing the requirement of
needing to set `INFERENCE_MODEL` env variables for running Llama Stack
server._
## Test Plan
Copy and modify the starter.yaml template / config and enable
`refresh_models: true, refresh_models_interval: 10` for the ollama
provider. Then, run:
```
LLAMA_STACK_LOGGING=all=debug \
ENABLE_OLLAMA=ollama uv run llama stack run --image-type venv /tmp/starter.yaml
```
See a gargantuan amount of logs, but verify that the provider is
periodically refreshing models. Stop and prune a model from ollama
server, restart the server. Verify that the model goes away when I call
`uv run llama-stack-client models list`
# What does this PR do?
This PR fixes the `DPOAlignmentConfig` schema to use the correct Direct
Preference Optimization (DPO) parameters.
The current schema incorrectly uses PPO-inspired parameters
(`reward_scale`, `reward_clip`, `epsilon`, `gamma`) that are not part of
the DPO algorithm. This PR updates it to use the standard DPO
parameters:
- `beta`: The KL divergence coefficient that controls deviation from the
reference model
- `loss_type`: The type of DPO loss function (sigmoid, hinge, ipo,
kto_pair)
These parameters align with standard DPO implementations like
HuggingFace's TRL library.
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-43-83.ec2.internal>
When we call `construct_stack()`, providers are instantiated and
`initialize()` is called. This call can end up doing _anything_ at all
-- specifically, providers are free to create long running background
tasks as part of this. If we wrapped this within a `asyncio.run()` as in
the current code, these tasks get canceled when the stack construction
finishes. This is not correct. The PR addresses the issue by creating a
persistent event loop which is used for both the stack as well as for
running the uvicorn server. In other words, the lifetime of the
providers (and downstream async code) is now the same as the lifetime of
the uvicorn server.
## Test Plan
This should not affect any current code since we don't have background
tasks created right now. However,
https://github.com/meta-llama/llama-stack/pull/2805 will start using
this functionality.
# What does this PR do?
'build' command didn't take into account ENABLE flags for starter distro
for some reason, I was having issues with HuggingFace access for the
embedding model, so added a tip for that as well
Closes#2779
## Test Plan
I ran the described steps manually, but it would be nice if someone else
could try it and verify this still works
We might consider having some CI job ensure the QSG remains functional -
it's not a great experience for new users if they try Llama Stack for
the first time and it doesn't work as we describe
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- Added coverage badge to README. - [See my
fork](https://github.com/ChristianZaccaria/llama-stack)
- Added a GitHub Actions workflow that runs the tests and updates the
coverage badge. - [See
run](4574811323)
- Documented steps in `testing.md` for running the tests locally, and
viewing the `html` report.
- Excluded non-essential files from coverage reporting to provide a more
accurate measurement.
Automatically created PR to update coverage badge:
https://github.com/ChristianZaccaria/llama-stack/pull/9
# Note for reviewers
1. Currently the coverage report shows a 45% coverage. Wondering if
there are other files or directories that should also be excluded from
the report to increase the percentage. The directories with the least
test coverage are `llama_stack/cli`, `llama_stack/models`, and
`llama_stack/ui`. - Should we exclude these?
2. **[Required]** The `GITHUB_TOKEN` should have write permissions to
open a PR to update the coverage badge.
# GitHub Issue
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2355
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
The `testing.md` file describes how to run the unit tests locally.
# What does this PR do?
trigger integration tests on ALL changes to `tests/` to catch failures
before they merge into main
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
some async test markers are in the codebase causing pre-commit to fail
due to #2744
remove these pytest fixtures
## Test Plan
pre-commit passes
Signed-off-by: Charlie Doern <cdoern@redhat.com>
If I am running `uv run llama stack run --image-type venv` it should not
be saying to me "Conda detected" because I am pretty clearly telling it
I need venv. The root cause is the offending line.
# What does this PR do?
## Test Plan
ENABLE_OLLAMA=ollama LLAMA_STACK_CONFIG=starter uv run pytest
tests/integration/telemetry
--text-model="ollama/llama3.2:3b-instruct-fp16"
# What does this PR do?
let's users register models available at
https://integrate.api.nvidia.com/v1/models that isn't already in
llama_stack/providers/remote/inference/nvidia/models.py
## Test Plan
1. run the nvidia distro
2. register a model from https://integrate.api.nvidia.com/v1/models that
isn't already know, as of this writing
nvidia/llama-3.1-nemotron-ultra-253b-v1 is a good example
3. perform inference w/ the model
- POST /v1/models accepts optional provider_model_id
- ModelsRoutingTable.register_model handler ensures it is non-None,
providing a default
usage of Model.provider_model_id will no longer need to detect None