# What does this PR do?
When clients called the Open AI API with invalid input that wasn't
caught by our own Pydantic API validation but instead only caught by the
backend inference provider, that backend inference provider was
returning a HTTP 400 error. However, we were wrapping that into a HTTP
500 error, obfuscating the actual issue from calling clients and
triggering OpenAI client retry logic.
This change adjusts our existing `translate_exception` method in
`server.py` to wrap `openai.BadRequestError` as HTTP 400 errors, passing
through the string representation of the error message to the calling
user so they can see the actual input validation error and correct it. I
tried changing this in a few other places, but ultimately
`translate_exception` was the only real place to handle this for both
streaming and non-streaming requests across all inference providers that
use the OpenAI server APIs.
This also tightens up our validation a bit for the OpenAI chat
completions API, to catch empty `messages` parameters, invalid
`tool_choice` parameters, invalid `tools` items, or passing
`tool_choice` when `tools` isn't given.
Lastly, this extends our OpenAI API chat completions verifications to
also check for consistent input validation across providers. Providers
behind Llama Stack should automatically pass all the new tests due to
the input validation added here, but some of the providers fail this
test when not run behind Llama Stack due to differences in how they
handle input validation and errors.
(Closes#1951)
## Test Plan
To test this, start an OpenAI API verification stack:
```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```
Then, run the new verification tests with your provider(s) of choice:
```
python -m pytest -s -v \
tests/verifications/openai_api/test_chat_completion.py \
--provider openai-llama-stack
python -m pytest -s -v \
tests/verifications/openai_api/test_chat_completion.py \
--provider together-llama-stack
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR adds the name of the tool that is used by the agent on the
"tools" page of the playground. See image below for an example.

## Test Plan
Run the playground and navigate to the tools page. There users can see
that this additional text is present when tools are invoked and absent
when they are not.
```
streamlit run llama_stack/distribution/ui/app.py
```
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
# What does this PR do?
Previously, when a streaming client would disconnect before we were
finished streaming the entire response, an error like the below would
get raised from the `sse_generator` function in
`llama_stack/distribution/server/server.py`:
```
AttributeError: 'coroutine' object has no attribute 'aclose'. Did you mean: 'close'?
```
This was because we were calling `aclose` on a coroutine instead of the
awaited value from that coroutine. This change fixes that, so that we
save off the awaited value and then can call `aclose` on it if we
encounter an `asyncio.CancelledError`, like we see when a client
disconnects before we're finished streaming.
The other changes in here are to add a simple set of tests for the happy
path of our SSE streaming and this client disconnect path.
That unfortunately requires adding one more dependency into our unit
test section of pyproject.toml since `server.py` requires loading some
of the telemetry code for me to test this functionality.
## Test Plan
I wrote the tests in `tests/unit/server/test_sse.py` first, verified the
client disconnected test failed before my change, and that it passed
afterwards.
```
python -m pytest -s -v tests/unit/server/test_sse.py
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Add examples for how to define RAGDocuments. Not sure if this is the
best place for these docs. @raghotham Please advise
## Test Plan
None, documentation
[//]: # (## Documentation)
Signed-off-by: Kevin <kpostlet@redhat.com>
# What does this PR do?
Closes#1968.
The asynchronous client in `VLLMInferenceAdapter` is now initialized
directly before first use and not in `VLLMInferenceAdapter.initialize`.
This prevents issues arising due to accessing an expired event loop from
a completed `asyncio.run`.
## Test Plan
Ran unit tests, including `test_remote_vllm.py`.
Ran the code snippet mentioned in #1968.
---------
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Now, tool outputs and retrieved chunks from the vector DB (i.e.,
everything except for the actual model reply) are hidden under an
expander form when presented to the user.
# Test Plan
Navigate to the RAG page in the Playground UI.
# What does this PR do?
The together inference provider was throwing a stack trace every time it
shut down, as it was trying to call a non-existent `close` method on the
AsyncTogether client. While fixing that, I also adjusted its shutdown
logic to close the OpenAI client if we've created one of those, as that
client does have a `close` method.
In testing that, I also realized we were defaulting to treating all
requests as streaming requests instead of defaulting to non-streaming.
So, this flips that default to non-streaming to match how the other
providers work.
## Test Plan
I tested this by ensuring the together inference provider no longer
spits out a long stack trace when shutting it down and by running the
OpenAI API chat completion verification suite to ensure the change in
default streaming logic didn't mess anything else up.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR cleans up the sidebar on the tools page of the playground in the
following ways:
* created a clearer hierarchy of configuration options and tool
selections.
* Removed the `mcp::` or `builtin::` prefixes from the tool selection
buttons.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Run the playground and see the updated sidebar does not cause any new
errors.
```
streamlit run llama_stack/distribution/ui/app.py
```
[//]: # (## Documentation)
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
37da47ef8e (diff-4d7c51b1efe9043e44439a949dfd92e5827321b34082903477fd04876edb7552)
Pydantic was updated from v1 to v2 in this commit which caused this
breaking change
# What does this PR do?
Part of #1857
This won't fix the Validation error with the example, but it will
correctly supply user with a proper error rather than a 5xx code.
Signed-off-by: Kevin <kpostlet@redhat.com>
# What does this PR do?
We were passing a dict into the compat mixin for OpenAI Completions when
using Llama models with Fireworks, and that was breaking some strong
typing code that was added in openai_compat.py. We shouldn't have been
converting these params to a dict in that case anyway, so this adjusts
things to pass the params in as their actual original types when calling
the OpenAIChatCompletionToLlamaStackMixin.
## Test Plan
All of the fireworks provider verification tests were failing due to
some OpenAI compatibility cleanup in #1962. The changes in that PR were
good to make, and this just cleans up the fireworks provider code to
stop passing in untyped dicts to some of those `openai_compat.py`
methods since we have the original strongly-typed parameters we can pass
in.
```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```
```
python -m pytest -s -v tests/verifications/openai_api/test_chat_completion.py --provider=fireworks-llama-stack
```
Before this PR, all of the fireworks OpenAI verification tests were
failing. Now, most of them are passing.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
- Update NVIDIA documentation links to GA docs
- Remove reference to notebooks until merged
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
This is helpful when debugging issues with vLLM + Llama Stack after this
PR https://github.com/vllm-project/vllm/pull/15593
---------
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
NVIDIA Inference provider was using the ModelRegistryHelper to map input
model ids to provider model ids. this updates it to use the model_store.
## Test Plan
`LLAMA_STACK_CONFIG=http://localhost:8321 uv run pytest -v
tests/integration/inference/{test_embedding.py,test_text_inference.py,test_openai_completion.py}
--embedding-model nvidia/llama-3.2-nv-embedqa-1b-v2
--text-model=meta-llama/Llama-3.1-70B-Instruct`
# What does this PR do?
Fixes the UBI 9 container build failure ( `error: command 'gcc' failed`
when installing `polyleven`, `faiss`, etc.) by installing the missing
compiler tool‑chain:
- `python3.11-devel gcc` make added to the UBI 9 `dnf install` line.
### Closes#1970
## Test Plan
- Build a distro with an UBI image
# What does this PR do?
This was not caught as part of the CI build:
dd62a2388c.
[This PR](https://github.com/meta-llama/llama-stack/pull/1354) was too
old and didn't include the additional CI builds yet.
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
- Adds a note about unexpected Brave Search output appearing even when
Tavily Search is called. This behavior is expected for now and is a work
in progress https://github.com/meta-llama/llama-stack/issues/1229. The
note aims to clear any confusion for new users.
- Adds two example scripts demonstrating how to build an agent using:
1. WebSearch tool
2. WolframAlpha tool
These examples provide new users with an instant understanding of how to
integrate these tools.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Tested these example scripts using following steps:
step 1. `ollama run llama3.2:3b-instruct-fp16 --keepalive 60m`
step 2.
```
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export LLAMA_STACK_PORT=8321
```
step 3: `llama stack run --image-type conda
~/llama-stack/llama_stack/templates/ollama/run.yaml`
step 4: run the example script with your api keys.
expected output:


[//]: # (## Documentation)
Test plan:
python tests/verifications/generate_report.py --providers
fireworks,together,llama_meta_ref,openai
Co-authored-by: Eric Huang <erichuang@fb.com>
# What does this PR do?
Allow users to name an agent and use the name in telemetry instead of
relying on randomly generated agent_ids. This improves the developer
experience by making it easier to find specific agents in telemetry
logs.
Closes#1832
## Test Plan
- Added tests to verify the agent name is properly stored and retrieved
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_name_filtering`
from the root of the project and made sure the tests pass
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_query_spans`
to verify existing code without agent names still works correctly
## Use Example
```
agent = Agent(
llama_stack_client,
model=text_model_id,
name="CustomerSupportAgent", # New parameter
instructions="You are a helpful customer support assistant"
)
session_id = agent.create_session(f"test-session-{uuid4()}")
```
## Implementation Notes
- Agent names are optional string parameters with no additional
validation
- Names are not required to be unique - multiple agents can have the
same name
- The agent_id remains the unique identifier for an agent
---------
Co-authored-by: raghotham <raghotham@gmail.com>
# What does this PR do?
Some of our multi-turn verification tests were failing because I had
accidentally marked content as a required field in the OpenAI chat
completion request assistant messages, but it's actually optional. It is
required for messages from other roles, but assistant is explicitly
allowed to be optional.
Similarly, the assistant message tool_calls field should default to None
instead of an empty list.
These two changes get the openai-llama-stack verification test back to
100% passing, just like it passes 100% when not behind Llama Stack. They
also increase the pass rate of some of the other providers in the
verification test, but don't get them to 100%.
## Test Plan
I started a Llama Stack server setup to run all the verification tests
(requires OPENAI_API_KEY env variable)
```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```
Then, I manually ran the verification tests to see which were failing,
fix them, and ran them again after these changes to ensure they were all
passing.
```
python -m pytest -s -v tests/verifications/openai_api/test_chat_completion.py --provider=openai-llama-stack
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Add NVIDIA platform docs that serve as a starting point for Llama Stack
users and explains all supported microservices.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
This PR handles the case where a Customization Job's status is
`unknown`. Since we don't map `unknown` to a valid `JobStatus`, the
PostTraining provider throws an exception when fetching/listing a job.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
`./scripts/unit-tests.sh
tests/unit/providers/nvidia/test_supervised_fine_tuning.py` succeeds
[//]: # (## Documentation)
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
Fixes a crash that occurred when building a stack as a container image
via the interactive wizard without supplying --template or --config.
- Root cause: template_or_config was None; only the container path
relies on that parameter, which later reaches subprocess.run() and
triggers
`TypeError: expected str, bytes or os.PathLike object, not NoneType.`
- Change: in `_run_stack_build_command_from_build_config` we now fall
back to the freshly‑written build‑spec file whenever both optional
sources are missing. Also adds a spy‑based unit test that asserts a
valid string path is passed to build_image() for container builds.
### Closes#1976
## Test Plan
- New unit test: test_build_path.py. Monkey‑patches build_image,
captures the fourth argument, and verifies it is a real path
- Manual smoke test:
```
llama stack build --image-type container
# answer wizard prompts
```
Build proceeds into Docker without raising the previous TypeError.
## Future Work
Harmonise `build_image` arguments so every image type receives the same
inputs, eliminating this asymmetric special‑case.
# What does this PR do?
Build failures are hard to read, sometimes we get errors like:
```
Error building stack: 'key'
```
Which are difficult to debug without a proper trace.
## Test Plan
If `llama stack build` fails you get a traceback now.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR lets users select an existing vdb to use with their agent on the
tools page of the playground. The drop down menu that lets users select
a vdb only appears when the rag tool is selected. Without this change,
there is no way for a user to specify which vdb they want their rag tool
to use on the tools page. I have intentionally left the RAG options
sparse here since the full RAG options are exposed on the RAG page.
## Test Plan
Without these changes the RAG tool will throw the following error:
`name: knowledge_search) does not have any content `
With these changes the RAG tool works as expected.
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
# What does this PR do?
Adds `meta/llama-3.2-1b-instruct` to list of models that NeMo Customizer
can fine-tune. This is the model our example notebooks typically use for
fine-tuning.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
This PR persists the theme preference across page navigation.
Currently, if the default theme is detected, it is used.
But if a user flips **_the default theme_** and goes to a new page, the
theme will switch back to the default.
This resolves that issue.
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Fixes: #1955
Since 0.2.0, the vLLM gets an empty list (vs ``None``in 0.1.9 and
before) when there are no tools configured which causes the issue
described in #1955 p. This patch avoids sending the 'tools' param to the
vLLM altogether instead of an empty list.
It also adds a small unit test to avoid regressions.
The OpenAI
[specification](https://platform.openai.com/docs/api-reference/chat/create)
does not explicitly state that the list cannot be empty but I found this
out through experimentation and it might depend on the actual remote
vllm. In any case, as this parameter is Optional, is best to skip it
altogether if there's no tools configured.
Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>
# What does this PR do?
This PR adds a `max_tokens` slider to playground tools page. I have
found that in some instances the llama stack server throws a 500 error
if the max_tokens value is not explicitly set in the agent's
`sampling_params`. This PR, uses the same implementation of the
`max_tokens` slider from the chat page, and includes it on the tools
page.
## Test Plan
1. Attempting to call a tool without these changes results in a `500:
Internal server error: An unexpected error occurred`.
2. Attempting to call a tool with these changes results in the expected
output.
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
# What does this PR do?
PR adds instructions to setup vLLM remote endpoint for vllm-remote llama
stack distribution.
## Test Plan
* Verified with manual tests of the configured vllm-remote against vllm
endpoint running on the system with Intel GPU
* Also verified with ci pytests (see cmdline below). Test passes in the
same capacity as it does on the A10 Nvidia setup (some tests do fail
which seems to be known issues with vllm remote llama stack
distribution)
```
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config=http://localhost:5001 \
--text-model=meta-llama/Llama-3.2-3B-Instruct
```
CC: @ashwinb
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
# What does this PR do?
allow users to specify only the providers they want in the llama stack
build command. If a user wants a non-interactive build, but doesn't want
to use a template, `--providers` allows someone to specify something
like `--providers inference=remote::ollama` for a distro with JUST
ollama
## Test Plan
`llama stack build --providers inference=remote::ollama --image-type
venv`
<img width="1084" alt="Screenshot 2025-03-20 at 9 34 14 AM"
src="https://github.com/user-attachments/assets/502b5fa2-edab-4267-a595-4f987204a6a9"
/>
`llama stack run --image-type venv
/Users/charliedoern/projects/Documents/llama-stack/venv-run.yaml`
<img width="1149" alt="Screenshot 2025-03-20 at 9 35 19 AM"
src="https://github.com/user-attachments/assets/433765f3-6b7f-4383-9241-dad085b69228"
/>
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
## Test Plan
(myenv) ➜ llama-stack python tests/verifications/generate_report.py
--providers fireworks,together,openai --run-tests
f27f617629/tests/verifications/REPORT.md
## What does this PR do?
This PR improves the server's request routing logic by ensuring built-in
FastAPI paths such as `/docs`, `/redoc`, `/openapi.json`,
`/favicon.ico`, and `/static` bypass the custom `TracingMiddleware`.
This prevents unnecessary tracing logic for documentation and static
file requests, ensuring better performance and cleaner logs.
Additionally, it adds proper metadata (`title`, `description`, and
`version`) to the FastAPI application initialization and updates the
requirements document accordingly.
[//]: # (Closes#1822 )
---
## Test Plan
- Ran the server locally with `uvicorn` using the provided `run.yaml`
config
- Verified that:
- FastAPI docs (`/docs`, `/redoc`) load correctly without triggering the
custom tracing middleware
- All other routes still go through the middleware and trace logic
- Application metadata appears as expected in the OpenAPI docs
To reproduce:
1. Start the server with `python server.py --template <template-name>`
2. Navigate to `/docs` and `/redoc`
3. Confirm that no extra trace headers are added for those routes
4. Confirm other API endpoints behave as expected and include
`x-trace-id` in the response headers
[//]: # (## Documentation)
---
Froze the requirements file to include many of the other libraries that
have been added in the past few releases to make install easier.
---------
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
ollama's CLI supports running models via commands such as 'ollama run
llama3.2' this syntax does not work with the INFERENCE_MODEL llamastack
var as currently specifying a tag such as 'latest' is required
this commit will check to see if the 'latest' model is available and use
that model if a user passes a model name without a tag but the 'latest'
is available in ollama
## Test Plan
Behavior pre-code change
```bash
$ INFERENCE_MODEL=llama3.2 llama stack build --template ollama --image-type venv --run
...
INFO 2025-04-08 13:42:42,842 llama_stack.providers.remote.inference.ollama.ollama:80 inference: checking
connectivity to Ollama at `http://beanlab1.bss.redhat.com:11434`...
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/server/server.py", line 502, in <module>
main()
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/server/server.py", line 401, in main
impls = asyncio.run(construct_stack(config))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/base_events.py", line 691, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/stack.py", line 222, in construct_stack
await register_resources(run_config, impls)
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/stack.py", line 99, in register_resources
await method(**obj.model_dump())
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
result = await method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 294, in register_model
registered_model = await self.register_object(model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 228, in register_object
registered_obj = await register_object_with_provider(obj, p)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 77, in register_object_with_provider
return await p.register_model(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
result = await method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/remote/inference/ollama/ollama.py", line 315, in register_model
raise ValueError(
ValueError: Model 'llama3.2' is not available in Ollama. Available models: llama3.2:latest
++ error_handler 108
++ echo 'Error occurred in script at line: 108'
Error occurred in script at line: 108
++ exit 1
```
Behavior post-code change
```bash
$ INFERENCE_MODEL=llama3.2 llama stack build --template ollama --image-type venv --run
...
INFO 2025-04-08 13:58:17,365 llama_stack.providers.remote.inference.ollama.ollama:80 inference: checking
connectivity to Ollama at `http://beanlab1.bss.redhat.com:11434`...
WARNING 2025-04-08 13:58:18,190 llama_stack.providers.remote.inference.ollama.ollama:317 inference: Imprecise provider
resource id was used but 'latest' is available in Ollama - using 'llama3.2:latest'
INFO 2025-04-08 13:58:18,191 llama_stack.providers.remote.inference.ollama.ollama:308 inference: Pulling embedding
model `all-minilm:latest` if necessary...
INFO 2025-04-08 13:58:18,799 __main__:478 server: Listening on ['::', '0.0.0.0']:8321
INFO: Started server process [28378]
INFO: Waiting for application startup.
INFO 2025-04-08 13:58:18,803 __main__:148 server: Starting up
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
...
```
## Documentation
Did not document this anywhere but happy to do so if there is an
appropriate place
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
Now a separate thread is started to execute training jobs. Training
requests now return job ID before the job completes. (Which fixes API
timeouts for any jobs that take longer than a minute.)
Note: the scheduler code is meant to be spun out in the future into a
common provider service that can be reused for different APIs and
providers. It is also expected to back the /jobs API proposed here:
https://github.com/meta-llama/llama-stack/discussions/1238
Hence its somewhat generalized form which is expected to simplify its
adoption elsewhere in the future.
Note: this patch doesn't attempt to implement missing APIs (e.g. cancel
or job removal). This work will belong to follow-up PRs.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Added unit tests for the scheduler module. For the API coverage, did
manual testing and was able to run a training cycle on GPU. The initial
call returned job ID before the training completed, as (now) expected.
Artifacts are returned as expected.
```
JobArtifactsResponse(checkpoints=[{'identifier': 'meta-llama/Llama-3.2-3B-Instruct-sft-0', 'created_at': '2025-03-07T22:45:19.892714', 'epoch': 0, 'post_training_job_id': 'test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50', 'path': '/home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0', 'training_metrics': None}], job_uuid='test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50')
```
The integration test is currently disabled for the provider. I will look
into how it can be enabled in a different PR / issue context.
[//]: # (## Documentation)
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
TLDR: Changes needed to get 100% passing tests for OpenAI API
verification tests when run against Llama Stack with the `together`,
`fireworks`, and `openai` providers. And `groq` is better than before,
at 88% passing.
This cleans up the OpenAI API support for image message types
(specifically `image_url` types) and handling of the `response_format`
chat completion parameter. Both of these required a few more Pydantic
model definitions in our Inference API, just to move from the
not-quite-right stubs I had in place to something fleshed out to match
the actual OpenAI API specs.
As part of testing this, I also found and fixed a bug in the litellm
implementation of openai_completion and openai_chat_completion, so the
providers based on those should actually be working now.
The method `prepare_openai_completion_params` in
`llama_stack/providers/utils/inference/openai_compat.py` was improved to
actually recursively clean up input parameters, including handling of
lists, dicts, and dumping of Pydantic models to dicts. These changes
were required to get to 100% passing tests on the OpenAI API
verification against the `openai` provider.
With the above, the together.ai provider was passing as well as it is
without Llama Stack. But, since we have Llama Stack in the middle, I
took the opportunity to clean up the together.ai provider so that it now
also passes the OpenAI API spec tests we have at 100%. That means
together.ai is now passing our verification test better when using an
OpenAI client talking to Llama Stack than it is when hitting together.ai
directly, without Llama Stack in the middle.
And, another round of work for Fireworks to improve translation of
incoming OpenAI chat completion requests to Llama Stack chat completion
requests gets the fireworks provider passing at 100%. The server-side
fireworks.ai tool calling support with OpenAI chat completions and Llama
4 models isn't great yet, but by pointing the OpenAI clients at Llama
Stack's API we can clean things up and get everything working as
expected for Llama 4 models.
## Test Plan
### OpenAI API Verification Tests
I ran the OpenAI API verification tests as below and 100% of the tests
passed.
First, start a Llama Stack server that runs the `openai` provider with
the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template
setup to do this out of the box, so I added a
`tests/verifications/openai-api-verification-run.yaml` to do this.
First, ensure you have the necessary API key environment variables set:
```
export TOGETHER_API_KEY="..."
export FIREWORKS_API_KEY="..."
export OPENAI_API_KEY="..."
```
Then, run a Llama Stack server that serves up all these providers:
```
llama stack run \
--image-type venv \
tests/verifications/openai-api-verification-run.yaml
```
Finally, generate a new verification report against all these providers,
both with and without the Llama Stack server in the middle.
```
python tests/verifications/generate_report.py \
--run-tests \
--provider \
together \
fireworks \
groq \
openai \
together-llama-stack \
fireworks-llama-stack \
groq-llama-stack \
openai-llama-stack
```
You'll see that most of the configurations with Llama Stack in the
middle now pass at 100%, even though some of them do not pass at 100%
when hitting the backend provider's API directly with an OpenAI client.
### OpenAI Completion Integration Tests with vLLM:
I also ran the smaller `test_openai_completion.py` test suite (that's
not yet merged with the verification tests) on multiple of the
providers, since I had to adjust the method signature of
openai_chat_completion a bit and thus had to touch lots of these
providers to match. Here's the tests I ran there, all passing:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```
### OpenAI Completion Integration Tests with ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```
### OpenAI Completion Integration Tests with together.ai
```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo"
```
### OpenAI Completion Integration Tests with fireworks.ai
```
INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct"
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from
5.4.0 to 5.4.1.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/astral-sh/setup-uv/releases">astral-sh/setup-uv's
releases</a>.</em></p>
<blockquote>
<h2>v5.4.1 🌈 Add support for pep440 version specifiers</h2>
<h2>Changes</h2>
<p>With this release you can also use <a
href="https://peps.python.org/pep-0440/#version-specifiers">pep440
version specifiers</a> as <code>required-version</code> in
files<code>uv.toml</code>, <code>pyroject.toml</code> and in the
<code>version</code> input:</p>
<pre lang="yaml"><code>- name: Install a pep440-specifier-satisfying
version of uv
uses: astral-sh/setup-uv@v5
with:
version: ">=0.4.25,<0.5"
</code></pre>
<h2>🐛 Bug fixes</h2>
<ul>
<li>Add support for pep440 version identifiers <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/353">#353</a>)</li>
</ul>
<h2>🧰 Maintenance</h2>
<ul>
<li>chore: update known checksums for 0.6.10 @<a
href="https://github.com/apps/github-actions">github-actions[bot]</a>
(<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/345">#345</a>)</li>
</ul>
<h2>📚 Documentation</h2>
<ul>
<li>Add pep440 to docs header <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/355">#355</a>)</li>
<li>Fix glob syntax link <a
href="https://github.com/flying-sheep"><code>@flying-sheep</code></a>
(<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/349">#349</a>)</li>
<li>Add link to supported glob patterns <a
href="https://github.com/eifinger"><code>@eifinger</code></a> (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/348">#348</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="0c5e2b8115"><code>0c5e2b8</code></a>
Add pep440 to docs header (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/355">#355</a>)</li>
<li><a
href="794ea9455c"><code>794ea94</code></a>
Add support for pep440 version identifiers (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/353">#353</a>)</li>
<li><a
href="2d49baf2b6"><code>2d49baf</code></a>
chore: update known checksums for 0.6.10 (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/345">#345</a>)</li>
<li><a
href="4fa25599ce"><code>4fa2559</code></a>
Fix glob syntax link (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/349">#349</a>)</li>
<li><a
href="224dce1d79"><code>224dce1</code></a>
Add link to supported glob patterns (<a
href="https://redirect.github.com/astral-sh/setup-uv/issues/348">#348</a>)</li>
<li>See full diff in <a
href="22695119d7...0c5e2b8115">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
Currently the instructions for Llama 4 take quite some space before
people can see the overview and other sections about Llama Stack. Moving
this to a collapsed section would make it less verbose.
workflow -
0. Checkout
1. Install uv
2. Install Ollama
3. Pull Ollama image
4. Start Ollama in background
5. Set Up Environment and Install Dependencies
6. Wait for Ollama to start
7. Start Llama Stack server in background
8. Wait for Llama Stack server to be ready
9. Run Integration Tests
changes -
(4) starts the loading of the ollama model, it does not start ollama.
the model will be loaded when used. this step is removed.
(6) is handled in (2). this step is removed.
(2) is renamed to reflect it's dual purpose.