note: the openai provider exposes the litellm specific model names to
the user. this change is compatible with that. the litellm names should
be deprecated.
# What does this PR do?
Closes#2113.
Closes#1783.
Fixes a bug in handling the end of tool execution request stream where
no `finish_reason` is provided by the model.
## Test Plan
1. Ran existing unit tests
2. Added a dedicated test verifying correct behavior in this edge case
3. Ran the code snapshot from #2113
[//]: # (## Documentation)
# What does this PR do?
Closes#2111.
Fixes an error causing Llama Stack to just return `<tool_call>` and
complete the turn without actually executing the tool. See the issue
description for more detail.
## Test Plan
1) Ran existing unit tests
2) Added a dedicated test verifying correct behavior in this edge case
3) Ran the code snapshot from #2111
# What does this PR do?
The ollama provider was using an older variant of the code to convert
incoming parameters from the OpenAI API completions and chat completion
endpoints into requests that get sent to the backend provider over its
own OpenAI client. This updates it to use the common
`prepare_openai_completion_params` method used elsewhere, which takes
care of removing stray `None` values even for nested structures.
Without this, some other parameters, even if they have values of `None`,
make their way to ollama and actually influence its inference output as
opposed to when those parameters are not sent at all.
## Test Plan
This passes tests/integration/inference/test_openai_completion.py and
fixes the issue found in #2098, which was tested via manual curl
requests crafted a particular way.
Closes#2098
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
In our OpenAI API verification tests, some providers were still calling
tools even when `tool_choice="none"` was passed in the chat completion
requests. Because they aren't all respecting `tool_choice` properly,
this adjusts our routing implementation to remove the `tools` and
`tool_choice` from the request if `tool_choice="none"` is passed in so
that it does not attempt to call any of those tools. Adjusting this in
the router fixes this across all providers.
This also cleans up the non-streaming together.ai responses for tools,
ensuring it returns `None` instead of an empty list when there are no
tool calls, to exactly match the OpenAI API responses in that case.
## Test Plan
I observed existing failures in our OpenAI API verification suite - see
https://github.com/bbrowning/llama-stack-tests/blob/main/openai-api-verification/2025-04-27.md#together-llama-stack
for the failing `test_chat_*_tool_choice_none` tests. All streaming and
non-streaming variants were failing across all 3 tested models.
After this change, all of those 6 failing tests are now passing with no
regression in the other tests.
I verified this via:
```
llama stack run --image-type venv \
tests/verifications/openai-api-verification-run.yaml
```
```
python -m pytest -s -v \
'tests/verifications/openai_api/test_chat_completion.py' \
--provider=together-llama-stack
```
The entire verification suite is not 100% on together.ai yet, but it's
getting closer.
This also increased the pass rate for fireworks.ai, and did not regress
the groq or openai tests at all.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
switch sambanova inference adaptor to LiteLLM usage to simplify
integration and solve issues with current adaptor when streaming and
tool calling, models and templates updated
## Test Plan
pytest -s -v tests/integration/inference/test_text_inference.py
--stack-config=sambanova
--text-model=sambanova/Meta-Llama-3.3-70B-Instruct
pytest -s -v tests/integration/inference/test_vision_inference.py
--stack-config=sambanova
--vision-model=sambanova/Llama-3.2-11B-Vision-Instruct
# What does this PR do?
Mainly tried to cover the entire llama_stack/apis directory, we only
have one left. Some excludes were just noop.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The goal of this PR is code base modernization.
Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)
Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
Add several new pre-commit hooks to improve code quality and security:
- no-commit-to-branch: prevent direct commits to protected branches like
`main`
- check-yaml: validate YAML files
- detect-private-key: prevent accidental commit of private keys
- requirements-txt-fixer: maintain consistent requirements.txt format
and sorting
- mixed-line-ending: enforce LF line endings to avoid mixed line endings
- check-executables-have-shebangs: ensure executable scripts have
shebangs
- check-json: validate JSON files
- check-shebang-scripts-are-executable: verify shebang scripts are
executable
- check-symlinks: validate symlinks and report broken ones
- check-toml: validate TOML files mainly for pyproject.toml
The respective fixes have been included.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
In our OpenAI API verification tests, ollama was still calling tools
even when `tool_choice="none"` was passed in its chat completion
requests. Because ollama isn't respecting `tool_choice` properly, this
adjusts our provider implementation to remove the `tools` from the
request if `tool_choice="none"` is passed in so that it does not attempt
to call any of those tools.
## Test Plan
I tested this with a couple of Llama models, using both our OpenAI
completions integration tests and our verification test suites.
### OpenAI Completions / Chat Completions integration tests
These all passed before, and still do.
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" \
llama stack build --template ollama --image-type venv --run
```
```
LLAMA_STACK_CONFIG=http://localhost:8321 \
python -m pytest -v \
tests/integration/inference/test_openai_completion.py \
--text-model "llama3.2:3b-instruct-fp16"
```
### OpenAI API Verification test suite
test_chat_*_tool_choice_none OpenAI API verification tests pass now,
when they failed before.
See
https://github.com/bbrowning/llama-stack-tests/blob/main/openai-api-verification/2025-04-27.md#ollama-llama-stack
for an example of these failures from a recent nightly CI run.
```
INFERENCE_MODEL="llama3.3:70b-instruct-q3_K_M" \
llama stack build --template ollama --image-type venv --run
```
```
cat <<-EOF > tests/verifications/conf/ollama-llama-stack.yaml
base_url: http://localhost:8321/v1/openai/v1
api_key_var: OPENAI_API_KEY
models:
- llama3.3:70b-instruct-q3_K_M
model_display_names:
llama3.3:70b-instruct-q3_K_M: Llama-3.3-70B-Instruct
test_exclusions:
llama3.3:70b-instruct-q3_K_M:
- test_chat_non_streaming_image
- test_chat_streaming_image
- test_chat_multi_turn_multiple_images
EOF
```
```
python -m pytest -s -v \
'tests/verifications/openai_api/test_chat_completion.py' \
--provider=ollama-llama-stack
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
There are new changes in repo which needs to add some additional
functions to the inference which is fixed. Also need one additional
params to pass some extra arguments to watsonx.ai
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>
# What does this PR do?
IBM watsonx ai added as the inference [#1741
](https://github.com/meta-llama/llama-stack/issues/1741)
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
---------
Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>
# What does this PR do?
Adds custom model registration functionality to NVIDIAInferenceAdapter
which let's the inference happen on:
- post-training model
- non-llama models in API Catalogue(behind
https://integrate.api.nvidia.com and endpoints compatible with
AyncOpenAI)
## Example Usage:
```python
from llama_stack.apis.models import Model, ModelType
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("nvidia")
_ = client.initialize()
client.models.register(
model_id=model_name,
model_type=ModelType.llm,
provider_id="nvidia"
)
response = client.inference.chat_completion(
model_id=model_name,
messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Write a limerick about the wonders of GPU computing."}],
)
```
## Test Plan
```bash
pytest tests/unit/providers/nvidia/test_supervised_fine_tuning.py
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0
collected 6 items
tests/unit/providers/nvidia/test_supervised_fine_tuning.py ...... [100%]
============================================================ warnings summary ============================================================
../miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076
/home/ubuntu/miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'contentEncoding'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================== 6 passed, 1 warning in 1.51s ======================================================
```
[//]: # (## Documentation)
Updated Readme.md
cc: @dglogo, @sumitb, @mattf
# What does this PR do?
Closes#1968.
The asynchronous client in `VLLMInferenceAdapter` is now initialized
directly before first use and not in `VLLMInferenceAdapter.initialize`.
This prevents issues arising due to accessing an expired event loop from
a completed `asyncio.run`.
## Test Plan
Ran unit tests, including `test_remote_vllm.py`.
Ran the code snippet mentioned in #1968.
---------
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The together inference provider was throwing a stack trace every time it
shut down, as it was trying to call a non-existent `close` method on the
AsyncTogether client. While fixing that, I also adjusted its shutdown
logic to close the OpenAI client if we've created one of those, as that
client does have a `close` method.
In testing that, I also realized we were defaulting to treating all
requests as streaming requests instead of defaulting to non-streaming.
So, this flips that default to non-streaming to match how the other
providers work.
## Test Plan
I tested this by ensuring the together inference provider no longer
spits out a long stack trace when shutting it down and by running the
OpenAI API chat completion verification suite to ensure the change in
default streaming logic didn't mess anything else up.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
We were passing a dict into the compat mixin for OpenAI Completions when
using Llama models with Fireworks, and that was breaking some strong
typing code that was added in openai_compat.py. We shouldn't have been
converting these params to a dict in that case anyway, so this adjusts
things to pass the params in as their actual original types when calling
the OpenAIChatCompletionToLlamaStackMixin.
## Test Plan
All of the fireworks provider verification tests were failing due to
some OpenAI compatibility cleanup in #1962. The changes in that PR were
good to make, and this just cleans up the fireworks provider code to
stop passing in untyped dicts to some of those `openai_compat.py`
methods since we have the original strongly-typed parameters we can pass
in.
```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```
```
python -m pytest -s -v tests/verifications/openai_api/test_chat_completion.py --provider=fireworks-llama-stack
```
Before this PR, all of the fireworks OpenAI verification tests were
failing. Now, most of them are passing.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
NVIDIA Inference provider was using the ModelRegistryHelper to map input
model ids to provider model ids. this updates it to use the model_store.
## Test Plan
`LLAMA_STACK_CONFIG=http://localhost:8321 uv run pytest -v
tests/integration/inference/{test_embedding.py,test_text_inference.py,test_openai_completion.py}
--embedding-model nvidia/llama-3.2-nv-embedqa-1b-v2
--text-model=meta-llama/Llama-3.1-70B-Instruct`
# What does this PR do?
Add NVIDIA platform docs that serve as a starting point for Llama Stack
users and explains all supported microservices.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
Fixes: #1955
Since 0.2.0, the vLLM gets an empty list (vs ``None``in 0.1.9 and
before) when there are no tools configured which causes the issue
described in #1955 p. This patch avoids sending the 'tools' param to the
vLLM altogether instead of an empty list.
It also adds a small unit test to avoid regressions.
The OpenAI
[specification](https://platform.openai.com/docs/api-reference/chat/create)
does not explicitly state that the list cannot be empty but I found this
out through experimentation and it might depend on the actual remote
vllm. In any case, as this parameter is Optional, is best to skip it
altogether if there's no tools configured.
Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>
# What does this PR do?
ollama's CLI supports running models via commands such as 'ollama run
llama3.2' this syntax does not work with the INFERENCE_MODEL llamastack
var as currently specifying a tag such as 'latest' is required
this commit will check to see if the 'latest' model is available and use
that model if a user passes a model name without a tag but the 'latest'
is available in ollama
## Test Plan
Behavior pre-code change
```bash
$ INFERENCE_MODEL=llama3.2 llama stack build --template ollama --image-type venv --run
...
INFO 2025-04-08 13:42:42,842 llama_stack.providers.remote.inference.ollama.ollama:80 inference: checking
connectivity to Ollama at `http://beanlab1.bss.redhat.com:11434`...
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/server/server.py", line 502, in <module>
main()
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/server/server.py", line 401, in main
impls = asyncio.run(construct_stack(config))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/asyncio/base_events.py", line 691, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/stack.py", line 222, in construct_stack
await register_resources(run_config, impls)
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/stack.py", line 99, in register_resources
await method(**obj.model_dump())
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
result = await method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 294, in register_model
registered_model = await self.register_object(model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 228, in register_object
registered_obj = await register_object_with_provider(obj, p)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 77, in register_object_with_provider
return await p.register_model(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
result = await method(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/remote/inference/ollama/ollama.py", line 315, in register_model
raise ValueError(
ValueError: Model 'llama3.2' is not available in Ollama. Available models: llama3.2:latest
++ error_handler 108
++ echo 'Error occurred in script at line: 108'
Error occurred in script at line: 108
++ exit 1
```
Behavior post-code change
```bash
$ INFERENCE_MODEL=llama3.2 llama stack build --template ollama --image-type venv --run
...
INFO 2025-04-08 13:58:17,365 llama_stack.providers.remote.inference.ollama.ollama:80 inference: checking
connectivity to Ollama at `http://beanlab1.bss.redhat.com:11434`...
WARNING 2025-04-08 13:58:18,190 llama_stack.providers.remote.inference.ollama.ollama:317 inference: Imprecise provider
resource id was used but 'latest' is available in Ollama - using 'llama3.2:latest'
INFO 2025-04-08 13:58:18,191 llama_stack.providers.remote.inference.ollama.ollama:308 inference: Pulling embedding
model `all-minilm:latest` if necessary...
INFO 2025-04-08 13:58:18,799 __main__:478 server: Listening on ['::', '0.0.0.0']:8321
INFO: Started server process [28378]
INFO: Waiting for application startup.
INFO 2025-04-08 13:58:18,803 __main__:148 server: Starting up
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
...
```
## Documentation
Did not document this anywhere but happy to do so if there is an
appropriate place
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
TLDR: Changes needed to get 100% passing tests for OpenAI API
verification tests when run against Llama Stack with the `together`,
`fireworks`, and `openai` providers. And `groq` is better than before,
at 88% passing.
This cleans up the OpenAI API support for image message types
(specifically `image_url` types) and handling of the `response_format`
chat completion parameter. Both of these required a few more Pydantic
model definitions in our Inference API, just to move from the
not-quite-right stubs I had in place to something fleshed out to match
the actual OpenAI API specs.
As part of testing this, I also found and fixed a bug in the litellm
implementation of openai_completion and openai_chat_completion, so the
providers based on those should actually be working now.
The method `prepare_openai_completion_params` in
`llama_stack/providers/utils/inference/openai_compat.py` was improved to
actually recursively clean up input parameters, including handling of
lists, dicts, and dumping of Pydantic models to dicts. These changes
were required to get to 100% passing tests on the OpenAI API
verification against the `openai` provider.
With the above, the together.ai provider was passing as well as it is
without Llama Stack. But, since we have Llama Stack in the middle, I
took the opportunity to clean up the together.ai provider so that it now
also passes the OpenAI API spec tests we have at 100%. That means
together.ai is now passing our verification test better when using an
OpenAI client talking to Llama Stack than it is when hitting together.ai
directly, without Llama Stack in the middle.
And, another round of work for Fireworks to improve translation of
incoming OpenAI chat completion requests to Llama Stack chat completion
requests gets the fireworks provider passing at 100%. The server-side
fireworks.ai tool calling support with OpenAI chat completions and Llama
4 models isn't great yet, but by pointing the OpenAI clients at Llama
Stack's API we can clean things up and get everything working as
expected for Llama 4 models.
## Test Plan
### OpenAI API Verification Tests
I ran the OpenAI API verification tests as below and 100% of the tests
passed.
First, start a Llama Stack server that runs the `openai` provider with
the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template
setup to do this out of the box, so I added a
`tests/verifications/openai-api-verification-run.yaml` to do this.
First, ensure you have the necessary API key environment variables set:
```
export TOGETHER_API_KEY="..."
export FIREWORKS_API_KEY="..."
export OPENAI_API_KEY="..."
```
Then, run a Llama Stack server that serves up all these providers:
```
llama stack run \
--image-type venv \
tests/verifications/openai-api-verification-run.yaml
```
Finally, generate a new verification report against all these providers,
both with and without the Llama Stack server in the middle.
```
python tests/verifications/generate_report.py \
--run-tests \
--provider \
together \
fireworks \
groq \
openai \
together-llama-stack \
fireworks-llama-stack \
groq-llama-stack \
openai-llama-stack
```
You'll see that most of the configurations with Llama Stack in the
middle now pass at 100%, even though some of them do not pass at 100%
when hitting the backend provider's API directly with an OpenAI client.
### OpenAI Completion Integration Tests with vLLM:
I also ran the smaller `test_openai_completion.py` test suite (that's
not yet merged with the verification tests) on multiple of the
providers, since I had to adjust the method signature of
openai_chat_completion a bit and thus had to touch lots of these
providers to match. Here's the tests I ran there, all passing:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```
### OpenAI Completion Integration Tests with ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```
### OpenAI Completion Integration Tests with together.ai
```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo"
```
### OpenAI Completion Integration Tests with fireworks.ai
```
INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct"
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR adds two methods to the Inference API:
- `batch_completion`
- `batch_chat_completion`
The motivation is for evaluations targeting a local inference engine
(like meta-reference or vllm) where batch APIs provide for a substantial
amount of acceleration.
Why did I not add this to `Api.batch_inference` though? That just
resulted in a _lot_ more book-keeping given the structure of Llama
Stack. Had I done that, I would have needed to create a notion of a
"batch model" resource, setup routing based on that, etc. This does not
sound ideal.
So what's the future of the batch inference API? I am not sure. Maybe we
can keep it for true _asynchronous_ execution. So you can submit
requests, and it can return a Job instance, etc.
## Test Plan
Run meta-reference-gpu using:
```bash
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144
LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu
```
Then run the batch inference test case.
# What does this PR do?
This stubs in some OpenAI server-side compatibility with three new
endpoints:
/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions
This gives common inference apps using OpenAI clients the ability to
talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .
The two "v1" instances in there isn't awesome, but the thinking is that
Llama Stack's API is v1 and then our OpenAI compatibility layer is
compatible with OpenAI V1. And, some OpenAI clients implicitly assume
the URL ends with "v1", so this gives maximum compatibility.
The openai models endpoint is implemented in the routing layer, and just
returns all the models Llama Stack knows about.
The following providers should be working with the new OpenAI
completions and chat/completions API:
* remote::anthropic (untested)
* remote::cerebras-openai-compat (untested)
* remote::fireworks (tested)
* remote::fireworks-openai-compat (untested)
* remote::gemini (untested)
* remote::groq-openai-compat (untested)
* remote::nvidia (tested)
* remote::ollama (tested)
* remote::openai (untested)
* remote::passthrough (untested)
* remote::sambanova-openai-compat (untested)
* remote::together (tested)
* remote::together-openai-compat (untested)
* remote::vllm (tested)
The goal to support this for every inference provider - proxying
directly to the provider's OpenAI endpoint for OpenAI-compatible
providers. For providers that don't have an OpenAI-compatible API, we'll
add a mixin to translate incoming OpenAI requests to Llama Stack
inference requests and translate the Llama Stack inference responses to
OpenAI responses.
This is related to #1817 but is a bit larger in scope than just chat
completions, as I have real use-cases that need the older completions
API as well.
## Test Plan
### vLLM
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```
### ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```
## Documentation
Run a Llama Stack distribution that uses one of the providers mentioned
in the list above. Then, use your favorite OpenAI client to send
completion or chat completion requests with the base_url set to
http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the
host and port of your Llama Stack server, if different.
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
closes#1853
## Test Plan
```
uv run llama stack build --image-type conda --image-name ollama --config llama_stack/templates/ollama/build.yaml
ollama pull llama3.2:3b
LLAMA_STACK_CONFIG=http://localhost:8321 uv run pytest tests/integration/inference/test_text_inference.py -v --text-model=llama3.2:3b
```
# What does this PR do?
## Test Plan
export MODEL=accounts/fireworks/models/llama4-scout-instruct-basic;
LLAMA_STACK_CONFIG=verification pytest -s -v tests/integration/inference
--vision-model $MODEL --text-model $MODEL
# What does this PR do?
Move around bits. This makes the copies from llama-models _much_ easier
to maintain and ensures we don't entangle meta-reference specific
tidbits into llama-models code even by accident.
Also, kills the meta-reference-quantized-gpu distro and rolls
quantization deps into meta-reference-gpu.
## Test Plan
```
LLAMA_MODELS_DEBUG=1 \
with-proxy llama stack run meta-reference-gpu \
--env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct \
--env INFERENCE_CHECKPOINT_DIR=<DIR> \
--env MODEL_PARALLEL_SIZE=4 \
--env QUANTIZATION_TYPE=fp8_mixed
```
Start a server with and without quantization. Point integration tests to
it using:
```
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config http://localhost:8321 --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
# What does this PR do?
- **chore: mypy for strong_typing**
- **chore: mypy for remote::vllm**
- **chore: mypy for remote::ollama**
- **chore: mypy for providers.datatype**
---------
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
This PR adds support for NVIDIA's NeMo Customizer API to the Llama Stack
post-training module. The integration enables users to fine-tune models
using NVIDIA's cloud-based customization service through a consistent
Llama Stack interface.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Yet to be done
Things pending under this PR:
- [x] Integration of fine-tuned model(new checkpoint) for inference with
nvidia llm distribution
- [x] distribution integration of API
- [x] Add test cases for customizer(In Progress)
- [x] Documentation
```
LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/post_training/test_supervised_fine_tuning.py
============================================================================================================================================================================ test session starts =============================================================================================================================================================================
platform linux -- Python 3.10.0, pytest-8.3.4, pluggy-1.5.0 -- /home/ubuntu/llama-stack/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.0', 'Platform': 'Linux-6.8.0-1021-gcp-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'nbval': '0.11.0', 'metadata': '3.1.1', 'anyio': '4.8.0', 'html': '4.1.1', 'asyncio': '0.25.3'}}
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: nbval-0.11.0, metadata-3.1.1, anyio-4.8.0, html-4.1.1, asyncio-0.25.3
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 2 items
tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_post_training_provider_registration[txt=8B] PASSED [ 50%]
tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_list_training_jobs[txt=8B] PASSED [100%]
======================================================================================================================================================================== 2 passed, 1 warning in 0.10s ========================================================================================================================================================================
```
cc: @mattf @dglogo @sumitb
---------
Co-authored-by: Ubuntu <ubuntu@llama-stack-customizer-dev-inst-2tx95fyisatvlic4we8hidx5tfj.us-central1-a.c.brevdevprod.internal>
### What does this PR do?
Currently, `ToolCall.arguments` is a `Dict[str, RecursiveType]`.
However, on the client SDK side -- the `RecursiveType` gets deserialized
into a number ( both int and float get collapsed ) and hence when params
are `int` they get converted to float which might break client side
tools that might be doing type checking.
Closes: https://github.com/meta-llama/llama-stack/issues/1683
### Test Plan
Stainless changes --
https://github.com/meta-llama/llama-stack-client-python/pull/204
```
pytest -s -v --stack-config=fireworks tests/integration/agents/test_agents.py --text-model meta-llama/Llama-3.1-8B-Instruct
```
# What does this PR do?
support nvidia hosted 3.2 11b/90b vision models. they are not hosted on
the common https://integrate.api.nvidia.com/v1. they are hosted on their
own individual urls.
## Test Plan
`LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -s -v
tests/client-sdk/inference/test_vision_inference.py
--inference-model=meta/llama-3.2-11b-vision-instruct -k image`
# What does this PR do?
Add the option to not verify SSL certificates for the remote-vllm
provider. This allows llama stack server to talk to remote LLMs which
have self-signed certificates
Partially addresses #1545
# What does this PR do?
current passthrough impl returns chatcompletion_message.content as a
TextItem() , not a straight string. so it's not compatible with other
providers, and causes parsing error downstream.
change away from the generic pydantic conversion, and explicitly parse
out content.text
## Test Plan
setup llama server with passthrough
```
llama-stack-client eval run-benchmark "MMMU_Pro_standard" --model-id meta-llama/Llama-3-8B --output-dir /tmp/ --num-examples 20
```
works without parsing error
## What does this PR do?
We noticed that the passthrough inference provider doesn't work agent
due to the type mis-match between client and server. We manually cast
the llama stack client type to llama stack server type to fix the issue.
## test
run `python -m examples.agents.hello localhost 8321` within
llama-stack-apps
<img width="1073" alt="Screenshot 2025-03-11 at 8 43 44 PM"
src="https://github.com/user-attachments/assets/bd1bdd31-606a-420c-a249-95f6184cc0b1"
/>
fix https://github.com/meta-llama/llama-stack/issues/1560
## What does this PR do?
As title, add codegen for open-benchmark template
## test
checked the new generated run.yaml file and it's identical before and
after the change
Also add small improvement to together template so that missing
TOGETHER_API_KEY won't crash the server which is the consistent user
experience as other remote providers
# What does this PR do?
remove Llama-3.2-1B-Instruct for fireworks as its no longer appears to
be hosted on website.
## Test Plan
python distro_codegen.py
# What does this PR do?
Uses together async client instead of sync client
[//]: # (If resolving an issue, uncomment and update the line below)
## Test Plan
Command to run the test is in the image below(2 tests fail, and they
were failing for the old stable version as well with the same errors.)
<img width="1689" alt="image"
src="https://github.com/user-attachments/assets/503db720-5379-425d-9844-0225010e41a1"
/>
[//]: # (## Documentation)
---------
Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>
Concurrent requests should not trample (or reuse) each others' provider
data. Provider data should be scoped to each request.
## Test Plan
Set the uvicorn server to have a single worker process + thread by
updating the config:
```python
uvicorn_config = {
...
"workers": 1,
"loop": "asyncio",
}
```
Then perform the following steps on `origin/main` (without this change).
(1) Run the server using `llama stack run dev` without having
`FIREWORKS_API_KEY` in the environment.
(2) Run a test by specifying the FIREWORKS_API_KEY env var so it gets
stored in the thread local
```
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config http://localhost:8321 \
--text-model accounts/fireworks/models/llama-v3p1-8b-instruct \
-k test_text_chat_completion_with_tool_calling_and_streaming \
--env FIREWORKS_API_KEY=<...>
```
Ensure you don't have any other API keys in the environment (otherwise
the bug will not reproduce due to other specifics in our testing code.)
Verify this works.
(3) Run the same command again without specifying FIREWORKS_API_KEY. See
that the request actually succeeds when it *should have failed*.
----
Now do the same tests on this branch, verify step (3) results in
failure.
Finally, run the full `test_text_inference.py` test suite with this
change, verify it succeeds.
# What does this PR do?
This switches from an OpenAI client to the AsyncOpenAI client in the
remote vllm provider. The main benefit of this is that instead of each
client call being a blocking operation that was blocking our server
event loop, the client calls are now async operations that do not block
the event loop.
The actual fix is quite simple and straightforward. Creating a reliable
reproducer of this with a unit test that verifies we were blocking the
event loop before and are not blocking it any longer was a bit harder.
Some other inference providers have this same issue, so we may want to
make that simple delayed http server a bit more generic and pull it into
a common place as other inference providers get fixed.
(Closes#1457)
## Test Plan
I verified the unit tests and test_text_inference tests pass with this
change like below:
```
python -m pytest -v tests/unit
```
```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v -s \
tests/integration/inference/test_text_inference.py \
--text-model "meta-llama/Llama-3.2-3B-Instruct"
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This commit introduces a new logging system that allows loggers to be
assigned
a category while retaining the logger name based on the file name. The
log
format includes both the logger name and the category, producing output
like:
```
INFO 2025-03-03 21:44:11,323 llama_stack.distribution.stack:103 [core]: Tool_groups: builtin::websearch served by
tavily-search
```
Key features include:
- Category-based logging: Loggers can be assigned a category (e.g.,
"core", "server") when programming. The logger can be loaded like
this: `logger = get_logger(name=__name__, category="server")`
- Environment variable control: Log levels can be configured
per-category using the
`LLAMA_STACK_LOGGING` environment variable. For example:
`LLAMA_STACK_LOGGING="server=DEBUG;core=debug"` enables DEBUG level for
the "server"
and "core" categories.
- `LLAMA_STACK_LOGGING="all=debug"` sets DEBUG level globally for all
categories and
third-party libraries.
This provides fine-grained control over logging levels while maintaining
a clean and
informative log format.
The formatter uses the rich library which provides nice colors better
stack traces like so:
```
ERROR 2025-03-03 21:49:37,124 asyncio:1758 [uncategorized]: unhandled exception during asyncio.run() shutdown
task: <Task finished name='Task-16' coro=<handle_signal.<locals>.shutdown() done, defined at
/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:146>
exception=UnboundLocalError("local variable 'loop' referenced before assignment")>
╭────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────╮
│ /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:178 in shutdown │
│ │
│ 175 │ │ except asyncio.CancelledError: │
│ 176 │ │ │ pass │
│ 177 │ │ finally: │
│ ❱ 178 │ │ │ loop.stop() │
│ 179 │ │
│ 180 │ loop = asyncio.get_running_loop() │
│ 181 │ loop.create_task(shutdown()) │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
UnboundLocalError: local variable 'loop' referenced before assignment
```
Co-authored-by: Ashwin Bharambe <@ashwinb>
Signed-off-by: Sébastien Han <seb@redhat.com>
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml
INFO 2025-03-03 21:55:35,918 __main__:365 [server]: Using config file: llama_stack/templates/ollama/run.yaml
INFO 2025-03-03 21:55:35,925 __main__:378 [server]: Run configuration:
INFO 2025-03-03 21:55:35,928 __main__:380 [server]: apis:
- agents
```
[//]: # (## Documentation)
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
The commit addresses the Ruff warning B008 by refactoring the code to
avoid calling SamplingParams() directly in function argument defaults.
Instead, it either uses Field(default_factory=SamplingParams) for
Pydantic models or sets the default to None and instantiates
SamplingParams inside the function body when the argument is None.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This gracefully handles the case where the vLLM server responded to a
completion request with no choices, which can happen in certain vLLM
error situations. Previously, we'd error out with a stack trace about a
list index out of range. Now, we just log a warning to the user and move
past any chunks with an empty choices list.
A specific example of the type of stack trace this fixes:
```
File "/app/llama-stack-source/llama_stack/providers/remote/inference/vllm/vllm.py", line 170, in _process_vllm_chat_completion_stream_response
choice = chunk.choices[0]
~~~~~~~~~~~~~^^^
IndexError: list index out of range
```
Now, instead of erroring out with that stack trace, we log a warning
that vLLM failed to generate any completions and alert the user to check
the vLLM server logs for details.
This is related to #1277 and addresses the stack trace shown in that
issue, although does not in and of itself change the functional behavior
of vLLM tool calling.
## Test Plan
As part of this fix, I added new unit tests to trigger this same error
and verify it no longer happens. That is
`test_process_vllm_chat_completion_stream_response_no_choices` in the
new `tests/unit/providers/inference/test_remote_vllm.py`. I also added a
couple of more tests to trigger and verify the last couple of remote
vllm provider bug fixes - specifically a test for #1236 (builtin tool
calling) and #1325 (vLLM <= v0.6.3).
This required fixing the signature of
`_process_vllm_chat_completion_stream_response` to accept the actual
type of chunks it was getting passed - specifically changing from our
openai_compat `OpenAICompatCompletionResponse` to
`openai.types.chat.chat_completion_chunk.ChatCompletionChunk`. It was
not actually getting passed `OpenAICompatCompletionResponse` objects
before, and was using attributes that didn't exist on those objects. So,
the signature now matches the type of object it's actually passed.
Run these new unit tests like this:
```
pytest tests/unit/providers/inference/test_remote_vllm.py
```
Additionally, I ensured the existing `test_text_inference.py` tests
passed via:
```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v tests/integration/inference/test_text_inference.py \
--inference-model "meta-llama/Llama-3.2-3B-Instruct" \
--vision-inference-model ""
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>