Commit graph

323 commits

Author SHA1 Message Date
Derek Higgins
2e807b38cc
chore: Add fixtures to conftest.py (#2067)
Add fixtures for SqliteKVStore, DiskDistributionRegistry and
CachedDiskDistributionRegistry. And use them in tests that had all been
duplicating similar setups.

## Test Plan
unit tests continue to run

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-05-06 13:57:48 +02:00
ehhuang
4597145011
chore: remove recordable mock (#2088)
# What does this PR do?
We've disabled it for a while given that this hasn't worked as well as
expected given the frequent changes of llama_stack_client and how this
requires both repos to be in sync.

## Test Plan

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-05-05 10:08:55 -07:00
Ashwin Bharambe
d27a0f276c fix: pytest.mark.skip, not pytest.skip 2025-05-04 13:22:06 -07:00
Ashwin Bharambe
c69f14bfaa fix: disable rag_and_code_agent test because no code interpreter anymore 2025-05-03 14:29:06 -07:00
Ben Browning
f1b103e6c8
fix: openai_compat messages system/assistant non-str content (#2095)
# What does this PR do?

When converting OpenAI message content for the "system" and "assistant"
roles to Llama Stack inference APIs (used for some providers when
dealing with Llama models via OpenAI API requests to get proper prompt /
tool handling), we were not properly converting any non-string content.

I discovered this while running the new Responses AI verification suite
against the Fireworks provider, but instead of fixing it as part of some
ongoing work there split this out into a separate PR.

This fixes that, by using the `openai_content_to_content` helper we used
elsewhere to ensure content parts were mapped properly.

## Test Plan

I added a couple of new tests to `test_openai_compat` to reproduce this
issue and validate its fix. I ran those as below:

```
python -m pytest -s -v tests/unit/providers/utils/inference/test_openai_compat.py
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-05-02 13:09:27 -07:00
Ashwin Bharambe
272d3359ee
fix: remove code interpeter implementation (#2087)
# What does this PR do?

The builtin implementation of code interpreter is not robust and has a
really weak sandboxing shell (the `bubblewrap` container). Given the
availability of better MCP code interpreter servers coming up, we should
use them instead of baking an implementation into the Stack and
expanding the vulnerability surface to the rest of the Stack.

This PR only does the removal. We will add examples with how to
integrate with MCPs in subsequent ones.

## Test Plan

Existing tests.
2025-05-01 14:35:08 -07:00
Ihar Hrachyshka
9e6561a1ec
chore: enable pyupgrade fixes (#1806)
# What does this PR do?

The goal of this PR is code base modernization.

Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)

Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-05-01 14:23:50 -07:00
Matthew Farrellee
88a796ca5a
fix: allow use of models registered at runtime (#1980)
# What does this PR do?

fix a bug where models registered at runtime could not be used.

```
$ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct

$ curl http://localhost:8321/v1/openai/v1/chat/completions \                                                        
-H "Content-Type: application/json" \
-d '{
  "model": "test-model",
  "messages": [{"role": "user", "content": "What is the weather like in Boston today?"}]
}'

=(client)=> {"detail":"Internal server error: An unexpected error occurred."}
=(server)=> TypeError: Missing required arguments; Expected either ('messages' and 'model') or ('messages', 'model' and 'stream') arguments to be given
```

*root cause:* test-model is not added to ModelRegistryHelper's
alias_to_provider_id_map.

as part of the fix, this adds tests for ModelRegistryHelper and defines
its expected behavior.

user visible behavior changes -

| action | existing behavior | new behavior |
| -- | -- | -- |
| double register | success (but no change) | error |
| register unknown | success (fail when used) | error |

existing behavior for register unknown model and double register -
```
$ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct-unknown
Successfully registered model test-model

$ llama-stack-client models list | grep test-model
│ llm │ test-model                               │ meta/llama-3.1-70b-instruct-unknown │     │ nv… │

$ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct       
Successfully registered model test-model

$ llama-stack-client models list | grep test-model
│ llm │ test-model                               │ meta/llama-3.1-70b-instruct-unknown │     │ nv… │
```

new behavior for register unknown -
```
$ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct-unknown
╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Failed to register model                                                                         │
│                                                                                                  │
│ Error Type: BadRequestError                                                                      │
│ Details: Error code: 400 - {'detail': "Invalid value: Model id                                   │
│ 'meta/llama-3.1-70b-instruct-unknown' is not supported. Supported ids are:                       │
│ meta/llama-3.1-70b-instruct, snowflake/arctic-embed-l, meta/llama-3.2-1b-instruct,               │
│ nvidia/nv-embedqa-mistral-7b-v2, meta/llama-3.2-90b-vision-instruct, meta/llama-3.2-3b-instruct, │
│ meta/llama-3.2-11b-vision-instruct, meta/llama-3.1-405b-instruct, meta/llama3-8b-instruct,       │
│ meta/llama3-70b-instruct, nvidia/llama-3.2-nv-embedqa-1b-v2, meta/llama-3.1-8b-instruct,         │
│ nvidia/nv-embedqa-e5-v5"}                                                                        │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
```

new behavior for double register -
```
$ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct
Successfully registered model test-model

$ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.2-1b-instruct 
╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Failed to register model                                                                         │
│                                                                                                  │
│ Error Type: BadRequestError                                                                      │
│ Details: Error code: 400 - {'detail': "Invalid value: Model id 'test-model' is already           │
│ registered. Please use a different id or unregister it first."}                                  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
```


## Test Plan

```
uv run pytest -v tests/unit/providers/utils/test_model_registry.py
```
2025-05-01 12:00:58 -07:00
Derek Higgins
64829947d0
feat: Add temperature support to responses API (#2065)
# What does this PR do?
Add support for the temperature to the responses API 


## Test Plan
Manually tested simple case
unit tests added for simple case and tool calls

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-05-01 11:47:58 -07:00
Ben Browning
6378c2a2f3
fix: resolve BuiltinTools to strings for vllm tool_call messages (#2071)
# What does this PR do?

When the result of a ToolCall gets passed back into vLLM for the model
to handle the tool call result (as is often the case in agentic
tool-calling workflows), we forgot to handle the case where BuiltinTool
calls are not string values but instead instances of the BuiltinTool
enum. This fixes that, properly converting those enums to string values
before trying to serialize them into an OpenAI chat completion request
to vLLM.

PR #1931 fixed a bug where we weren't passing these tool calling results
back into vLLM, but as a side-effect it created this serialization bug
when using BuiltinTools.

Closes #2070

## Test Plan

I added a new unit test to the openai_compat unit tests to cover this
scenario, ensured the new test failed before this fix, and all the
existing tests there plus the new one passed with this fix.

```
python -m pytest -s -v tests/unit/providers/utils/inference/test_openai_compat.py
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-05-01 08:47:29 -04:00
Sébastien Han
dc94433072
feat(pre-commit): enhance pre-commit hooks with additional checks (#2014)
# What does this PR do?

Add several new pre-commit hooks to improve code quality and security:

- no-commit-to-branch: prevent direct commits to protected branches like
`main`
- check-yaml: validate YAML files
- detect-private-key: prevent accidental commit of private keys
- requirements-txt-fixer: maintain consistent requirements.txt format
and sorting
- mixed-line-ending: enforce LF line endings to avoid mixed line endings
- check-executables-have-shebangs: ensure executable scripts have
shebangs
- check-json: validate JSON files
- check-shebang-scripts-are-executable: verify shebang scripts are
executable
- check-symlinks: validate symlinks and report broken ones
- check-toml: validate TOML files mainly for pyproject.toml

The respective fixes have been included.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-30 11:35:49 -07:00
Jash Gulabrai
eab550f7d2
fix: Fix messages format in NVIDIA safety check request body (#2063)
# What does this PR do?
When running a Llama Stack server and invoking the
`/v1/safety/run-shield` endpoint, the NVIDIA Guardrails endpoint in some
cases errors with a `422: Unprocessable Entity` due to malformed input.

For example, given an request body like:
```
{
  "model": "test",
  "messages": [
    { "role": "user", "content": "You are stupid." }
  ]
}
```
`convert_pydantic_to_json_value` converts the message to:
```
{ "role": "user", "content": "You are stupid.", "context": null }
```
Which causes NVIDIA Guardrails to return an error `HTTPError: 422 Client
Error: Unprocessable Entity for url:
http://nemo.test/v1/guardrail/checks`, because `context` shouldn't be
included in the body.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
I ran the Llama Stack server locally and manually verified that the
endpoint now succeeds.

```
message = {"role": "user", "content": "You are stupid."}
response = client.safety.run_shield(messages=[message], shield_id=shield_id, params={})
```
Server logs:
```
14:29:09.656 [START] /v1/safety/run-shield
INFO:     127.0.0.1:54616 - "POST /v1/safety/run-shield HTTP/1.1" 200 OK
14:29:09.918 [END] /v1/safety/run-shield [StatusCode.OK] (262.26ms
```

[//]: # (## Documentation)

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
2025-04-30 18:01:28 +02:00
Derek Higgins
78ef6a6099
chore: Increase unit test coverage of routing_tables.py (#2057)
# What does this PR do?
Adds some unit tests for the routing logic

## Test Plan
Overall unit test coverage goes from 
TOTAL 12434 8030 35%
to
TOTAL 12434 7871 37%

Better coverage on router.py, before:

```
llama_stack/distribution/routers/routers.py | 342 | 219 | 0 | 36%
llama_stack/distribution/routers/routing_tables.py | 346 | 236 | 0 | 32%
```

After:

```
llama_stack/distribution/routers/routers.py | 342 | 219 | 0 | 36%
llama_stack/distribution/routers/routing_tables.py | 349 | 89 | 0 | 74%
```

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-04-30 16:00:43 +02:00
Roland Huß
5a2bfd6ad5
refactor: Replace SQLITE_DB_PATH by SQLITE_STORE_DIR env in templates (#2055)
# What does this PR do?

The telemetry provider configs is the only one who leverages the env var
`SQLITE_DB_PATH` for pointing to persistent data in the respective
templates, whereas usually `SQLITE_STORE_DIR` is used.

This PR modifies the `sqlite_db_path` in various telemetry configuration
files to use the environment variable `SQLITE_STORE_DIR` instead of
`SQLITE_DB_PATH`. This change ensures that _only_ the SQLITE_STORE_DIR
needs to be set to point to a different persistence location for
providers.

All references to `SQLITE_DB_PATH` have been removed.

Another improvement could be to move `sqlite_db_path` to `db_path` in
the telemetry provider config, to align with the other provider
configurations. That could be done by another PR (if wanted).
2025-04-29 15:28:10 -07:00
Ben Browning
8dfce2f596
feat: OpenAI Responses API (#1989)
# What does this PR do?

This provides an initial [OpenAI Responses
API](https://platform.openai.com/docs/api-reference/responses)
implementation. The API is not yet complete, and this is more a
proof-of-concept to show how we can store responses in our key-value
stores and use them to support the Responses API concepts like
`previous_response_id`.

## Test Plan

I've added a new
`tests/integration/openai_responses/test_openai_responses.py` as part of
a test-driven development for this new API. I'm only testing this
locally with the remote-vllm provider for now, but it should work with
any of our inference providers since the only API it requires out of the
inference provider is the `openai_chat_completion` endpoint.

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack build --template remote-vllm --image-type venv --run
```

```
LLAMA_STACK_CONFIG="http://localhost:8321" \
python -m pytest -v \
  tests/integration/openai_responses/test_openai_responses.py \
  --text-model "meta-llama/Llama-3.2-3B-Instruct"
 ```

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-04-28 14:06:00 -07:00
Sébastien Han
79851d93aa
feat: Add Kubernetes authentication (#1778)
# What does this PR do?

This commit adds a new authentication system to the Llama Stack server
with support for Kubernetes and custom authentication providers. Key
changes include:

- Implemented KubernetesAuthProvider for validating Kubernetes service
account tokens
- Implemented CustomAuthProvider for validating tokens against external
endpoints - this is the same code that was already present.
- Added test for Kubernetes
- Updated server configuration to support authentication settings
- Added documentation for authentication configuration and usage

The authentication system supports:
- Bearer token validation
- Kubernetes service account token validation
- Custom authentication endpoints

## Test Plan

Setup a Kube cluster using Kind or Minikube.

Run a server with:

```
server:
  port: 8321
  auth:
    provider_type: kubernetes
    config:
      api_server_url: http://url
      ca_cert_path: path/to/cert (optional)
```

Run:

```
curl -s -L -H "Authorization: Bearer $(kubectl create token my-user)" http://127.0.0.1:8321/v1/providers
```

Or replace "my-user" with your service account.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-28 22:24:58 +02:00
Rashmi Pawar
e6bbf8d20b
feat: Add NVIDIA NeMo datastore (#1852)
# What does this PR do?
Implemetation of NeMO Datastore register, unregister API.

Open Issues: 
- provider_id gets set to `localfs` in client.datasets.register() as it
is specified in routing_tables.py: DatasetsRoutingTable
see: #1860

Currently I have passed `"provider_id":"nvidia"` in metadata and have
parsed that in `DatasetsRoutingTable`
(Not the best approach, but just a quick workaround to make it work for
now.)

## Test Plan
- Unit test cases: `pytest
tests/unit/providers/nvidia/test_datastore.py`
```bash
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, asyncio-0.26.0, nbval-0.11.0, metadata-3.1.1, html-4.1.1, cov-6.1.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items                                                                                                                        

tests/unit/providers/nvidia/test_datastore.py ..                                                                                   [100%]

============================================================ warnings summary ============================================================

====================================================== 2 passed, 1 warning in 0.84s ======================================================
```

cc: @dglogo, @mattf, @yanxi0830
2025-04-28 09:41:59 -07:00
Ashwin Bharambe
bb1a85c9a0 fix: make sure test works equally well against llama stack as a server 2025-04-25 15:24:11 -07:00
Jash Gulabrai
8713d67ce3
fix: Correctly parse algorithm_config when launching NVIDIA customization job; fix internal request handler (#2025)
# What does this PR do?
This addresses 2 bugs I ran into when launching a fine-tuning job with
the NVIDIA Adapter:
1. Session handling in `_make_request` helper function returns an error.
```
INFO:     127.0.0.1:55831 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 500 Internal Server Error
16:11:45.643 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (270.44ms)
 16:11:45.643 [ERROR] Error executing endpoint route='/v1/post-training/supervised-fine-tune' method='post'
Traceback (most recent call last):
  File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 201, in endpoint
    return await maybe_await(value)
  File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 161, in maybe_await
    return await value
  File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 408, in supervised_fine_tune
    response = await self._make_request(
  File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 98, in _make_request
    async with self.session.request(method, url, params=params, json=json, **kwargs) as response:
  File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 1425, in __aenter__
    self._resp: _RetType = await self._coro
  File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 579, in _request
    handle = tm.start()
  File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/helpers.py", line 587, in start
    return self._loop.call_at(when, self.__call__)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 724, in call_at
    self._check_closed()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 510, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
```
Note: This only occurred when initializing the client like so:
```
client = LlamaStackClient(
    base_url="http://0.0.0.0:8321"
)
response = client.post_training.supervised_fine_tune(...) # Returns error
```
I didn't run into this issue when using the library client:
```
client =  LlamaStackAsLibraryClient("nvidia")
client.initialize()
response = client.post_training.supervised_fine_tune(...) # Works fine
```

2. The `algorithm_config` param in `supervised_fine_tune` is parsed as a
`dict` when run from unit tests, but a Pydantic model when invoked using
the Llama Stack client. So, the call fails outside of unit tests:
```
INFO:     127.0.0.1:54024 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 500 Internal Server Error
21:14:02.315 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (71.18ms)
 21:14:02.314 [ERROR] Error executing endpoint route='/v1/post-training/supervised-fine-tune' method='post'
Traceback (most recent call last):
  File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 205, in endpoint
    return await maybe_await(value)
  File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 164, in maybe_await
    return await value
  File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 407, in supervised_fine_tune
    "adapter_dim": algorithm_config.get("adapter_dim"),
  File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/pydantic/main.py", line 891, in __getattr__
    raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'LoraFinetuningConfig' object has no attribute 'get'
```
The code assumes `algorithm_config` should be `dict`, so I just handle
both cases.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
1. I ran a local Llama Stack server with the necessary env vars:
```
lama stack run llama_stack/templates/nvidia/run.yaml --port 8321 --env ...
```
And invoked `supervised_fine_tune` to confirm neither of the errors
above occur.
```
client = LlamaStackClient(
    base_url="http://0.0.0.0:8321"
)
response = client.post_training.supervised_fine_tune(...)
```
2. I confirmed the unit tests still pass: `./scripts/unit-tests.sh
tests/unit/providers/nvidia/test_supervised_fine_tuning.py`

[//]: # (## Documentation)

---------

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
2025-04-25 13:21:50 -07:00
Ashwin Bharambe
b5d8e44e81 fix: only sleep for tests when they pass or fail 2025-04-25 13:16:22 -07:00
Ashwin Bharambe
4fb583b407
fix: check that llama stack client plain can be used as a subst for OpenAI client (#2032)
With https://github.com/meta-llama/llama-stack-client-python/pull/226,
now we have llama-stack-client be able to used as a substitute for
OpenAI client (duck-typed) so you don't need to change downstream
library code.

<img width="1399" alt="image"
src="https://github.com/user-attachments/assets/abab6bfd-e6ff-4a7d-a965-fd93e3c105d7"
/>
2025-04-25 12:23:33 -07:00
Rashmi Pawar
ace82836c1
feat: NVIDIA allow non-llama model registration (#1859)
# What does this PR do?
Adds custom model registration functionality to NVIDIAInferenceAdapter
which let's the inference happen on:
- post-training model
- non-llama models in API Catalogue(behind
https://integrate.api.nvidia.com and endpoints compatible with
AyncOpenAI)

## Example Usage:
```python
from llama_stack.apis.models import Model, ModelType
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("nvidia")
_ = client.initialize()

client.models.register(
        model_id=model_name,
        model_type=ModelType.llm,
        provider_id="nvidia"
)

response = client.inference.chat_completion(
    model_id=model_name,
    messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Write a limerick about the wonders of GPU computing."}],
)
```

## Test Plan
```bash
pytest tests/unit/providers/nvidia/test_supervised_fine_tuning.py 
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0
collected 6 items                                                                                                                        

tests/unit/providers/nvidia/test_supervised_fine_tuning.py ......                                                                  [100%]

============================================================ warnings summary ============================================================
../miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076
  /home/ubuntu/miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'contentEncoding'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
    warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================== 6 passed, 1 warning in 1.51s ======================================================
```

[//]: # (## Documentation)
Updated Readme.md

cc: @dglogo, @sumitb, @mattf
2025-04-24 17:13:33 -07:00
Jash Gulabrai
cc77f79f55
feat: Add NVIDIA Eval integration (#1890)
# What does this PR do?
This PR adds support for NVIDIA's NeMo Evaluator API to the Llama Stack
eval module. The integration enables users to evaluate models via the
Llama Stack interface.

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
1. Added unit tests and successfully ran from root of project:
`./scripts/unit-tests.sh tests/unit/providers/nvidia/test_eval.py`
```
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_cancel PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_result PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_status PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_register_benchmark PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_run_eval PASSED
```
2. Verified I could build the Llama Stack image: `LLAMA_STACK_DIR=$(pwd)
llama stack build --template nvidia --image-type venv`

Documentation added to
`llama_stack/providers/remote/eval/nvidia/README.md`

---------

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
2025-04-24 17:12:42 -07:00
Ben Browning
0b6cd45950
fix: Additional streaming error handling (#2007)
# What does this PR do?

This expands the `test_sse` test suite and fixes some edge cases with
bugs in our SSE error handling to ensure streaming clients always get a
proper error response.

First, we handle the case where a client disconnects before we actually
start streaming the response back. Previously we only handled the case
where a client disconnected as we were streaming the response, but there
was an edge case where a client disconnecting before we streamed any
response back did not trigger our logic to cleanly handle that
disconnect.

Second, we handle the case where an error is thrown from the server
before the actual async generator gets created from the provider. This
happens in scenarios like the newly merged OpenAI API input validation,
where we eagerly raise validation errors before returning the async
generator object that streams the responses back.

## Test Plan

Tested via:

```
python -m pytest -s -v tests/unit/server/test_sse.py
```

Both test cases failed before, and passed afterwards. The test cases
were written based on me experimenting with actual clients that would do
bad things like randomly disconnect or send invalid input in streaming
mode and I hit these two cases, where things were misbehaving in our
error handling.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-24 17:01:45 -07:00
Derek Higgins
c8797f1125
fix: Including tool call in chat (#1931)
Include the tool call details with the chat when doing Rag with Remote
vllm

Fixes: #1929

With this PR the tool call is included in the chat returned to vllm, the
model (meta-llama/Llama-3.1-8B-Instruct) the returns the answer as
expected.

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-04-24 16:59:10 -07:00
Sébastien Han
14e60e3c02
feat: include run.yaml in the container image (#2005)
As part of the build process, we now include the generated run.yaml
(based of the provided build configuration file) into the container. We
updated the entrypoint to use this run configuration as well.

Given this simple distribution configuration:

```
# build.yaml
version: '2'
distribution_spec:
  description: Use (an external) Ollama server for running LLM inference
  providers:
    inference:
    - remote::ollama
    vector_io:
    - inline::faiss
    safety:
    - inline::llama-guard
    agents:
    - inline::meta-reference
    telemetry:
    - inline::meta-reference
    eval:
    - inline::meta-reference
    datasetio:
    - remote::huggingface
    - inline::localfs
    scoring:
    - inline::basic
    - inline::llm-as-judge
    - inline::braintrust
    tool_runtime:
    - remote::brave-search
    - remote::tavily-search
    - inline::code-interpreter
    - inline::rag-runtime
    - remote::model-context-protocol
    - remote::wolfram-alpha
  container_image: "registry.access.redhat.com/ubi9"
image_type: container
image_name: test
```

Build it:
```
llama stack build --config build.yaml
```

Run it:

```
podman run --rm \
         -p 8321:8321 \
         -e OLLAMA_URL=http://host.containers.internal:11434 \
         --name llama-stack-server \
         localhost/leseb-test:0.2.2
```

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-24 11:29:53 +02:00
Ben Browning
fa5dfee07b
fix: Return HTTP 400 for OpenAI API validation errors (#2002)
# What does this PR do?

When clients called the Open AI API with invalid input that wasn't
caught by our own Pydantic API validation but instead only caught by the
backend inference provider, that backend inference provider was
returning a HTTP 400 error. However, we were wrapping that into a HTTP
500 error, obfuscating the actual issue from calling clients and
triggering OpenAI client retry logic.

This change adjusts our existing `translate_exception` method in
`server.py` to wrap `openai.BadRequestError` as HTTP 400 errors, passing
through the string representation of the error message to the calling
user so they can see the actual input validation error and correct it. I
tried changing this in a few other places, but ultimately
`translate_exception` was the only real place to handle this for both
streaming and non-streaming requests across all inference providers that
use the OpenAI server APIs.

This also tightens up our validation a bit for the OpenAI chat
completions API, to catch empty `messages` parameters, invalid
`tool_choice` parameters, invalid `tools` items, or passing
`tool_choice` when `tools` isn't given.

Lastly, this extends our OpenAI API chat completions verifications to
also check for consistent input validation across providers. Providers
behind Llama Stack should automatically pass all the new tests due to
the input validation added here, but some of the providers fail this
test when not run behind Llama Stack due to differences in how they
handle input validation and errors.

(Closes #1951)

## Test Plan

To test this, start an OpenAI API  verification stack:

```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```

Then, run the new verification tests with your provider(s) of choice:

```
python -m pytest -s -v \
  tests/verifications/openai_api/test_chat_completion.py \
  --provider openai-llama-stack

python -m pytest -s -v \
  tests/verifications/openai_api/test_chat_completion.py \
  --provider together-llama-stack
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-23 17:48:32 +02:00
Ben Browning
dc46725f56
fix: properly handle streaming client disconnects (#2000)
# What does this PR do?

Previously, when a streaming client would disconnect before we were
finished streaming the entire response, an error like the below would
get raised from the `sse_generator` function in
`llama_stack/distribution/server/server.py`:

```
AttributeError: 'coroutine' object has no attribute 'aclose'. Did you mean: 'close'?
```

This was because we were calling `aclose` on a coroutine instead of the
awaited value from that coroutine. This change fixes that, so that we
save off the awaited value and then can call `aclose` on it if we
encounter an `asyncio.CancelledError`, like we see when a client
disconnects before we're finished streaming.

The other changes in here are to add a simple set of tests for the happy
path of our SSE streaming and this client disconnect path.

That unfortunately requires adding one more dependency into our unit
test section of pyproject.toml since `server.py` requires loading some
of the telemetry code for me to test this functionality.

## Test Plan

I wrote the tests in `tests/unit/server/test_sse.py` first, verified the
client disconnected test failed before my change, and that it passed
afterwards.

```
python -m pytest -s -v tests/unit/server/test_sse.py
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-23 15:44:28 +02:00
Sébastien Han
94f83382eb
feat: allow building distro with external providers (#1967)
# What does this PR do?

We can now build a distribution that includes external providers.
Closes: https://github.com/meta-llama/llama-stack/issues/1948

## Test Plan

Build a distro with an external provider following the doc instructions.

[//]: # (## Documentation)

Added.

Rendered:


![Screenshot 2025-04-18 at 11 26
39](https://github.com/user-attachments/assets/afcf3d50-8d30-48c3-8d24-06a4b3662881)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-18 17:18:28 +02:00
ehhuang
0ed41aafbf
test: add multi_image test (#1972)
# What does this PR do?


## Test Plan
pytest tests/verifications/openai_api/test_chat_completion.py --provider
openai -k 'test_chat_multiple_images'
2025-04-17 12:51:42 -07:00
ehhuang
2976b5d992
fix: OAI compat endpoint for meta reference inference provider (#1962)
Test plan:
python tests/verifications/generate_report.py --providers
fireworks,together,llama_meta_ref,openai

Co-authored-by: Eric Huang <erichuang@fb.com>
2025-04-17 11:16:04 -07:00
ehhuang
8bd6665775
chore(verification): update README and reorganize generate_report.py (#1978)
# What does this PR do?


## Test Plan
uv run --with-editable ".[dev]" python
tests/verifications/generate_report.py --run-tests
2025-04-17 10:41:22 -07:00
Alexey Rybak
326cbba579
feat(agents): add agent naming functionality (#1922)
# What does this PR do?
Allow users to name an agent and use the name in telemetry instead of
relying on randomly generated agent_ids. This improves the developer
experience by making it easier to find specific agents in telemetry
logs.

Closes #1832

## Test Plan

- Added tests to verify the agent name is properly stored and retrieved
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_name_filtering`
from the root of the project and made sure the tests pass
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_query_spans`
to verify existing code without agent names still works correctly

## Use Example
```
agent = Agent(
    llama_stack_client, 
    model=text_model_id, 
    name="CustomerSupportAgent",  # New parameter
    instructions="You are a helpful customer support assistant"
)
session_id = agent.create_session(f"test-session-{uuid4()}")
```

## Implementation Notes
- Agent names are optional string parameters with no additional
validation
- Names are not required to be unique - multiple agents can have the
same name
- The agent_id remains the unique identifier for an agent

---------

Co-authored-by: raghotham <raghotham@gmail.com>
2025-04-17 07:02:47 -07:00
Jash Gulabrai
45e08ff417
fix: Handle case when Customizer Job status is unknown (#1965)
# What does this PR do?
This PR handles the case where a Customization Job's status is
`unknown`. Since we don't map `unknown` to a valid `JobStatus`, the
PostTraining provider throws an exception when fetching/listing a job.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
`./scripts/unit-tests.sh
tests/unit/providers/nvidia/test_supervised_fine_tuning.py` succeeds

[//]: # (## Documentation)

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
2025-04-17 10:27:07 +02:00
Alexey Rybak
8f57b08f2c
fix(build): always pass path when no template/config provided (#1982)
# What does this PR do?

Fixes a crash that occurred when building a stack as a container image
via the interactive wizard without supplying --template or --config.

- Root cause: template_or_config was None; only the container path
relies on that parameter, which later reaches subprocess.run() and
triggers

`TypeError: expected str, bytes or os.PathLike object, not NoneType.`

- Change: in `_run_stack_build_command_from_build_config` we now fall
back to the freshly‑written build‑spec file whenever both optional
sources are missing. Also adds a spy‑based unit test that asserts a
valid string path is passed to build_image() for container builds.

### Closes #1976

## Test Plan

- New unit test: test_build_path.py. Monkey‑patches build_image,
captures the fourth argument, and verifies it is a real path
- Manual smoke test: 

```
llama stack build --image-type container
# answer wizard prompts

```

Build proceeds into Docker without raising the previous TypeError.

## Future Work
Harmonise `build_image` arguments so every image type receives the same
inputs, eliminating this asymmetric special‑case.
2025-04-17 10:20:43 +02:00
ehhuang
b44f84ce18
test: disable flaky dataset (#1979)
# What does this PR do?


## Test Plan
2025-04-16 15:33:37 -07:00
Daniel Alvarez Sanchez
b5a9ef4c6d
fix: Do not send an empty 'tools' list to remote vllm (#1957)
Fixes: #1955

Since 0.2.0, the vLLM gets an empty list (vs ``None``in 0.1.9 and
before) when there are no tools configured which causes the issue
described in #1955 p. This patch avoids sending the 'tools' param to the
vLLM altogether instead of an empty list.

It also adds a small unit test to avoid regressions.

The OpenAI
[specification](https://platform.openai.com/docs/api-reference/chat/create)
does not explicitly state that the list cannot be empty but I found this
out through experimentation and it might depend on the actual remote
vllm. In any case, as this parameter is Optional, is best to skip it
altogether if there's no tools configured.

Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>
2025-04-15 20:31:12 -04:00
ehhuang
32e3da7392
test(verification): more tests, multiturn tool use tests (#1954)
# What does this PR do?


## Test Plan
(myenv) ➜ llama-stack python tests/verifications/generate_report.py
--providers fireworks,together,openai --run-tests

f27f617629/tests/verifications/REPORT.md
2025-04-14 18:45:22 -07:00
Ihar Hrachyshka
3ed4316ed5
feat: Implement async job execution for torchtune training (#1437)
# What does this PR do?

Now a separate thread is started to execute training jobs. Training
requests now return job ID before the job completes. (Which fixes API
timeouts for any jobs that take longer than a minute.)

Note: the scheduler code is meant to be spun out in the future into a
common provider service that can be reused for different APIs and
providers. It is also expected to back the /jobs API proposed here:

https://github.com/meta-llama/llama-stack/discussions/1238

Hence its somewhat generalized form which is expected to simplify its
adoption elsewhere in the future.

Note: this patch doesn't attempt to implement missing APIs (e.g. cancel
or job removal). This work will belong to follow-up PRs.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

Added unit tests for the scheduler module. For the API coverage, did
manual testing and was able to run a training cycle on GPU. The initial
call returned job ID before the training completed, as (now) expected.
Artifacts are returned as expected.

```
JobArtifactsResponse(checkpoints=[{'identifier': 'meta-llama/Llama-3.2-3B-Instruct-sft-0', 'created_at': '2025-03-07T22:45:19.892714', 'epoch': 0, 'post_training_job_id': 'test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50', 'path': '/home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0', 'training_metrics': None}], job_uuid='test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50')
```

The integration test is currently disabled for the provider. I will look
into how it can be enabled in a different PR / issue context.

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-04-14 08:59:11 -07:00
Ben Browning
7641a5cd0b
fix: 100% OpenAI API verification for together and fireworks (#1946)
# What does this PR do?

TLDR: Changes needed to get 100% passing tests for OpenAI API
verification tests when run against Llama Stack with the `together`,
`fireworks`, and `openai` providers. And `groq` is better than before,
at 88% passing.

This cleans up the OpenAI API support for image message types
(specifically `image_url` types) and handling of the `response_format`
chat completion parameter. Both of these required a few more Pydantic
model definitions in our Inference API, just to move from the
not-quite-right stubs I had in place to something fleshed out to match
the actual OpenAI API specs.

As part of testing this, I also found and fixed a bug in the litellm
implementation of openai_completion and openai_chat_completion, so the
providers based on those should actually be working now.

The method `prepare_openai_completion_params` in
`llama_stack/providers/utils/inference/openai_compat.py` was improved to
actually recursively clean up input parameters, including handling of
lists, dicts, and dumping of Pydantic models to dicts. These changes
were required to get to 100% passing tests on the OpenAI API
verification against the `openai` provider.

With the above, the together.ai provider was passing as well as it is
without Llama Stack. But, since we have Llama Stack in the middle, I
took the opportunity to clean up the together.ai provider so that it now
also passes the OpenAI API spec tests we have at 100%. That means
together.ai is now passing our verification test better when using an
OpenAI client talking to Llama Stack than it is when hitting together.ai
directly, without Llama Stack in the middle.

And, another round of work for Fireworks to improve translation of
incoming OpenAI chat completion requests to Llama Stack chat completion
requests gets the fireworks provider passing at 100%. The server-side
fireworks.ai tool calling support with OpenAI chat completions and Llama
4 models isn't great yet, but by pointing the OpenAI clients at Llama
Stack's API we can clean things up and get everything working as
expected for Llama 4 models.

## Test Plan

### OpenAI API Verification Tests

I ran the OpenAI API verification tests as below and 100% of the tests
passed.

First, start a Llama Stack server that runs the `openai` provider with
the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template
setup to do this out of the box, so I added a
`tests/verifications/openai-api-verification-run.yaml` to do this.

First, ensure you have the necessary API key environment variables set:

```
export TOGETHER_API_KEY="..."
export FIREWORKS_API_KEY="..."
export OPENAI_API_KEY="..."
```

Then, run a Llama Stack server that serves up all these providers:

```
llama stack run \
      --image-type venv \
      tests/verifications/openai-api-verification-run.yaml
```

Finally, generate a new verification report against all these providers,
both with and without the Llama Stack server in the middle.

```
python tests/verifications/generate_report.py \
      --run-tests \
      --provider \
        together \
        fireworks \
        groq \
        openai \
        together-llama-stack \
        fireworks-llama-stack \
        groq-llama-stack \
        openai-llama-stack
```

You'll see that most of the configurations with Llama Stack in the
middle now pass at 100%, even though some of them do not pass at 100%
when hitting the backend provider's API directly with an OpenAI client.

### OpenAI Completion Integration Tests with vLLM:

I also ran the smaller `test_openai_completion.py` test suite (that's
not yet merged with the verification tests) on multiple of the
providers, since I had to adjust the method signature of
openai_chat_completion a bit and thus had to touch lots of these
providers to match. Here's the tests I ran there, all passing:

```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```

### OpenAI Completion Integration Tests with ollama

```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```

### OpenAI Completion Integration Tests with together.ai

```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo"
```

### OpenAI Completion Integration Tests with fireworks.ai

```
INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct"

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-14 08:56:29 -07:00
Ashwin Bharambe
429f6de7d7 fix: misc fixes for tests kill horrible warnings 2025-04-12 17:12:11 -07:00
ehhuang
ad86a68a32
feat: support '-' in tool names (#1807)
# What does this PR do?
titled

## Test Plan
added new unit tests
pytest -s -v tests/unit/models/llama/llama3/test_tool_utils.py
2025-04-12 14:23:03 -07:00
Ashwin Bharambe
ef3dc143ec fix: test_registration was borked somehow 2025-04-12 12:04:01 -07:00
Ashwin Bharambe
f34f22f8c7
feat: add batch inference API to llama stack inference (#1945)
# What does this PR do?

This PR adds two methods to the Inference API:
- `batch_completion`
- `batch_chat_completion`

The motivation is for evaluations targeting a local inference engine
(like meta-reference or vllm) where batch APIs provide for a substantial
amount of acceleration.

Why did I not add this to `Api.batch_inference` though? That just
resulted in a _lot_ more book-keeping given the structure of Llama
Stack. Had I done that, I would have needed to create a notion of a
"batch model" resource, setup routing based on that, etc. This does not
sound ideal.

So what's the future of the batch inference API? I am not sure. Maybe we
can keep it for true _asynchronous_ execution. So you can submit
requests, and it can return a Job instance, etc.

## Test Plan

Run meta-reference-gpu using:
```bash
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144

LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu
```

Then run the batch inference test case.
2025-04-12 11:41:12 -07:00
Ben Browning
2b2db5fbda
feat: OpenAI-Compatible models, completions, chat/completions (#1894)
# What does this PR do?

This stubs in some OpenAI server-side compatibility with three new
endpoints:

/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions

This gives common inference apps using OpenAI clients the ability to
talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .

The two "v1" instances in there isn't awesome, but the thinking is that
Llama Stack's API is v1 and then our OpenAI compatibility layer is
compatible with OpenAI V1. And, some OpenAI clients implicitly assume
the URL ends with "v1", so this gives maximum compatibility.

The openai models endpoint is implemented in the routing layer, and just
returns all the models Llama Stack knows about.

The following providers should be working with the new OpenAI
completions and chat/completions API:
* remote::anthropic (untested)
* remote::cerebras-openai-compat (untested)
* remote::fireworks (tested)
* remote::fireworks-openai-compat (untested)
* remote::gemini (untested)
* remote::groq-openai-compat (untested)
* remote::nvidia (tested)
* remote::ollama (tested)
* remote::openai (untested)
* remote::passthrough (untested)
* remote::sambanova-openai-compat (untested)
* remote::together (tested)
* remote::together-openai-compat (untested)
* remote::vllm (tested)

The goal to support this for every inference provider - proxying
directly to the provider's OpenAI endpoint for OpenAI-compatible
providers. For providers that don't have an OpenAI-compatible API, we'll
add a mixin to translate incoming OpenAI requests to Llama Stack
inference requests and translate the Llama Stack inference responses to
OpenAI responses.

This is related to #1817 but is a bit larger in scope than just chat
completions, as I have real use-cases that need the older completions
API as well.

## Test Plan

### vLLM

```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```

### ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```



## Documentation

Run a Llama Stack distribution that uses one of the providers mentioned
in the list above. Then, use your favorite OpenAI client to send
completion or chat completion requests with the base_url set to
http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the
host and port of your Llama Stack server, if different.

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-11 13:14:17 -07:00
Jash Gulabrai
c1cb6aad11
feat: Add unit tests for NVIDIA safety (#1897)
# What does this PR do?
This PR adds unit tests for the NVIDIA Safety provider implementation.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
1. Ran `./scripts/unit-tests.sh
tests/unit/providers/nvidia/test_safety.py` from the root of the
project. Verified tests pass.
```
tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_init_nemo_guardrails Initializing NVIDIASafetyAdapter(http://nemo.test)...
PASSED
tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_init_nemo_guardrails_invalid_temperature Initializing NVIDIASafetyAdapter(http://nemo.test)...
PASSED
tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_register_shield_with_valid_id Initializing NVIDIASafetyAdapter(http://nemo.test)...
PASSED
tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_register_shield_without_id Initializing NVIDIASafetyAdapter(http://nemo.test)...
PASSED
tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_run_shield_allowed Initializing NVIDIASafetyAdapter(http://nemo.test)...
PASSED
tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_run_shield_blocked Initializing NVIDIASafetyAdapter(http://nemo.test)...
PASSED
tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_run_shield_http_error Initializing NVIDIASafetyAdapter(http://nemo.test)...
PASSED
tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_run_shield_not_found Initializing NVIDIASafetyAdapter(http://nemo.test)...
PASSED
```

[//]: # (## Documentation)

---------

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
2025-04-11 11:49:55 -07:00
ehhuang
2fcb70b789
test(verification): overwrite test result instead of creating new ones (#1934)
# What does this PR do?


## Test Plan
(myenv) ➜ llama-stack python tests/verifications/generate_report.py
--providers fireworks,together,openai --run-tests
2025-04-10 16:59:28 -07:00
ehhuang
a4cc4b7e31
test(verification): add streaming tool calling test (#1933)
# What does this PR do?


## Test Plan

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1933).
* #1934
* __->__ #1933
2025-04-10 16:58:06 -07:00
Francisco Arceo
de6ec5803e
fix: Fix linter failures from #1921 (#1932)
# What does this PR do?
fix: Fix linter failures from #1921

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-04-10 10:37:31 -07:00
ehhuang
14146e4b3f
feat(verification): various improvements (#1921)
# What does this PR do?
- provider and their models now live in config.yaml
- better distinguish different cases within a test
- add model key to surface provider's model_id
- include example command to rerun single test case

## Test Plan
<img width="1173" alt="image"
src="https://github.com/user-attachments/assets/b414baf0-c768-451f-8c3b-c2905cf36fac"
/>
2025-04-10 10:26:19 -07:00