Commit graph

106 commits

Author SHA1 Message Date
Sébastien Han
dbdc811d16
chore: isolate bare minimum project dependencies (#2282)
Some checks failed
Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 12s
Integration Tests / test-matrix (http, 3.12, datasets) (push) Failing after 20s
Integration Tests / test-matrix (library, 3.13, post_training) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 14s
Integration Tests / test-matrix (library, 3.13, datasets) (push) Failing after 7s
Test Llama Stack Build / generate-matrix (push) Successful in 7s
Integration Tests / test-matrix (http, 3.13, scoring) (push) Failing after 16s
Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 16s
Integration Tests / test-matrix (http, 3.12, tool_runtime) (push) Failing after 18s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 8s
Python Package Build Test / build (3.12) (push) Failing after 5s
Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 17s
Python Package Build Test / build (3.13) (push) Failing after 4s
Test Llama Stack Build / build-single-provider (push) Failing after 8s
Integration Tests / test-matrix (library, 3.13, inspect) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.13, inference) (push) Failing after 11s
Integration Tests / test-matrix (http, 3.12, inference) (push) Failing after 26s
Integration Tests / test-matrix (http, 3.12, scoring) (push) Failing after 19s
Integration Tests / test-matrix (http, 3.13, vector_io) (push) Failing after 15s
Integration Tests / test-matrix (library, 3.13, scoring) (push) Failing after 11s
Integration Tests / test-matrix (library, 3.13, tool_runtime) (push) Failing after 8s
Test External Providers / test-external-providers (venv) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.13, providers) (push) Failing after 10s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 12s
Unit Tests / unit-tests (3.12) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.13, vector_io) (push) Failing after 10s
Unit Tests / unit-tests (3.13) (push) Failing after 6s
Update ReadTheDocs / update-readthedocs (push) Failing after 4s
Test Llama Stack Build / build (push) Failing after 7s
Pre-commit / pre-commit (push) Successful in 48s
# What does this PR do?

The goal is to promote the minimal set of dependencies the project needs
to run, this includes:

* dependencies needed to work with the CLI
* dependencies needed for the server to run with no providers

This also:
* Relocate redundant dependencies out of the core project and into the
  individual providers that actually require them.
* Include all necessary server dependencies so the project can run
  standalone, even without any providers.

<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->

<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->

## Test Plan

Build and run distro a server.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-06-26 10:14:27 +02:00
Sébastien Han
ac5fd57387
chore: remove nested imports (#2515)
# What does this PR do?

* Given that our API packages use "import *" in `__init.py__` we don't
need to do `from llama_stack.apis.models.models` but simply from
llama_stack.apis.models. The decision to use `import *` is debatable and
should probably be revisited at one point.

* Remove unneeded Ruff F401 rule
* Consolidate Ruff F403 rule in the pyprojectfrom
llama_stack.apis.models.models

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-06-26 08:01:05 +05:30
Sébastien Han
9c8be89fb6
chore: bump python supported version to 3.12 (#2475)
Some checks failed
Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 12s
Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 16s
Test Llama Stack Build / build-single-provider (push) Failing after 9s
Integration Tests / test-matrix (library, 3.13, inference) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 7s
Python Package Build Test / build (3.13) (push) Failing after 5s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 7s
Integration Tests / test-matrix (http, 3.13, datasets) (push) Failing after 14s
Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 15s
Integration Tests / test-matrix (library, 3.13, agents) (push) Failing after 14s
Integration Tests / test-matrix (library, 3.13, datasets) (push) Failing after 11s
Integration Tests / test-matrix (library, 3.13, vector_io) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.13, scoring) (push) Failing after 11s
Integration Tests / test-matrix (library, 3.13, post_training) (push) Failing after 12s
Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 12s
Integration Tests / test-matrix (http, 3.13, providers) (push) Failing after 13s
Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 14s
Integration Tests / test-matrix (library, 3.13, tool_runtime) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 11s
Unit Tests / unit-tests (3.12) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.13, inspect) (push) Failing after 6s
Update ReadTheDocs / update-readthedocs (push) Failing after 5s
Unit Tests / unit-tests (3.13) (push) Failing after 8s
Test Llama Stack Build / build (push) Failing after 6s
Integration Tests / test-matrix (library, 3.13, providers) (push) Failing after 41s
Python Package Build Test / build (3.12) (push) Failing after 33s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 36s
Test External Providers / test-external-providers (venv) (push) Failing after 31s
Pre-commit / pre-commit (push) Successful in 1m54s
# What does this PR do?

The project now supports Python >= 3.12

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-06-24 09:22:04 +05:30
ehhuang
6fde601765
chore: upgrade hf hub dependency (#2487)
Some checks failed
Integration Tests / test-matrix (library, 3.11, vector_io) (push) Failing after 6s
Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 8s
Integration Tests / test-matrix (http, 3.12, tool_runtime) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 8s
Test Llama Stack Build / generate-matrix (push) Successful in 7s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 6s
Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 9s
Python Package Build Test / build (3.11) (push) Failing after 2s
Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 10s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 4s
Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 9s
Python Package Build Test / build (3.13) (push) Failing after 2s
Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 8s
Python Package Build Test / build (3.12) (push) Failing after 4s
Unit Tests / unit-tests (3.12) (push) Failing after 5s
Test External Providers / test-external-providers (venv) (push) Failing after 8s
Unit Tests / unit-tests (3.13) (push) Failing after 6s
Update ReadTheDocs / update-readthedocs (push) Failing after 11s
Unit Tests / unit-tests (3.11) (push) Failing after 13s
Test Llama Stack Build / build (push) Failing after 8s
Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 33s
Test Llama Stack Build / build-single-provider (push) Failing after 31s
Pre-commit / pre-commit (push) Successful in 1m12s
Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 11s
Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 10s
# What does this PR do?
CI tests have been failing with
.venv/lib/python3.12/site-packages/peft/auto.py:21: in <module>
    from transformers import (
.venv/lib/python3.12/site-packages/transformers/__init__.py:27: in
<module>
    from . import dependency_versions_check

.venv/lib/python3.12/site-packages/transformers/dependency_versions_check.py:57:
in <module>
    require_version_core(deps[pkg])
.venv/lib/python3.12/site-packages/transformers/utils/versions.py:117:
in require_version_core
    return require_version(requirement, hint)
.venv/lib/python3.12/site-packages/transformers/utils/versions.py:111:
in require_version
    _compare_versions(op, got_ver, want_ver, requirement, pkg, hint)
.venv/lib/python3.12/site-packages/transformers/utils/versions.py:44: in
_compare_versions
    raise ImportError(
E ImportError: huggingface-hub>=0.30.0,<1.0 is required for a normal
functioning of this module, but found huggingface-hub==0.29.0.
E Try: `pip install transformers -U` or `pip install -e '.[dev]'` if
you're working with git main
------------------------------ Captured log setup
------------------------------
INFO llama_stack.providers.remote.inference.ollama.ollama:ollama.py:106
checking connectivity to Ollama at `http://0.0.0.0:11434`.../
=========================== short test summary info
============================
ERROR
tests/integration/providers/test_providers.py::TestProviders::test_providers
- ImportError: huggingface-hub>=0.30.0,<1.0 is required for a normal
functioning of this module, but found huggingface-hub==0.29.0.
Try: `pip install transformers -U` or `pip install -e '.[dev]'` if
you're working with git main
=================== 1 skipped, 4 warnings, 1 error in 9.52s
====================

## Test Plan
CI
2025-06-20 15:50:54 -07:00
github-actions[bot]
d70573bd47 build: Bump version to 0.2.12 2025-06-20 21:06:17 +00:00
Charlie Doern
d12f195f56
feat: drop python 3.10 support (#2469)
# What does this PR do?

dropped python3.10, updated pyproject and dependencies, and also removed
some blocks of code with special handling for enum.StrEnum

Closes #2458

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-06-19 12:07:14 +05:30
github-actions[bot]
7d812e3bf0 build: Bump version to 0.2.11
Some checks failed
Integration Tests / test-matrix (library, 3.10, post_training) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.10, providers) (push) Failing after 12s
Integration Tests / test-matrix (library, 3.10, scoring) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.10, vector_io) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.10, tool_runtime) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 4s
Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 4s
Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 5s
Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 6s
Integration Tests / test-matrix (library, 3.11, vector_io) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 6s
Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 6s
Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 8s
Unit Tests / unit-tests (3.12) (push) Failing after 5s
Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 8s
Test External Providers / test-external-providers (venv) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 10s
Unit Tests / unit-tests (3.10) (push) Failing after 7s
Update ReadTheDocs / update-readthedocs (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 10s
Unit Tests / unit-tests (3.11) (push) Failing after 9s
Unit Tests / unit-tests (3.13) (push) Failing after 17s
Pre-commit / pre-commit (push) Successful in 55s
2025-06-17 19:08:17 +00:00
Ibrahim Haroon
a34cef925b
fix(faiss): handle case where distance is 0 by setting d to minimum positive… (#2387)
# What does this PR do?
Adds try-catch to faiss `query_vector` function for when the distance
between the query embedding and an embedding within the vector db is 0
(identical vectors). Catches `ZeroDivisionError` and then appends `(1.0
/ sys.float_info.min)` to `scores` to represent maximum similarity.

<!-- If resolving an issue, uncomment and update the line below -->
Closes [#2381]

## Test Plan
Checkout this PR

Execute this code and there will no longer be a `ZeroDivisionError`
exception
```
from llama_stack_client import LlamaStackClient

base_url = "http://localhost:8321"
client = LlamaStackClient(base_url=base_url)

models = client.models.list()
embedding_model = (
    em := next(m for m in models if m.model_type == "embedding")
).identifier
embedding_dimension = 384

_ = client.vector_dbs.register(
    vector_db_id="foo_db",
    embedding_model=embedding_model,
    embedding_dimension=embedding_dimension,
    provider_id="faiss",
)

chunk = {
    "content": "foo",
    "mime_type": "text/plain",
    "metadata": {
        "document_id": "foo-id"
    }
}

client.vector_io.insert(vector_db_id="foo_db", chunks=[chunk])
client.vector_io.query(vector_db_id="foo_db", query="foo")
```

### Running unit tests
`uv run pytest tests/unit/rag/test_rag_query.py -v`

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ben Browning <bbrownin@redhat.com>
2025-06-07 16:09:46 -04:00
github-actions[bot]
692709cd45 build: Bump version to 0.2.10
Some checks failed
Integration Tests / test-matrix (library, 3.10, tool_runtime) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.10, scoring) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 9s
Test Llama Stack Build / generate-matrix (push) Successful in 6s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 7s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 6s
Test External Providers / test-external-providers (venv) (push) Failing after 7s
Unit Tests / unit-tests (3.10) (push) Failing after 8s
Unit Tests / unit-tests (3.11) (push) Failing after 7s
Unit Tests / unit-tests (3.12) (push) Failing after 7s
Update ReadTheDocs / update-readthedocs (push) Failing after 6s
Unit Tests / unit-tests (3.13) (push) Failing after 9s
Test Llama Stack Build / build-single-provider (push) Failing after 27s
Test Llama Stack Build / build (push) Failing after 7s
Pre-commit / pre-commit (push) Failing after 1m16s
2025-06-05 22:56:39 +00:00
Hardik Shah
102516f33c
fix: Pin fastapi to avoid picking up spurious versions in test pypi (#2409)
as titled
2025-06-05 15:33:30 -07:00
Hardik Shah
04592b9590
fix: update pyproject to include recursive LS deps (#2404)
trying to run `llama` cli after installing wheel fails with this error 
```
Traceback (most recent call last):
  File "/tmp/tmp.wdZath9U6j/.venv/bin/llama", line 4, in <module>
    from llama_stack.cli.llama import main
  File "/tmp/tmp.wdZath9U6j/.venv/lib/python3.10/site-packages/llama_stack/__init__.py", line 7, in <module>
    from llama_stack.distribution.library_client import (  # noqa: F401
ModuleNotFoundError: No module named 'llama_stack.distribution.library_client'
``` 

This PR fixes it by ensurring that all sub-directories of `llama_stack`
are also included.

Also, fixes the missing `fastapi` dependency issue.
2025-06-05 11:46:48 -07:00
ehhuang
3c9a10d2fe
feat: reference implementation for files API (#2330)
Some checks failed
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 2s
Integration Tests / test-matrix (http, post_training) (push) Failing after 9s
Integration Tests / test-matrix (http, agents) (push) Failing after 10s
Integration Tests / test-matrix (http, providers) (push) Failing after 8s
Integration Tests / test-matrix (http, inference) (push) Failing after 11s
Integration Tests / test-matrix (http, inspect) (push) Failing after 10s
Integration Tests / test-matrix (http, datasets) (push) Failing after 11s
Integration Tests / test-matrix (library, datasets) (push) Failing after 8s
Integration Tests / test-matrix (http, scoring) (push) Failing after 10s
Integration Tests / test-matrix (library, inference) (push) Failing after 8s
Integration Tests / test-matrix (library, agents) (push) Failing after 10s
Integration Tests / test-matrix (http, tool_runtime) (push) Failing after 11s
Integration Tests / test-matrix (library, inspect) (push) Failing after 8s
Test External Providers / test-external-providers (venv) (push) Failing after 7s
Integration Tests / test-matrix (library, post_training) (push) Failing after 9s
Integration Tests / test-matrix (library, scoring) (push) Failing after 8s
Integration Tests / test-matrix (library, tool_runtime) (push) Failing after 8s
Integration Tests / test-matrix (library, providers) (push) Failing after 9s
Unit Tests / unit-tests (3.11) (push) Failing after 7s
Unit Tests / unit-tests (3.10) (push) Failing after 7s
Unit Tests / unit-tests (3.12) (push) Failing after 8s
Unit Tests / unit-tests (3.13) (push) Failing after 8s
Update ReadTheDocs / update-readthedocs (push) Failing after 6s
Pre-commit / pre-commit (push) Successful in 53s
# What does this PR do?
TSIA
Added Files provider to the fireworks template. Might want to add to all
templates as a follow-up.

## Test Plan
llama-stack pytest tests/unit/files/test_files.py

llama-stack llama stack build --template fireworks --image-type conda
--run
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s -v
tests/integration/files/
2025-06-02 21:54:24 -07:00
Sébastien Han
af65207ebd
chore: help setuptools finding the project path (#2333) 2025-06-02 07:20:46 -07:00
github-actions[bot]
ad15276da1 build: Bump version to 0.2.9
Some checks failed
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 4s
Integration Tests / test-matrix (http, inspect) (push) Failing after 9s
Integration Tests / test-matrix (http, providers) (push) Failing after 9s
Integration Tests / test-matrix (http, agents) (push) Failing after 10s
Integration Tests / test-matrix (library, agents) (push) Failing after 8s
Integration Tests / test-matrix (http, scoring) (push) Failing after 9s
Integration Tests / test-matrix (http, datasets) (push) Failing after 10s
Integration Tests / test-matrix (http, post_training) (push) Failing after 10s
Integration Tests / test-matrix (http, inference) (push) Failing after 11s
Integration Tests / test-matrix (library, inference) (push) Failing after 8s
Integration Tests / test-matrix (http, tool_runtime) (push) Failing after 10s
Integration Tests / test-matrix (library, datasets) (push) Failing after 10s
Test External Providers / test-external-providers (venv) (push) Failing after 5s
Integration Tests / test-matrix (library, inspect) (push) Failing after 7s
Integration Tests / test-matrix (library, post_training) (push) Failing after 8s
Unit Tests / unit-tests (3.10) (push) Failing after 7s
Integration Tests / test-matrix (library, providers) (push) Failing after 9s
Integration Tests / test-matrix (library, scoring) (push) Failing after 9s
Unit Tests / unit-tests (3.11) (push) Failing after 8s
Integration Tests / test-matrix (library, tool_runtime) (push) Failing after 9s
Unit Tests / unit-tests (3.12) (push) Failing after 8s
Update ReadTheDocs / update-readthedocs (push) Failing after 6s
Unit Tests / unit-tests (3.13) (push) Failing after 10s
Pre-commit / pre-commit (push) Failing after 1m34s
2025-05-30 19:43:09 +00:00
Sébastien Han
63a9f08c9e
chore: use starlette built-in Route class (#2267)
# What does this PR do?

Use a more common pattern and known terminology from the ecosystem,
where Route is more approved than Endpoint.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-28 09:53:33 -07:00
Sébastien Han
6352078e4b
chore: use groups when running commands (#2298)
# What does this PR do?

Followup of https://github.com/meta-llama/llama-stack/pull/2287. We must
use `--group` when running commands with uv.

<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->

<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-28 09:13:16 -07:00
Sébastien Han
4f3f28f718
chore: use dependency-groups for dev (#2287)
# What does this PR do?

The previous `[project.optional-dependencies]` was misrepresenting what
the packages were. They were NOT optional dependencies to the project
but development dependencies. Unlike optional dependencies, development
dependencies are local-only and will not be included in the project
requirements when published to PyPI or other indexes. As such,
development dependencies are not included in the [project] table.
Additionally, the dev group is synced by default.

Source:

https://docs.astral.sh/uv/concepts/projects/dependencies/#development-dependencies

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-27 23:00:17 +02:00
github-actions[bot]
7105a25b0f build: Bump version to 0.2.8 2025-05-27 20:28:29 +00:00
Sébastien Han
448f00903d
chore: mark blobpath as optional (#2271)
# What does this PR do?

This is not a core dependency of the distro server. It's only necessary
when using `inline::rag-runtime` or `inline::meta-reference` providers.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-27 10:55:24 +02:00
Ashwin Bharambe
298721c238
chore: split routing_tables into individual files (#2259) 2025-05-24 23:15:05 -07:00
Ashwin Bharambe
3faf1e4a79
feat: enable MCP execution in Responses impl (#2240)
## Test Plan

```
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
  --provider=stack:together --model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
2025-05-24 14:20:42 -07:00
ehhuang
b054023800
chore: add sqlalchemy to test dependencies (#2236)
# What does this PR do?


## Test Plan
2025-05-23 10:33:38 -07:00
ehhuang
549812f51e
feat: implement get chat completions APIs (#2200)
# What does this PR do?
* Provide sqlite implementation of the APIs introduced in
https://github.com/meta-llama/llama-stack/pull/2145.
* Introduced a SqlStore API: llama_stack/providers/utils/sqlstore/api.py
and the first Sqlite implementation
* Pagination support will be added in a future PR.

## Test Plan
Unit test on sql store:
<img width="1005" alt="image"
src="https://github.com/user-attachments/assets/9b8b7ec8-632b-4667-8127-5583426b2e29"
/>


Integration test:
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" llama stack build --template ollama --image-type conda --run
```
```
LLAMA_STACK_CONFIG=http://localhost:5001 INFERENCE_MODEL="llama3.2:3b-instruct-fp16" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-fp16" -k 'inference_store and openai'
```
2025-05-21 22:21:52 -07:00
Jorge Piedrahita Ortiz
633bb9c5b3
feat(providers): sambanova safety provider (#2221)
# What does this PR do?

Includes SambaNova safety adaptor to use the sambanova cloud served
Meta-Llama-Guard-3-8B
minor updates in sambanova docs

## Test Plan
pytest -s -v tests/integration/safety/test_safety.py
--stack-config=sambanova --safety-shield=sambanova/Meta-Llama-Guard-3-8B
2025-05-21 15:33:02 -07:00
Sébastien Han
85b5f3172b
docs: misc cleanup (#2223)
# What does this PR do?

* remove requirements.txt to use pyproject.toml as the source of truth
* update relevant docs

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-21 17:35:27 +02:00
Sébastien Han
c25acedbcd
chore: remove k8s auth in favor of k8s jwks endpoint (#2216)
# What does this PR do?

Kubernetes since 1.20 exposes a JWKS endpoint that we can use with our
recent oauth2 recent implementation.
The CI test has been kept intact for validation.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-21 16:23:54 +02:00
Ashwin Bharambe
c7015d3d60
feat: introduce OAuth2TokenAuthProvider and notion of "principal" (#2185)
This PR adds a notion of `principal` (aka some kind of persistent
identity) to the authentication infrastructure of the Stack. Until now
we only used access attributes ("claims" in the more standard OAuth /
OIDC setup) but we need the notion of a User fundamentally as well.
(Thanks @rhuss for bringing this up.)

This value is not yet _used_ anywhere downstream but will be used to
segregate access to resources.

In addition, the PR introduces a built-in JWT token validator so the
Stack does not need to contact an authentication provider to validating
the authorization and merely check the signed token for the represented
claims. Public keys are refreshed via the configured JWKS server. This
Auth Provider should overwhelmingly be considered the default given the
seamless integration it offers with OAuth setups.
2025-05-18 17:54:19 -07:00
Charlie Doern
f02f7b28c1
feat: add huggingface post_training impl (#2132)
# What does this PR do?


adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a
super popular option for running training jobs. The config allows a user
to specify some key fields such as a model, chat_template, device, etc

the provider comes with one recipe `finetune_single_device` which works
both with and without LoRA.

any model that is a valid HF identifier can be given and the model will
be pulled.

this has been tested so far with CPU and MPS device types, but should be
compatible with CUDA out of the box

The provider processes the given dataset into the proper format,
establishes the various steps per epoch, steps per save, steps per eval,
sets a sane SFTConfig, and runs n_epochs of training

if checkpoint_dir is none, no model is saved. If there is a checkpoint
dir, a model is saved every `save_steps` and at the end of training.


## Test Plan

re-enabled post_training integration test suite with a singular test
that loads the simpleqa dataset:
https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite
model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The
test now uses the llama stack client and the proper post_training API

runs one step with a batch_size of 1. This test runs on CPU on the
Ubuntu runner so it needs to be a small batch and a single step.

[//]: # (## Documentation)

---------

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-05-16 14:41:28 -07:00
github-actions[bot]
65cf076f13 build: Bump version to 0.2.7 2025-05-16 20:32:06 +00:00
Ashwin Bharambe
1a6d4af5e9
refactor: rename dev distro as starter (#2181)
We want this to be a "flagship" distribution we can advertize to a
segment of users to get started quickly. This distro should package a
bunch of remote providers and some cheap inline providers so they get a
solid "AI Platform in a box" setup instantly.
2025-05-15 12:52:34 -07:00
Francisco Arceo
8e7ab146f8
feat: Adding support for customizing chunk context in RAG insertion and querying (#2134)
# What does this PR do?
his PR allows users to customize the template used for chunks when
inserted into the context. Additionally, this enables metadata injection
into the context of an LLM for RAG. This makes a naive and crude
assumption that each chunk should include the metadata, this is
obviously redundant when multiple chunks are returned from the same
document. In order to remove any sort of duplication of chunks, we'd
have to make much more significant changes so this is a reasonable first
step that unblocks users requesting this enhancement in
https://github.com/meta-llama/llama-stack/issues/1767.

In the future, this can be extended to support citations.


List of Changes:
- `llama_stack/apis/tools/rag_tool.py`
    - Added  `chunk_template` field in `RAGQueryConfig`.
- Added `field_validator` to validate the `chunk_template` field in
`RAGQueryConfig`.
- Ensured the `chunk_template` field includes placeholders `{index}` and
`{chunk.content}`.
- Updated the `query` method to use the `chunk_template` for formatting
chunk text content.
- `llama_stack/providers/inline/tool_runtime/rag/memory.py`
- Modified the `insert` method to pass `doc.metadata` for chunk
creation.
- Enhanced the `query` method to format results using `chunk_template`
and exclude unnecessary metadata fields like `token_count`.
- `llama_stack/providers/utils/memory/vector_store.py`
- Updated `make_overlapped_chunks` to include metadata serialization and
token count for both content and metadata.
    - Added error handling for metadata serialization issues.
- `pyproject.toml`
- Added `pydantic.field_validator` as a recognized `classmethod`
decorator in the linting configuration.
- `tests/integration/tool_runtime/test_rag_tool.py`
- Refactored test assertions to separate `assert_valid_chunk_response`
and `assert_valid_text_response`.
- Added integration tests to validate `chunk_template` functionality
with and without metadata inclusion.
- Included a test case to ensure `chunk_template` validation errors are
raised appropriately.
- `tests/unit/rag/test_vector_store.py`
- Added unit tests for `make_overlapped_chunks`, verifying chunk
creation with overlapping tokens and metadata integrity.
- Added tests to handle metadata serialization errors, ensuring proper
exception handling.
- `docs/_static/llama-stack-spec.html`
- Added a new `chunk_template` field of type `string` with a default
template for formatting retrieved chunks in RAGQueryConfig.
    - Updated the `required` fields to include `chunk_template`.
- `docs/_static/llama-stack-spec.yaml`
- Introduced `chunk_template` field with a default value for
RAGQueryConfig.
- Updated the required configuration list to include `chunk_template`.
- `docs/source/building_applications/rag.md`
- Documented the `chunk_template` configuration, explaining how to
customize metadata formatting in RAG queries.
- Added examples demonstrating the usage of the `chunk_template` field
in RAG tool queries.
    - Highlighted default values for `RAG` agent configurations.

# Resolves https://github.com/meta-llama/llama-stack/issues/1767

## Test Plan
Updated both `test_vector_store.py` and `test_rag_tool.py` and tested
end-to-end with a script.

I also tested the quickstart to enable this and specified this metadata:
```python
document = RAGDocument(
    document_id="document_1",
    content=source,
    mime_type="text/html",
    metadata={"author": "Paul Graham", "title": "How to do great work"},
)
```
Which produced the output below: 

![Screenshot 2025-05-13 at 10 53
43 PM](https://github.com/user-attachments/assets/bb199d04-501e-4217-9c44-4699d43d5519)

This highlights the usefulness of the additional metadata. Notice how
the metadata is redundant for different chunks of the same document. I
think we can update that in a subsequent PR.

# Documentation
I've added a brief comment about this in the documentation to outline
this to users and updated the API documentation.

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-05-14 21:56:20 -04:00
github-actions[bot]
23d9f3b1fb build: Bump version to 0.2.6 2025-05-12 18:02:05 +00:00
Sébastien Han
1a529705da
chore: more mypy fixes (#2029)
# What does this PR do?

Mainly tried to cover the entire llama_stack/apis directory, we only
have one left. Some excludes were just noop.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-06 09:52:31 -07:00
Christian Zaccaria
18d2312690
fix: test_datasets HF scenario in CI (#2090)
# What does this PR do?
**Fixes** #1959 

HuggingFace provides several loading paths that the datasets library can
use. My theory on why the test would previously fail intermittently is
because when calling `load_dataset(...)`, it may be trying several
options such as local cache, Hugging Face Hub, or a dataset script, or
other. There's one of these options that seem to work inconsistently in
the CI.

The HuggingFace datasets library relies on the `transformers` package to
load certain datasets such as `llamastack/simpleqa`, and by adding the
package, we can see the dataset is loaded consistently via the Hugging
Face Hub.

Please see PR in my fork demonstrating over 7 consecutive passes:
https://github.com/ChristianZaccaria/llama-stack/pull/1 

**Some References:**
- https://github.com/huggingface/transformers/issues/8690
- https://huggingface.co/docs/datasets/en/loading 

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
2025-05-06 14:09:15 +02:00
github-actions[bot]
6b4c218788 build: Bump version to 0.2.5 2025-05-03 21:31:01 +00:00
Ihar Hrachyshka
9e6561a1ec
chore: enable pyupgrade fixes (#1806)
# What does this PR do?

The goal of this PR is code base modernization.

Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)

Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-05-01 14:23:50 -07:00
Sébastien Han
2c7aba4158
fix: enforce stricter ASCII rules lint rules in Ruff (#2062)
# What does this PR do?

- Added new Ruff lint rules to detect ambiguous or non-ASCII characters:
- Added per-file ignores where Unicode usage is still required.
- Fixed whatever had to be fixed

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-30 18:05:27 +02:00
Ashwin Bharambe
799286fe52 fix: Bump version to 0.2.4 2025-04-29 10:34:17 -07:00
Ashwin Bharambe
4d0bfbf984
feat: add api.llama provider, llama-guard-4 model (#2058)
This PR adds a llama-stack inference provider for `api.llama.com`, as
well as adds entries for Llama-Guard-4 and updated Prompt-Guard models.
2025-04-29 10:07:41 -07:00
Sébastien Han
79851d93aa
feat: Add Kubernetes authentication (#1778)
# What does this PR do?

This commit adds a new authentication system to the Llama Stack server
with support for Kubernetes and custom authentication providers. Key
changes include:

- Implemented KubernetesAuthProvider for validating Kubernetes service
account tokens
- Implemented CustomAuthProvider for validating tokens against external
endpoints - this is the same code that was already present.
- Added test for Kubernetes
- Updated server configuration to support authentication settings
- Added documentation for authentication configuration and usage

The authentication system supports:
- Bearer token validation
- Kubernetes service account token validation
- Custom authentication endpoints

## Test Plan

Setup a Kube cluster using Kind or Minikube.

Run a server with:

```
server:
  port: 8321
  auth:
    provider_type: kubernetes
    config:
      api_server_url: http://url
      ca_cert_path: path/to/cert (optional)
```

Run:

```
curl -s -L -H "Authorization: Bearer $(kubectl create token my-user)" http://127.0.0.1:8321/v1/providers
```

Or replace "my-user" with your service account.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-28 22:24:58 +02:00
Rashmi Pawar
e6bbf8d20b
feat: Add NVIDIA NeMo datastore (#1852)
# What does this PR do?
Implemetation of NeMO Datastore register, unregister API.

Open Issues: 
- provider_id gets set to `localfs` in client.datasets.register() as it
is specified in routing_tables.py: DatasetsRoutingTable
see: #1860

Currently I have passed `"provider_id":"nvidia"` in metadata and have
parsed that in `DatasetsRoutingTable`
(Not the best approach, but just a quick workaround to make it work for
now.)

## Test Plan
- Unit test cases: `pytest
tests/unit/providers/nvidia/test_datastore.py`
```bash
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, asyncio-0.26.0, nbval-0.11.0, metadata-3.1.1, html-4.1.1, cov-6.1.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items                                                                                                                        

tests/unit/providers/nvidia/test_datastore.py ..                                                                                   [100%]

============================================================ warnings summary ============================================================

====================================================== 2 passed, 1 warning in 0.84s ======================================================
```

cc: @dglogo, @mattf, @yanxi0830
2025-04-28 09:41:59 -07:00
Yuan Tang
28687b0e85
fix: Bump h11 to 0.16.0 to fix cve-2025-43859 (#2041)
This resolves a new critical severity on h11. See
https://access.redhat.com/security/cve/cve-2025-43859. We should
consider releasing a new patch with this fix.

This was updated via:

```
uv add "h11>=0.16.0"
uv export --frozen --no-hashes --no-emit-project --output-file=requirements.txt
```

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-04-27 11:45:35 -07:00
Sajikumar JS
1bb1d9b2ba
feat: Add watsonx inference adapter (#1895)
# What does this PR do?
IBM watsonx ai added as the inference [#1741
](https://github.com/meta-llama/llama-stack/issues/1741)

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

---------

Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>
2025-04-25 11:29:21 -07:00
Ben Browning
dc46725f56
fix: properly handle streaming client disconnects (#2000)
# What does this PR do?

Previously, when a streaming client would disconnect before we were
finished streaming the entire response, an error like the below would
get raised from the `sse_generator` function in
`llama_stack/distribution/server/server.py`:

```
AttributeError: 'coroutine' object has no attribute 'aclose'. Did you mean: 'close'?
```

This was because we were calling `aclose` on a coroutine instead of the
awaited value from that coroutine. This change fixes that, so that we
save off the awaited value and then can call `aclose` on it if we
encounter an `asyncio.CancelledError`, like we see when a client
disconnects before we're finished streaming.

The other changes in here are to add a simple set of tests for the happy
path of our SSE streaming and this client disconnect path.

That unfortunately requires adding one more dependency into our unit
test section of pyproject.toml since `server.py` requires loading some
of the telemetry code for me to test this functionality.

## Test Plan

I wrote the tests in `tests/unit/server/test_sse.py` first, verified the
client disconnected test failed before my change, and that it passed
afterwards.

```
python -m pytest -s -v tests/unit/server/test_sse.py
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-23 15:44:28 +02:00
ehhuang
8bd6665775
chore(verification): update README and reorganize generate_report.py (#1978)
# What does this PR do?


## Test Plan
uv run --with-editable ".[dev]" python
tests/verifications/generate_report.py --run-tests
2025-04-17 10:41:22 -07:00
Ashwin Bharambe
ff14773fa7 fix: update llama stack client dependency 2025-04-12 18:14:33 -07:00
Ben Browning
2b2db5fbda
feat: OpenAI-Compatible models, completions, chat/completions (#1894)
# What does this PR do?

This stubs in some OpenAI server-side compatibility with three new
endpoints:

/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions

This gives common inference apps using OpenAI clients the ability to
talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .

The two "v1" instances in there isn't awesome, but the thinking is that
Llama Stack's API is v1 and then our OpenAI compatibility layer is
compatible with OpenAI V1. And, some OpenAI clients implicitly assume
the URL ends with "v1", so this gives maximum compatibility.

The openai models endpoint is implemented in the routing layer, and just
returns all the models Llama Stack knows about.

The following providers should be working with the new OpenAI
completions and chat/completions API:
* remote::anthropic (untested)
* remote::cerebras-openai-compat (untested)
* remote::fireworks (tested)
* remote::fireworks-openai-compat (untested)
* remote::gemini (untested)
* remote::groq-openai-compat (untested)
* remote::nvidia (tested)
* remote::ollama (tested)
* remote::openai (untested)
* remote::passthrough (untested)
* remote::sambanova-openai-compat (untested)
* remote::together (tested)
* remote::together-openai-compat (untested)
* remote::vllm (tested)

The goal to support this for every inference provider - proxying
directly to the provider's OpenAI endpoint for OpenAI-compatible
providers. For providers that don't have an OpenAI-compatible API, we'll
add a mixin to translate incoming OpenAI requests to Llama Stack
inference requests and translate the Llama Stack inference responses to
OpenAI responses.

This is related to #1817 but is a bit larger in scope than just chat
completions, as I have real use-cases that need the older completions
API as well.

## Test Plan

### vLLM

```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```

### ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run

LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```



## Documentation

Run a Llama Stack distribution that uses one of the providers mentioned
in the list above. Then, use your favorite OpenAI client to send
completion or chat completion requests with the base_url set to
http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the
host and port of your Llama Stack server, if different.

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-11 13:14:17 -07:00
Sébastien Han
770b38f8b5
chore: simplify running the demo UI (#1907)
# What does this PR do?

* Manage UI deps in pyproject
* Use a new "ui" dep group to pull the deps with "uv"
* Simplify the run command
* Bump versions in requirements.txt

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-09 11:22:29 -07:00
Ashwin Bharambe
530d4bdfe1
refactor: move all llama code to models/llama out of meta reference (#1887)
# What does this PR do?

Move around bits. This makes the copies from llama-models _much_ easier
to maintain and ensures we don't entangle meta-reference specific
tidbits into llama-models code even by accident.

Also, kills the meta-reference-quantized-gpu distro and rolls
quantization deps into meta-reference-gpu.

## Test Plan

```
LLAMA_MODELS_DEBUG=1 \
  with-proxy llama stack run meta-reference-gpu \
  --env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct \
   --env INFERENCE_CHECKPOINT_DIR=<DIR> \
   --env MODEL_PARALLEL_SIZE=4 \
   --env QUANTIZATION_TYPE=fp8_mixed
```

Start a server with and without quantization. Point integration tests to
it using:

```
pytest -s -v  tests/integration/inference/test_text_inference.py \
   --stack-config http://localhost:8321 --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
2025-04-07 15:03:58 -07:00
Ashwin Bharambe
5a31e66a91 fix: update llama-stack-client dependency to fix integration tests 2025-04-06 19:11:05 -07:00