Commit graph

1975 commits

Author SHA1 Message Date
Ashwin Bharambe
ae7272d8ff
chore: split routers into individual files (datasets) (#2249) 2025-05-24 22:11:43 -07:00
Ashwin Bharambe
a2160dc0af
chore: split routers into individual files (safety)
Reviewers:
bbrowning, leseb, ehhuang, terrytangyuan, raghotham, yanxi0830, hardikjshah

Reviewed By: raghotham

Pull Request: https://github.com/meta-llama/llama-stack/pull/2248
2025-05-24 22:00:32 -07:00
Ashwin Bharambe
c290999c63
fix(telemetry): get rid of annoying sqlite span export error (#2245) 2025-05-24 20:24:34 -07:00
Ashwin Bharambe
3faf1e4a79
feat: enable MCP execution in Responses impl (#2240)
## Test Plan

```
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
  --provider=stack:together --model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
2025-05-24 14:20:42 -07:00
Ashwin Bharambe
66f09f24ed
fix: disable test_responses_store (#2244)
The test depends on llama's tool calling ability. In the CI, we run with
a small ollama model.

The fix might be to check for either message or function_call because
the model is flaky and we aren't really testing that behavior?
2025-05-24 08:18:06 -07:00
raghotham
84751f3e55
fix: skip failing tests (#2243)
as title. trying release 0.2.8
2025-05-24 07:31:08 -07:00
Yuan Tang
a411029d7e
docs: Update CHANGELOG.md (#2241)
# What does this PR do?

This PR adds release notes for recent releases.

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-05-24 07:06:36 -07:00
ehhuang
15b0a67555
feat: add responses input items api (#2239)
# What does this PR do?
TSIA

## Test Plan
added integration and unit tests
2025-05-24 07:05:53 -07:00
Yuan Tang
055f48b6a2
fix(security): Upgrade setuptools to v80.8.0. Fixes CVE-2025-47273 (#2242)
# What does this PR do?

This fixes a high vulnerable CVE in `setuptools`:
https://github.com/advisories/GHSA-5rjg-fvgr-3xxf

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
2025-05-24 06:57:24 -07:00
ehhuang
ca65617a71
feat: start ui server in llama stack run (#2170)
# What does this PR do?
TSIA
`--enable-ui` to enable


## Test Plan
`llama stack run dev --image-type conda --enable-ui`
`localhost:8322` shows UI


llama stack run dev --image-type conda
`localhost:8322` does not work
2025-05-23 20:00:09 -07:00
ehhuang
5844c2da68
feat: add list responses API (#2233)
# What does this PR do?
This is not part of the official OpenAI API, but we'll use this for the
logs UI.
In order to support more filtering options, I'm adopting the newly
introduced sql store in in place of the kv store.

## Test Plan
Added integration/unit tests.
2025-05-23 13:16:48 -07:00
Ashwin Bharambe
6463ee7633
feat: allow using llama-stack-library-client from verifications (#2238)
Having to run (and re-run) a server while running verifications can be
annoying while you are iterating on code. This makes it so you can use
the library client -- and because it is OpenAI client compatible, it all
works.

## Test Plan

```
pytest -s -v tests/verifications/openai_api/test_responses.py \
   --provider=stack:together \
   --model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
2025-05-23 11:43:41 -07:00
Ashwin Bharambe
558d109ab7
fix: signature change to match OpenAI SDK (#2237) 2025-05-23 10:59:30 -07:00
ehhuang
b054023800
chore: add sqlalchemy to test dependencies (#2236)
# What does this PR do?


## Test Plan
2025-05-23 10:33:38 -07:00
Ashwin Bharambe
51945f1e57
feat: accept MCP authorization headers for MCP toolgroups (#2230)
The most interesting MCP servers are those with an authorization wall in
front of them. This PR uses the existing `provider_data` mechanism of
passing provider API keys for passing MCP access tokens (in fact,
arbitrary headers in the style of the OpenAI Responses API) from the
client through to the MCP server.

```
class MCPProviderDataValidator(BaseModel):
    # mcp_endpoint => list of headers to send
    mcp_headers: dict[str, list[str]] | None = None
```

Note how we must stuff the headers for all MCP endpoints into a single
"MCPProviderDataValidator". Unlike existing providers (e.g., Together
and Fireworks for inference) where we could name the provider api keys
clearly (`together_api_key`, `fireworks_api_key`), we cannot name these
keys for MCP. We have a single generic MCP provider which can serve
multiple "toolgroups". So we use a dict to combine all the headers for
all MCP endpoints you may want to use in an agentic call.


## Test Plan

See the added integration test for usage.
2025-05-23 08:52:18 -07:00
ehhuang
2708312168
feat(ui): implement chat completion views (#2201)
# What does this PR do?
 Implements table and detail views for chat completions

<img width="1548" alt="image"
src="https://github.com/user-attachments/assets/01061b7f-0d47-4b3b-b5ac-2df8f9035ef6"
/>
<img width="1549" alt="image"
src="https://github.com/user-attachments/assets/738d8612-8258-4c2c-858b-bee39030649f"
/>


## Test Plan
npm run test
2025-05-22 22:05:54 -07:00
Ashwin Bharambe
d8c6ab9bfc
feat: add MCP tool signature to Responses API (#2232) 2025-05-22 16:43:08 -07:00
ehhuang
8feb1827c8
fix: openai provider model id (#2229)
# What does this PR do?
Since https://github.com/meta-llama/llama-stack/pull/2193 switched to
openai sdk, we need to strip 'openai/' from the model_id


## Test Plan
start server with openai provider and send a chat completion call
2025-05-22 14:51:01 -07:00
ehhuang
549812f51e
feat: implement get chat completions APIs (#2200)
# What does this PR do?
* Provide sqlite implementation of the APIs introduced in
https://github.com/meta-llama/llama-stack/pull/2145.
* Introduced a SqlStore API: llama_stack/providers/utils/sqlstore/api.py
and the first Sqlite implementation
* Pagination support will be added in a future PR.

## Test Plan
Unit test on sql store:
<img width="1005" alt="image"
src="https://github.com/user-attachments/assets/9b8b7ec8-632b-4667-8127-5583426b2e29"
/>


Integration test:
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" llama stack build --template ollama --image-type conda --run
```
```
LLAMA_STACK_CONFIG=http://localhost:5001 INFERENCE_MODEL="llama3.2:3b-instruct-fp16" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-fp16" -k 'inference_store and openai'
```
2025-05-21 22:21:52 -07:00
Jorge Piedrahita Ortiz
633bb9c5b3
feat(providers): sambanova safety provider (#2221)
# What does this PR do?

Includes SambaNova safety adaptor to use the sambanova cloud served
Meta-Llama-Guard-3-8B
minor updates in sambanova docs

## Test Plan
pytest -s -v tests/integration/safety/test_safety.py
--stack-config=sambanova --safety-shield=sambanova/Meta-Llama-Guard-3-8B
2025-05-21 15:33:02 -07:00
Sébastien Han
02e5e8a633
fix: only print routes that match the runtime config (#2226)
# What does this PR do?

We now only print the 'active' routes, not all the possible routes. This
is based on the distribution server config by looking at enabled APIs
and their respective providers.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-21 15:30:29 -07:00
Sébastien Han
37f1e8a7f7
fix: use proper service account for kube auth (#2227)
# What does this PR do?

Not sure why it passed CI earlier...

Strange only 24 workflows run here
https://github.com/meta-llama/llama-stack/pull/2216 so the test never
ran...

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-21 15:28:21 -07:00
Varsha
e92301f2d7
feat(sqlite-vec): enable keyword search for sqlite-vec (#1439)
# What does this PR do?
This PR introduces support for keyword based FTS5 search with BM25
relevance scoring. It makes changes to the existing EmbeddingIndex base
class in order to support a search_mode and query_str parameter, that
can be used for keyword based search implementations.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
run 
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
```
Output:
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
/Users/vnarsing/miniconda3/envs/stack-client/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
====================================================== test session starts =======================================================
platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'macOS-14.7.4-arm64-arm-64bit', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0'}}
rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack
configfile: pyproject.toml
plugins: html-4.1.1, metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0
asyncio: mode=auto, asyncio_default_fixture_loop_scope=None
collected 7 items                                                                                                                

llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_add_chunks PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_vector PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_fts PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_chunk_id_conflict PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_register_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_unregister_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_generate_chunk_id PASSED
```


For reference, with the implementation, the fts table looks like below:
```
Chunk ID: 9fbc39ce-c729-64a2-260f-c5ec9bb2a33e, Content: Sentence 0 from document 0
Chunk ID: 94062914-3e23-44cf-1e50-9e25821ba882, Content: Sentence 1 from document 0
Chunk ID: e6cfd559-4641-33ba-6ce1-7038226495eb, Content: Sentence 2 from document 0
Chunk ID: 1383af9b-f1f0-f417-4de5-65fe9456cc20, Content: Sentence 3 from document 0
Chunk ID: 2db19b1a-de14-353b-f4e1-085e8463361c, Content: Sentence 4 from document 0
Chunk ID: 9faf986a-f028-7714-068a-1c795e8f2598, Content: Sentence 5 from document 0
Chunk ID: ef593ead-5a4a-392f-7ad8-471a50f033e8, Content: Sentence 6 from document 0
Chunk ID: e161950f-021f-7300-4d05-3166738b94cf, Content: Sentence 7 from document 0
Chunk ID: 90610fc4-67c1-e740-f043-709c5978867a, Content: Sentence 8 from document 0
Chunk ID: 97712879-6fff-98ad-0558-e9f42e6b81d3, Content: Sentence 9 from document 0
Chunk ID: aea70411-51df-61ba-d2f0-cb2b5972c210, Content: Sentence 0 from document 1
Chunk ID: b678a463-7b84-92b8-abb2-27e9a1977e3c, Content: Sentence 1 from document 1
Chunk ID: 27bd63da-909c-1606-a109-75bdb9479882, Content: Sentence 2 from document 1
Chunk ID: a2ad49ad-f9be-5372-e0c7-7b0221d0b53e, Content: Sentence 3 from document 1
Chunk ID: cac53bcd-1965-082a-c0f4-ceee7323fc70, Content: Sentence 4 from document 1
```

Query results:
Result 1: Sentence 5 from document 0
Result 2: Sentence 5 from document 1
Result 3: Sentence 5 from document 2

[//]: # (## Documentation)

---------

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
2025-05-21 15:24:24 -04:00
Sébastien Han
85b5f3172b
docs: misc cleanup (#2223)
# What does this PR do?

* remove requirements.txt to use pyproject.toml as the source of truth
* update relevant docs

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-21 17:35:27 +02:00
Sébastien Han
6a62e783b9
chore: refactor workflow writting (#2225)
# What does this PR do?

Use a composite action to avoid similar steps repetitions and
centralization of the defaults.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-21 17:31:14 +02:00
Sébastien Han
1862de4be5
chore: clarify cache_ttl to be key_recheck_period (#2220)
# What does this PR do?

The cache_ttl config value is not in fact tied to the lifetime of any of
the keys, it represents the time interval between for our key cache
refresher.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-21 17:30:23 +02:00
Sébastien Han
c25acedbcd
chore: remove k8s auth in favor of k8s jwks endpoint (#2216)
# What does this PR do?

Kubernetes since 1.20 exposes a JWKS endpoint that we can use with our
recent oauth2 recent implementation.
The CI test has been kept intact for validation.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-21 16:23:54 +02:00
liangwen12year
2890243107
feat(quota): add server‑side per‑client request quotas (requires auth) (#2096)
# What does this PR do?
feat(quota): add server‑side per‑client request quotas (requires auth)
    
Unrestricted usage can lead to runaway costs and fragmented client-side
    workarounds. This commit introduces a native quota mechanism to the
    server, giving operators a unified, centrally managed throttle for
    per-client requests—without needing extra proxies or custom client
logic. This helps contain cloud-compute expenses, enables fine-grained
usage control, and simplifies deployment and monitoring of Llama Stack
services. Quotas are fully opt-in and have no effect unless explicitly
    configured.
    
    Notice that Quotas are fully opt-in and require authentication to be
enabled. The 'sqlite' is the only supported quota `type` at this time,
any other `type` will be rejected. And the only supported `period` is
    'day'.
    
    Highlights:
    
    - Adds `QuotaMiddleware` to enforce per-client request quotas:
      - Uses `Authorization: Bearer <client_id>` (from
        AuthenticationMiddleware)
      - Tracks usage via a SQLite-based KV store
      - Returns 429 when the quota is exceeded
    
    - Extends `ServerConfig` with a `quota` section (type + config)
    
- Enforces strict coupling: quotas require authentication or the server
      will fail to start
    
    Behavior changes:
    - Quotas are disabled by default unless explicitly configured
    - SQLite defaults to `./quotas.db` if no DB path is set
    - The server requires authentication when quotas are enabled
    
    To enable per-client request quotas in `run.yaml`, add:
    ```
    server:
      port: 8321
      auth:
        provider_type: "custom"
        config:
          endpoint: "https://auth.example.com/validate"
      quota:
        type: sqlite
        config:
          db_path: ./quotas.db
          limit:
            max_requests: 1000
            period: day

[//]: # (If resolving an issue, uncomment and update the line below)
Closes #2093

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Wen Liang <wenliang@redhat.com>
Co-authored-by: Wen Liang <wenliang@redhat.com>
2025-05-21 10:58:45 +02:00
Abhishek koserwal
5a3d777b20
feat: add llama stack rm command (#2127)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

```
llama stack rm llamastack-test
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
#225 

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
2025-05-21 10:25:51 +02:00
grs
091d8c48f2
feat: add additional auth provider that uses oauth token introspection (#2187)
# What does this PR do?

This adds an alternative option to the oauth_token auth provider that
can be used with existing authorization services which support token
introspection as defined in RFC 7662. This could be useful where token
revocation needs to be handled or where opaque tokens (or other non jwt
formatted tokens) are used

## Test Plan
Tested against keycloak

Signed-off-by: Gordon Sim <gsim@redhat.com>
2025-05-20 19:45:11 -07:00
grs
87a4b9cb28
fix: synchronize concurrent coroutines checking & updating key set (#2215)
# What does this PR do?

This PR adds a lock to coordinate concurrent coroutines passing through
the jwt verification. As _refresh_jwks() was setting _jwks to an empty
dict then repopulating it, having multiple coroutines doing this
concurrently risks losing keys. The PR also builds the updated dict as a
separate object and assigns it to _jwks once completed. This avoids
impacting any coroutines using the key set as it is being updated.

Signed-off-by: Gordon Sim <gsim@redhat.com>
2025-05-20 10:00:44 -07:00
Derek Higgins
3339844fda
feat: Add "instructions" support to responses API (#2205)
# What does this PR do?
Add support for "instructions" to the responses API. Instructions
provide a way to swap out system (or developer) messages in new
responses.


## Test Plan
unit tests added

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-05-20 09:52:10 -07:00
Jash Gulabrai
1a770cf8ac
fix: Pass model parameter as config name to NeMo Customizer (#2218)
# What does this PR do?
When launching a fine-tuning job, an upcoming version of NeMo Customizer
will expect the `config` name to be formatted as
`namespace/name@version`. Here, `config` is a reference to a model +
additional metadata. There could be multiple `config`s that reference
the same base model.

This PR updates NVIDIA's `supervised_fine_tune` to simply pass the
`model` param as-is to NeMo Customizer. Currently, it expects a
specific, allowlisted llama model (i.e. `meta/Llama3.1-8B-Instruct`) and
converts it to the provider format (`meta/llama-3.1-8b-instruct`).

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
From a notebook, I built an image with my changes: 
```
!llama stack build --template nvidia --image-type venv
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient

client = LlamaStackAsLibraryClient("nvidia")
client.initialize()
```
And could successfully launch a job:
```
response = client.post_training.supervised_fine_tune(
    job_uuid="",
    model="meta/llama-3.2-1b-instruct@v1.0.0+A100", # Model passed as-is to Customimzer
    ...
)

job_id = response.job_uuid
print(f"Created job with ID: {job_id}")

Output:
Created job with ID: cust-Jm4oGmbwcvoufaLU4XkrRU
```

[//]: # (## Documentation)

---------

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
2025-05-20 09:51:39 -07:00
Sébastien Han
2eae8568e1
chore: collapse all local hook under the same repo (#2217)
Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-20 09:51:09 -07:00
Sébastien Han
3f6368d56c
ci: enable ruff output format for github (#2214)
# What does this PR do?

Update output format to enable automatic inline annotations.

![Screenshot 2025-05-20 at 10 55
38](https://github.com/user-attachments/assets/f943aa00-9b60-4cdb-b434-67b2de8b79f2)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-20 09:04:03 -07:00
Francisco Arceo
90d7612f5f
chore: Updated readme (#2219)
# What does this PR do?
chore: Updated readme

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-05-20 17:06:20 +02:00
Francisco Arceo
ed7b4731aa
fix: Setting default value for metadata_token_count in case the key is not found (#2199)
# What does this PR do?
If a user has previously serialized data into their vector store without
the `metadata_token_count` in the chunk, the `query` method will fail in
a server error. This fixes that edge case by returning 0 when the key is
not detected. This solution is suboptimal but I think it's better to
understate the token size rather than recalculate it and add unnecessary
complexity to the retrieval code.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-05-20 08:03:22 -04:00
Ben Browning
6d20b720b8
feat: Propagate W3C trace context headers from clients (#2153)
# What does this PR do?

This extracts the W3C trace context headers (traceparent and tracestate)
from incoming requests, stuffs them as attributes on the spans we
create, and uses them within the tracing provider implementation to
actually wrap our spans in the proper context.

What this means in practice is that when a client (such as an OpenAI
client) is instrumented to create these traces, we'll continue that
distributed trace within Llama Stack as opposed to creating our own root
span that breaks the distributed trace between client and server.

It's slightly awkward to do this in Llama Stack because our Tracing API
knows nothing about opentelemetry, W3C trace headers, etc - that's only
knowledge the specific provider implementation has. So, that's why the
trace headers get extracted by in the server code but not actually used
until the provider implementation to form the proper context.

This also centralizes how we were adding the `__root__` and
`__root_span__` attributes, as those two were being added in different
parts of the code instead of from a single place.

Closes #2097

## Test Plan

This was tested manually using the helpful scripts from #2097. I
verified that Llama Stack properly joined the client's span when the
client was instrumented for distributed tracing, and that Llama Stack
properly started its own root span when the incoming request was not
part of an existing trace.

Here's an example of the joined spans:

![Screenshot 2025-05-13 at 8 46
09 AM](https://github.com/user-attachments/assets/dbefda28-9faa-4339-a08d-1441efefc149)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-05-19 18:56:54 -07:00
Sébastien Han
82778ecbb0
fix: remove wrong deprecated warning (#2202)
# What does this PR do?

`--yaml-config` is gone now with
https://github.com/meta-llama/llama-stack/pull/2196.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-19 13:02:23 -07:00
Michael Anstis
0cc0731189
fix: Pass external_config_dir to BuildConfig (#2190)
# What does this PR do?

The `external_config_dir` configuration parameter is not being passed to
the `BuildConfig` for `LlamaStackAsLibraryClient`.

This prevents _plugin_ providers from being loaded when `llama-stack` is
uses as a library.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
I ran `LlamaStackAsLibraryClient` with a configuration file that
contained `external_config_dir` and related configuration.

It does not work without this change: _external_ providers are not
resolved.

It does work with this change 👍 

[//]: # (## Documentation)
2025-05-19 14:01:28 +02:00
ehhuang
047303e339
feat: introduce APIs for retrieving chat completion requests (#2145)
# What does this PR do?
This PR introduces APIs to retrieve past chat completion requests, which
will be used in the LS UI.

Our current `Telemetry` is ill-suited for this purpose as it's untyped
so we'd need to filter by obscure attribute names, making it brittle.

Since these APIs are 'provided by stack' and don't need to be
implemented by inference providers, we introduce a new InferenceProvider
class, containing the existing inference protocol, which is implemented
by inference providers.

The APIs are OpenAI-compliant, with an additional `input_messages`
field.


## Test Plan
This PR just adds the API and marks them provided_by_stack. S
tart stack server -> doesn't crash
2025-05-18 21:43:19 -07:00
Ashwin Bharambe
c7015d3d60
feat: introduce OAuth2TokenAuthProvider and notion of "principal" (#2185)
This PR adds a notion of `principal` (aka some kind of persistent
identity) to the authentication infrastructure of the Stack. Until now
we only used access attributes ("claims" in the more standard OAuth /
OIDC setup) but we need the notion of a User fundamentally as well.
(Thanks @rhuss for bringing this up.)

This value is not yet _used_ anywhere downstream but will be used to
segregate access to resources.

In addition, the PR introduces a built-in JWT token validator so the
Stack does not need to contact an authentication provider to validating
the authorization and merely check the signed token for the represented
claims. Public keys are refreshed via the configured JWKS server. This
Auth Provider should overwhelmingly be considered the default given the
seamless integration it offers with OAuth setups.
2025-05-18 17:54:19 -07:00
dependabot[bot]
1341916caf
chore(github-deps): bump astral-sh/setup-uv from 5.4.1 to 6.0.1 (#2197) 2025-05-18 02:09:56 -04:00
Matthew Farrellee
f40693e720
feat: --image-type argument overrides value in --config build.yaml (#2179)
closes #2162

# test plan

run `llama stack build --image-name ollama --image-type
<venv/conda/container> --config llama_stack/templates/ollama/build.yaml`
and verify venv | conda | container are built.
2025-05-16 14:45:41 -07:00
Charlie Doern
f02f7b28c1
feat: add huggingface post_training impl (#2132)
# What does this PR do?


adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a
super popular option for running training jobs. The config allows a user
to specify some key fields such as a model, chat_template, device, etc

the provider comes with one recipe `finetune_single_device` which works
both with and without LoRA.

any model that is a valid HF identifier can be given and the model will
be pulled.

this has been tested so far with CPU and MPS device types, but should be
compatible with CUDA out of the box

The provider processes the given dataset into the proper format,
establishes the various steps per epoch, steps per save, steps per eval,
sets a sane SFTConfig, and runs n_epochs of training

if checkpoint_dir is none, no model is saved. If there is a checkpoint
dir, a model is saved every `save_steps` and at the end of training.


## Test Plan

re-enabled post_training integration test suite with a singular test
that loads the simpleqa dataset:
https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite
model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The
test now uses the llama stack client and the proper post_training API

runs one step with a batch_size of 1. This test runs on CPU on the
Ubuntu runner so it needs to be a small batch and a single step.

[//]: # (## Documentation)

---------

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-05-16 14:41:28 -07:00
Matthew Farrellee
8f9964f46b
fix: update llama stack build --run to use new start_stack.sh signature (#2191)
# What does this PR do?
fixes #2188

## Test Plan
`INFERENCE_MODEL=meta-llama/Llama-3.3-70B-Instruct llama stack build
--image-name ollama --image-type conda --template ollama --run` without
error
2025-05-16 14:32:02 -07:00
Charlie Doern
1ae61e8d5f
fix: replace all instances of --yaml-config with --config (#2196)
# What does this PR do?

start_stack.sh was using --yaml-config which is deprecated.

a bunch of distro docs also mentioned --yaml-config. Replaces all
instances and logic for --yaml-config with --config

resolves #2189

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-05-16 14:31:12 -07:00
github-actions[bot]
65cf076f13 build: Bump version to 0.2.7 2025-05-16 20:32:06 +00:00
grs
b8f7e1504d
feat: allow the interface on which the server will listen to be configured (#2015)
# What does this PR do?

It may not always be desirable to listen on all interfaces, which is the
default. As an example, by listening instead only on a loopback
interface, the server cannot be reached except from within the host it
is run on. This PR makes this configurable, through a CLI option, an env
var or an entry on the config file.

## Test Plan

I ran a server with and without the added CLI argument to verify that
the argument is used if provided, but the default is as it was before if
not.

Signed-off-by: Gordon Sim <gsim@redhat.com>
2025-05-16 12:59:31 -07:00
Matthew Farrellee
64f8d4c3ad
feat: use openai-python for openai inference provider (#2193)
# What does this PR do?

fixes #2121

this implementation splits reponsibility between litellm and openai
libraries -

 | Inference Method           | Implementation Source    |
 |----------------------------|--------------------------|
 | completion                 | LiteLLMOpenAIMixin       |
 | chat_completion            | LiteLLMOpenAIMixin       |
 | embedding                  | LiteLLMOpenAIMixin       |
 | batch_completion           | LiteLLMOpenAIMixin       |
 | batch_chat_completion      | LiteLLMOpenAIMixin       |
 | openai_completion          | AsyncOpenAI              |
 | openai_chat_completion     | AsyncOpenAI              |

## Test Plan

smoke test with -
```
$ OPENAI_API_KEY=$LLAMA_API_KEY OPENAI_BASE_URL=https://api.llama.com/compat/v1 llama stack build --image-type conda --image-name openai --providers inference=remote::openai --run

$ llama-stack-client models register Llama-4-Scout-17B-16E-Instruct-FP8

$ curl "http://localhost:8321/v1/openai/v1/chat/completions" -H "Content-Type: application/json" \ -d '{
      "model": "Llama-4-Scout-17B-16E-Instruct-FP8",
      "messages": [
        {"role": "user", "content": "Hello Llama! Can you give me a quick intro?"}
      ]
}'
{"id":"AmPwrrkc5JgVjejPdIPrpT2","choices":[{"finish_reason":"stop","index":0,"logprobs":{"content":null,"refusal":null},"message":{"content":"Hello! I'm Llama, a Meta-designed model that adapts to your conversational style. Whether you need quick answers, deep dives into ideas, or just want to vent, joke, or brainstorm—I'm here for it. What’s on your mind?","refusal":"","role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null,"id":"AmPwrrkc5JgVjejPdIPrpT2"}}],"created":1747410061,"model":"Llama-4-Scout-17B-16E-Instruct-FP8","object":"chat.completions","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":54,"prompt_tokens":22,"total_tokens":76,"completion_tokens_details":null,"prompt_tokens_details":null}}
```

and run full test suite.
2025-05-16 12:57:56 -07:00