llama-stack

forked from phoenix-oss/llama-stack-mirror

Author	SHA1	Message	Date
github-actions[bot]	7105a25b0f	build: Bump version to 0.2.8	2025-05-27 20:28:29 +00:00
Ashwin Bharambe	5cdb29758a	feat(responses): add output_text delta events to responses (#2265 ) This adds initial streaming support to the Responses API. This PR makes sure that the _first_ inference call made to chat completions streams out. There's more to be done: - tool call output tokens need to stream out when possible - we need to loop through multiple rounds of inference and they all need to stream out. ## Test Plan Added a test. Executed as: ``` FIREWORKS_API_KEY=... \ pytest -s -v 'tests/verifications/openai_api/test_responses.py' \ --provider=stack:fireworks --model meta-llama/Llama-4-Scout-17B-16E-Instruct ``` Then, started a llama stack fireworks distro and tested against it like this: ``` OPENAI_API_KEY=blah \ pytest -s -v 'tests/verifications/openai_api/test_responses.py' \ --base-url http://localhost:8321/v1/openai/v1 \ --model meta-llama/Llama-4-Scout-17B-16E-Instruct ```	2025-05-27 13:07:14 -07:00
Sébastien Han	6ee319ae08	fix: convert boolean string to boolean (#2284 ) # What does this PR do? Handles the case where the vllm config `tls_verify` is set to `false` or `true`. Closes: https://github.com/meta-llama/llama-stack/issues/2283 Signed-off-by: Sébastien Han <seb@redhat.com>	2025-05-27 13:05:38 -07:00
Sébastien Han	a8f75d3897	chore: remove dependencies.json (#2281 ) # What does this PR do? It's not used anywhere in the build process. Ancient artifact from an old attempt of using sub packages to build distros. ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed. --> N/A Signed-off-by: Sébastien Han <seb@redhat.com>	2025-05-27 10:26:57 -07:00
Ignas Baranauskas	28930cdab6	fix: handle None external_providers_dir in build with run arg (#2269 ) # What does this PR do? Fixes an issue where running `llama stack build --template ollama --image-type venv --run` fails with a TypeError when validating external providers directory paths. The error occurs because `os.path.exists()` is called with `Path(None)` instead of converting it to a string first. This change ensures consistent handling of `None` values for `external_providers_dir` across both build and [run](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/cli/stack/run.py#L134) commands by using `str()` conversion before path validation. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ```bash INFERENCE_MODEL=llama3.2:3b uv run --with llama-stack llama stack build --template ollama --image-type venv --run ``` Command completes successfully without TypeError [//]: # (## Documentation)	2025-05-27 09:41:12 +02:00
Ashwin Bharambe	51e6f529f3	fix: index non-MCP toolgroups at registration time (#2272 ) Two somewhat annoying fixes: - we are going to index tools for non-MCP toolgroups always (like we used to do). because there are just random assumptions in our tests, etc. and I don't want to fix them right now - we need to handle the funny case of toolgroups like `builtin::rag/knowledge_search` where we added the tool name to use in the toolgroup itself.	2025-05-26 20:33:36 -07:00
Sébastien Han	39b33a3b01	chore: allow to pass CA cert to remote vllm (#2266 ) # What does this PR do? The `tls_verify` can now receive a path to a certificate file if the endpoint requires it. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-05-26 20:59:03 +02:00
Sébastien Han	7710b2f43b	chore: removed unused class (#2268 ) Signed-off-by: Sébastien Han <seb@redhat.com>	2025-05-26 08:41:37 -07:00
Ashwin Bharambe	9623d5d230	fix: match mcp headers in provider data to Responses API shape (#2263 )	2025-05-25 14:33:10 -07:00
Ashwin Bharambe	ce33d02443	fix(tools): do not index tools, only index toolgroups (#2261 ) When registering a MCP endpoint, we cannot list tools (like we used to) since the MCP endpoint may be behind an auth wall. Registration can happen much sooner (via run.yaml). Instead, we do listing only when the _user_ actually calls listing. Furthermore, we cache the list in-memory in the server. Currently, the cache is not invalidated -- we may want to periodically re-list for MCP servers. Note that they must call `list_tools` before calling `invoke_tool` -- we use this critically. This will enable us to list MCP servers in run.yaml ## Test Plan Existing tests, updated tests accordingly.	2025-05-25 13:27:52 -07:00
raghotham	5a422e236c	chore: make cprint write to stderr (#2250 ) Also do sys.exit(1) in case of errors	2025-05-24 23:39:57 -07:00
Ashwin Bharambe	298721c238	chore: split routing_tables into individual files (#2259 )	2025-05-24 23:15:05 -07:00
Ashwin Bharambe	eedf21f19c	chore: split routers into individual files (inference, tool, vector_io, eval_scoring) (#2258 )	2025-05-24 22:59:07 -07:00
Ashwin Bharambe	ae7272d8ff	chore: split routers into individual files (datasets) (#2249 )	2025-05-24 22:11:43 -07:00
Ashwin Bharambe	a2160dc0af	chore: split routers into individual files (safety) Reviewers: bbrowning, leseb, ehhuang, terrytangyuan, raghotham, yanxi0830, hardikjshah Reviewed By: raghotham Pull Request: https://github.com/meta-llama/llama-stack/pull/2248	2025-05-24 22:00:32 -07:00
Ashwin Bharambe	c290999c63	fix(telemetry): get rid of annoying sqlite span export error (#2245 )	2025-05-24 20:24:34 -07:00
Ashwin Bharambe	3faf1e4a79	feat: enable MCP execution in Responses impl (#2240 ) ## Test Plan ``` pytest -s -v 'tests/verifications/openai_api/test_responses.py' \ --provider=stack:together --model meta-llama/Llama-4-Scout-17B-16E-Instruct ```	2025-05-24 14:20:42 -07:00
ehhuang	15b0a67555	feat: add responses input items api (#2239 ) # What does this PR do? TSIA ## Test Plan added integration and unit tests	2025-05-24 07:05:53 -07:00
ehhuang	ca65617a71	feat: start ui server in `llama stack run` (#2170 ) # What does this PR do? TSIA `--enable-ui` to enable ## Test Plan `llama stack run dev --image-type conda --enable-ui` `localhost:8322` shows UI llama stack run dev --image-type conda `localhost:8322` does not work	2025-05-23 20:00:09 -07:00
ehhuang	5844c2da68	feat: add list responses API (#2233 ) # What does this PR do? This is not part of the official OpenAI API, but we'll use this for the logs UI. In order to support more filtering options, I'm adopting the newly introduced sql store in in place of the kv store. ## Test Plan Added integration/unit tests.	2025-05-23 13:16:48 -07:00
Ashwin Bharambe	558d109ab7	fix: signature change to match OpenAI SDK (#2237 )	2025-05-23 10:59:30 -07:00
Ashwin Bharambe	51945f1e57	feat: accept MCP authorization headers for MCP toolgroups (#2230 ) The most interesting MCP servers are those with an authorization wall in front of them. This PR uses the existing `provider_data` mechanism of passing provider API keys for passing MCP access tokens (in fact, arbitrary headers in the style of the OpenAI Responses API) from the client through to the MCP server. ``` class MCPProviderDataValidator(BaseModel): # mcp_endpoint => list of headers to send mcp_headers: dict[str, list[str]] \| None = None ``` Note how we must stuff the headers for all MCP endpoints into a single "MCPProviderDataValidator". Unlike existing providers (e.g., Together and Fireworks for inference) where we could name the provider api keys clearly (`together_api_key`, `fireworks_api_key`), we cannot name these keys for MCP. We have a single generic MCP provider which can serve multiple "toolgroups". So we use a dict to combine all the headers for all MCP endpoints you may want to use in an agentic call. ## Test Plan See the added integration test for usage.	2025-05-23 08:52:18 -07:00
ehhuang	2708312168	feat(ui): implement chat completion views (#2201 ) # What does this PR do? Implements table and detail views for chat completions <img width="1548" alt="image" src="https://github.com/user-attachments/assets/01061b7f-0d47-4b3b-b5ac-2df8f9035ef6" /> <img width="1549" alt="image" src="https://github.com/user-attachments/assets/738d8612-8258-4c2c-858b-bee39030649f" /> ## Test Plan npm run test	2025-05-22 22:05:54 -07:00
Ashwin Bharambe	d8c6ab9bfc	feat: add MCP tool signature to Responses API (#2232 )	2025-05-22 16:43:08 -07:00
ehhuang	8feb1827c8	fix: openai provider model id (#2229 ) # What does this PR do? Since https://github.com/meta-llama/llama-stack/pull/2193 switched to openai sdk, we need to strip 'openai/' from the model_id ## Test Plan start server with openai provider and send a chat completion call	2025-05-22 14:51:01 -07:00
ehhuang	549812f51e	feat: implement get chat completions APIs (#2200 ) # What does this PR do? * Provide sqlite implementation of the APIs introduced in https://github.com/meta-llama/llama-stack/pull/2145. * Introduced a SqlStore API: llama_stack/providers/utils/sqlstore/api.py and the first Sqlite implementation * Pagination support will be added in a future PR. ## Test Plan Unit test on sql store: <img width="1005" alt="image" src="https://github.com/user-attachments/assets/9b8b7ec8-632b-4667-8127-5583426b2e29" /> Integration test: ``` INFERENCE_MODEL="llama3.2:3b-instruct-fp16" llama stack build --template ollama --image-type conda --run ``` ``` LLAMA_STACK_CONFIG=http://localhost:5001 INFERENCE_MODEL="llama3.2:3b-instruct-fp16" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-fp16" -k 'inference_store and openai' ```	2025-05-21 22:21:52 -07:00
Jorge Piedrahita Ortiz	633bb9c5b3	feat(providers): sambanova safety provider (#2221 ) # What does this PR do? Includes SambaNova safety adaptor to use the sambanova cloud served Meta-Llama-Guard-3-8B minor updates in sambanova docs ## Test Plan pytest -s -v tests/integration/safety/test_safety.py --stack-config=sambanova --safety-shield=sambanova/Meta-Llama-Guard-3-8B	2025-05-21 15:33:02 -07:00
Sébastien Han	02e5e8a633	fix: only print routes that match the runtime config (#2226 ) # What does this PR do? We now only print the 'active' routes, not all the possible routes. This is based on the distribution server config by looking at enabled APIs and their respective providers. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-05-21 15:30:29 -07:00
Varsha	e92301f2d7	feat(sqlite-vec): enable keyword search for sqlite-vec (#1439 ) # What does this PR do? This PR introduces support for keyword based FTS5 search with BM25 relevance scoring. It makes changes to the existing EmbeddingIndex base class in order to support a search_mode and query_str parameter, that can be used for keyword based search implementations. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan run ``` pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto ``` Output: ``` pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto /Users/vnarsing/miniconda3/envs/stack-client/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset. The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session" warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET)) ====================================================== test session starts ======================================================= platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python cachedir: .pytest_cache metadata: {'Python': '3.10.16', 'Platform': 'macOS-14.7.4-arm64-arm-64bit', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0'}} rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack configfile: pyproject.toml plugins: html-4.1.1, metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0 asyncio: mode=auto, asyncio_default_fixture_loop_scope=None collected 7 items llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_add_chunks PASSED llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_vector PASSED llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_fts PASSED llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_chunk_id_conflict PASSED llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_register_vector_db PASSED llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_unregister_vector_db PASSED llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_generate_chunk_id PASSED ``` For reference, with the implementation, the fts table looks like below: ``` Chunk ID: 9fbc39ce-c729-64a2-260f-c5ec9bb2a33e, Content: Sentence 0 from document 0 Chunk ID: 94062914-3e23-44cf-1e50-9e25821ba882, Content: Sentence 1 from document 0 Chunk ID: e6cfd559-4641-33ba-6ce1-7038226495eb, Content: Sentence 2 from document 0 Chunk ID: 1383af9b-f1f0-f417-4de5-65fe9456cc20, Content: Sentence 3 from document 0 Chunk ID: 2db19b1a-de14-353b-f4e1-085e8463361c, Content: Sentence 4 from document 0 Chunk ID: 9faf986a-f028-7714-068a-1c795e8f2598, Content: Sentence 5 from document 0 Chunk ID: ef593ead-5a4a-392f-7ad8-471a50f033e8, Content: Sentence 6 from document 0 Chunk ID: e161950f-021f-7300-4d05-3166738b94cf, Content: Sentence 7 from document 0 Chunk ID: 90610fc4-67c1-e740-f043-709c5978867a, Content: Sentence 8 from document 0 Chunk ID: 97712879-6fff-98ad-0558-e9f42e6b81d3, Content: Sentence 9 from document 0 Chunk ID: aea70411-51df-61ba-d2f0-cb2b5972c210, Content: Sentence 0 from document 1 Chunk ID: b678a463-7b84-92b8-abb2-27e9a1977e3c, Content: Sentence 1 from document 1 Chunk ID: 27bd63da-909c-1606-a109-75bdb9479882, Content: Sentence 2 from document 1 Chunk ID: a2ad49ad-f9be-5372-e0c7-7b0221d0b53e, Content: Sentence 3 from document 1 Chunk ID: cac53bcd-1965-082a-c0f4-ceee7323fc70, Content: Sentence 4 from document 1 ``` Query results: Result 1: Sentence 5 from document 0 Result 2: Sentence 5 from document 1 Result 3: Sentence 5 from document 2 [//]: # (## Documentation) --------- Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>	2025-05-21 15:24:24 -04:00
Sébastien Han	1862de4be5	chore: clarify cache_ttl to be key_recheck_period (#2220 ) # What does this PR do? The cache_ttl config value is not in fact tied to the lifetime of any of the keys, it represents the time interval between for our key cache refresher. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-05-21 17:30:23 +02:00
Sébastien Han	c25acedbcd	chore: remove k8s auth in favor of k8s jwks endpoint (#2216 ) # What does this PR do? Kubernetes since 1.20 exposes a JWKS endpoint that we can use with our recent oauth2 recent implementation. The CI test has been kept intact for validation. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-05-21 16:23:54 +02:00
liangwen12year	2890243107	feat(quota): add server‑side per‑client request quotas (requires auth) (#2096 ) # What does this PR do? feat(quota): add server‑side per‑client request quotas (requires auth) Unrestricted usage can lead to runaway costs and fragmented client-side workarounds. This commit introduces a native quota mechanism to the server, giving operators a unified, centrally managed throttle for per-client requests—without needing extra proxies or custom client logic. This helps contain cloud-compute expenses, enables fine-grained usage control, and simplifies deployment and monitoring of Llama Stack services. Quotas are fully opt-in and have no effect unless explicitly configured. Notice that Quotas are fully opt-in and require authentication to be enabled. The 'sqlite' is the only supported quota `type` at this time, any other `type` will be rejected. And the only supported `period` is 'day'. Highlights: - Adds `QuotaMiddleware` to enforce per-client request quotas: - Uses `Authorization: Bearer <client_id>` (from AuthenticationMiddleware) - Tracks usage via a SQLite-based KV store - Returns 429 when the quota is exceeded - Extends `ServerConfig` with a `quota` section (type + config) - Enforces strict coupling: quotas require authentication or the server will fail to start Behavior changes: - Quotas are disabled by default unless explicitly configured - SQLite defaults to `./quotas.db` if no DB path is set - The server requires authentication when quotas are enabled To enable per-client request quotas in `run.yaml`, add: ``` server: port: 8321 auth: provider_type: "custom" config: endpoint: "https://auth.example.com/validate" quota: type: sqlite config: db_path: ./quotas.db limit: max_requests: 1000 period: day [//]: # (If resolving an issue, uncomment and update the line below) Closes #2093 ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: Wen Liang <wenliang@redhat.com> Co-authored-by: Wen Liang <wenliang@redhat.com>	2025-05-21 10:58:45 +02:00
Abhishek koserwal	5a3d777b20	feat: add llama stack rm command (#2127 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] ``` llama stack rm llamastack-test ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) #225 ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation)	2025-05-21 10:25:51 +02:00
grs	091d8c48f2	feat: add additional auth provider that uses oauth token introspection (#2187 ) # What does this PR do? This adds an alternative option to the oauth_token auth provider that can be used with existing authorization services which support token introspection as defined in RFC 7662. This could be useful where token revocation needs to be handled or where opaque tokens (or other non jwt formatted tokens) are used ## Test Plan Tested against keycloak Signed-off-by: Gordon Sim <gsim@redhat.com>	2025-05-20 19:45:11 -07:00
grs	87a4b9cb28	fix: synchronize concurrent coroutines checking & updating key set (#2215 ) # What does this PR do? This PR adds a lock to coordinate concurrent coroutines passing through the jwt verification. As _refresh_jwks() was setting _jwks to an empty dict then repopulating it, having multiple coroutines doing this concurrently risks losing keys. The PR also builds the updated dict as a separate object and assigns it to _jwks once completed. This avoids impacting any coroutines using the key set as it is being updated. Signed-off-by: Gordon Sim <gsim@redhat.com>	2025-05-20 10:00:44 -07:00
Derek Higgins	3339844fda	feat: Add "instructions" support to responses API (#2205 ) # What does this PR do? Add support for "instructions" to the responses API. Instructions provide a way to swap out system (or developer) messages in new responses. ## Test Plan unit tests added Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-05-20 09:52:10 -07:00
Jash Gulabrai	1a770cf8ac	fix: Pass model parameter as config name to NeMo Customizer (#2218 ) # What does this PR do? When launching a fine-tuning job, an upcoming version of NeMo Customizer will expect the `config` name to be formatted as `namespace/name@version`. Here, `config` is a reference to a model + additional metadata. There could be multiple `config`s that reference the same base model. This PR updates NVIDIA's `supervised_fine_tune` to simply pass the `model` param as-is to NeMo Customizer. Currently, it expects a specific, allowlisted llama model (i.e. `meta/Llama3.1-8B-Instruct`) and converts it to the provider format (`meta/llama-3.1-8b-instruct`). [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan From a notebook, I built an image with my changes: ``` !llama stack build --template nvidia --image-type venv from llama_stack.distribution.library_client import LlamaStackAsLibraryClient client = LlamaStackAsLibraryClient("nvidia") client.initialize() ``` And could successfully launch a job: ``` response = client.post_training.supervised_fine_tune( job_uuid="", model="meta/llama-3.2-1b-instruct@v1.0.0+A100", # Model passed as-is to Customimzer ... ) job_id = response.job_uuid print(f"Created job with ID: {job_id}") Output: Created job with ID: cust-Jm4oGmbwcvoufaLU4XkrRU ``` [//]: # (## Documentation) --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-05-20 09:51:39 -07:00
Francisco Arceo	90d7612f5f	chore: Updated readme (#2219 ) # What does this PR do? chore: Updated readme [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>	2025-05-20 17:06:20 +02:00
Francisco Arceo	ed7b4731aa	fix: Setting default value for `metadata_token_count` in case the key is not found (#2199 ) # What does this PR do? If a user has previously serialized data into their vector store without the `metadata_token_count` in the chunk, the `query` method will fail in a server error. This fixes that edge case by returning 0 when the key is not detected. This solution is suboptimal but I think it's better to understate the token size rather than recalculate it and add unnecessary complexity to the retrieval code. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>	2025-05-20 08:03:22 -04:00
Ben Browning	6d20b720b8	feat: Propagate W3C trace context headers from clients (#2153 ) # What does this PR do? This extracts the W3C trace context headers (traceparent and tracestate) from incoming requests, stuffs them as attributes on the spans we create, and uses them within the tracing provider implementation to actually wrap our spans in the proper context. What this means in practice is that when a client (such as an OpenAI client) is instrumented to create these traces, we'll continue that distributed trace within Llama Stack as opposed to creating our own root span that breaks the distributed trace between client and server. It's slightly awkward to do this in Llama Stack because our Tracing API knows nothing about opentelemetry, W3C trace headers, etc - that's only knowledge the specific provider implementation has. So, that's why the trace headers get extracted by in the server code but not actually used until the provider implementation to form the proper context. This also centralizes how we were adding the `__root__` and `__root_span__` attributes, as those two were being added in different parts of the code instead of from a single place. Closes #2097 ## Test Plan This was tested manually using the helpful scripts from #2097. I verified that Llama Stack properly joined the client's span when the client was instrumented for distributed tracing, and that Llama Stack properly started its own root span when the incoming request was not part of an existing trace. Here's an example of the joined spans: ![Screenshot 2025-05-13 at 8 46 09 AM](https://github.com/user-attachments/assets/dbefda28-9faa-4339-a08d-1441efefc149) Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-05-19 18:56:54 -07:00
Sébastien Han	82778ecbb0	fix: remove wrong deprecated warning (#2202 ) # What does this PR do? `--yaml-config` is gone now with https://github.com/meta-llama/llama-stack/pull/2196. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-05-19 13:02:23 -07:00
Michael Anstis	0cc0731189	fix: Pass external_config_dir to BuildConfig (#2190 ) # What does this PR do? The `external_config_dir` configuration parameter is not being passed to the `BuildConfig` for `LlamaStackAsLibraryClient`. This prevents _plugin_ providers from being loaded when `llama-stack` is uses as a library. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan I ran `LlamaStackAsLibraryClient` with a configuration file that contained `external_config_dir` and related configuration. It does not work without this change: _external_ providers are not resolved. It does work with this change 👍 [//]: # (## Documentation)	2025-05-19 14:01:28 +02:00
ehhuang	047303e339	feat: introduce APIs for retrieving chat completion requests (#2145 ) # What does this PR do? This PR introduces APIs to retrieve past chat completion requests, which will be used in the LS UI. Our current `Telemetry` is ill-suited for this purpose as it's untyped so we'd need to filter by obscure attribute names, making it brittle. Since these APIs are 'provided by stack' and don't need to be implemented by inference providers, we introduce a new InferenceProvider class, containing the existing inference protocol, which is implemented by inference providers. The APIs are OpenAI-compliant, with an additional `input_messages` field. ## Test Plan This PR just adds the API and marks them provided_by_stack. S tart stack server -> doesn't crash	2025-05-18 21:43:19 -07:00
Ashwin Bharambe	c7015d3d60	feat: introduce OAuth2TokenAuthProvider and notion of "principal" (#2185 ) This PR adds a notion of `principal` (aka some kind of persistent identity) to the authentication infrastructure of the Stack. Until now we only used access attributes ("claims" in the more standard OAuth / OIDC setup) but we need the notion of a User fundamentally as well. (Thanks @rhuss for bringing this up.) This value is not yet _used_ anywhere downstream but will be used to segregate access to resources. In addition, the PR introduces a built-in JWT token validator so the Stack does not need to contact an authentication provider to validating the authorization and merely check the signed token for the represented claims. Public keys are refreshed via the configured JWKS server. This Auth Provider should overwhelmingly be considered the default given the seamless integration it offers with OAuth setups.	2025-05-18 17:54:19 -07:00
Matthew Farrellee	f40693e720	feat: --image-type argument overrides value in --config build.yaml (#2179 ) closes #2162 # test plan run `llama stack build --image-name ollama --image-type <venv/conda/container> --config llama_stack/templates/ollama/build.yaml` and verify venv \| conda \| container are built.	2025-05-16 14:45:41 -07:00
Charlie Doern	f02f7b28c1	feat: add huggingface post_training impl (#2132 ) # What does this PR do? adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a super popular option for running training jobs. The config allows a user to specify some key fields such as a model, chat_template, device, etc the provider comes with one recipe `finetune_single_device` which works both with and without LoRA. any model that is a valid HF identifier can be given and the model will be pulled. this has been tested so far with CPU and MPS device types, but should be compatible with CUDA out of the box The provider processes the given dataset into the proper format, establishes the various steps per epoch, steps per save, steps per eval, sets a sane SFTConfig, and runs n_epochs of training if checkpoint_dir is none, no model is saved. If there is a checkpoint dir, a model is saved every `save_steps` and at the end of training. ## Test Plan re-enabled post_training integration test suite with a singular test that loads the simpleqa dataset: https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The test now uses the llama stack client and the proper post_training API runs one step with a batch_size of 1. This test runs on CPU on the Ubuntu runner so it needs to be a small batch and a single step. [//]: # (## Documentation) --------- Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-05-16 14:41:28 -07:00
Matthew Farrellee	8f9964f46b	fix: update llama stack build --run to use new start_stack.sh signature (#2191 ) # What does this PR do? fixes #2188 ## Test Plan `INFERENCE_MODEL=meta-llama/Llama-3.3-70B-Instruct llama stack build --image-name ollama --image-type conda --template ollama --run` without error	2025-05-16 14:32:02 -07:00
Charlie Doern	1ae61e8d5f	fix: replace all instances of --yaml-config with --config (#2196 ) # What does this PR do? start_stack.sh was using --yaml-config which is deprecated. a bunch of distro docs also mentioned --yaml-config. Replaces all instances and logic for --yaml-config with --config resolves #2189 Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-05-16 14:31:12 -07:00
grs	b8f7e1504d	feat: allow the interface on which the server will listen to be configured (#2015 ) # What does this PR do? It may not always be desirable to listen on all interfaces, which is the default. As an example, by listening instead only on a loopback interface, the server cannot be reached except from within the host it is run on. This PR makes this configurable, through a CLI option, an env var or an entry on the config file. ## Test Plan I ran a server with and without the added CLI argument to verify that the argument is used if provided, but the default is as it was before if not. Signed-off-by: Gordon Sim <gsim@redhat.com>	2025-05-16 12:59:31 -07:00
Matthew Farrellee	64f8d4c3ad	feat: use openai-python for openai inference provider (#2193 ) # What does this PR do? fixes #2121 this implementation splits reponsibility between litellm and openai libraries - \| Inference Method \| Implementation Source \| \|----------------------------\|--------------------------\| \| completion \| LiteLLMOpenAIMixin \| \| chat_completion \| LiteLLMOpenAIMixin \| \| embedding \| LiteLLMOpenAIMixin \| \| batch_completion \| LiteLLMOpenAIMixin \| \| batch_chat_completion \| LiteLLMOpenAIMixin \| \| openai_completion \| AsyncOpenAI \| \| openai_chat_completion \| AsyncOpenAI \| ## Test Plan smoke test with - ``` $ OPENAI_API_KEY=$LLAMA_API_KEY OPENAI_BASE_URL=https://api.llama.com/compat/v1 llama stack build --image-type conda --image-name openai --providers inference=remote::openai --run $ llama-stack-client models register Llama-4-Scout-17B-16E-Instruct-FP8 $ curl "http://localhost:8321/v1/openai/v1/chat/completions" -H "Content-Type: application/json" \ -d '{ "model": "Llama-4-Scout-17B-16E-Instruct-FP8", "messages": [ {"role": "user", "content": "Hello Llama! Can you give me a quick intro?"} ] }' {"id":"AmPwrrkc5JgVjejPdIPrpT2","choices":[{"finish_reason":"stop","index":0,"logprobs":{"content":null,"refusal":null},"message":{"content":"Hello! I'm Llama, a Meta-designed model that adapts to your conversational style. Whether you need quick answers, deep dives into ideas, or just want to vent, joke, or brainstorm—I'm here for it. What’s on your mind?","refusal":"","role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null,"id":"AmPwrrkc5JgVjejPdIPrpT2"}}],"created":1747410061,"model":"Llama-4-Scout-17B-16E-Instruct-FP8","object":"chat.completions","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":54,"prompt_tokens":22,"total_tokens":76,"completion_tokens_details":null,"prompt_tokens_details":null}} ``` and run full test suite.	2025-05-16 12:57:56 -07:00

1 2 3 4 5 ...

1185 commits