llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-02 12:30:38 +00:00

Author	SHA1	Message	Date
Sébastien Han	43c1f39bd6	refactor(env)!: enhanced environment variable substitution (#2490 ) # What does this PR do? This commit significantly improves the environment variable substitution functionality in Llama Stack configuration files: * The version field in configuration files has been changed from string to integer type for better type consistency across build and run configurations. * The environment variable substitution system for ${env.FOO:} was fixed and properly returns an error * The environment variable substitution system for ${env.FOO+} returns None instead of an empty strings, it better matches type annotations in config fields * The system includes automatic type conversion for boolean, integer, and float values. * The error messages have been enhanced to provide clearer guidance when environment variables are missing, including suggestions for using default values or conditional syntax. * Comprehensive documentation has been added to the configuration guide explaining all supported syntax patterns, best practices, and runtime override capabilities. * Multiple provider configurations have been updated to use the new conditional syntax for optional API keys, making the system more flexible for different deployment scenarios. The telemetry configuration has been improved to properly handle optional endpoints with appropriate validation, ensuring that required endpoints are specified when their corresponding sinks are enabled. * There were many instances of ${env.NVIDIA_API_KEY:} that should have caused the code to fail. However, due to a bug, the distro server was still being started, and early validation wasn’t triggered. As a result, failures were likely being handled downstream by the providers. I’ve maintained similar behavior by using ${env.NVIDIA_API_KEY:+}, though I believe this is incorrect for many configurations. I’ll leave it to each provider to correct it as needed. * Environment variable substitution now uses the same syntax as Bash parameter expansion. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-06-26 08:20:08 +05:30
Rohan Awhad	436c7aa751	feat: Add url field to PaginatedResponse and populate it using route … (#2419 ) Some checks failed Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.10, inspect) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.10, post_training) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.10, scoring) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.10, tool_runtime) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.10, providers) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.10, vector_io) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.11, vector_io) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 10s Details Test External Providers / test-external-providers (venv) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 15s Details Unit Tests / unit-tests (3.11) (push) Failing after 10s Details Unit Tests / unit-tests (3.13) (push) Failing after 9s Details Update ReadTheDocs / update-readthedocs (push) Failing after 50s Details Unit Tests / unit-tests (3.12) (push) Failing after 58s Details Unit Tests / unit-tests (3.10) (push) Failing after 1m0s Details Pre-commit / pre-commit (push) Successful in 2m10s Details …path # What does this PR do? Closes #1847 Changes: - llama_stack/apis/common/responses.py: adds optional `url` field to PaginatedResponse - llama_stack/distribution/server/server.py: automatically populate the URL field with route path ## Test Plan - Built and ran llama stack server using the following cmds: ```bash export INFERENCE_MODEL=llama3.1:8b llama stack build --run --template ollama --image-type container llama stack run llama_stack/templates/ollama/run.yaml ``` - Ran `curl` to test if we are seeing the `url` param in response: ```bash curl -X 'GET' \ 'http://localhost:8321/v1/agents' \ -H 'accept: application/json' ``` - Expected and Received Output: `{"data":[],"has_more":false,"url":"/v1/agents"}` --------- Co-authored-by: Rohan Awhad <rawhad@redhat.com>	2025-06-16 11:19:48 +02:00
grs	e2e15ebb6c	feat(auth): allow token to be provided for use against jwks endpoint (#2394 ) Some checks failed Update ReadTheDocs / update-readthedocs (push) Failing after 1m11s Details Integration Tests / test-matrix (http, 3.12, vector_io) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.10, inference) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.10, datasets) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.10, vector_io) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 5s Details Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.10, scoring) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 9s Details Integration Tests / test-matrix (http, 3.12, inspect) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.10, tool_runtime) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 7s Details Integration Tests / test-matrix (http, 3.11, scoring) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 6s Details Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.11, vector_io) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 7s Details Test External Providers / test-external-providers (venv) (push) Failing after 6s Details Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 9s Details Unit Tests / unit-tests (3.11) (push) Failing after 8s Details Unit Tests / unit-tests (3.13) (push) Failing after 6s Details Unit Tests / unit-tests (3.12) (push) Failing after 1m17s Details Unit Tests / unit-tests (3.10) (push) Failing after 1m19s Details Pre-commit / pre-commit (push) Successful in 2m26s Details Though the jwks endpoint does not usually require authentication, it does in a kubernetes cluster. While the cluster can be configured to allow anonymous access to that endpoint, this avoids the need to do so.	2025-06-13 10:13:41 +02:00
grs	7c1998db25	feat: fine grained access control policy (#2264 ) This allows a set of rules to be defined for determining access to resources. The rules are (loosely) based on the cedar policy format. A rule defines a list of action either to permit or to forbid. It may specify a principal or a resource that must match for the rule to take effect. It may also specify a condition, either a 'when' or an 'unless', with additional constraints as to where the rule applies. A list of rules is held for each type to be protected and tried in order to find a match. If a match is found, the request is permitted or forbidden depening on the type of rule. If no match is found, the request is denied. If no rules are specified for a given type, a rule that allows any action as long as the resource attributes match the user attributes is added (i.e. the previous behaviour is the default. Some examples in yaml: ``` model: - permit: principal: user-1 actions: [create, read, delete] comment: user-1 has full access to all models - permit: principal: user-2 actions: [read] resource: model-1 comment: user-2 has read access to model-1 only - permit: actions: [read] when: user_in: resource.namespaces comment: any user has read access to models with matching attributes vector_db: - forbid: actions: [create, read, delete] unless: user_in: role::admin comment: only user with admin role can use vector_db resources ``` --------- Signed-off-by: Gordon Sim <gsim@redhat.com>	2025-06-03 14:51:12 -07:00
Ashwin Bharambe	298721c238	chore: split routing_tables into individual files (#2259 )	2025-05-24 23:15:05 -07:00
Ashwin Bharambe	eedf21f19c	chore: split routers into individual files (inference, tool, vector_io, eval_scoring) (#2258 )	2025-05-24 22:59:07 -07:00
Sébastien Han	1862de4be5	chore: clarify cache_ttl to be key_recheck_period (#2220 ) # What does this PR do? The cache_ttl config value is not in fact tied to the lifetime of any of the keys, it represents the time interval between for our key cache refresher. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-05-21 17:30:23 +02:00
Sébastien Han	c25acedbcd	chore: remove k8s auth in favor of k8s jwks endpoint (#2216 ) # What does this PR do? Kubernetes since 1.20 exposes a JWKS endpoint that we can use with our recent oauth2 recent implementation. The CI test has been kept intact for validation. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-05-21 16:23:54 +02:00
liangwen12year	2890243107	feat(quota): add server‑side per‑client request quotas (requires auth) (#2096 ) # What does this PR do? feat(quota): add server‑side per‑client request quotas (requires auth) Unrestricted usage can lead to runaway costs and fragmented client-side workarounds. This commit introduces a native quota mechanism to the server, giving operators a unified, centrally managed throttle for per-client requests—without needing extra proxies or custom client logic. This helps contain cloud-compute expenses, enables fine-grained usage control, and simplifies deployment and monitoring of Llama Stack services. Quotas are fully opt-in and have no effect unless explicitly configured. Notice that Quotas are fully opt-in and require authentication to be enabled. The 'sqlite' is the only supported quota `type` at this time, any other `type` will be rejected. And the only supported `period` is 'day'. Highlights: - Adds `QuotaMiddleware` to enforce per-client request quotas: - Uses `Authorization: Bearer <client_id>` (from AuthenticationMiddleware) - Tracks usage via a SQLite-based KV store - Returns 429 when the quota is exceeded - Extends `ServerConfig` with a `quota` section (type + config) - Enforces strict coupling: quotas require authentication or the server will fail to start Behavior changes: - Quotas are disabled by default unless explicitly configured - SQLite defaults to `./quotas.db` if no DB path is set - The server requires authentication when quotas are enabled To enable per-client request quotas in `run.yaml`, add: ``` server: port: 8321 auth: provider_type: "custom" config: endpoint: "https://auth.example.com/validate" quota: type: sqlite config: db_path: ./quotas.db limit: max_requests: 1000 period: day [//]: # (If resolving an issue, uncomment and update the line below) Closes #2093 ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: Wen Liang <wenliang@redhat.com> Co-authored-by: Wen Liang <wenliang@redhat.com>	2025-05-21 10:58:45 +02:00
grs	091d8c48f2	feat: add additional auth provider that uses oauth token introspection (#2187 ) # What does this PR do? This adds an alternative option to the oauth_token auth provider that can be used with existing authorization services which support token introspection as defined in RFC 7662. This could be useful where token revocation needs to be handled or where opaque tokens (or other non jwt formatted tokens) are used ## Test Plan Tested against keycloak Signed-off-by: Gordon Sim <gsim@redhat.com>	2025-05-20 19:45:11 -07:00
Ashwin Bharambe	c7015d3d60	feat: introduce OAuth2TokenAuthProvider and notion of "principal" (#2185 ) This PR adds a notion of `principal` (aka some kind of persistent identity) to the authentication infrastructure of the Stack. Until now we only used access attributes ("claims" in the more standard OAuth / OIDC setup) but we need the notion of a User fundamentally as well. (Thanks @rhuss for bringing this up.) This value is not yet _used_ anywhere downstream but will be used to segregate access to resources. In addition, the PR introduces a built-in JWT token validator so the Stack does not need to contact an authentication provider to validating the authorization and merely check the signed token for the represented claims. Public keys are refreshed via the configured JWKS server. This Auth Provider should overwhelmingly be considered the default given the seamless integration it offers with OAuth setups.	2025-05-18 17:54:19 -07:00
Derek Higgins	2e807b38cc	chore: Add fixtures to conftest.py (#2067 ) Add fixtures for SqliteKVStore, DiskDistributionRegistry and CachedDiskDistributionRegistry. And use them in tests that had all been duplicating similar setups. ## Test Plan unit tests continue to run Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-05-06 13:57:48 +02:00
Ihar Hrachyshka	9e6561a1ec	chore: enable pyupgrade fixes (#1806 ) # What does this PR do? The goal of this PR is code base modernization. Schema reflection code needed a minor adjustment to handle UnionTypes and collections.abc.AsyncIterator. (Both are preferred for latest Python releases.) Note to reviewers: almost all changes here are automatically generated by pyupgrade. Some additional unused imports were cleaned up. The only change worth of note can be found under `docs/openapi_generator` and `llama_stack/strong_typing/schema.py` where reflection code was updated to deal with "newer" types. Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-05-01 14:23:50 -07:00
Sébastien Han	79851d93aa	feat: Add Kubernetes authentication (#1778 ) # What does this PR do? This commit adds a new authentication system to the Llama Stack server with support for Kubernetes and custom authentication providers. Key changes include: - Implemented KubernetesAuthProvider for validating Kubernetes service account tokens - Implemented CustomAuthProvider for validating tokens against external endpoints - this is the same code that was already present. - Added test for Kubernetes - Updated server configuration to support authentication settings - Added documentation for authentication configuration and usage The authentication system supports: - Bearer token validation - Kubernetes service account token validation - Custom authentication endpoints ## Test Plan Setup a Kube cluster using Kind or Minikube. Run a server with: ``` server: port: 8321 auth: provider_type: kubernetes config: api_server_url: http://url ca_cert_path: path/to/cert (optional) ``` Run: ``` curl -s -L -H "Authorization: Bearer $(kubectl create token my-user)" http://127.0.0.1:8321/v1/providers ``` Or replace "my-user" with your service account. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-28 22:24:58 +02:00
Ben Browning	0b6cd45950	fix: Additional streaming error handling (#2007 ) # What does this PR do? This expands the `test_sse` test suite and fixes some edge cases with bugs in our SSE error handling to ensure streaming clients always get a proper error response. First, we handle the case where a client disconnects before we actually start streaming the response back. Previously we only handled the case where a client disconnected as we were streaming the response, but there was an edge case where a client disconnecting before we streamed any response back did not trigger our logic to cleanly handle that disconnect. Second, we handle the case where an error is thrown from the server before the actual async generator gets created from the provider. This happens in scenarios like the newly merged OpenAI API input validation, where we eagerly raise validation errors before returning the async generator object that streams the responses back. ## Test Plan Tested via: ``` python -m pytest -s -v tests/unit/server/test_sse.py ``` Both test cases failed before, and passed afterwards. The test cases were written based on me experimenting with actual clients that would do bad things like randomly disconnect or send invalid input in streaming mode and I hit these two cases, where things were misbehaving in our error handling. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-24 17:01:45 -07:00
Ben Browning	dc46725f56	fix: properly handle streaming client disconnects (#2000 ) # What does this PR do? Previously, when a streaming client would disconnect before we were finished streaming the entire response, an error like the below would get raised from the `sse_generator` function in `llama_stack/distribution/server/server.py`: ``` AttributeError: 'coroutine' object has no attribute 'aclose'. Did you mean: 'close'? ``` This was because we were calling `aclose` on a coroutine instead of the awaited value from that coroutine. This change fixes that, so that we save off the awaited value and then can call `aclose` on it if we encounter an `asyncio.CancelledError`, like we see when a client disconnects before we're finished streaming. The other changes in here are to add a simple set of tests for the happy path of our SSE streaming and this client disconnect path. That unfortunately requires adding one more dependency into our unit test section of pyproject.toml since `server.py` requires loading some of the telemetry code for me to test this functionality. ## Test Plan I wrote the tests in `tests/unit/server/test_sse.py` first, verified the client disconnected test failed before my change, and that it passed afterwards. ``` python -m pytest -s -v tests/unit/server/test_sse.py ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-23 15:44:28 +02:00
Ashwin Bharambe	01a25d9744	feat(server): add attribute based access control for resources (#1703 ) This PR introduces a way to implement Attribute Based Access Control (ABAC) for the Llama Stack server. The rough design is: - https://github.com/meta-llama/llama-stack/pull/1626 added a way for the Llama Stack server to query an authenticator - We build upon that and expect "access attributes" as part of the response. These attributes indicate the scopes available for the request. - We use these attributes to perform access control for registered resources as well as for constructing the default access control policies for newly created resources. - By default, if you support authentication but don't return access attributes, we will add a unique namespace pointing to the API_KEY. That way, all resources by default will be scoped to API_KEYs. An important aspect of this design is that Llama Stack stays out of the business of credential management or the CRUD for attributes. How you manage your namespaces or projects is entirely up to you. The design only implements access control checks for the metadata / book-keeping information that the Stack tracks. ### Limitations - Currently, read vs. write vs. admin permissions aren't made explicit, but this can be easily extended by adding appropriate attributes to the `AccessAttributes` data structure. - This design does not apply to agent instances since they are not considered resources the Stack knows about. Agent instances are completely within the scope of the Agents API provider. ### Test Plan Added unit tests, existing integration tests	2025-03-19 21:28:52 -07:00
Ashwin Bharambe	5b39d5a76a	feat(auth, rfc): Add support for Bearer (api_key) Authentication (#1626 ) This PR adds support (or is a proposal for) for supporting API KEY authentication on the Llama Stack server end. `llama-stack-client` already supports accepting an api_key parameter and passes it down through every request as an `Authentication: ` header. Currently, Llama Stack does not propose APIs for handling authentication or authorization for resources of any kind. Given that, and the fact that any deployment will typically have _some_ authentication system present, we simply adopt a delegation mechanism: delegate to an HTTPS endpoint performing key management / authentication. It is configured via: ```yaml server: auth: endpoint: <...> ``` in the run.yaml configuration. ## How It Works When authentication is enabled: 1. Every API request must include an `Authorization: Bearer <token>` header 2. The server will send a _POST_ validation request to the configured endpoint with the following payload: ```json { "api_key": "<token>", "request": { "path": "/api/path", "headers": { "header1": "value1", ... }, "params": { "param1": "value1", ... } } } ``` 3. If the authentication endpoint returns a 200 status code, the request is allowed to proceed 4. If the authentication endpoint returns any other status code, a 401 Unauthorized response is returned ## Test Plan Unit tests	2025-03-18 16:24:18 -07:00
Sébastien Han	7cf1e24c4e	feat(logging): implement category-based logging (#1362 ) # What does this PR do? This commit introduces a new logging system that allows loggers to be assigned a category while retaining the logger name based on the file name. The log format includes both the logger name and the category, producing output like: ``` INFO 2025-03-03 21:44:11,323 llama_stack.distribution.stack:103 [core]: Tool_groups: builtin::websearch served by tavily-search ``` Key features include: - Category-based logging: Loggers can be assigned a category (e.g., "core", "server") when programming. The logger can be loaded like this: `logger = get_logger(name=__name__, category="server")` - Environment variable control: Log levels can be configured per-category using the `LLAMA_STACK_LOGGING` environment variable. For example: `LLAMA_STACK_LOGGING="server=DEBUG;core=debug"` enables DEBUG level for the "server" and "core" categories. - `LLAMA_STACK_LOGGING="all=debug"` sets DEBUG level globally for all categories and third-party libraries. This provides fine-grained control over logging levels while maintaining a clean and informative log format. The formatter uses the rich library which provides nice colors better stack traces like so: ``` ERROR 2025-03-03 21:49:37,124 asyncio:1758 [uncategorized]: unhandled exception during asyncio.run() shutdown task: <Task finished name='Task-16' coro=<handle_signal.<locals>.shutdown() done, defined at /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:146> exception=UnboundLocalError("local variable 'loop' referenced before assignment")> ╭────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────╮ │ /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:178 in shutdown │ │ │ │ 175 │ │ except asyncio.CancelledError: │ │ 176 │ │ │ pass │ │ 177 │ │ finally: │ │ ❱ 178 │ │ │ loop.stop() │ │ 179 │ │ │ 180 │ loop = asyncio.get_running_loop() │ │ 181 │ loop.create_task(shutdown()) │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ UnboundLocalError: local variable 'loop' referenced before assignment ``` Co-authored-by: Ashwin Bharambe <@ashwinb> Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml INFO 2025-03-03 21:55:35,918 __main__:365 [server]: Using config file: llama_stack/templates/ollama/run.yaml INFO 2025-03-03 21:55:35,925 __main__:378 [server]: Run configuration: INFO 2025-03-03 21:55:35,928 __main__:380 [server]: apis: - agents ``` [//]: # (## Documentation) --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-03-07 11:34:30 -08:00
Ashwin Bharambe	290cc843fc	test: first unit test for resolver (#1475 ) Starting to create unit tests to cover critical (and mostly undocumented) provider resolution and routing logic. ## Test Plan Unit tests	2025-03-07 10:20:51 -08:00
Ashwin Bharambe	4ca58eb987	refactor: tests/unittests -> tests/unit; tests/api -> tests/integration	2025-03-04 09:57:00 -08:00

21 commits